Notes: INFO-371: Data Mining and Its Applications (Spring, 2025)

Chapter 1: Introduction

What is Data Mining?

  • Definition:
    • The core step in the Knowledge Discovery (KDD) process. It involves extracting patterns or knowledge from large amounts of data.
    • A natural evolution of database technology, in great demand, with wide applications.
  • Not Data Mining: Simple search and query processing, (Deductive) expert systems.

什么是数据挖掘?

  • 定义:
    • 知识发现(KDD)过程中的核心步骤。它涉及从大量数据中提取模式或知识。
    • 数据库技术的自然演变,需求量大,应用广泛。
  • 不是数据挖掘: 简单的搜索和查询处理、(演绎式)专家系统。

Knowledge Discovery (KDD) Process Steps

Data mining is a key step within a larger iterative process:

  1. Learning the Application Domain: Understanding prior knowledge and goals.
  2. Data Selection: Creating a target dataset.
  3. Data Cleaning & Preprocessing: Handling noise, incomplete data (often takes significant effort).
  4. Data Reduction & Transformation: Finding useful features, reducing dimensionality, invariant representation.
  5. Choosing Data Mining Functions: Defining the type of analysis (e.g., summarization, classification, clustering).
  6. Choosing the Mining Algorithm(s): Selecting appropriate methods.
  7. Data Mining: The actual search for patterns of interest.
  8. Pattern Evaluation & Knowledge Presentation: Visualizing and presenting discovered patterns, removing redundant ones.
  9. Use of Discovered Knowledge: Integrating and applying the findings.

知识发现(KDD)过程步骤

数据挖掘是更大迭代过程中的关键一步:

  1. 理解应用领域: 理解先验知识和目标。
  2. 数据选择: 创建目标数据集。
  3. 数据清洗与预处理: 处理噪声、不完整数据(通常需要大量工作)。
  4. 数据规约与转换: 寻找有用特征、降低维度、不变表示。
  5. 选择数据挖掘功能: 定义分析类型(例如,汇总、分类、聚类)。
  6. 选择挖掘算法: 选择合适的方法。
  7. 数据挖掘: 实际搜索感兴趣的模式。
  8. 模式评估与知识表示: 可视化和呈现发现的模式,移除冗余模式。
  9. 使用发现的知识: 整合和应用发现的结果。

Multi-Dimensional View & Classification Schemes

Data mining can be classified based on different dimensions:

  • Data to be Mined: Relational, data warehouse, transactional, stream, object-oriented, spatial, time-series, text, multimedia, heterogeneous, Web.
  • General Functionality:
    • Descriptive Data Mining: Summarizes and characterizes properties of data.
    • Predictive Data Mining: Builds models to predict future data trends or behavior.

多维度视图与分类方案

数据挖掘可以根据不同的维度进行分类:

  • 待挖掘数据类型: 关系型、数据仓库、事务型、流式、面向对象、空间、时间序列、文本、多媒体、异构、Web。
  • 通用功能:
    • 描述性数据挖掘: 总结并刻画数据属性。
    • 预测性数据挖掘: 构建模型以预测未来数据趋势或行为。

Data Mining Functionalities (Detailed)

  • Characterization: Summarizing characteristics of a target class.
  • Discrimination: Comparing target class characteristics with contrasting classes.
  • Association: Discovering patterns that occur frequently together.
  • Classification: Building models to categorize data into predefined classes.
  • Prediction: Forecasting unknown or missing numerical values.
  • Cluster Analysis: Grouping data objects into classes (clusters) where class labels are unknown, maximizing intra-class similarity and minimizing inter-class similarity.
  • Outlier Analysis: Detecting data objects that do not conform to the general behavior of the data.
  • Trend & Evolution Analysis: Analyzing data changes over time.

数据挖掘功能(详细)

  • 特征描述: 总结目标类的特征。
  • 判别: 将目标类特征与对比类特征进行比较。
  • 关联: 发现经常一起出现的模式。
  • 分类: 构建模型将数据分到预定义类别中。
  • 预测: 预测未知或缺失的数值。
  • 聚类分析: 将数据对象分组到类(簇)中,其中类标签未知,最大化类内相似性并最小化类间相似性。
  • 异常值分析: 检测不符合数据一般行为的数据对象。
  • 趋势与演化分析: 分析数据随时间的变化。

(次)Interestingness of Discovered Patterns

  • Objective Measures: Based on statistics and structure (e.g., support, confidence).
  • Subjective Measures: Based on user's belief (e.g., unexpectedness, novelty, actionability).

(次)发现模式的趣味性

  • 客观度量: 基于统计和结构(例如,支持度、置信度)。
  • 主观度量: 基于用户的信念(例如,意外性、新颖性、可操作性)。

(次)Integration with DB/DW Systems

  • Coupling Schemes:
    • No coupling: Flat file processing (not recommended).
    • Loose coupling: Fetching data from DB/DW.
    • Semi-tight coupling: Efficient implementation of basic data mining primitives within DB/DW (e.g., sorting, indexing, aggregation).
    • Tight coupling: Data mining fully integrated into DB/DW, with optimized mining queries and processing methods (ideal for uniform information processing).

(次)与数据库/数据仓库系统的集成

  • 耦合方案:
    • 无耦合: 平面文件处理(不推荐)。
    • 松散耦合: 从数据库/数据仓库中获取数据。
    • 半紧密耦合: 在数据库/数据仓库内部高效实现基本数据挖掘原语(例如,排序、索引、聚合)。
    • 紧密耦合: 数据挖掘完全集成到数据库/数据仓库中,具有优化的挖掘查询和处理方法(统一信息处理的理想选择)。

Typical Data Mining System Architecture

A typical data mining system includes:

  • Graphical User Interface: For user interaction.
  • Pattern Evaluation Module: Assesses interestingness of discovered patterns.
  • Data Mining Engine: Contains various mining algorithms.
  • Database or Data Warehouse Server: Stores and manages data.
  • Knowledge-Base: Stores domain knowledge, concept hierarchies, etc.
  • Data Cleaning, Integration, and Selection Module: Prepares data from various sources (databases, files, Web).

典型数据挖掘系统架构

典型的数据挖掘系统包括:

  • 图形用户界面: 用于用户交互。
  • 模式评估模块: 评估发现模式的趣味性。
  • 数据挖掘引擎: 包含各种挖掘算法。
  • 数据库或数据仓库服务器: 存储和管理数据。
  • 知识库: 存储领域知识、概念层次结构等。
  • 数据清洗、集成和选择模块: 准备来自各种源(数据库、文件、Web)的数据。

alt text


Major Issues in Data Mining

  • Mining Methodology: Diverse data types, efficiency, effectiveness, scalability, pattern evaluation, background knowledge incorporation, handling noise and incomplete data, parallel/distributed/incremental mining, knowledge integration.
  • User Interaction: Data mining query languages, expression and visualization of results, interactive multi-level mining.
  • Applications & Social Impacts: Domain-specific mining, data security, integrity, and privacy.

数据挖掘的主要问题

  • 挖掘方法论: 多样的数据类型、效率、有效性、可扩展性、模式评估、背景知识的融入、处理噪声和不完整数据、并行/分布式/增量挖掘、知识集成。
  • 用户交互: 数据挖掘查询语言、结果的表达和可视化、交互式多级挖掘。
  • 应用与社会影响: 领域特定挖掘、数据安全、完整性与隐私。

Chapter 2: Getting to Know Your Data

Types of Data Sets

Data can be structured in various ways:

  • Record Data:
    • Relational records (tables).
    • Data matrix (numerical matrix, crosstabs).
    • Document data (term-frequency vectors).
    • Transactional data (sets of items).
  • Graph and Network Data:
    • World Wide Web.
    • Social or information networks.
    • Molecular structures.
  • Ordered Data:
    • Video data (sequence of images).
    • Temporal data (time-series).
    • Sequential data (transaction sequences).
    • Genetic sequence data.
  • Spatial, Image, and Multimedia Data: Maps, images, video.

数据集类型

数据可以通过多种方式结构化:

  • 记录数据:
    • 关系记录(表)。
    • 数据矩阵(数值矩阵、交叉表)。
    • 文档数据(词频向量)。
    • 事务数据(项目集)。
  • 图与网络数据:
    • 万维网。
    • 社交或信息网络。
    • 分子结构。
  • 有序数据:
    • 视频数据(图像序列)。
    • 时间数据(时间序列)。
    • 序列数据(事务序列)。
    • 基因序列数据。
  • 空间、图像和多媒体数据: 地图、图像、视频。

(次)Important Characteristics of Structured Data

  • Dimensionality: The number of attributes; high dimensionality can lead to the "curse of dimensionality."
  • Sparsity: Data where most attribute values are zero or missing (e.g., only presence counts).
  • Resolution: Patterns can depend on the scale at which data is viewed.
  • Distribution: Describes the centrality and dispersion of data values.

(次)结构化数据的重要特征

  • 维度: 属性的数量;高维度可能导致“维度灾难”。
  • 稀疏性: 大多数属性值为零或缺失的数据(例如,仅存在计数)。
  • 分辨率: 模式可能取决于数据被查看的尺度。
  • 分布: 描述数据值的集中趋势和离散程度。

(次)Data Objects and Attributes

  • Data Object (also called sample, example, instance, data point, tuple): Represents an entity.
  • Attribute (also called dimension, feature, variable): A data field that represents a characteristic or feature of a data object.

(次)数据对象和属性

  • 数据对象(也称为样本、示例、实例、数据点、元组): 代表一个实体。
  • 属性(也称为维度、特征、变量): 一个数据字段,表示数据对象的特征或特性。

Attribute Types

Attributes can be categorized by the type of values they hold:

  • Nominal Attributes: Categories, states, or "names of things" without inherent order or quantitative meaning.
  • Binary Attributes: Nominal attributes with only two states (0 or 1).
  • Ordinal Attributes: Values have a meaningful order or ranking, but the magnitude of difference between successive values is unknown or undefined.
  • Numeric Attributes: Quantitative attributes that are integer or real-valued.
    • Interval-Scaled: Measured on a scale of equal-sized units, values have an order, but there is no true zero-point (ratios are not meaningful).
    • Ratio-Scaled: Has an inherent zero-point, allowing for meaningful ratios between values.

属性类型

属性可以根据其持有的值类型进行分类:

  • 标称属性: 类别、状态或“事物的名称”,没有固有的顺序或量化意义。
  • 二元属性: 仅有两种状态(0或1)的标称属性。
    • 对称二元: 两种结果同等重要(例如,性别:男/女)。
    • 非对称二元: 结果并非同等重要(例如,医学检测:阳性/阴性,其中阳性更具显著性)。
  • 序数属性: 值具有有意义的顺序或排名,但连续值之间的差异大小未知或未定义。
  • 数值属性: 整数或实数值的定量属性。
    • 区间尺度: 在等大小单位的尺度上测量,值有顺序,但没有真实的零点(比率没有意义)。
    • 比率尺度: 具有固有的零点,允许值之间进行有意义的比率运算。

Basic Statistical Descriptions of Data

  • Motivation: To understand the central tendency, variation, and spread of data.
  • Dispersion Characteristics: Include measures like median, max, min, quantiles, outliers, variance.

数据的基本统计描述

  • 动机: 为了理解数据的集中趋势、变异和离散程度。
  • 离散特征: 包括中位数、最大值、最小值、分位数、异常值、方差等度量。

Data Visualization Techniques

  • Motivation: Gain insight, provide qualitative overview, search for patterns, find interesting regions, provide visual proof.
  • Categorization of Methods:
    • Pixel-oriented
    • Geometric projection
    • Icon-based
    • Hierarchical
    • Visualizing complex data and relations

数据可视化技术

  • 动机: 获取洞察力,提供定性概述,搜索模式,发现有趣区域,提供视觉证明。
  • 方法分类:
    • 面向像素的
    • 几何投影
    • 基于图标的
    • 层次化
    • 复杂数据和关系的可视化

Measuring Data Similarity and Dissimilarity

  • Similarity: A numerical measure of how alike two data objects are. Higher values indicate greater likeness. Often falls in the range [0, 1].
  • Dissimilarity (e.g., Distance): A numerical measure of how different two data objects are. Lower values indicate greater likeness. Minimum dissimilarity is often 0, while the upper limit can vary.
  • Proximity: A general term referring to either similarity or dissimilarity.

测量数据相似性和相异性

  • 相似性: 衡量两个数据对象之间相似程度的数值度量。值越高表示相似性越大。通常落在 [0, 1] 范围内。
  • 相异性(例如,距离): 衡量两个数据对象之间差异程度的数值度量。值越低表示相似性越大。最小相异性通常为0,而上限可能有所不同。
  • 邻近度: 指相似性或相异性的一般术语。

(次)Data Matrix and Dissimilarity Matrix

  • Data Matrix: Represents 'n' data points, each with 'p' dimensions (attributes).
  • Dissimilarity Matrix: Stores the pairwise dissimilarity values between 'n' data points. It is typically a triangular matrix since d(i,j) = d(j,i) and d(i,i) = 0.

(次)数据矩阵和相异性矩阵

  • 数据矩阵: 表示 'n' 个数据点,每个数据点具有 'p' 个维度(属性)。
  • 相异性矩阵: 存储 'n' 个数据点之间的成对相异性值。它通常是一个三角矩阵,因为 d(i,j) = d(j,i) 且 d(i,i) = 0。

(次)Proximity Measures for Different Attribute Types

  • Nominal Attributes:
    • Simple Matching: Dissimilarity d(i,j)=(pm)/pd(i,j) = (p - m) / p, where m is the number of matches and p is the total number of variables.
    • Can also be handled by converting nominal attributes into a large number of binary attributes.
  • Binary Attributes:
    • Often represented using a contingency table (q, r, s, t for combinations of 0/1 values for two objects).
    • Symmetric Binary Dissimilarity: d(i,j)=(r+s)/(q+r+s+t)d(i,j) = (r + s) / (q + r + s + t). Both outcomes are equally important.
    • Asymmetric Binary Dissimilarity: d(i,j)=(r+s)/(q+r+s)d(i,j) = (r + s) / (q + r + s). Ignores 'negative matches' (t), focusing on positive presences.
    • Jaccard Coefficient (Similarity for Asymmetric Binary): simJaccard(i,j)=q/(q+r+s)sim_Jaccard(i,j) = q / (q + r + s). Measures the ratio of common positive presences to total presences.
  • Standardizing Numeric Data:
    • Z-score Normalization: z=(xμ)/σz = (x - μ) / σ. Transforms data to have a mean of 0 and standard deviation of 1.
    • Can also use mean absolute deviation for more robustness to outliers.
  • Minkowski Distance (for Numeric Data): A popular distance metric, also known as the L-h norm.
    • Properties of a Metric:
      • Positive definiteness: d(i,j)>0ifij,andd(i,i)=0d(i,j) > 0 if i ≠ j, and d(i,i) = 0.
      • Symmetry: d(i,j)=d(j,i)d(i,j) = d(j,i).
      • Triangle inequality: d(i,j)d(i,k)+d(k,j)d(i,j) ≤ d(i,k) + d(k,j).
    • Special Cases of Minkowski Distance:
      • h = 1: Manhattan Distance (L1 norm): Sum of the absolute differences of the coordinates. d(i,j)=Σxifxjfd(i,j) = Σ|x_if - x_jf|. (e.g., city block distance).
      • h = 2: Euclidean Distance (L2 norm): The straight-line distance between two points. d(i,j)=(Σ(xifxjf)2)d(i,j) = \sqrt(Σ(x_if - x_jf)^2).
      • h = \infty: Supremum Distance (Lmax norm): The maximum absolute difference between any coordinate of the vectors. d(i,j)=max(xifxjf)d(i,j) = \max(|x_if - x_jf|).
  • Ordinal Variables:
    • Order is important. Can be treated like interval-scaled data by replacing values with ranks and then normalizing the ranks to a [0, 1] range before applying interval-scaled distance measures.
  • Attributes of Mixed Type:
    • A weighted formula is used to combine dissimilarities from different attribute types (nominal, binary, numeric, ordinal). An indicator variable δij(f)\delta_{ij}^{(f)} is used to denote if xifx_if and xjfx_jf are comparable for attribute f.

(次)不同属性类型的邻近度度量

  • 标称属性:
    • 简单匹配: 相异性 d(i,j)=(pm)/pd(i,j) = (p - m) / p,其中 m 是匹配的数量,p 是变量总数。
    • 也可以通过将标称属性转换为大量二元属性来处理。
  • 二元属性:
    • 通常使用列联表表示(q, r, s, t 用于两个对象0/1值的组合)。
    • 对称二元相异性: d(i,j)=(r+s)/(q+r+s+t)d(i,j) = (r + s) / (q + r + s + t)。两种结果同等重要。
    • 非对称二元相异性: d(i,j)=(r+s)/(q+r+s)d(i,j) = (r + s) / (q + r + s)。忽略“负匹配”(t),关注正向存在。
    • Jaccard 系数(非对称二元相似性): simJaccard(i,j)=q/(q+r+s)sim_Jaccard(i,j) = q / (q + r + s)。测量共同正向存在的比率与总存在的比率。
  • 数值数据标准化:
    • Z-score 标准化: z=(xμ)/σz = (x - μ) / σ。将数据转换为均值为0、标准差为1。
    • 也可以使用平均绝对偏差,以对异常值更具鲁棒性。
  • 闵可夫斯基距离(数值数据): 一种流行的距离度量,也称为L-h范数。
    • 度量的性质:
      • 正定性:如果 iji ≠ j,则 d(i,j)>0d(i,j) > 0,且 d(i,i)=0d(i,i) = 0
      • 对称性:d(i,j)=d(j,i)d(i,j) = d(j,i)
      • 三角不等式:d(i,j)d(i,k)+d(k,j)d(i,j) ≤ d(i,k) + d(k,j)
    • 闵可夫斯基距离的特例:
      • h = 1: 曼哈顿距离(L1范数): 坐标绝对差之和。d(i,j)=Σxifxjfd(i,j) = Σ|x_if - x_jf|。(例如,城市街区距离)。
      • h = 2: 欧几里得距离(L2范数): 两点之间的直线距离。d(i,j)=(Σ(xifxjf)2)d(i,j) = \sqrt(Σ(x_if - x_jf)^2)
      • h = \infty: 上确界距离(Lmax范数): 向量任意坐标之间的最大绝对差。d(i,j)=max(xifxjf)d(i,j) = \max(|x_if - x_jf|)
  • 序数变量:
    • 顺序很重要。可以像区间尺度数据一样处理,即用秩代替值,然后将秩归一化到 [0, 1] 范围,再应用区间尺度距离度量。
  • 混合类型属性:
    • 使用加权公式来组合来自不同属性类型(标称、二元、数值、序数)的相异性。指示变量 δij(f)\delta_{ij}^{(f)} 用于表示 xifx_ifxjfx_jf 对于属性 f 是否可比较。

(次)Cosine Similarity

  • Used for vector objects, such as documents (represented as term-frequency vectors) or gene features.
  • Measures the cosine of the angle between two vectors.
  • Formula: cos(d1, d2) = (d1 • d2) / (||d1|| ||d2||)
    • d1 • d2 is the dot product of vectors d1 and d2.
    • ||d|| is the Euclidean norm (length) of vector d.
    • A cosine similarity close to 1 indicates high similarity (small angle), while 0 indicates no similarity (orthogonal vectors).

(次)余弦相似度

  • 用于向量对象,例如文档(表示为词频向量)或基因特征。
  • 测量两个向量之间夹角的余弦。
  • 公式:cos(d1, d2) = (d1 • d2) / (||d1|| ||d2||)
    • d1 • d2 是向量 d1 和 d2 的点积。
    • ||d|| 是向量 d 的欧几里得范数(长度)。
    • 余弦相似度接近1表示高度相似(小角度),而接近0表示不相似(正交向量)。

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

  • Purpose: To improve data quality and prepare data for effective data mining.
  • Data Quality Dimensions:
    • Accuracy: Correctness of data.
    • Completeness: Presence of all required data.
    • Consistency: Reliability and coherence across datasets.
    • Timeliness: Data being up-to-date.
    • Believability: How trustworthy data is.
    • Interpretability: Ease of understanding the data.
  • Major Tasks: Data Cleaning, Data Integration, Data Reduction, Data Transformation, and Data Discretization.

数据预处理:概述

  • 目的: 提高数据质量并为有效的数据挖掘做准备。
  • 数据质量维度:
    • 准确性: 数据的正确性。
    • 完整性: 所有必需数据的存在。
    • 一致性: 数据集之间的可靠性和连贯性。
    • 及时性: 数据是最新的。
    • 可信性: 数据值得信赖的程度。
    • 可解释性: 数据易于理解的程度。
  • 主要任务: 数据清洗、数据集成、数据规约、数据转换和数据离散化。

Data Cleaning

  • Real-World Data Issues (Dirty Data):
    • Incomplete: Missing attribute values or entire attributes.
      • Handling Missing Data: Ignore tuples (not always effective), manual fill (impractical for large data), automatic fill (global constant, attribute mean, class-specific mean, most probable value via inference).
    • Noisy: Contains random errors, variance, or outliers.
      • Handling Noisy Data:
        • Binning: Sort data, partition into "bins" (equal-frequency or equal-width), then "smooth" by replacing values with bin means, medians, or boundaries.
        • Regression: Smooth data by fitting it to regression functions.
        • Clustering: Detect and remove outliers by grouping similar data points.
        • Combined Computer & Human Inspection: Automated detection with human verification.
    • Inconsistent: Discrepancies in codes or names.
    • Intentional: Deliberately disguised missing data.
  • Data Cleaning as a Process:
    • Discrepancy Detection: Using metadata, checking rules (uniqueness, null, consecutive), using commercial tools, data scrubbing (domain knowledge for corrections), data auditing (analyzing rules and relationships for violators).
    • Data Migration & Integration Tools: ETL (Extraction/Transformation/Loading) tools for data movement and transformation.
    • Iterative & Interactive: Data cleaning is often a multi-step, iterative process.

数据清洗

  • 现实世界数据问题(脏数据):
    • 不完整: 缺失属性值或整个属性。
      • 缺失数据处理: 忽略元组(并非总有效)、手动填充(对于大量数据不切实际)、自动填充(全局常量、属性均值、类别特定均值、通过推断得到的最可能值)。
    • 噪声: 包含随机错误、方差或异常值。
      • 噪声数据处理:
        • 分箱: 对数据进行排序,划分到“箱”(等频或等宽),然后通过用箱均值、中位数或边界替换值进行“平滑”。
        • 回归: 通过将数据拟合到回归函数来平滑数据。
        • 聚类: 通过将相似数据点分组来检测和移除异常值。
        • 计算机与人工结合检查: 自动化检测与人工验证相结合。
    • 不一致: 代码或名称存在差异。
    • 有意为之: 故意伪装的缺失数据。
  • 数据清洗过程:
    • 异常检测: 使用元数据、检查规则(唯一性、空值、连续性)、使用商业工具、数据擦洗(利用领域知识进行修正)、数据审计(分析规则和关系以发现违规者)。
    • 数据迁移与集成工具: ETL(提取/转换/加载)工具用于数据移动和转换。
    • 迭代与交互: 数据清洗通常是一个多步骤、迭代的过程。

Data Integration

  • Definition: Combining data from multiple disparate sources into a coherent data store.
  • Challenges:
    • Schema Integration: Matching equivalent real-world entities from different sources (e.g., customer IDs with different names).
    • Entity Identification Problem: Identifying the same real-world entity with different names (e.g., "Bill Clinton" vs. "William Clinton").
    • Data Value Conflicts: Different representations or scales for the same attribute (e.g., metric vs. British units).
  • Handling Redundancy:
    • Correlation Analysis & Covariance Analysis: Used to detect redundant attributes.
    • Careful integration reduces redundancies and inconsistencies, improving mining speed and quality.

数据集成

  • 定义: 将来自多个不同来源的数据组合成一个连贯的数据存储。
  • 挑战:
    • 模式集成: 匹配来自不同来源的等效真实世界实体(例如,不同名称的客户ID)。
    • 实体识别问题: 识别具有不同名称的相同真实世界实体(例如,“Bill Clinton”与“William Clinton”)。
    • 数据值冲突: 同一属性的不同表示或尺度(例如,公制单位与英制单位)。
  • 处理冗余:
    • 相关性分析与协方差分析: 用于检测冗余属性。
    • 仔细的集成可以减少冗余和不一致性,提高挖掘速度和质量。

(次)Correlation Analysis

  • Nominal Data (Qualitative):
    • Chi-square (χ2\chi^2) test: Measures association between nominal attributes. A larger χ2\chi^2 value indicates a stronger likelihood of relationship.
    • Important Note: Correlation does not imply causation.
  • Numeric Data (Quantitative):
    • Pearson's Product-Moment Coefficient (Correlation Coefficient): Measures the linear relationship between two numeric attributes (A and B).
      • rA,B>0r_{A,B} > 0: Positive linear correlation (A increases as B increases).
      • rA,B=0r_{A,B} = 0: No linear correlation (potentially independent).
      • rA,B<0r_{A,B} < 0: Negative linear correlation (A increases as B decreases).
    • Covariance: Measures how two variables change together.
      • Cov(A,B)>0Cov(A,B) > 0: Both A and B tend to be larger than their expected values together.
      • Cov(A,B)<0Cov(A,B) < 0: If A is large, B is likely small.
      • Cov(A,B)=0Cov(A,B) = 0: Indicates no linear relationship, but not necessarily independence (unless under specific assumptions like multivariate normal distribution).
    • Visual Evaluation: Scatter plots provide a visual representation of correlation.

(次)相关性分析

  • 标称数据(定性):
    • 卡方(χ2\chi^2)检验: 测量标称属性之间的关联。较大的 χ2\chi^2 值表示更强的关系可能性。
    • 重要提示: 相关性意味着因果关系。
  • 数值数据(定量):
    • 皮尔逊积矩系数(相关系数): 测量两个数值属性(A和B)之间的线性关系。
      • rA,B>0r_{A,B} > 0:正线性相关(A增加时B增加)。
      • rA,B=0r_{A,B} = 0:无线性相关(可能独立)。
      • rA,B<0r_{A,B} < 0:负线性相关(A增加时B减少)。
    • 协方差: 测量两个变量如何一起变化。
      • Cov(A,B)>0Cov(A,B) > 0:A和B都倾向于同时大于其期望值。
      • Cov(A,B)<0Cov(A,B) < 0:如果A大,则B可能小。
      • Cov(A,B)=0Cov(A,B) = 0:表示没有线性关系,但不一定表示独立性(除非在多元正态分布等特定假设下)。
    • 视觉评估: 散点图提供了相关性的视觉表示。

alt text
alt text


Data Reduction Strategies

  • Goal: Obtain a reduced representation of the data set that is much smaller in volume but produces similar analytical results, addressing issues of large data volume and analysis time.
  • Strategies:
    • Dimensionality Reduction: Reducing the number of attributes.
    • Numerosity Reduction: Reducing the number of data objects.
    • Data Compression: Encoding data to reduce size.

数据规约策略

  • 目标: 获得数据集的精简表示,其体积小得多但能产生相似的分析结果,从而解决大数据量和分析时间的问题。
  • 策略:
    • 维度规约: 减少属性的数量。
    • 数量规约: 减少数据对象的数量。
    • 数据压缩: 对数据进行编码以减小大小。

(次)Dimensionality Reduction

  • Curse of Dimensionality: As dimensionality increases, data becomes sparse, distances and densities become less meaningful, and computational complexity rises exponentially.
  • Benefits: Avoids the curse, eliminates irrelevant features and noise, reduces mining time/space, facilitates visualization.
  • Techniques:
    • Wavelet Transforms: Decomposes a signal into different frequency subbands. Applicable to multi-dimensional signals, preserves relative distances, allows natural clustering, used in image compression. Produces a compressed approximation by storing only significant coefficients.
    • Principal Component Analysis (PCA): Finds a projection of data (principal components) that captures the maximum variance. Projects original high-dimensional data onto a much smaller space using eigenvectors of the covariance matrix. Works primarily for numeric data.
    • Attribute Subset Selection: Identifies and removes redundant or irrelevant attributes.
      • Heuristic Search Methods: Include best single attribute selection, step-wise forward selection, step-wise backward elimination, combined selection and elimination, and optimal branch and bound.
    • Attribute Creation (Feature Generation): Creates new attributes that more effectively capture important information. This can involve domain-specific extraction, mapping data to new spaces (e.g., Fourier/wavelet transforms), or combining existing features.

(次)维度规约

  • 维度灾难: 随着维度增加,数据变得稀疏,距离和密度变得不那么有意义,计算复杂度呈指数级增长。
  • 优点: 避免维度灾难,消除不相关特征和噪声,减少挖掘时间/空间,便于可视化。
  • 技术:
    • 小波变换: 将信号分解为不同的频率子带。适用于多维信号,保留相对距离,允许自然聚类,用于图像压缩。通过仅存储重要系数来生成压缩近似。
    • 主成分分析(PCA): 找到数据的投影(主成分),捕获最大方差。使用协方差矩阵的特征向量将原始高维数据投影到更小的空间。主要用于数值数据。
    • 属性子集选择: 识别并移除冗余或不相关的属性。
      • 启发式搜索方法: 包括最佳单个属性选择、逐步向前选择、逐步向后消除、组合选择和消除以及最优分支限界。
    • 属性创建(特征生成): 创建更有效地捕获重要信息的新属性。这可能涉及领域特定提取、将数据映射到新空间(例如,傅里叶/小波变换),或组合现有特征。

(次)Numerosity Reduction

  • Goal: Reduce the number of data records.
  • Methods:
    • Parametric Methods (e.g., Regression and Log-Linear Models): Assume data fits a model; only store the model parameters (discarding most data).
      • Linear Regression: Models data as a straight line.
      • Multiple Regression: Models a response variable as a linear function of multiple predictor variables.
      • Log-linear Models: Approximates discrete multidimensional probability distributions. Useful for dimensionality reduction and data smoothing.
    • Non-parametric Methods: Do not assume a data model.
      • Histograms: Groups data into bins and stores aggregated summaries (e.g., count, sum, average) for each bin.
      • Clustering: Partitions data into clusters; stores only a representation for each cluster (e.g., centroid, diameter).
      • Sampling: Selects a small, representative sample (ss) from the larger dataset (NN).
        • Simple Random Sampling: Equal probability for each item.
        • Sampling Without Replacement: Once an item is selected, it's removed.
        • Sampling With Replacement: An item can be selected multiple times.
        • Stratified Sampling: Partitions data into strata (subgroups) and draws samples proportionally from each. Useful for skewed data.
      • Data Cube Aggregation: Uses pre-computed aggregate data from data cubes for queries, reducing the volume of data processed.

(次)数量规约

  • 目标: 减少数据记录的数量。
  • 方法:
    • 参数方法(例如,回归和对数线性模型): 假设数据符合某个模型;只存储模型参数(丢弃大部分数据)。
      • 线性回归: 将数据建模为一条直线。
      • 多元回归: 将响应变量建模为多个预测变量的线性函数。
      • 对数线性模型: 近似离散多维概率分布。有助于维度规约和数据平滑。
    • 非参数方法: 不假设数据模型。
      • 直方图: 将数据分组到箱中,并存储每个箱的聚合摘要(例如,计数、总和、平均值)。
      • 聚类: 将数据划分为簇;仅存储每个簇的表示(例如,质心、直径)。
      • 采样: 从大型数据集(NN)中选择一个小的、具有代表性的样本(ss)。
        • 简单随机抽样: 每个项目具有相同的选择概率。
        • 无放回抽样: 一旦项目被选中,就被移除。
        • 有放回抽样: 一个项目可以被多次选中。
        • 分层抽样: 将数据划分为层(子组),并按比例从每层抽取样本。适用于偏斜数据。
      • 数据立方体聚合: 使用数据立方体中预先计算的聚合数据进行查询,减少处理的数据量。

(次)Data Compression

  • Types:
    • String Compression: Typically lossless, but limits direct manipulation without decompression.
    • Audio/Video Compression: Often lossy, with progressive refinement.
  • Dimensionality and numerosity reduction can also be considered forms of data compression.

(次)数据压缩

  • 类型:
    • 字符串压缩: 通常是无损的,但限制了在不解压缩情况下的直接操作。
    • 音频/视频压缩: 通常是有损的,具有渐进式细化。
  • 维度规约和数量规约也可以被视为数据压缩的形式。

Data Transformation and Discretization

  • Data Transformation: A function that maps existing attribute values to a new set of replacement values.
    • Techniques:
      • Smoothing: Removing noise (e.g., using binning, regression).
      • Attribute/Feature Construction: Creating new attributes from existing ones.
      • Aggregation: Summarization, data cube construction.
      • Normalization: Scaling attribute values to a specified range.
        • Min-Max Normalization: Scales values to a custom range (e.g., [0, 1]).
        • Z-score Normalization: Scales values based on mean and standard deviation.
        • Normalization by Decimal Scaling: Scales values by moving the decimal point.
      • (重点)Discretization: Dividing the range of a continuous attribute into intervals (bins) and replacing actual values with interval labels.
        • Benefits: Reduces data size, can be supervised (using class labels) or unsupervised, can be top-down (split) or bottom-up (merge).
        • Methods: Binning (equal-width, equal-depth), Histogram analysis, Clustering analysis, Decision-tree analysis (supervised, uses entropy for splits), Correlation analysis (e.g., Chi-merge, bottom-up merge based on χ2\chi^2 values).

数据转换与离散化

  • 数据转换: 将现有属性值映射到一组新的替换值的函数。
    • 技术:
      • 平滑: 消除噪声(例如,使用分箱、回归)。
      • 属性/特征构建: 从现有属性创建新属性。
      • 聚合: 汇总、数据立方体构建。
      • 归一化: 将属性值缩放到指定范围。
        • 最小-最大归一化: 将值缩放到自定义范围(例如,[0, 1])。
        • Z-score 归一化: 根据均值和标准差缩放值。
        • 小数定标归一化: 通过移动小数点缩放值。
      • (重点)离散化: 将连续属性的范围划分为区间(箱),并用区间标签替换实际值。
        • 优点: 减少数据大小,可以是监督式(使用类标签)或非监督式,可以是自顶向下(分裂)或自底向上(合并)。
        • 方法: 分箱(等宽、等深)、直方图分析、聚类分析、决策树分析(监督式,使用熵进行分裂)、相关性分析(例如,Chi-merge,基于 χ2\chi^2 值的自底向上合并)。

Concept Hierarchy Generation

  • Definition: Organizes attribute values hierarchically (e.g., street < city < state < country).
  • Purpose: Facilitates "drilling" and "rolling" in data warehouses to view data at multiple granularities.
  • Formation:
    • Explicit Specification: Defined by domain experts or data warehouse designers.
    • Automatic Generation: Based on analysis of distinct values (attributes with more distinct values are placed at lower levels). Can be applied to both numeric (using discretization methods) and nominal data.

概念层次生成

  • 定义: 将属性值进行层次化组织(例如,街道 < 城市 < 州 < 国家)。
  • 目的: 便于在数据仓库中进行“下钻”和“上卷”,以多粒度查看数据。
  • 形成:
    • 显式指定: 由领域专家或数据仓库设计者定义。
    • 自动生成: 基于对不同值的分析(具有更多不同值的属性放置在较低级别)。可应用于数值数据(使用离散化方法)和标称数据。

Chapter 5: Mining Frequent Patterns, Association and Correlations

What is Frequent Pattern Analysis?

  • Frequent Pattern: A pattern (e.g., a set of items, subsequences, substructures) that occurs frequently in a dataset.
  • Applications: Market basket analysis, cross-marketing, Web log analysis, DNA sequence analysis, etc.
  • Importance:
    • Reveals intrinsic and important properties of data sets.
    • Forms the foundation for many data mining tasks, including:
      • Association, correlation, and causality analysis.
      • Sequential and structural pattern mining.
      • Pattern analysis in spatiotemporal, multimedia, time-series, and stream data.
      • Associative classification and frequent pattern-based clustering.
      • Data warehousing tasks (e.g., iceberg cube).
      • Semantic data compression.

什么是频繁模式分析?

  • 频繁模式: 在数据集中频繁出现的模式(例如,一组项目、子序列、子结构)。
  • 重要性:
    • 揭示数据集固有的重要特性。
    • 构成许多数据挖掘任务的基础,包括:
      • 关联、相关和因果分析。
      • 序列和结构模式挖掘。
      • 时空、多媒体、时间序列和流数据中的模式分析。
      • 关联分类和基于频繁模式的聚类。
      • 数据仓库任务(例如,冰山立方体)。
      • 语义数据压缩。

Basic Concepts: Frequent Patterns and Association Rules

  • Itemset: A collection of one or more items.
  • Frequent Itemset: An itemset whose occurrence frequency (or support) is above a user-specified min_support threshold.
  • Association Rule (X \rightarrow Y): An implication expression showing that the presence of itemset X in a transaction implies the presence of itemset Y.
    • Support (s): The probability that a transaction contains both X and Y (i.e., P(XY)P(X \cup Y)).
    • Confidence (c): The conditional probability that a transaction containing X also contains Y (i.e., P(YX)P(Y|X)).
  • Mining Goal: Find all association rules X \rightarrow Y that satisfy both min_support and min_confidence.

基本概念:频繁模式和关联规则

  • 项集: 包含一个或多个项目的集合。
  • 频繁项集: 其出现频率(或支持度)高于用户指定 min_support 阈值的项集。
  • 关联规则(X \rightarrow Y): 一个蕴含表达式,表示事务中项集X的存在隐含着项集Y的存在。
    • 支持度(s): 事务包含X和Y的概率(即 P(XY)P(X \cup Y))。
    • 置信度(c): 包含X的事务也包含Y的条件概率(即 P(YX)P(Y|X))。
  • 挖掘目标: 找出所有满足 min_support 和 min_confidence 的关联规则 X \rightarrow Y。

Scalable Methods for Mining Frequent Patterns

  • Downward Closure Property (Apriori Property): A fundamental property stating that if an itemset is frequent, then all of its non-empty subsets must also be frequent. Conversely, if any subset of an itemset is infrequent, then the itemset itself must be infrequent. This property is crucial for pruning the search space.

  • (计算重点)Apriori Algorithm:

    • Principle: Iteratively finds frequent k-itemsets by using the frequent (k-1)-itemsets to generate candidates. Infrequent candidates are pruned.
    • Steps:
      1. Scan the database to find all frequent 1-itemsets (L1).
      2. Generate Candidates (Ck+1): Use Lk (frequent k-itemsets) to generate Ck+1 (candidate (k+1)-itemsets) through self-joining.
      3. Prune Candidates: Remove any candidate in Ck+1 that has an infrequent k-subset (based on Lk).
      4. Count Support: Scan the database to count the support for each candidate in Ck+1.
      5. Find Frequent Itemsets (Lk+1): Collect candidates from Ck+1 whose support meets min_support.
      6. Repeat until no more frequent or candidate sets can be generated.

频繁模式的可扩展挖掘方法

  • 向下闭合性质(Apriori性质): 一个基本性质,它指出如果一个项集是频繁的,那么它的所有非空子集也必须是频繁的。反之,如果一个项集的任何子集是不频繁的,那么该项集本身也必然是不频繁的。此性质对于剪枝搜索空间至关重要。

  • (计算重点)Apriori算法:

    • 原理: 通过使用频繁的 (k-1) 项集来生成候选,迭代地发现频繁的 k 项集。不频繁的候选被剪枝。
    • 步骤:
      1. 扫描数据库,找出所有频繁的 1 项集(L1)。
      2. 生成候选(Ck+1): 使用 Lk(频繁的 k 项集)通过自连接生成 Ck+1(候选的 (k+1) 项集)。
      3. 剪枝候选: 移除 Ck+1 中任何具有不频繁 k 子集(基于 Lk)的候选。
      4. 计数支持度: 扫描数据库以计算 Ck+1 中每个候选的支持度。
      5. 找到频繁项集(Lk+1): 收集 Ck+1 中支持度达到 min_support 的候选。
      6. 重复,直到无法再生成频繁集或候选集。

Mining Various Kinds of Association Rules

  • Multilevel Association Rules:
    • Handle items that exist at different levels of abstraction (e.g., "Milk" vs. "2% Milk").
    • Often require flexible support settings, as lower-level items typically have lower support (e.g., "reduced support" thresholds for lower levels).
    • Redundancy Filtering: Identify and remove redundant rules where a more general "ancestor" rule effectively explains a more specific rule (e.g., if milk \rightarrow bread is strong, 2% milk \rightarrow bread might be redundant if its support is close to expected based on the ancestor rule).
  • Multidimensional Association Rules:
    • Involve predicates/attributes from multiple dimensions.
    • Inter-dimension association rules: Involve attributes from different dimensions with no repeated predicates (e.g., age(X) ^ occupation(X) \rightarrow buys(X, "coke")).
    • Hybrid-dimension association rules: Allow repeated predicates (e.g., age(X) ^ buys(X, "popcorn") \rightarrow buys(X, "coke")).
    • Categorical Attributes: Typically handled using a data cube approach.
  • Quantitative Association Rules:
    • Involve numeric attributes (e.g., age, salary).
    • Handling:
      • Static Discretization: Pre-define concept hierarchies to convert numeric values into ranges (e.g., age(X,"30-39")).
      • Dynamic Discretization: Discretize numeric attributes "on the fly" during mining, often maximizing rule confidence or compactness.
      • Clustering: Use clustering to group similar numeric values, then derive associations from these clusters.
      • Deviation Analysis: Identify associations based on deviations from overall means.

挖掘各种关联规则

  • 多层次关联规则:
    • 处理存在于不同抽象层次的项目(例如,“牛奶”与“2%牛奶”)。
    • 通常需要灵活的支持度设置,因为较低层次的项目通常具有较低的支持度(例如,较低层次的“降低支持度”阈值)。
    • 冗余过滤: 识别并移除冗余规则,其中更一般的“祖先”规则能有效解释更具体的规则(例如,如果 牛奶 \rightarrow 面包 很强,那么如果 2%牛奶 \rightarrow 面包 的支持度接近基于祖先规则的预期,则它可能是冗余的)。
  • 多维关联规则:
    • 涉及来自多个维度的谓词/属性。
    • 维间关联规则: 涉及来自不同维度且没有重复谓词的属性(例如,age(X) ^ occupation(X) \rightarrow buys(X, "coke"))。
    • 混合维度关联规则: 允许重复谓词(例如,age(X) ^ buys(X, "popcorn") \rightarrow buys(X, "coke"))。
    • 分类属性: 通常使用数据立方体方法处理。
  • 定量关联规则:
    • 涉及数值属性(例如,age、salary)。
    • 处理方法:
      • 静态离散化: 预定义概念层次结构,将数值转换为范围(例如,age(X,"30-39"))。
      • 动态离散化: 在挖掘过程中“即时”离散化数值属性,通常旨在最大化规则置信度或紧凑性。
      • 聚类: 使用聚类将相似的数值分组,然后从这些簇中推导出关联。
      • 偏差分析: 根据与整体均值的偏差识别关联。

Constraint-Based (Query-Directed) Mining

  • Motivation: Finding all possible patterns autonomously is unrealistic due to the sheer volume and lack of focus. Data mining should be an interactive process.
  • User-Directed: Users guide the mining process by specifying constraints, often through a data mining query language (like DMQL) or a graphical user interface.
  • Benefits:
    • User Flexibility: Allows users to focus on patterns relevant to their specific interests or domain knowledge.
    • System Optimization (Constraint-Pushing): Constraints can be used to significantly prune the search space and improve the efficiency of the mining algorithms, similar to how selections are pushed down in database query processing.
  • Types of Constraints:
    • Knowledge Type Constraint: Specify the type of knowledge to be mined (e.g., classification rules, association rules).
    • Data Constraint: SQL-like conditions to select relevant data for mining.
    • Dimension/Level Constraint: Specify the dimensions or hierarchy levels of interest.
    • Rule (or Pattern) Constraint: Define conditions on the structure or properties of the patterns themselves (e.g., "patterns must contain X" or "rules must involve a certain value range").
    • Interestingness Constraint: Specify minimum thresholds for measures like support, confidence, or lift.

基于约束的(查询导向)挖掘

  • 动机: 由于数据量巨大且缺乏焦点,自主寻找所有可能的模式是不现实的。数据挖掘应该是一个交互过程。
  • 用户导向: 用户通过指定约束来指导挖掘过程,通常通过数据挖掘查询语言(如 DMQL)或图形用户界面。
  • 优点:
    • 用户灵活性: 允许用户专注于与其特定兴趣或领域知识相关的模式。
    • 系统优化(约束下推): 约束可用于显著剪枝搜索空间并提高挖掘算法的效率,类似于数据库查询处理中选择操作的下推。
  • 约束类型:
    • 知识类型约束: 指定要挖掘的知识类型(例如,分类规则、关联规则)。
    • 数据约束: 类似SQL的条件,用于选择相关的挖掘数据。
    • 维度/级别约束: 指定感兴趣的维度或层次级别。
    • 规则(或模式)约束: 定义模式本身结构或属性的条件(例如,“模式必须包含X”或“规则必须涉及某个值范围”)。
    • 趣味性约束: 指定支持度、置信度或提升度等度量的最小阈值。

Chapter 6: Classification and Prediction

Classification vs. Prediction

  • Classification:
    • Predicts categorical (discrete or nominal) class labels.
    • Constructs a model based on a training set with known class labels.
    • Uses the model to classify new, unseen data.
  • Prediction:
    • Models continuous-valued functions.
    • Predicts unknown or missing numerical values.

分类与预测

  • 分类:
    • 预测**类别型(离散或标称)**的类标签。
    • 基于具有已知类标签的训练集构建模型。
    • 使用模型对新的、未见过的数据进行分类。
  • 预测:
    • 建模连续值函数
    • 预测未知或缺失的数值

Classification: A Two-Step Process

  1. Model Construction (Learning):
    • A set of predetermined classes is described.
    • A training set of tuples (samples) with known class labels is used to build the model.
    • The model can be represented as classification rules, decision trees, or mathematical formulae.
  2. Model Usage (Prediction):
    • The constructed model is used to classify future or unknown objects.
    • Accuracy of the model is estimated by comparing classified results with known labels of a test set (which is independent of the training set to avoid overfitting).
    • If accuracy is acceptable, the model is deployed.

分类:两步过程

  1. 模型构建(学习):
    • 描述一组预定的类别。
    • 使用具有已知类标签的元组(样本)组成的训练集来构建模型。
    • 模型可以表示为分类规则、决策树或数学公式。
  2. 模型使用(预测):
    • 构建好的模型用于分类未来或未知对象
    • 通过将分类结果与测试集(独立于训练集以避免过拟合)的已知标签进行比较,来估计模型的准确性
    • 如果准确性可接受,则部署模型。

Issues Regarding Classification and Prediction

  • Data Preparation:
    • Data Cleaning: Preprocessing to reduce noise and handle missing values.
    • Relevance Analysis (Feature Selection): Removing irrelevant or redundant attributes.
    • Data Transformation: Generalizing and/or normalizing data.
  • Evaluating Classification Methods:
    • Accuracy: Percentage of correctly classified test samples (for classifiers) or how close predicted values are to actual values (for predictors).
    • Speed: Time to construct (training time) and use (classification/prediction time) the model.
    • Robustness: Ability to handle noisy and missing data.
    • Scalability: Efficiency in handling large, disk-resident databases.
    • Interpretability: Ease of understanding the model and the insights it provides (e.g., clarity of rules or decision tree size).
    • Other Measures: Goodness/compactness of rules.

分类与预测相关问题

  • 数据准备:
    • 数据清洗: 预处理以减少噪声和处理缺失值。
    • 相关性分析(特征选择): 移除不相关或冗余的属性。
    • 数据转换: 泛化和/或规范化数据。
  • 评估分类方法:
    • 准确性: 正确分类的测试样本百分比(对于分类器)或预测值与实际值接近程度(对于预测器)。
    • 速度: 构建模型(训练时间)和使用模型(分类/预测时间)所需的时间。
    • 鲁棒性: 处理噪声和缺失数据的能力。
    • 可扩展性: 处理大型、驻留磁盘的数据库的效率。
    • 可解释性: 模型及其提供的洞察易于理解的程度(例如,规则的清晰度或决策树的大小)。
    • 其他度量: 规则的优劣/紧凑性。

(计算重点)Classification by Decision Tree Induction

(计算重点)决策树归纳分类


(计算重点)Bayesian Classification

(计算重点)贝叶斯分类


(次)Rule-Based Classification

  • Representation: Knowledge is represented as IF-THEN rules (IF antecedent THEN consequent).
  • Rule Assessment:
    • Coverage: Number of tuples satisfied by the rule's antecedent.
    • Accuracy: Percentage of covered tuples that are correctly classified by the rule's consequent.
  • Conflict Resolution (if multiple rules are triggered):
    • Size Ordering: Priority to rules with more attribute tests (more specific).
    • Class-based Ordering: Priority based on class prevalence or misclassification cost.
    • Rule-based Ordering (Decision List): Rules are ordered by quality or expert knowledge.
  • Rule Extraction from Decision Tree: Each path from the root to a leaf in a decision tree can be converted directly into an IF-THEN rule. These rules are mutually exclusive and exhaustive.
  • Rule Extraction from Training Data (Sequential Covering Algorithm):
    • Extracts rules one at a time.
    • After a rule is learned, the tuples it covers are removed.
    • Process repeats until a termination condition (e.g., no more training examples, rule quality below threshold).

(次)基于规则的分类

  • 表示: 知识表示为IF-THEN规则(如果 前提 THEN 结果)。
  • 规则评估:
    • 覆盖率: 规则前提所满足的元组数量。
    • 准确性: 被规则结果正确分类的覆盖元组的百分比。
  • 冲突解决(如果多条规则被触发):
    • 大小排序: 优先考虑具有更多属性测试(更具体)的规则。
    • 基于类别的排序: 根据类别普遍性或误分类成本进行优先级排序。
    • 基于规则的排序(决策列表): 规则按质量或专家知识排序。
  • 从决策树中提取规则: 决策树中从根到叶的每条路径都可以直接转换为IF-THEN规则。这些规则是互斥且完备的。
  • 从训练数据中提取规则(序贯覆盖算法):
    • 一次提取一条规则。
    • 学习到一条规则后,移除其覆盖的元组。
    • 重复该过程,直到满足终止条件(例如,没有更多训练样本,规则质量低于阈值)。

(次)Support Vector Machines (SVMs)

  • Concept: A powerful classification method for both linear and nonlinear data.
  • Principle:
    • Transforms original training data into a higher dimension using a nonlinear mapping.
    • In this higher dimension, it searches for a linear optimal separating hyperplane (the "decision boundary") with the largest margin.
    • A sufficiently high dimension can always linearly separate two classes.
    • Uses support vectors (the "essential" training tuples lying closest to the decision boundary) to define the hyperplane and its margin.
  • Effectiveness on High-Dimensional Data:
    • Complexity depends on the number of support vectors, not the data dimensionality.
    • Support vectors are critical examples; removing others doesn't change the hyperplane.
    • A small number of support vectors enables good generalization even in high dimensions.
  • SVM vs. Neural Networks:
    • SVM: Newer concept, deterministic, good generalization, learned in batch mode (quadratic programming), can handle complex functions with kernels.
    • Neural Network: Older, non-deterministic, generalizes well but less strong mathematical foundation, can be learned incrementally, multi-layer perceptrons can learn complex functions.

(次)支持向量机(SVM)

  • 概念: 一种强大的分类方法,适用于线性和非线性数据。
  • 原理:
    • 使用非线性映射将原始训练数据转换到更高维度
    • 在这个更高维度中,它寻找一个具有最大间隔的线性最优分离超平面(“决策边界”)。
    • 足够高的维度总能线性分离两个类别。
    • 使用支持向量(最接近决策边界的“关键”训练元组)来定义超平面及其间隔。
  • 在高维数据上的有效性:
    • 复杂度取决于支持向量的数量,而不是数据维度。
    • 支持向量是关键示例;移除其他示例不会改变超平面。
    • 少量支持向量即使在高维中也能实现良好的泛化。
  • SVM与神经网络:
    • SVM: 较新的概念,确定性,泛化能力好,批量学习(二次规划),可以使用核函数处理复杂函数。
    • 神经网络: 较旧的概念,非确定性,泛化能力好但数学基础不强,可以增量学习,多层感知器可以学习复杂函数。

(次)Associative Classification

  • Concept: Integrates association rule mining with classification.
  • Process:
    • Generates and analyzes association rules that involve frequent patterns (conjunctions of attribute-value pairs) and class labels.
    • Rules are typically in the form of P1 \wedge P2 ... \wedge Pn \rightarrow "Aclass = C" (confidence, support).
  • Effectiveness: Explores highly confident associations among multiple attributes, often overcoming single-attribute constraints of decision trees, and has shown higher accuracy than some traditional methods (e.g., C4.5).

(次)关联分类

  • 概念: 将关联规则挖掘与分类集成。
  • 过程:
    • 生成并分析涉及频繁模式(属性-值对的合取)和类标签关联规则
    • 规则通常形式为 P1 \wedge P2 ... \wedge Pn \rightarrow "Aclass = C" (置信度, 支持度)。
  • 有效性: 探索多个属性之间高度置信的关联,通常克服决策树的单属性限制,并且比某些传统方法(例如C4.5)表现出更高的准确性。

(次)Lazy Learners (Instance-Based Methods)

  • Concept: Delays processing of training data until a new instance needs to be classified ("lazy evaluation"). Stores training data with minimal processing.
  • Contrast with Eager Learners: Eager learners (like decision trees, Bayes classifiers) construct a model before receiving new data.
  • Advantages of Lazy Learning: Less training time, effectively uses a richer hypothesis space with many local functions.
  • Disadvantages of Lazy Learning: More time in prediction.
  • Typical Approaches:
    • k-Nearest Neighbor (k-NN):
      • All instances are points in n-dimensional space.
      • Nearest neighbors are defined by Euclidean distance.
      • For discrete-valued target functions: returns the most common class among the k nearest training examples.
      • For real-valued prediction: returns the mean of the k nearest neighbors.
      • Distance-weighted k-NN: Gives greater weight to closer neighbors.
      • Curse of Dimensionality: Distances can be dominated by irrelevant attributes; overcome by axes stretch or attribute elimination.
    • Locally Weighted Regression: Constructs local approximations for predictions.
    • Case-Based Reasoning (CBR):
      • Uses a database of problem solutions (cases) represented by symbolic descriptions.
      • Searches for similar cases to solve new problems.
      • Challenges: Finding good similarity metrics, indexing, and adapting solutions.

(次)惰性学习器(基于实例的方法)

  • 概念: 延迟训练数据的处理,直到需要对新实例进行分类(“惰性评估”)。以最少的处理存储训练数据。
  • 与急切学习器的对比: 急切学习器(如决策树、贝叶斯分类器)在接收新数据之前构建模型。
  • 惰性学习的优点: 训练时间更少,有效利用具有许多局部函数的更丰富的假设空间。
  • 惰性学习的缺点: 预测时间更长。
  • 典型方法:
    • k-最近邻(k-NN):
      • 所有实例都是 n 维空间中的点。
      • 最近邻由欧几里得距离定义。
      • 对于离散值目标函数:返回 k 个最近训练示例中最常见的类别。
      • 对于实值预测:返回 k 个最近邻的平均值。
      • 距离加权 k-NN: 对较近的邻居给予更大的权重。
      • 维度灾难: 距离可能被不相关属性主导;通过轴拉伸或属性消除来克服。
    • 局部加权回归: 构建局部近似进行预测。
    • 基于案例的推理(CBR):
      • 使用由符号描述表示的问题解决方案(案例)数据库。
      • 搜索相似案例以解决新问题。
      • 挑战: 寻找良好的相似性度量、索引和解决方案适应。

Prediction (for Numerical Data)

  • Similarity to Classification: Constructs a model to predict values.
  • Difference from Classification: Predicts continuous or ordered values, not categorical class labels.
  • Major Method: Regression Analysis: Models the relationship between one or more independent/predictor variables and a dependent/response variable.
    • Linear Regression: Models linear relationships (y = w0 + w1x). Parameters (coefficients) estimated using least squares method.
    • Multiple Linear Regression: Models a response variable as a linear function of multiple predictor variables.
    • Nonlinear Regression: Models nonlinear relationships, sometimes transformable into linear models (e.g., polynomial regression).
    • Other Regression-Based Models:
      • Generalized Linear Models: Foundation for applying linear regression to categorical response variables (e.g., Logistic Regression, Poisson Regression).
      • Log-Linear Models: Used for categorical data to approximate discrete multidimensional probability distributions (also useful for data compression and smoothing).
      • Regression Trees and Model Trees: Tree-based models that predict continuous values rather than class labels. Leaves in a regression tree store the average prediction, while leaves in a model tree store a regression model (e.g., linear equation).

预测(数值数据)

  • 与分类的相似性: 构建模型来预测值。
  • 与分类的区别: 预测连续或有序值,而不是类别类标签。
  • 主要方法:回归分析: 建立一个或多个自变量/预测变量因变量/响应变量之间的关系模型。
    • 线性回归: 建模线性关系(y = w0 + w1x)。参数(系数)使用最小二乘法估计。
    • 多元线性回归: 将响应变量建模为多个预测变量的线性函数。
    • 非线性回归: 建模非线性关系,有时可转换为线性模型(例如,多项式回归)。
    • 其他基于回归的模型:
      • 广义线性模型: 将线性回归应用于分类响应变量的基础(例如,逻辑回归、泊松回归)。
      • 对数线性模型: 用于分类数据,以近似离散多维概率分布(也适用于数据压缩和平滑)。
      • 回归树和模型树: 基于树的模型,预测连续值而不是类标签。回归树中的叶节点存储平均预测值,而模型树中的叶节点存储回归模型(例如,线性方程)。

Accuracy and Error Measures (Detailed)

  • Classifier Accuracy: Percentage of correctly classified test tuples.
    • Error Rate (Misclassification Rate): 1 - accuracy.
    • Confusion Matrix: Summarizes classification results for multiple classes (True Positives, False Negatives, False Positives, True Negatives).
    • Alternative Measures (e.g., for imbalanced data like cancer diagnosis):
      • Sensitivity (Recall): True Positive Rate (t-pos/pos).
      • Specificity: True Negative Rate (t-neg/neg).
      • Precision: t-pos/(t-pos + f-pos).
      • These measures are used for cost-benefit analysis.
  • Predictor Error Measures (for Numerical Prediction):
    • Loss Function: Measures error between actual (y_i) and predicted (y_i') values.
    • Absolute Error: yiyi|y_i - y_i'|.
    • Squared Error: (yiyi)2(y_i - y_i')^2.
    • Test Error (Generalization Error): Average loss over the test set.
    • Mean Absolute Error (MAE): Average of absolute errors.
    • Mean Squared Error (MSE): Average of squared errors (exaggerates outliers).
    • Root Mean Squared Error (RMSE): Square root of MSE.

准确性和错误度量(详细)

  • 分类器准确性: 正确分类的测试元组的百分比。
    • 错误率(误分类率): 1 - 准确性。
    • 混淆矩阵: 总结多类别分类结果(真阳性、假阴性、假阳性、真阴性)。
    • 替代度量(例如,用于癌症诊断等不平衡数据):
      • 灵敏度(召回率): 真阳性率 (t-pos/pos)。
      • 特异度: 真阴性率 (t-neg/neg)。
      • 精确率: t-pos/(t-pos + f-pos)。
      • 这些度量用于成本效益分析。
  • 预测器错误度量(数值预测):
    • 损失函数: 测量实际值(y_i)与预测值(y_i')之间的误差。
    • 绝对误差: yiyi|y_i - y_i'|
    • 平方误差: (yiyi)2(y_i - y_i')^2
    • 测试误差(泛化误差): 测试集上的平均损失。
    • 平均绝对误差(MAE): 绝对误差的平均值。
    • 均方误差(MSE): 平方误差的平均值(夸大异常值)。
    • 均方根误差(RMSE): MSE的平方根。

Evaluating Classifier or Predictor Accuracy (Methods)

  • Holdout Method: Randomly partitions data into two independent sets:
    • Training Set (e.g., 2/3 of data) for model construction.
    • Test Set (e.g., 1/3 of data) for accuracy estimation.
    • Random Sampling: Repeat holdout multiple times and average the accuracies.
  • Cross-Validation (k-fold):
    • Randomly partitions data into k mutually exclusive subsets of equal size (e.g., k=10 is popular).
    • In each iteration i, subset D_i is used as the test set, and remaining k-1 subsets are used as training data.
    • Leave-one-out: k equals the number of tuples (for very small datasets).
    • Stratified Cross-Validation: Ensures class distribution in each fold is approximately the same as in the original data.
  • Bootstrap (.632 Bootstrap):
    • Works well with small datasets.
    • Samples training tuples uniformly with replacement.
    • A training set of d samples is created by sampling d times with replacement from the original d tuples. (Approx. 63.2% of original data forms the training set, remaining ~36.8% form the test set).
    • Repeats sampling k times, and overall accuracy is a weighted average of test and training set accuracies.

评估分类器或预测器准确性(方法)

  • 留出法: 将数据随机划分为两个独立的集合:
    • 训练集(例如,2/3的数据)用于模型构建。
    • 测试集(例如,1/3的数据)用于准确性估计。
    • 随机抽样: 多次重复留出法并平均准确性。
  • 交叉验证(k折):
    • 将数据随机划分为 k 个互斥且大小相等的子集(例如,k=10 很常用)。
    • 在每次迭代 i 中,子集 D_i 用作测试集,其余 k-1 个子集用作训练数据。
    • 留一法: k 等于元组数量(适用于非常小的数据集)。
    • 分层交叉验证: 确保每个折叠中的类别分布与原始数据中大致相同。
  • 自助法(.632 自助法):
    • 对小数据集效果良好。
    • 有放回地均匀抽取训练元组。
    • 通过从原始 d 个元组中有放回地抽样 d 次来创建 d 个样本的训练集。(大约63.2%的原始数据构成训练集,剩余的约36.8%构成测试集)。
    • 重复抽样 k 次,总准确性是测试集和训练集准确性的加权平均值。

Ensemble Methods (Increasing Accuracy)

  • Concept: Combines multiple learned models to improve overall accuracy.
  • Types:
    • Bagging (Bootstrap Aggregating): Generates multiple training sets by bootstrapping, builds a classifier for each, and averages their predictions (or uses majority vote). Reduces variance.
    • Boosting: Sequentially builds models, with each new model focusing on correcting errors made by previous models. Weights misclassified instances more heavily. Reduces bias.
    • Ensemble: Combining a set of heterogeneous classifiers (different types of models).

集成方法(提高准确性)

  • 概念: 结合多个学习模型以提高整体准确性。
  • 类型:
    • 装袋法(Bootstrap Aggregating): 通过自举法生成多个训练集,为每个训练集构建一个分类器,并对其预测进行平均(或使用多数投票)。减少方差。
    • 提升法(Boosting): 顺序构建模型,每个新模型侧重于纠正先前模型所犯的错误。对错误分类的实例赋予更高的权重。减少偏差。
    • 集成(Ensemble): 结合一组异构分类器(不同类型的模型)。

Model Selection (ROC Curves)

  • ROC (Receiver Operating Characteristics) Curves:
    • Purpose: Visual comparison of classification models, especially useful for imbalanced datasets or when the costs of false positives/negatives are different.
    • Plot: Plots True Positive Rate (Sensitivity) on the Y-axis against False Positive Rate (1-Specificity) on the X-axis for various classification thresholds.
    • Interpretation:
      • A model with perfect accuracy has an area of 1.0 (top-left corner).
      • The closer the curve is to the diagonal line (area close to 0.5), the less accurate the model (random guessing).
    • AUC (Area Under the Curve): A single scalar value representing the overall accuracy of the model; higher AUC indicates better performance.

模型选择(ROC曲线)

  • ROC(受试者工作特征)曲线:
    • 目的: 对分类模型进行视觉比较,特别适用于不平衡数据集或当假阳性/假阴性成本不同时。
    • 绘图: 绘制在不同分类阈值下,Y轴为真阳性率(灵敏度),X轴为**假阳性率(1-特异度)**的曲线。
    • 解读:
      • 完美准确的模型面积为1.0(左上角)。
      • 曲线越接近对角线(面积接近0.5),模型准确性越低(随机猜测)。
    • AUC(曲线下面积): 代表模型整体准确性的单一标量值;AUC越高表示性能越好。

Chapter 7: Cluster Analysis

What is Cluster Analysis?

  • Definition: Grouping a collection of data objects into clusters.
  • Goal: Objects within a cluster are similar to one another, and dissimilar to objects in other clusters.
  • Nature: It is an unsupervised learning technique, meaning there are no predefined classes or labels.
  • Applications: Can be used as a stand-alone tool to understand data distribution or as a preprocessing step for other algorithms (e.g., classification).
    • Examples: Marketing (customer segmentation), Land use identification, Insurance (grouping high-claim policyholders), City planning, Earthquake studies.

什么是聚类分析?

  • 定义: 将数据对象的集合分组到中。
  • 目标: 同一簇内的对象彼此之间相似,而与不同簇中的对象不相似
  • 性质: 它是一种无监督学习技术,意味着没有预定义的类别或标签。
  • 应用: 可以作为独立工具来理解数据分布,也可以作为其他算法(例如分类)的预处理步骤
    • 示例: 市场营销(客户细分)、土地利用识别、保险(对高索赔保单持有人进行分组)、城市规划、地震研究。

Quality of Clustering

  • Good Clustering: Produces clusters with:
    • High intra-class similarity: Objects within the same cluster are very alike.
    • Low inter-class similarity: Objects in different clusters are very different.
  • Quality Measures: Dependent on the similarity measure used and its implementation. Also, the ability to discover hidden patterns.
  • Subjectivity: Defining "similar enough" or "good enough" is often subjective.

聚类的质量

  • 好的聚类: 产生具有以下特征的簇:
    • 高类内相似性: 同一簇内的对象非常相似。
    • 低类间相似性: 不同簇中的对象非常不同。
  • 质量度量: 取决于所使用的相似性度量及其实现。还取决于发现隐藏模式的能力。
  • 主观性: 定义“足够相似”或“足够好”通常是主观的。

(次)Requirements of Clustering in Data Mining

  • Scalability: Efficiently handle large datasets.
  • Attribute Types: Ability to deal with various data types (numeric, categorical, mixed).
  • Dynamic Data: Adapt to changing data.
  • Arbitrary Shapes: Discover clusters of non-spherical shapes.
  • Minimal Domain Knowledge: Require few input parameters or prior knowledge.
  • Noise and Outliers: Robustly handle erroneous or unusual data points.
  • Order Insensitivity: Results should not depend on the order of input records.
  • High Dimensionality: Work effectively in spaces with many attributes.
  • User-Specified Constraints: Incorporate user-defined rules or preferences.
  • Interpretability and Usability: Results should be easy to understand and apply.

(次)数据挖掘中聚类的要求

  • 可伸缩性: 高效处理大型数据集。
  • 属性类型: 能够处理各种数据类型(数值型、类别型、混合型)。
  • 动态数据: 适应变化的数据。
  • 任意形状: 发现非球形形状的簇。
  • 最小领域知识: 需要少量输入参数或先验知识。
  • 噪声和异常值: 鲁棒地处理错误或异常数据点。
  • 顺序不敏感性: 结果不应依赖于输入记录的顺序。
  • 高维度: 在多属性空间中有效工作。
  • 用户指定约束: 纳入用户定义的规则或偏好。
  • 可解释性和可用性: 结果应易于理解和应用。

Types of Data in Cluster Analysis (and their Proximity Measures)

  • Data Structures:
    • Data Matrix: Represents data points with dimensions (attributes).
    • Dissimilarity Matrix: Stores pairwise distances/dissimilarities between objects.
  • Interval-Scaled Variables (Numeric):
    • Standardization: Data often needs to be normalized (e.g., z-score, mean absolute deviation) before distance calculation to prevent attributes with large ranges from dominating.
    • Minkowski Distance: A generalized distance metric.
      • Manhattan Distance (L1 norm, q=1): Sum of absolute differences.
      • Euclidean Distance (L2 norm, q=2): Square root of the sum of squared differences.
      • Supremum Distance (Lmax norm, q=$\infty$): Maximum difference.
      • Properties (for a metric): Positive definiteness, Symmetry, Triangle Inequality.
  • Binary Variables:
    • Symmetric Binary: Both outcomes (0/1) are equally important for dissimilarity.
    • Asymmetric Binary: Only the presence (1) matters for similarity.
    • Jaccard Coefficient: A similarity measure for asymmetric binary variables.
  • Nominal Variables:
    • Simple Matching: Dissimilarity is the proportion of non-matching attributes.
    • Can be converted to multiple binary variables.
  • Ordinal Variables:
    • Order is important (rank). Can be treated like interval-scaled data by mapping ranks to a [0, 1] range.
  • Ratio-Scaled Variables:
    • Positive measurements on a nonlinear (exponential) scale.
    • Often, logarithmic transformation (Yif = log(Xif)) is applied to linearize the scale.
  • Variables of Mixed Types: A weighted formula is used to combine dissimilarities from different attribute types.
  • Vector Objects (e.g., documents, gene features):
    • Cosine Similarity: Measures the cosine of the angle between two vectors (higher value = more similar).
    • Tanimoto Coefficient: A variant of cosine similarity.

聚类分析中的数据类型(及其邻近度度量)

  • 数据结构:
    • 数据矩阵: 表示具有维度(属性)的数据点。
    • 相异性矩阵: 存储对象之间的成对距离/相异性。
  • 区间尺度变量(数值型):
    • 标准化: 在距离计算之前,数据通常需要进行归一化(例如,z-score,平均绝对偏差),以防止范围大的属性占据主导地位。
    • 闵可夫斯基距离: 一种广义距离度量。
      • 曼哈顿距离(L1范数,q=1): 绝对差之和。
      • 欧几里得距离(L2范数,q=2): 平方差之和的平方根。
      • 上确界距离(Lmax范数,q=$\infty$): 最大差值。
      • 度量属性: 正定性、对称性、三角不等式。
  • 二元变量:
    • 对称二元: 两种结果(0/1)对于相异性同等重要。
    • 非对称二元: 只有存在(1)对于相似性重要。
    • Jaccard系数: 一种非对称二元变量的相似性度量。
  • 标称变量:
    • 简单匹配: 相异性是属性不匹配的比例。
    • 可以转换为多个二元变量。
  • 序数变量:
    • 顺序很重要(排名)。可以通过将排名映射到 [0, 1] 范围来像区间尺度数据一样处理。
  • 比率尺度变量:
    • 非线性(指数)尺度上的正测量值。
    • 通常应用对数变换(Yif = log(Xif))来线性化尺度。
  • 混合类型变量: 使用加权公式组合来自不同属性类型的相异性。
  • 向量对象(例如,文档、基因特征):
    • 余弦相似度: 测量两个向量之间夹角的余弦(值越高越相似)。
    • Tanimoto系数: 余弦相似度的一种变体。

Categorization of Major Clustering Methods

  • Partitioning Methods: Construct k partitions, optimizing a criterion (e.g., minimizing squared error).
  • Hierarchical Methods: Create a tree-like hierarchy of clusters.
  • Density-Based Methods: Discover arbitrarily shaped clusters based on density connectivity.
  • Grid-Based Methods: Quantize space into a finite number of cells, perform clustering on the grid structure.
  • Model-Based Methods: Hypothesize a model for each cluster and find the best fit.
  • Frequent Pattern-Based: Based on the analysis of frequent patterns.
  • User-Guided or Constraint-Based: Incorporate user-specified or application-specific constraints.

主要聚类方法分类

  • 划分方法: 构建 k 个分区,优化某个准则(例如,最小化平方误差)。
  • 层次方法: 创建一个树状的簇层次结构。
  • 基于密度的方法: 根据密度连通性发现任意形状的簇。
  • 基于网格的方法: 将空间量化为有限数量的单元格,在网格结构上执行聚类。
  • 基于模型的方法: 为每个簇假设一个模型并找到最佳拟合。
  • 基于频繁模式: 基于频繁模式的分析。
  • 用户引导或基于约束: 整合用户指定或特定于应用的约束。

(次)Partitioning Methods (Details)

  • K-Means:
    • Concept: Partitions n objects into k clusters. Each cluster is represented by its centroid (mean point).
    • Algorithm Steps:
      1. Arbitrarily choose k initial centroids (seed points).
      2. Assign each object to the cluster with the nearest centroid.
      3. Recompute new centroids for each cluster based on assigned objects.
      4. Repeat steps 2-3 until no object changes its cluster assignment.
    • Strengths: Relatively efficient for large datasets (O(tkn) time complexity).
    • Weaknesses:
      • Applicable only when mean is defined (numerical data).
      • Requires specifying k in advance.
      • Sensitive to noisy data and outliers (means are easily influenced).
      • Only discovers clusters with convex (e.g., spherical) shapes.
      • Can converge to a local optimum.
    • Variations: Different initial centroid selections, dissimilarity calculations, and strategies for updating cluster means.
      • K-Modes: For categorical data, replaces means with modes.
      • K-Prototypes: For mixed categorical and numerical data.
  • K-Medoids (PAM - Partitioning Around Medoids):
    • Problem with K-Means: Sensitivity to outliers.
    • Concept: Uses medoids (actual objects from the dataset that are the most centrally located in a cluster) as reference points instead of means. More robust to noise and outliers.
    • PAM Algorithm: Starts with an initial set of k medoids and iteratively replaces a medoid with a non-medoid object if it improves the total distance of the clustering.
    • Scalability Issue: PAM works well for small datasets but does not scale well for large datasets (O(k(n-k)^2) per iteration).
    • CLARA (Clustering Large Applications): A sampling-based method that improves PAM's scalability by applying PAM to multiple samples.

(次)划分方法(详细)

  • K-Means:
    • 概念:n 个对象划分为 k 个簇。每个簇由其质心(均值点)表示。
    • 算法步骤:
      1. 任意选择 k 个初始质心(种子点)。
      2. 将每个对象分配到最近质心的簇中。
      3. 根据分配的对象重新计算每个簇的新质心。
      4. 重复步骤2-3,直到没有对象改变其簇分配。
    • 优点: 对大型数据集相对高效(时间复杂度为O(tkn))。
    • 缺点:
      • 仅适用于均值有定义的情况(数值数据)。
      • 需要提前指定 k
      • 对噪声数据和异常值敏感(均值易受影响)。
      • 只发现凸形(例如球形)的簇。
      • 可能收敛到局部最优。
    • 变体: 不同的初始质心选择、相异性计算和更新簇均值的策略。
      • K-Modes: 用于类别数据,用众数代替均值。
      • K-Prototypes: 用于混合类别和数值数据。
  • K-Medoids (PAM - 基于中心点的划分):
    • K-Means的问题: 对异常值敏感。
    • 概念: 使用中心点(数据集中实际存在的对象,它们是簇中最中心的位置)作为参考点而不是均值。对噪声和异常值更鲁棒。
    • PAM算法:k 个初始中心点集开始,如果替换一个非中心点对象能改善聚类的总距离,则迭代地替换中心点。
    • 可伸缩性问题: PAM 对小型数据集效果良好,但对大型数据集可伸缩性不佳(每次迭代O(k(n-k)^2))。
    • CLARA(大型应用聚类算法): 一种基于采样的K-Medoids改进方法,通过对多个样本应用PAM来提高可伸缩性。

(次)Hierarchical Methods (Details)

  • Concept: Creates a hierarchical decomposition of the data objects. Does not require specifying k in advance. Uses a distance matrix.
  • Agglomerative (AGNES - Agglomerative Nesting):
    • Bottom-up approach: Starts with each object as a separate cluster, and iteratively merges the closest pairs of clusters until all objects are in a single cluster or a termination condition is met.
    • Uses a dissimilarity matrix.
    • Linkage Criteria for Merging: Single-link (minimum distance), Complete-link (maximum distance), Average-link.
  • Divisive (DIANA - Divisive Analysis):
    • Top-down approach: Starts with all objects in one cluster and recursively splits the largest cluster into smaller clusters.
  • Dendrogram: A tree-like diagram that visually represents the hierarchical clustering process, showing how clusters are merged (AGNES) or split (DIANA) at different dissimilarity levels. Cutting the dendrogram at a desired level yields a clustering.
  • Weaknesses of Traditional Hierarchical Methods:
    • Scalability: Do not scale well (at least O(n^2) time complexity).
    • Irreversibility: Cannot undo what was done previously (merges/splits).
  • Recent Hierarchical Clustering Methods (Address Scalability/Limitations):
    • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Uses a CF-tree (Clustering Feature Tree) to summarize clusters, handles large datasets efficiently.
    • ROCK (Robust Clustering using Links): For categorical data, uses neighbor and link analysis.
    • CHAMELEON: Uses dynamic modeling to merge subclusters based on their interconnectivity and closeness.

(次)层次方法(详细)

  • 概念: 创建数据对象的分层分解。不需要预先指定 k。使用距离矩阵
  • 聚合式(AGNES - 凝聚式嵌套):
    • 自底向上方法: 从每个对象作为一个单独的簇开始,迭代地合并最近的簇对,直到所有对象都在一个簇中或达到终止条件。
    • 使用相异性矩阵
    • 合并的连接准则: 单一连接(最小距离)、完全连接(最大距离)、平均连接。
  • 分裂式(DIANA - 分裂分析):
    • 自顶向下方法: 从所有对象在一个簇中开始,递归地将最大的簇分成更小的簇。
  • 树状图: 一种树状图,可视化地表示层次聚类过程,显示簇在不同相异性级别上如何合并(AGNES)或分裂(DIANA)。在所需的级别切割树状图会产生一个聚类结果。
  • 传统层次方法的缺点:
    • 可伸缩性: 可伸缩性不佳(至少O(n^2)时间复杂度)。
    • 不可逆性: 无法撤销之前所做的操作(合并/分裂)。
  • 近期层次聚类方法(解决可伸缩性/限制):
    • BIRCH(使用层次结构的平衡迭代规约和聚类): 使用CF-树(聚类特征树)来汇总簇,高效处理大型数据集。
    • ROCK(使用链接的鲁棒聚类): 用于类别数据,使用邻居和链接分析。
    • CHAMELEON: 使用动态建模根据它们的互连性和紧密性合并子簇。

Outlier Analysis

  • What are Outliers? Objects that are considerably dissimilar or "deviate" from the majority of the data.
  • Applications: Fraud detection (credit card, telecom), unusual pattern detection, medical analysis, security.
  • Approaches to Outlier Discovery:
    • Statistical Approaches:
      • Assume an underlying data distribution model (e.g., normal distribution).
      • Use discordancy tests to determine if an object deviates significantly based on the distribution's parameters (mean, variance) and the number of expected outliers.
      • Drawbacks: Most tests are for single attributes; the data distribution is often unknown.
    • Distance-Based Approach:
      • An object O is a DB(p, D)-outlier if at least a fraction p of other objects in the dataset are at a distance greater than D from O.
      • Allows multi-dimensional analysis without assuming a data distribution.
      • Algorithms: Index-based, Nested-loop, Cell-based.
    • Density-Based Approach (Local Outlier Factor - LOF):
      • Addresses limitations of distance-based methods when data is not uniformly distributed (e.g., clusters of varying densities).
      • Defines a Local Outlier Factor (LOF) for each object, which measures its density relative to its neighbors. Higher LOF indicates a lower density, suggesting an outlier.
    • Deviation-Based Approach:
      • Identifies outliers by examining the main characteristics of objects in a group and finding objects that "deviate" from this description.
      • Sequential Exception Technique: Simulates human pattern recognition to distinguish unusual objects.
      • OLAP Data Cube Technique: Uses data cubes to identify regions of anomalies in multidimensional data.

异常值分析

  • 什么是异常值? 与大多数数据显著不相似或“偏离”的对象。
  • 应用: 欺诈检测(信用卡、电信)、异常模式检测、医学分析、安全。
  • 异常值发现方法:
    • 统计方法:
      • 假设一个潜在的数据分布模型(例如,正态分布)。
      • 使用不一致性测试来判断对象是否基于分布参数(均值、方差)和预期异常值数量而显著偏离。
      • 缺点: 大多数测试仅适用于单个属性;数据分布通常未知。
    • 基于距离的方法:
      • 如果数据集中至少 p 比例的其他对象与对象 O 的距离大于 D,则对象 ODB(p, D)-异常值。
      • 允许在不假设数据分布的情况下进行多维分析。
      • 算法: 基于索引、嵌套循环、基于单元格。
    • 基于密度的方法(局部异常因子 - LOF):
      • 解决了当数据分布不均匀时(例如,密度不同的簇)基于距离方法的局限性。
      • 为每个对象定义一个局部异常因子(LOF),它衡量其相对于邻居的密度。LOF越高表示密度越低,表明是异常值。
    • 基于偏差的方法:
      • 通过检查组中对象的主要特征并找到“偏离”该描述的对象来识别异常值。
      • 序列异常技术: 模拟人类模式识别以区分异常对象。
      • OLAP数据立方体技术: 使用数据立方体识别多维数据中的异常区域。

Chapter 8 Text Data Mining

Introduction to Text Data Mining

  • Text Data: Large collections of documents (news articles, research papers, emails, web pages) that are usually semi-structured.
  • Goal: Extract knowledge and discover patterns from unstructured or semi-structured text.

文本数据挖掘简介

  • 文本数据: 大量文档集合(新闻文章、研究论文、电子邮件、网页),通常是半结构化的。
  • 目标: 从非结构化或半结构化文本中提取知识并发现模式。

Challenges in Text Data Mining (from NLP perspective)

  • Bag-of-Tokens Approach: Represents documents as a collection of words (tokens) with their frequencies. Loses all order-specific information and severely limits context.
  • Natural Language Processing (NLP) Difficulties:
    • Ambiguity:
      • Word-level: Lexical ambiguity (e.g., "design" as noun/verb), Word sense ambiguity (e.g., "root" having multiple meanings).
      • Syntactic: Parsing ambiguity (e.g., "A man saw a boy with a telescope" - who has the telescope?).
    • Anaphora Resolution: Determining what pronouns refer to (e.g., "himself").
    • Presupposition: Implied meanings.
    • Computational Intensity: Humans rely heavily on context, which can extend beyond a single document.
  • Shallow Linguistics (Practical Approach): Focuses on "useful sub-goals" to make NLP feasible.
    • Lexicon: Machine-understandable linguistic knowledge (senses, definitions, synonyms, antonyms).
    • Part-of-Speech (POS) Tagging: Identifying grammatical categories of words (e.g., noun, verb). Helps limit ambiguity.
    • Word Sense Disambiguation (WSD): Resolving the meaning of a word based on context.
    • Phrase Detection / Parsing: Identifying meaningful phrases or syntactic structures.
    • WordNet: An extensive lexical network organizing words into synonym sets (synsets) and showing various semantic relationships.

文本数据挖掘的挑战(从NLP角度)

  • 词袋法: 将文档表示为词(token)及其频率的集合。丢失所有顺序特异性信息,并严重限制上下文。
  • 自然语言处理(NLP)的困难:
    • 歧义性:
      • 词级别: 词汇歧义(例如,“design”既是名词也是动词),词义歧义(例如,“root”有多种含义)。
      • 句法: 解析歧义(例如,“一个人用望远镜看到了一个男孩”——谁有望远镜?)。
    • 回指消解: 确定代词指代的对象(例如,“他自己”)。
    • 预设: 隐含的意义。
    • 计算密集度: 人类严重依赖上下文,这可能超出单个文档的范围。
  • 浅层语言学(实用方法): 专注于“有用的子目标”以使NLP可行。
    • 词典: 机器可理解的语言知识(含义、定义、同义词、反义词)。
    • 词性标注(POS Tagging): 识别词的语法类别(例如,名词、动词)。有助于限制歧义。
      词义消歧(WSD): 根据上下文解决词的含义。
    • 短语检测/解析: 识别有意义的短语或句法结构。
    • WordNet: 一个广泛的词汇网络,将词组织成同义词集(synsets)并显示各种语义关系。

Types of Text Data Mining Applications

  • Keyword-Based Association Analysis: Discovering frequently co-occurring keywords or terms within documents.
  • Automatic Document Classification (Text Categorization):
    • Assigning pre-given categories (with labeled examples) to new documents. A supervised learning problem.
    • Methods: Vector space model based classifiers (K-NN, Decision Trees, Neural Networks, SVM) and Probabilistic models (Naïve Bayes).
    • Evaluation: Uses Precision & Recall, Confusion Matrices.
  • Document Clustering: Automatically grouping related documents when no predefined categories exist (unsupervised learning). Creates a taxonomy at runtime.
  • Similarity Detection: Clustering documents by common author or source.
  • Link Analysis: Discovering unusual correlations between entities.
  • Sequence Analysis: Predicting recurring events in text data.
  • Anomaly Detection: Finding information that violates usual patterns.
  • Hypertext Analysis: Analyzing patterns in anchors, links, and text correlations with linked objects.

文本数据挖掘应用类型

  • 基于关键词的关联分析: 发现文档中频繁共同出现的关键词或术语。
  • 自动文档分类(文本分类):
    • 将预先给定的类别(带有标签示例)分配给新文档。一个监督学习问题。
    • 方法: 基于向量空间模型的分类器(K-NN、决策树、神经网络、SVM)和概率模型(朴素贝叶斯)。
    • 评估: 使用精确率和召回率、混淆矩阵。
  • 文档聚类: 当没有预定义类别时自动对相关文档进行分组(无监督学习)。在运行时创建分类。
  • 相似性检测: 按共同作者或来源对文档进行聚类。
  • 链接分析: 发现实体之间不寻常的关联。
  • 序列分析: 预测文本数据中反复发生的事件。
  • 异常检测: 发现违反常规模式的信息。
  • 超文本分析: 分析锚点、链接以及文本与链接对象之间关联的模式。

Research Problems in Text Mining

  • Developing methods for more sophisticated document matching, incorporating user profiles/preferences.
  • Building efficient indices for complex documents.
  • Improving similarity search with such indices.

文本挖掘中的研究问题

  • 开发更复杂的文档匹配方法,结合用户档案/偏好。
  • 为复杂文档构建高效索引。
  • 利用此类索引改进相似性搜索。

Chapter 9: Data Mining Trends and Research Frontiers

Mining Complex Types of Data (Research Frontiers)

The field is expanding to handle increasingly diverse and complex data:

  • Sequence Data
  • Graphs and Networks
  • Other Kinds of Data:
    • Spatial Data: Spatial frequent/co-located patterns, clustering, and classification.
    • Spatiotemporal & Moving Object Data: Trajectory mining, periodic patterns.
    • Cyber-Physical System Data: Applications in healthcare, air-traffic control, flood simulation.
    • Multimedia Data: Social media data, geo-tagged spatial clustering, periodicity analysis.
    • Text Data: Topic modeling, integration with geo- and networked data.
    • Web Data: Web content, structure, and usage mining.
    • Data Streams: Dynamics, one-pass processing, pattern mining, clustering, classification, outlier detection for continuous data flows.

复杂类型数据的挖掘(研究前沿)

该领域正在扩展以处理日益多样和复杂的数据:

  • 序列数据
  • 图和网络
  • 其他类型数据:
    • 空间数据: 空间频繁/共存模式、聚类和分类。
    • 时空与移动对象数据: 轨迹挖掘、周期性模式。
    • 信息物理系统数据: 在医疗保健、空中交通管制、洪水模拟中的应用。
    • 多媒体数据: 社交媒体数据、地理标记的空间聚类、周期性分析。
    • 文本数据: 主题建模、与地理和网络数据的集成。
    • Web数据: Web内容、结构和使用挖掘。
    • 数据流: 动态性、单次处理、模式挖掘、聚类、分类、连续数据流的异常值检测。

Other Methodologies of Data Mining

  • Statistical Data Mining: Utilizes well-established statistical techniques, especially for numeric data, in scientific and economic applications.
    • Methods: Regression (linear, multiple, polynomial, nonparametric), Generalized Linear Models (for categorical response variables, including logistic and Poisson regression), Mixed-Effect Models (for grouped data), Analysis of Variance (ANOVA), Factor Analysis, Discriminant Analysis, Survival Analysis (life span prediction, time-to-event).
  • Views on Data Mining Foundations:
    • Data Reduction: Data mining as a way to reduce data representation, trading accuracy for speed.
    • Data Compression: Data mining as a way to compress data (e.g., into rules, trees, clusters).
    • Probability and Statistical Theory: Data mining as discovering joint probability distributions of random variables.
    • Microeconomic View (Utility): Finding patterns useful for decision-making.
    • Pattern Discovery & Inductive Databases: Data mining as performing inductive logic on databases, querying both data and patterns.
  • Visual and Audio Data Mining:
    • Visual Data Mining: Uses computer graphics to discover implicit but useful knowledge from large datasets through visual images.
      • Purpose of Visualization: Gain insight, qualitative overview, search for patterns, identify interesting regions, provide visual proof.
      • Integration: Data visualization (visualizing data, mining results, mining process) and interactive visual data mining (enabling smart decisions through visual exploration).
      • Results Visualization: Scatter plots, boxplots, decision trees, association rules, clusters, outliers.
    • Audio Data Mining: Uses audio signals to represent patterns or features of data mining results, transforming patterns into sounds to identify interesting or unusual findings.

其他数据挖掘方法论

  • 统计数据挖掘: 利用成熟的统计技术,特别是针对数值数据,应用于科学和经济领域。
    • 方法: 回归(线性、多元、多项式、非参数)、广义线性模型(用于类别响应变量,包括逻辑回归和泊松回归)、混合效应模型(用于分组数据)、方差分析(ANOVA)、因子分析、判别分析、生存分析(寿命预测、事件发生时间)。
  • 数据挖掘基础的观点:
    • 数据规约: 数据挖掘作为一种减少数据表示的方法,以准确性换取速度。
    • 数据压缩: 数据挖掘作为一种压缩数据的方法(例如,压缩成规则、树、簇)。
    • 概率和统计理论: 数据挖掘作为发现随机变量的联合概率分布。
    • 微观经济学观点(效用): 寻找对决策有用的模式。
    • 模式发现和归纳数据库: 数据挖掘作为在数据库上执行归纳逻辑,查询数据和模式。
  • 视觉和音频数据挖掘:
    • 视觉数据挖掘: 使用计算机图形学从大型数据集中通过视觉图像发现隐含但有用的知识。
      • 可视化目的: 获取洞察力、定性概述、搜索模式、识别有趣区域、提供视觉证明。
      • 集成: 数据可视化(可视化数据、挖掘结果、挖掘过程)和交互式视觉数据挖掘(通过视觉探索实现智能决策)。
      • 结果可视化: 散点图、箱线图、决策树、关联规则、聚类、异常值。
    • 音频数据挖掘: 使用音频信号表示数据挖掘结果的模式或特征,将模式转换为声音以识别有趣或异常的发现。

Data Mining Applications

  • Financial Data Analysis
  • Retail and Telecommunication Industries
  • Science and Engineering
  • Intrusion Detection and Prevention
  • Recommender Systems

数据挖掘应用

  • 金融数据分析
  • 零售和电信行业
  • 科学与工程
  • 入侵检测和防御
  • 推荐系统

Data Mining and Society

  • Ubiquitous Data Mining: Data mining is pervasive in daily life (e.g., online shopping, CRM).
  • Invisible Data Mining: Data mining functions are increasingly built into operations, often without user awareness (e.g., Google search results). This is highly desirable for seamless integration.
  • Privacy, Security, and Social Impacts:
    • Many applications do not involve personal data (e.g., scientific data).
    • The primary concern is unconstrained access to individual, privacy-sensitive records.
    • Methods for Privacy Preservation:
      1. Removing Sensitive IDs: Disconnecting sensitive identifiers.
      2. Data Security-Enhancing Methods: Multi-level security models, encryption (blind signatures, biometric encryption, anonymous databases).
      3. Privacy-Preserving Data Mining (PPDM): A field dedicated to obtaining valid mining results without disclosing underlying sensitive data.
        • Often involves a trade-off between information loss and privacy.
        • Techniques:
          • Randomization (Perturbation): Adding noise to mask attribute values.
          • K-anonymity and L-diversity: Altering individual records so they cannot be uniquely identified or enforcing diversity of sensitive values within a group.
          • Distributed Privacy Preservation: Partitioning and distributing data horizontally, vertically, or both.
          • Downgrading Effectiveness: Modifying data or mining results (e.g., hiding association rules, distorting classification models) to prevent privacy violations.

数据挖掘与社会

  • 无处不在的数据挖掘: 数据挖掘在日常生活中无处不在(例如,在线购物、客户关系管理)。
  • 无形的数据挖掘: 数据挖掘功能越来越多地内置到操作中,通常用户不知情(例如,谷歌搜索结果)。这对于无缝集成非常理想。
  • 隐私、安全和社会影响:
    • 许多应用不涉及个人数据(例如,科学数据)。
    • 主要关注的是对个人、隐私敏感记录不受限制的访问。
    • 隐私保护方法:
      1. 移除敏感ID: 断开敏感标识符。
      2. 数据安全增强方法: 多级安全模型、加密(盲签名、生物识别加密、匿名数据库)。
      3. 隐私保护数据挖掘(PPDM): 一个致力于在不泄露底层敏感数据的情况下获得有效挖掘结果的领域。
        • 通常涉及信息损失和隐私之间的权衡
        • 技术:
          • 随机化(扰动): 添加噪声以掩盖属性值。
          • K-匿名和L-多样性: 修改单个记录,使其无法被唯一识别,或强制组内敏感值的多样性。
          • 分布式隐私保护: 水平、垂直或同时对数据进行分区和分发。
          • 降级有效性: 修改数据或挖掘结果(例如,隐藏关联规则、扭曲分类模型)以防止隐私侵犯。

  • Application exploration for domain-specific problems.
  • Development of scalable and interactive data mining methods.
  • Deeper integration with Web search engines, databases, data warehouses, and cloud computing.
  • Increased focus on mining complex data types: social/information networks, spatiotemporal data, multimedia, text, Web, biological/biomedical data, cyber-physical systems.
  • Integration with software and system engineering.
  • Growth of visual and audio data mining.
  • Distributed and real-time data stream mining.
  • Continuous emphasis on privacy protection and information security.

数据挖掘趋势

  • 针对领域特定问题的应用探索。
  • 可扩展和交互式数据挖掘方法的开发。
  • 与Web搜索引擎、数据库、数据仓库和云计算的更深层次集成。
  • 越来越关注复杂数据类型的挖掘:社交/信息网络、时空数据、多媒体、文本、Web、生物/生物医学数据、信息物理系统。
  • 与软件和系统工程的集成。
  • 视觉和音频数据挖掘的发展。
  • 分布式和实时数据流挖掘。
  • 持续强调隐私保护和信息安全。