DolphinDB工业数据清洗：缺失值与异常值处理

- 摘要
- 一、数据清洗概述
- - [1.1 数据质量问题](#1.1 数据质量问题)
  - [1.2 工业数据特点](#1.2 工业数据特点)
- 二、缺失值检测
- - [2.1 检测缺失值](#2.1 检测缺失值)
  - [2.2 缺失值模式分析](#2.2 缺失值模式分析)
- 三、缺失值处理
- - [3.1 删除缺失值](#3.1 删除缺失值)
  - [3.2 填充缺失值](#3.2 填充缺失值)
  - [3.3 分组填充](#3.3 分组填充)
- 四、异常值检测
- - [4.1 统计方法检测](#4.1 统计方法检测)
  - [4.2 Z-Score检测](#4.2 Z-Score检测)
  - [4.3 分位数检测](#4.3 分位数检测)
- 五、异常值处理
- - [5.1 删除异常值](#5.1 删除异常值)
  - [5.2 替换异常值](#5.2 替换异常值)
  - [5.3 插值替换](#5.3 插值替换)
- 六、数据质量报告
- - [6.1 生成质量报告](#6.1 生成质量报告)
- 七、自动化清洗流程
- - [7.1 清洗流水线](#7.1 清洗流水线)
- 八、实战案例
- - [8.1 工业传感器数据清洗](#8.1 工业传感器数据清洗)
- 九、总结
- 参考资料

摘要

本文深入讲解DolphinDB工业数据清洗技术。从数据质量检测到缺失值处理，从异常值识别到数据修复，从清洗策略到自动化流程，全面介绍数据清洗的核心方法。通过丰富的代码示例，帮助读者掌握工业数据清洗的核心技能。

一、数据清洗概述

1.1 数据质量问题

#mermaid-svg-v295rCfRm2FV2w5O{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-v295rCfRm2FV2w5O .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-v295rCfRm2FV2w5O .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-v295rCfRm2FV2w5O .error-icon{fill:#552222;}#mermaid-svg-v295rCfRm2FV2w5O .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-v295rCfRm2FV2w5O .marker{fill:#333333;stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .marker.cross{stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-v295rCfRm2FV2w5O p{margin:0;}#mermaid-svg-v295rCfRm2FV2w5O .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label text{fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label span{color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label span p{background-color:transparent;}#mermaid-svg-v295rCfRm2FV2w5O .label text,#mermaid-svg-v295rCfRm2FV2w5O span{fill:#333;color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .node rect,#mermaid-svg-v295rCfRm2FV2w5O .node circle,#mermaid-svg-v295rCfRm2FV2w5O .node ellipse,#mermaid-svg-v295rCfRm2FV2w5O .node polygon,#mermaid-svg-v295rCfRm2FV2w5O .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .rough-node .label text,#mermaid-svg-v295rCfRm2FV2w5O .node .label text,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label,#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label{text-anchor:middle;}#mermaid-svg-v295rCfRm2FV2w5O .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .rough-node .label,#mermaid-svg-v295rCfRm2FV2w5O .node .label,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label,#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label{text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .node.clickable{cursor:pointer;}#mermaid-svg-v295rCfRm2FV2w5O .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .arrowheadPath{fill:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-v295rCfRm2FV2w5O .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-v295rCfRm2FV2w5O .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .cluster text{fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster span{color:#333;}#mermaid-svg-v295rCfRm2FV2w5O div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-v295rCfRm2FV2w5O .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O rect.text{fill:none;stroke-width:0;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape,#mermaid-svg-v295rCfRm2FV2w5O .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape p,#mermaid-svg-v295rCfRm2FV2w5O .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label rect,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-v295rCfRm2FV2w5O .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-v295rCfRm2FV2w5O :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据质量问题
缺失值
数据清洗
异常值
重复值
不一致
高质量数据

1.2 工业数据特点

问题	说明
缺失值	设备故障、网络中断
异常值	传感器故障、干扰
重复值	数据重复上报
不一致	格式不统一

二、缺失值检测

2.1 检测缺失值

python 复制代码

// 创建测试数据
t = table(
    1..10 as id,
    [1, NULL, 3, NULL, 5, 6, NULL, 8, 9, 10] as value,
    [25.0, 26.0, NULL, 28.0, 29.0, NULL, 31.0, 32.0, NULL, 34.0] as temperature
)

// 检测缺失值
select id, value, temperature,
       isNull(value) as value_null,
       isNull(temperature) as temp_null
from t

// 统计缺失值
select 
    count(*) as total,
    sum(isNull(value)) as value_null_count,
    sum(isNull(temperature)) as temp_null_count,
    sum(isNull(value)) * 100.0 / count(*) as value_null_rate
from t

2.2 缺失值模式分析

python 复制代码

// 分析缺失值模式
def analyzeMissingPattern(data) {
    result = dict(STRING, ANY)
    
    for (col in data.columnNames()) {
        nullCount = sum(isNull(data[col]))
        nullRate = nullCount * 100.0 / data.rows()
        result[col] = dict(STRING, ANY, [
            ["nullCount", nullCount],
            ["nullRate", nullRate]
        ])
    }
    
    return result
}

// 使用
analyzeMissingPattern(t)

三、缺失值处理

3.1 删除缺失值

python 复制代码

// 删除包含缺失值的行
cleaned = select * from t where value is not null and temperature is not null

// 删除特定列缺失的行
cleaned = select * from t where temperature is not null

3.2 填充缺失值

python 复制代码

// 均值填充
avgTemp = avg(t.temperature)
filled = select id, value,
         iif(temperature is null, avgTemp, temperature) as temperature
         from t

// 中位数填充
medTemp = med(t.temperature)
filled = select id, value,
         iif(temperature is null, medTemp, temperature) as temperature
         from t

// 前向填充
filled = select id, value,
         ffill(temperature) as temperature
         from t

// 线性插值
filled = select id, value,
         interpolate(temperature, "linear") as temperature
         from t

3.3 分组填充

python 复制代码

// 按设备分组填充
t = table(
    take(1..3, 30) as device_id,
    1..30 as id,
    [25.0, NULL, 27.0, NULL, 29.0, ...] as temperature
)

// 按设备均值填充
filled = select id, device_id,
         iif(temperature is null, avg(temperature), temperature) as temperature
         from t
         context by device_id

四、异常值检测

4.1 统计方法检测

python 复制代码

// 创建测试数据
t = table(
    1..100 as id,
    concat([rand(20.0..30.0, 95), [100.0, -50.0, 200.0, -100.0, 150.0]]) as temperature
)

// 3σ原则检测
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)

select id, temperature,
       abs(temperature - avgTemp) > 3 * stdTemp as is_outlier_3sigma
from t

// IQR方法检测
q1 = percentile(t.temperature, 25)
q3 = percentile(t.temperature, 75)
iqr = q3 - q1
lowerBound = q1 - 1.5 * iqr
upperBound = q3 + 1.5 * iqr

select id, temperature,
       temperature < lowerBound or temperature > upperBound as is_outlier_iqr
from t

4.2 Z-Score检测

python 复制代码

// Z-Score检测
def detectOutliersZScore(data, threshold = 3) {
    meanVal = avg(data)
    stdVal = std(data)
    zScore = abs(data - meanVal) / stdVal
    return zScore > threshold
}

// 使用
select id, temperature,
       detectOutliersZScore(temperature) as is_outlier
from t

4.3 分位数检测

python 复制代码

// 分位数检测
def detectOutliersQuantile(data, lowerQ = 0.01, upperQ = 0.99) {
    lower = percentile(data, lowerQ * 100)
    upper = percentile(data, upperQ * 100)
    return data < lower or data > upper
}

// 使用
select id, temperature,
       detectOutliersQuantile(temperature) as is_outlier
from t

五、异常值处理

5.1 删除异常值

python 复制代码

// 删除异常值
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)

cleaned = select * from t
          where abs(temperature - avgTemp) <= 3 * stdTemp

5.2 替换异常值

python 复制代码

// 替换为边界值
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)
lowerBound = avgTemp - 3 * stdTemp
upperBound = avgTemp + 3 * stdTemp

cleaned = select id,
          iif(temperature < lowerBound, lowerBound,
              iif(temperature > upperBound, upperBound, temperature)) as temperature
          from t

5.3 插值替换

python 复制代码

// 用相邻值替换
cleaned = select id, temperature,
          iif(abs(temperature - avgTemp) > 3 * stdTemp,
              ffill(temperature), temperature) as temperature_fixed
          from t

六、数据质量报告

6.1 生成质量报告

python 复制代码

// 数据质量报告函数
def generateQualityReport(data) {
    report = dict(STRING, ANY)
    
    report["totalRows"] = data.rows()
    report["totalColumns"] = data.columns()
    
    // 缺失值统计
    nullStats = dict(STRING, ANY)
    for (col in data.columnNames()) {
        nullStats[col] = dict(STRING, ANY, [
            ["nullCount", sum(isNull(data[col]))],
            ["nullRate", sum(isNull(data[col])) * 100.0 / data.rows()]
        ])
    }
    report["nullStats"] = nullStats
    
    // 异常值统计（数值列）
    outlierStats = dict(STRING, ANY)
    for (col in data.columnNames()) {
        if (type(data[col]) in [INT, LONG, FLOAT, DOUBLE]) {
            avgVal = avg(data[col])
            stdVal = std(data[col])
            outlierCount = sum(abs(data[col] - avgVal) > 3 * stdVal)
            outlierStats[col] = dict(STRING, ANY, [
                ["outlierCount", outlierCount],
                ["outlierRate", outlierCount * 100.0 / data.rows()]
            ])
        }
    }
    report["outlierStats"] = outlierStats
    
    return report
}

// 使用
report = generateQualityReport(t)
print(report)

七、自动化清洗流程

7.1 清洗流水线

python 复制代码

// 数据清洗流水线
def dataCleaningPipeline(data, config) {
    result = data
    
    // 1. 缺失值处理
    if (config.handleMissing == "drop") {
        result = select * from result where not hasNull(*)
    } else if (config.handleMissing == "fill_mean") {
        for (col in config.numericColumns) {
            result[col] = iif(isNull(result[col]), avg(result[col]), result[col])
        }
    }
    
    // 2. 异常值处理
    if (config.handleOutlier == "drop") {
        for (col in config.numericColumns) {
            avgVal = avg(result[col])
            stdVal = std(result[col])
            result = select * from result 
                     where abs(result[col] - avgVal) <= 3 * stdVal
        }
    }
    
    // 3. 重复值处理
    if (config.handleDuplicate) {
        result = select distinct * from result
    }
    
    return result
}

八、实战案例

8.1 工业传感器数据清洗

python 复制代码

// ========== 工业传感器数据清洗 ==========

// 1. 创建测试数据
t = table(
    take(1..10, 1000) as device_id,
    2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
    concat([rand(20.0..30.0, 950), take(NULL, 30), rand(100.0..200.0, 20)]) as temperature,
    concat([rand(40.0..60.0, 970), take(NULL, 30)]) as humidity
)

// 2. 数据质量报告
print("=== 清洗前数据质量 ===")
print("总行数: " + string(t.rows()))
print("温度缺失: " + string(sum(isNull(t.temperature))))
print("湿度缺失: " + string(sum(isNull(t.humidity))))

// 3. 缺失值处理
cleaned = select device_id, timestamp,
         iif(temperature is null, avg(temperature), temperature) as temperature,
         iif(humidity is null, avg(humidity), humidity) as humidity
         from t

// 4. 异常值处理
avgTemp = avg(cleaned.temperature)
stdTemp = std(cleaned.temperature)

cleaned = select device_id, timestamp,
         iif(abs(temperature - avgTemp) > 3 * stdTemp, avgTemp, temperature) as temperature,
         humidity
         from cleaned

// 5. 清洗后报告
print("=== 清洗后数据质量 ===")
print("总行数: " + string(cleaned.rows()))
print("温度缺失: " + string(sum(isNull(cleaned.temperature))))

// 6. 写入分布式表
db = database("dfs://cleaned_db", VALUE, 1..10)
db.createPartitionedTable(cleaned, `cleaned_data, `device_id)
loadTable("dfs://cleaned_db", "cleaned_data").append!(cleaned)

print("数据清洗完成")

九、总结

本文详细介绍了DolphinDB工业数据清洗：

数据质量问题：缺失值、异常值、重复值
缺失值检测：检测方法、模式分析
缺失值处理：删除、填充、插值
异常值检测：3σ、IQR、Z-Score
异常值处理：删除、替换、插值
自动化流程：清洗流水线、质量报告

思考题：

如何选择合适的缺失值处理方法？
如何平衡异常值检测的准确性和误报率？
如何设计自动化数据清洗流程？

DolphinDB工业数据清洗：缺失值与异常值处理

目录

摘要

一、数据清洗概述

1.1 数据质量问题

1.2 工业数据特点

二、缺失值检测

2.1 检测缺失值

2.2 缺失值模式分析

三、缺失值处理

3.1 删除缺失值

3.2 填充缺失值

3.3 分组填充

四、异常值检测

4.1 统计方法检测

4.2 Z-Score检测

4.3 分位数检测

五、异常值处理

5.1 删除异常值

5.2 替换异常值

5.3 插值替换

六、数据质量报告

6.1 生成质量报告

七、自动化清洗流程

7.1 清洗流水线

八、实战案例

8.1 工业传感器数据清洗

九、总结

参考资料