目录
-
- 摘要
- 一、数据清洗概述
-
- [1.1 数据质量问题](#1.1 数据质量问题)
- [1.2 工业数据特点](#1.2 工业数据特点)
- 二、缺失值检测
-
- [2.1 检测缺失值](#2.1 检测缺失值)
- [2.2 缺失值模式分析](#2.2 缺失值模式分析)
- 三、缺失值处理
-
- [3.1 删除缺失值](#3.1 删除缺失值)
- [3.2 填充缺失值](#3.2 填充缺失值)
- [3.3 分组填充](#3.3 分组填充)
- 四、异常值检测
-
- [4.1 统计方法检测](#4.1 统计方法检测)
- [4.2 Z-Score检测](#4.2 Z-Score检测)
- [4.3 分位数检测](#4.3 分位数检测)
- 五、异常值处理
-
- [5.1 删除异常值](#5.1 删除异常值)
- [5.2 替换异常值](#5.2 替换异常值)
- [5.3 插值替换](#5.3 插值替换)
- 六、数据质量报告
-
- [6.1 生成质量报告](#6.1 生成质量报告)
- 七、自动化清洗流程
-
- [7.1 清洗流水线](#7.1 清洗流水线)
- 八、实战案例
-
- [8.1 工业传感器数据清洗](#8.1 工业传感器数据清洗)
- 九、总结
- 参考资料
摘要
本文深入讲解DolphinDB工业数据清洗技术。从数据质量检测到缺失值处理,从异常值识别到数据修复,从清洗策略到自动化流程,全面介绍数据清洗的核心方法。通过丰富的代码示例,帮助读者掌握工业数据清洗的核心技能。
一、数据清洗概述
1.1 数据质量问题
#mermaid-svg-v295rCfRm2FV2w5O{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-v295rCfRm2FV2w5O .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-v295rCfRm2FV2w5O .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-v295rCfRm2FV2w5O .error-icon{fill:#552222;}#mermaid-svg-v295rCfRm2FV2w5O .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-v295rCfRm2FV2w5O .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-v295rCfRm2FV2w5O .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-v295rCfRm2FV2w5O .marker{fill:#333333;stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .marker.cross{stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-v295rCfRm2FV2w5O p{margin:0;}#mermaid-svg-v295rCfRm2FV2w5O .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label text{fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label span{color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster-label span p{background-color:transparent;}#mermaid-svg-v295rCfRm2FV2w5O .label text,#mermaid-svg-v295rCfRm2FV2w5O span{fill:#333;color:#333;}#mermaid-svg-v295rCfRm2FV2w5O .node rect,#mermaid-svg-v295rCfRm2FV2w5O .node circle,#mermaid-svg-v295rCfRm2FV2w5O .node ellipse,#mermaid-svg-v295rCfRm2FV2w5O .node polygon,#mermaid-svg-v295rCfRm2FV2w5O .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .rough-node .label text,#mermaid-svg-v295rCfRm2FV2w5O .node .label text,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label,#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label{text-anchor:middle;}#mermaid-svg-v295rCfRm2FV2w5O .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .rough-node .label,#mermaid-svg-v295rCfRm2FV2w5O .node .label,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label,#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label{text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .node.clickable{cursor:pointer;}#mermaid-svg-v295rCfRm2FV2w5O .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .arrowheadPath{fill:#333333;}#mermaid-svg-v295rCfRm2FV2w5O .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-v295rCfRm2FV2w5O .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-v295rCfRm2FV2w5O .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-v295rCfRm2FV2w5O .cluster text{fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O .cluster span{color:#333;}#mermaid-svg-v295rCfRm2FV2w5O div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-v295rCfRm2FV2w5O .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-v295rCfRm2FV2w5O rect.text{fill:none;stroke-width:0;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape,#mermaid-svg-v295rCfRm2FV2w5O .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape p,#mermaid-svg-v295rCfRm2FV2w5O .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-v295rCfRm2FV2w5O .icon-shape .label rect,#mermaid-svg-v295rCfRm2FV2w5O .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-v295rCfRm2FV2w5O .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-v295rCfRm2FV2w5O .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-v295rCfRm2FV2w5O :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据质量问题
缺失值
数据清洗
异常值
重复值
不一致
高质量数据
1.2 工业数据特点
| 问题 | 说明 |
|---|---|
| 缺失值 | 设备故障、网络中断 |
| 异常值 | 传感器故障、干扰 |
| 重复值 | 数据重复上报 |
| 不一致 | 格式不统一 |
二、缺失值检测
2.1 检测缺失值
python
// 创建测试数据
t = table(
1..10 as id,
[1, NULL, 3, NULL, 5, 6, NULL, 8, 9, 10] as value,
[25.0, 26.0, NULL, 28.0, 29.0, NULL, 31.0, 32.0, NULL, 34.0] as temperature
)
// 检测缺失值
select id, value, temperature,
isNull(value) as value_null,
isNull(temperature) as temp_null
from t
// 统计缺失值
select
count(*) as total,
sum(isNull(value)) as value_null_count,
sum(isNull(temperature)) as temp_null_count,
sum(isNull(value)) * 100.0 / count(*) as value_null_rate
from t
2.2 缺失值模式分析
python
// 分析缺失值模式
def analyzeMissingPattern(data) {
result = dict(STRING, ANY)
for (col in data.columnNames()) {
nullCount = sum(isNull(data[col]))
nullRate = nullCount * 100.0 / data.rows()
result[col] = dict(STRING, ANY, [
["nullCount", nullCount],
["nullRate", nullRate]
])
}
return result
}
// 使用
analyzeMissingPattern(t)
三、缺失值处理
3.1 删除缺失值
python
// 删除包含缺失值的行
cleaned = select * from t where value is not null and temperature is not null
// 删除特定列缺失的行
cleaned = select * from t where temperature is not null
3.2 填充缺失值
python
// 均值填充
avgTemp = avg(t.temperature)
filled = select id, value,
iif(temperature is null, avgTemp, temperature) as temperature
from t
// 中位数填充
medTemp = med(t.temperature)
filled = select id, value,
iif(temperature is null, medTemp, temperature) as temperature
from t
// 前向填充
filled = select id, value,
ffill(temperature) as temperature
from t
// 线性插值
filled = select id, value,
interpolate(temperature, "linear") as temperature
from t
3.3 分组填充
python
// 按设备分组填充
t = table(
take(1..3, 30) as device_id,
1..30 as id,
[25.0, NULL, 27.0, NULL, 29.0, ...] as temperature
)
// 按设备均值填充
filled = select id, device_id,
iif(temperature is null, avg(temperature), temperature) as temperature
from t
context by device_id
四、异常值检测
4.1 统计方法检测
python
// 创建测试数据
t = table(
1..100 as id,
concat([rand(20.0..30.0, 95), [100.0, -50.0, 200.0, -100.0, 150.0]]) as temperature
)
// 3σ原则检测
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)
select id, temperature,
abs(temperature - avgTemp) > 3 * stdTemp as is_outlier_3sigma
from t
// IQR方法检测
q1 = percentile(t.temperature, 25)
q3 = percentile(t.temperature, 75)
iqr = q3 - q1
lowerBound = q1 - 1.5 * iqr
upperBound = q3 + 1.5 * iqr
select id, temperature,
temperature < lowerBound or temperature > upperBound as is_outlier_iqr
from t
4.2 Z-Score检测
python
// Z-Score检测
def detectOutliersZScore(data, threshold = 3) {
meanVal = avg(data)
stdVal = std(data)
zScore = abs(data - meanVal) / stdVal
return zScore > threshold
}
// 使用
select id, temperature,
detectOutliersZScore(temperature) as is_outlier
from t
4.3 分位数检测
python
// 分位数检测
def detectOutliersQuantile(data, lowerQ = 0.01, upperQ = 0.99) {
lower = percentile(data, lowerQ * 100)
upper = percentile(data, upperQ * 100)
return data < lower or data > upper
}
// 使用
select id, temperature,
detectOutliersQuantile(temperature) as is_outlier
from t
五、异常值处理
5.1 删除异常值
python
// 删除异常值
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)
cleaned = select * from t
where abs(temperature - avgTemp) <= 3 * stdTemp
5.2 替换异常值
python
// 替换为边界值
avgTemp = avg(t.temperature)
stdTemp = std(t.temperature)
lowerBound = avgTemp - 3 * stdTemp
upperBound = avgTemp + 3 * stdTemp
cleaned = select id,
iif(temperature < lowerBound, lowerBound,
iif(temperature > upperBound, upperBound, temperature)) as temperature
from t
5.3 插值替换
python
// 用相邻值替换
cleaned = select id, temperature,
iif(abs(temperature - avgTemp) > 3 * stdTemp,
ffill(temperature), temperature) as temperature_fixed
from t
六、数据质量报告
6.1 生成质量报告
python
// 数据质量报告函数
def generateQualityReport(data) {
report = dict(STRING, ANY)
report["totalRows"] = data.rows()
report["totalColumns"] = data.columns()
// 缺失值统计
nullStats = dict(STRING, ANY)
for (col in data.columnNames()) {
nullStats[col] = dict(STRING, ANY, [
["nullCount", sum(isNull(data[col]))],
["nullRate", sum(isNull(data[col])) * 100.0 / data.rows()]
])
}
report["nullStats"] = nullStats
// 异常值统计(数值列)
outlierStats = dict(STRING, ANY)
for (col in data.columnNames()) {
if (type(data[col]) in [INT, LONG, FLOAT, DOUBLE]) {
avgVal = avg(data[col])
stdVal = std(data[col])
outlierCount = sum(abs(data[col] - avgVal) > 3 * stdVal)
outlierStats[col] = dict(STRING, ANY, [
["outlierCount", outlierCount],
["outlierRate", outlierCount * 100.0 / data.rows()]
])
}
}
report["outlierStats"] = outlierStats
return report
}
// 使用
report = generateQualityReport(t)
print(report)
七、自动化清洗流程
7.1 清洗流水线
python
// 数据清洗流水线
def dataCleaningPipeline(data, config) {
result = data
// 1. 缺失值处理
if (config.handleMissing == "drop") {
result = select * from result where not hasNull(*)
} else if (config.handleMissing == "fill_mean") {
for (col in config.numericColumns) {
result[col] = iif(isNull(result[col]), avg(result[col]), result[col])
}
}
// 2. 异常值处理
if (config.handleOutlier == "drop") {
for (col in config.numericColumns) {
avgVal = avg(result[col])
stdVal = std(result[col])
result = select * from result
where abs(result[col] - avgVal) <= 3 * stdVal
}
}
// 3. 重复值处理
if (config.handleDuplicate) {
result = select distinct * from result
}
return result
}
八、实战案例
8.1 工业传感器数据清洗
python
// ========== 工业传感器数据清洗 ==========
// 1. 创建测试数据
t = table(
take(1..10, 1000) as device_id,
2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
concat([rand(20.0..30.0, 950), take(NULL, 30), rand(100.0..200.0, 20)]) as temperature,
concat([rand(40.0..60.0, 970), take(NULL, 30)]) as humidity
)
// 2. 数据质量报告
print("=== 清洗前数据质量 ===")
print("总行数: " + string(t.rows()))
print("温度缺失: " + string(sum(isNull(t.temperature))))
print("湿度缺失: " + string(sum(isNull(t.humidity))))
// 3. 缺失值处理
cleaned = select device_id, timestamp,
iif(temperature is null, avg(temperature), temperature) as temperature,
iif(humidity is null, avg(humidity), humidity) as humidity
from t
// 4. 异常值处理
avgTemp = avg(cleaned.temperature)
stdTemp = std(cleaned.temperature)
cleaned = select device_id, timestamp,
iif(abs(temperature - avgTemp) > 3 * stdTemp, avgTemp, temperature) as temperature,
humidity
from cleaned
// 5. 清洗后报告
print("=== 清洗后数据质量 ===")
print("总行数: " + string(cleaned.rows()))
print("温度缺失: " + string(sum(isNull(cleaned.temperature))))
// 6. 写入分布式表
db = database("dfs://cleaned_db", VALUE, 1..10)
db.createPartitionedTable(cleaned, `cleaned_data, `device_id)
loadTable("dfs://cleaned_db", "cleaned_data").append!(cleaned)
print("数据清洗完成")
九、总结
本文详细介绍了DolphinDB工业数据清洗:
- 数据质量问题:缺失值、异常值、重复值
- 缺失值检测:检测方法、模式分析
- 缺失值处理:删除、填充、插值
- 异常值检测:3σ、IQR、Z-Score
- 异常值处理:删除、替换、插值
- 自动化流程:清洗流水线、质量报告
思考题:
- 如何选择合适的缺失值处理方法?
- 如何平衡异常值检测的准确性和误报率?
- 如何设计自动化数据清洗流程?