目录
-
- 摘要
- 一、数据质量概述
-
- [1.1 数据质量维度](#1.1 数据质量维度)
- [1.2 质量指标](#1.2 质量指标)
- 二、完整性检查
-
- [2.1 字段完整性](#2.1 字段完整性)
- [2.2 记录完整性](#2.2 记录完整性)
- [2.3 时间完整性](#2.3 时间完整性)
- 三、一致性检查
-
- [3.1 数据一致性](#3.1 数据一致性)
- [3.2 引用一致性](#3.2 引用一致性)
- [3.3 业务一致性](#3.3 业务一致性)
- 四、数据质量评分
-
- [4.1 质量评分函数](#4.1 质量评分函数)
- [4.2 质量报告](#4.2 质量报告)
- 五、自动修复
-
- [5.1 缺失值修复](#5.1 缺失值修复)
- [5.2 异常值修复](#5.2 异常值修复)
- [5.3 重复值修复](#5.3 重复值修复)
- 六、质量监控
-
- [6.1 质量监控表](#6.1 质量监控表)
- [6.2 定期质量检查](#6.2 定期质量检查)
- 七、实战案例
-
- [7.1 数据质量管理平台](#7.1 数据质量管理平台)
- 八、总结
- 参考资料
摘要
本文深入讲解DolphinDB工业数据质量管理。从完整性检查到一致性验证,从数据质量评分到自动修复,从质量监控到持续改进,全面介绍数据质量管理的核心方法。通过丰富的代码示例,帮助读者掌握工业数据质量管理的核心技能。
一、数据质量概述
1.1 数据质量维度
#mermaid-svg-KcP8DWowpoBChPQR{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-KcP8DWowpoBChPQR .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-KcP8DWowpoBChPQR .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-KcP8DWowpoBChPQR .error-icon{fill:#552222;}#mermaid-svg-KcP8DWowpoBChPQR .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-KcP8DWowpoBChPQR .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-KcP8DWowpoBChPQR .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-KcP8DWowpoBChPQR .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-KcP8DWowpoBChPQR .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-KcP8DWowpoBChPQR .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-KcP8DWowpoBChPQR .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-KcP8DWowpoBChPQR .marker{fill:#333333;stroke:#333333;}#mermaid-svg-KcP8DWowpoBChPQR .marker.cross{stroke:#333333;}#mermaid-svg-KcP8DWowpoBChPQR svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-KcP8DWowpoBChPQR p{margin:0;}#mermaid-svg-KcP8DWowpoBChPQR .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-KcP8DWowpoBChPQR .cluster-label text{fill:#333;}#mermaid-svg-KcP8DWowpoBChPQR .cluster-label span{color:#333;}#mermaid-svg-KcP8DWowpoBChPQR .cluster-label span p{background-color:transparent;}#mermaid-svg-KcP8DWowpoBChPQR .label text,#mermaid-svg-KcP8DWowpoBChPQR span{fill:#333;color:#333;}#mermaid-svg-KcP8DWowpoBChPQR .node rect,#mermaid-svg-KcP8DWowpoBChPQR .node circle,#mermaid-svg-KcP8DWowpoBChPQR .node ellipse,#mermaid-svg-KcP8DWowpoBChPQR .node polygon,#mermaid-svg-KcP8DWowpoBChPQR .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-KcP8DWowpoBChPQR .rough-node .label text,#mermaid-svg-KcP8DWowpoBChPQR .node .label text,#mermaid-svg-KcP8DWowpoBChPQR .image-shape .label,#mermaid-svg-KcP8DWowpoBChPQR .icon-shape .label{text-anchor:middle;}#mermaid-svg-KcP8DWowpoBChPQR .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-KcP8DWowpoBChPQR .rough-node .label,#mermaid-svg-KcP8DWowpoBChPQR .node .label,#mermaid-svg-KcP8DWowpoBChPQR .image-shape .label,#mermaid-svg-KcP8DWowpoBChPQR .icon-shape .label{text-align:center;}#mermaid-svg-KcP8DWowpoBChPQR .node.clickable{cursor:pointer;}#mermaid-svg-KcP8DWowpoBChPQR .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-KcP8DWowpoBChPQR .arrowheadPath{fill:#333333;}#mermaid-svg-KcP8DWowpoBChPQR .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-KcP8DWowpoBChPQR .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-KcP8DWowpoBChPQR .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KcP8DWowpoBChPQR .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-KcP8DWowpoBChPQR .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KcP8DWowpoBChPQR .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-KcP8DWowpoBChPQR .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-KcP8DWowpoBChPQR .cluster text{fill:#333;}#mermaid-svg-KcP8DWowpoBChPQR .cluster span{color:#333;}#mermaid-svg-KcP8DWowpoBChPQR div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-KcP8DWowpoBChPQR .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-KcP8DWowpoBChPQR rect.text{fill:none;stroke-width:0;}#mermaid-svg-KcP8DWowpoBChPQR .icon-shape,#mermaid-svg-KcP8DWowpoBChPQR .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KcP8DWowpoBChPQR .icon-shape p,#mermaid-svg-KcP8DWowpoBChPQR .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-KcP8DWowpoBChPQR .icon-shape .label rect,#mermaid-svg-KcP8DWowpoBChPQR .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KcP8DWowpoBChPQR .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-KcP8DWowpoBChPQR .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-KcP8DWowpoBChPQR :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据质量维度
完整性
高质量数据
准确性
一致性
及时性
有效性
1.2 质量指标
| 指标 | 说明 |
|---|---|
| 完整性 | 数据是否完整 |
| 准确性 | 数据是否正确 |
| 一致性 | 数据是否一致 |
| 及时性 | 数据是否及时 |
| 有效性 | 数据是否有效 |
二、完整性检查
2.1 字段完整性
python
// 创建测试数据
t = table(
1..100 as id,
take([1, NULL, 3], 100) as device_id,
take([25.0, NULL, 27.0], 100) as temperature,
take([50.0, 51.0, NULL], 100) as humidity
)
// 检查字段完整性
def checkFieldCompleteness(data) {
result = table(
data.columnNames() as field_name,
each(def(col) { sum(isNull(data[col])) }, data.columnNames()) as null_count,
each(def(col) { sum(isNull(data[col])) * 100.0 / data.rows() }, data.columnNames()) as null_rate
)
return result
}
// 使用
checkFieldCompleteness(t)
2.2 记录完整性
python
// 检查记录完整性
def checkRecordCompleteness(data, keyColumns) {
// 检查主键完整性
keyNull = select * from data where hasNull(keyColumns)
// 检查重复记录
duplicates = select count(*) as cnt from data group by keyColumns having count(*) > 1
return dict(STRING, ANY, [
["keyNullCount", keyNull.rows()],
["duplicateCount", duplicates.rows()]
])
}
// 使用
checkRecordCompleteness(t, `id)
2.3 时间完整性
python
// 检查时间序列完整性
def checkTimeCompleteness(data, timeCol, interval) {
// 获取时间范围
minTime = min(data[timeCol])
maxTime = max(data[timeCol])
// 计算预期记录数
expectedCount = (maxTime - minTime) / interval + 1
// 实际记录数
actualCount = data.rows()
// 缺失记录
missingCount = expectedCount - actualCount
return dict(STRING, ANY, [
["expectedCount", expectedCount],
["actualCount", actualCount],
["missingCount", missingCount],
["completenessRate", actualCount * 100.0 / expectedCount]
])
}
三、一致性检查
3.1 数据一致性
python
// 检查数据一致性
def checkDataConsistency(data, rules) {
results = array(STRING, 0)
for (rule in rules) {
violations = select * from data where not eval(rule.condition)
if (violations.rows() > 0) {
results.append!(rule.name + ": " + string(violations.rows()) + " 条违规")
}
}
return results
}
// 定义规则
rules = [
dict(STRING, ANY, [["name", "温度范围"], ["condition", "temperature between -40 and 100"]]),
dict(STRING, ANY, [["name", "湿度范围"], ["condition", "humidity between 0 and 100"]])
]
// 使用
checkDataConsistency(t, rules)
3.2 引用一致性
python
// 检查引用一致性
def checkReferenceConsistency(data, refTable, dataCol, refCol) {
// 查找无效引用
invalidRefs = select * from data
where data[dataCol] not in (select refCol from refTable)
return dict(STRING, ANY, [
["invalidCount", invalidRefs.rows()],
["invalidRecords", invalidRefs]
])
}
3.3 业务一致性
python
// 检查业务一致性
def checkBusinessConsistency(data) {
// 示例:温度升高时,湿度应该下降
inconsistent = select * from data
where temperature > 30 and humidity > 70
return inconsistent
}
四、数据质量评分
4.1 质量评分函数
python
// 数据质量评分
def calculateQualityScore(data) {
scores = dict(STRING, DOUBLE)
// 完整性评分
nullRates = each(def(col) { sum(isNull(data[col])) * 100.0 / data.rows() }, data.columnNames())
scores["completeness"] = 100 - avg(nullRates)
// 准确性评分(基于异常值比例)
outlierRates = array(DOUBLE, 0)
for (col in data.columnNames()) {
if (type(data[col]) in [INT, LONG, FLOAT, DOUBLE]) {
avgVal = avg(data[col])
stdVal = std(data[col])
outlierRate = sum(abs(data[col] - avgVal) > 3 * stdVal) * 100.0 / data.rows()
outlierRates.append!(outlierRate)
}
}
scores["accuracy"] = 100 - avg(outlierRates)
// 总分
scores["total"] = (scores["completeness"] + scores["accuracy"]) / 2
return scores
}
// 使用
calculateQualityScore(t)
4.2 质量报告
python
// 生成质量报告
def generateQualityReport(data) {
report = dict(STRING, ANY)
// 基本信息
report["totalRows"] = data.rows()
report["totalColumns"] = data.columns()
// 完整性
report["completeness"] = checkFieldCompleteness(data)
// 质量评分
report["scores"] = calculateQualityScore(data)
return report
}
// 使用
report = generateQualityReport(t)
print(report)
五、自动修复
5.1 缺失值修复
python
// 自动修复缺失值
def autoFixMissingValues(data, strategy = "mean") {
result = data
for (col in data.columnNames()) {
if (type(data[col]) in [INT, LONG, FLOAT, DOUBLE]) {
if (strategy == "mean") {
result[col] = iif(isNull(data[col]), avg(data[col]), data[col])
} else if (strategy == "median") {
result[col] = iif(isNull(data[col]), med(data[col]), data[col])
} else if (strategy == "zero") {
result[col] = iif(isNull(data[col]), 0, data[col])
}
}
}
return result
}
5.2 异常值修复
python
// 自动修复异常值
def autoFixOutliers(data, method = "clip") {
result = data
for (col in data.columnNames()) {
if (type(data[col]) in [INT, LONG, FLOAT, DOUBLE]) {
avgVal = avg(data[col])
stdVal = std(data[col])
lower = avgVal - 3 * stdVal
upper = avgVal + 3 * stdVal
if (method == "clip") {
result[col] = iif(data[col] < lower, lower,
iif(data[col] > upper, upper, data[col]))
} else if (method == "remove") {
result = select * from result where data[col] between lower and upper
}
}
}
return result
}
5.3 重复值修复
python
// 自动修复重复值
def autoFixDuplicates(data, keyColumns) {
return select distinct * from data
}
六、质量监控
6.1 质量监控表
python
// 创建质量监控表
share table(1:0,
`check_time`table_name`check_type`score`details,
[TIMESTAMP, STRING, STRING, DOUBLE, STRING]) as quality_log
// 记录质量检查
def logQualityCheck(tableName, checkType, score, details) {
insert into quality_log values (now(), tableName, checkType, score, details)
}
6.2 定期质量检查
python
// 定期质量检查任务
def scheduledQualityCheck() {
t = loadTable("dfs://iot_db", "sensor_data")
// 完整性检查
completeness = calculateQualityScore(t)["completeness"]
logQualityCheck("sensor_data", "completeness", completeness, "")
// 准确性检查
accuracy = calculateQualityScore(t)["accuracy"]
logQualityCheck("sensor_data", "accuracy", accuracy, "")
}
// 定时任务
scheduleJob("quality_check", "数据质量检查", scheduledQualityCheck,
00:00, 2024.01.01, 2030.12.31, 'D')
七、实战案例
7.1 数据质量管理平台
python
// ========== 数据质量管理平台 ==========
// 1. 创建质量检查函数
def qualityCheckPipeline(data, tableName) {
print("=== 数据质量检查: " + tableName + " ===")
// 完整性检查
completeness = checkFieldCompleteness(data)
print("字段完整性:")
print(completeness)
// 一致性检查
rules = [
dict(STRING, ANY, [["name", "温度范围"], ["condition", "temperature between -40 and 100"]]),
dict(STRING, ANY, [["name", "湿度范围"], ["condition", "humidity between 0 and 100"]])
]
consistency = checkDataConsistency(data, rules)
print("一致性检查: " + string(consistency))
// 质量评分
scores = calculateQualityScore(data)
print("质量评分:")
print(scores)
// 记录日志
logQualityCheck(tableName, "total", scores["total"], "")
return scores
}
// 2. 创建测试数据
t = table(
1..1000 as id,
take(1..10, 1000) as device_id,
2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
concat([rand(20.0..30.0, 950), take(NULL, 30), rand(100.0..200.0, 20)]) as temperature,
concat([rand(40.0..60.0, 970), take(NULL, 30)]) as humidity
)
// 3. 执行质量检查
qualityCheckPipeline(t, "sensor_data")
// 4. 自动修复
fixed = autoFixMissingValues(t, "mean")
fixed = autoFixOutliers(fixed, "clip")
// 5. 再次检查
qualityCheckPipeline(fixed, "sensor_data_fixed")
print("数据质量管理完成")
八、总结
本文详细介绍了DolphinDB工业数据质量管理:
- 完整性检查:字段完整性、记录完整性、时间完整性
- 一致性检查:数据一致性、引用一致性、业务一致性
- 质量评分:评分函数、质量报告
- 自动修复:缺失值修复、异常值修复、重复值修复
- 质量监控:监控表、定期检查
思考题:
- 如何设计数据质量指标体系?
- 如何平衡数据修复的自动化程度?
- 如何持续改进数据质量?