目录
-
- 摘要
- 一、数据预处理概述
-
- [1.1 预处理需求](#1.1 预处理需求)
- [1.2 预处理场景](#1.2 预处理场景)
- 二、时间对齐
-
- [2.1 时间戳对齐](#2.1 时间戳对齐)
- [2.2 多源数据对齐](#2.2 多源数据对齐)
- 三、重采样
-
- [3.1 降采样](#3.1 降采样)
- [3.2 升采样](#3.2 升采样)
- [3.3 OHLC重采样](#3.3 OHLC重采样)
- 四、插值计算
-
- [4.1 线性插值](#4.1 线性插值)
- [4.2 前向填充](#4.2 前向填充)
- [4.3 后向填充](#4.3 后向填充)
- [4.4 样条插值](#4.4 样条插值)
- 五、数据聚合
-
- [5.1 时间窗口聚合](#5.1 时间窗口聚合)
- [5.2 滑动窗口聚合](#5.2 滑动窗口聚合)
- [5.3 分组聚合](#5.3 分组聚合)
- 六、批量预处理
-
- [6.1 预处理流水线](#6.1 预处理流水线)
- 七、实战案例
-
- [7.1 多源数据预处理系统](#7.1 多源数据预处理系统)
- 八、总结
- 参考资料
摘要
本文深入讲解DolphinDB工业数据预处理技术。从时间对齐到重采样,从插值计算到数据聚合,从多源对齐到批量处理,全面介绍数据预处理的核心方法。通过丰富的代码示例,帮助读者掌握工业数据预处理的核心技能。
一、数据预处理概述
1.1 预处理需求
#mermaid-svg-fa6T1grmQtXKtBLF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .error-icon{fill:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fa6T1grmQtXKtBLF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .marker.cross{stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fa6T1grmQtXKtBLF p{margin:0;}#mermaid-svg-fa6T1grmQtXKtBLF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span p{background-color:transparent;}#mermaid-svg-fa6T1grmQtXKtBLF .label text,#mermaid-svg-fa6T1grmQtXKtBLF span{fill:#333;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .node rect,#mermaid-svg-fa6T1grmQtXKtBLF .node circle,#mermaid-svg-fa6T1grmQtXKtBLF .node ellipse,#mermaid-svg-fa6T1grmQtXKtBLF .node polygon,#mermaid-svg-fa6T1grmQtXKtBLF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-anchor:middle;}#mermaid-svg-fa6T1grmQtXKtBLF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label,#mermaid-svg-fa6T1grmQtXKtBLF .node .label,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .node.clickable{cursor:pointer;}#mermaid-svg-fa6T1grmQtXKtBLF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .arrowheadPath{fill:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fa6T1grmQtXKtBLF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF rect.text{fill:none;stroke-width:0;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape p,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label rect,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fa6T1grmQtXKtBLF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fa6T1grmQtXKtBLF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据预处理
原始数据
时间对齐
重采样
插值填充
标准化数据
1.2 预处理场景
| 场景 | 说明 |
|---|---|
| 时间对齐 | 多源数据时间同步 |
| 重采样 | 统一采样频率 |
| 插值填充 | 填补缺失数据 |
| 数据聚合 | 降低数据粒度 |
二、时间对齐
2.1 时间戳对齐
python
// 创建不同频率的数据源
t1 = table(
2024.01.01T00:00:00 + 0..99 * 60000 as timestamp, // 每分钟
rand(20.0..30.0, 100) as temperature
)
t2 = table(
2024.01.01T00:00:00 + 0..19 * 300000 as timestamp, // 每5分钟
rand(40.0..60.0, 20) as humidity
)
// 时间对齐:按5分钟窗口
select bar(t1.timestamp, 5m) as window_5m,
avg(t1.temperature) as avg_temp,
last(t2.humidity) as humidity
from t1
left join t2 on bar(t1.timestamp, 5m) = t2.timestamp
group by bar(t1.timestamp, 5m)
2.2 多源数据对齐
python
// 多源数据对齐
def alignMultiSource(sources, interval) {
// 获取所有时间戳
allTimestamps = sort(distinct(concat(each(def(t) { t.timestamp }, sources))))
// 按间隔对齐
alignedTime = bar(allTimestamps, interval)
// 对齐各数据源
result = table(alignedTime as timestamp)
for (i in 0..sources.size()) {
source = sources[i]
agg = select avg(value) as value
from source
group by bar(timestamp, interval)
result = lj(result, agg, `timestamp)
}
return result
}
三、重采样
3.1 降采样
python
// 创建高频数据
t = table(
2024.01.01T00:00:00 + 0..999 * 6000 as timestamp, // 每6秒
rand(20.0..30.0, 1000) as temperature
)
// 降采样:6秒 -> 1分钟
select bar(timestamp, 1m) as minute,
first(temperature) as open,
max(temperature) as high,
min(temperature) as low,
last(temperature) as close,
avg(temperature) as avg_temp,
count(*) as cnt
from t
group by bar(timestamp, 1m)
// 降采样:6秒 -> 5分钟
select bar(timestamp, 5m) as minute_5,
first(temperature) as open,
max(temperature) as high,
min(temperature) as low,
last(temperature) as close
from t
group by bar(timestamp, 5m)
3.2 升采样
python
// 创建低频数据
t = table(
2024.01.01T00:00:00 + 0..23 * 3600000 as timestamp, // 每小时
rand(20.0..30.0, 24) as temperature
)
// 升采样:小时 -> 分钟(需要插值)
// 生成目标时间序列
targetTime = 2024.01.01T00:00:00 + 0..1439 * 60000 // 每分钟
// 线性插值
resampled = select timestamp,
interpolate(temperature, "linear") as temperature
from t
3.3 OHLC重采样
python
// OHLC重采样(开高低收)
def ohlcResample(data, timeCol, valueCol, interval) {
return select bar(timeCol, interval) as window,
first(valueCol) as open,
max(valueCol) as high,
min(valueCol) as low,
last(valueCol) as close,
count(*) as volume
from data
group by bar(timeCol, interval)
}
// 使用
ohlc = ohlcResample(t, `timestamp, `temperature, 5m)
四、插值计算
4.1 线性插值
python
// 创建有缺失的数据
t = table(
2024.01.01T00:00:00 + [0, 1, 3, 5, 8, 10] * 60000 as timestamp,
[25.0, 26.0, NULL, 27.0, NULL, 28.0] as temperature
)
// 线性插值
select timestamp, temperature,
interpolate(temperature, "linear") as temp_linear
from t
4.2 前向填充
python
// 前向填充
select timestamp, temperature,
ffill(temperature) as temp_ffill
from t
4.3 后向填充
python
// 后向填充
select timestamp, temperature,
bfill(temperature) as temp_bfill
from t
4.4 样条插值
python
// 样条插值
select timestamp, temperature,
spline(temperature) as temp_spline
from t
五、数据聚合
5.1 时间窗口聚合
python
// 时间窗口聚合
t = table(
2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
rand(20.0..30.0, 1000) as temperature,
rand(40.0..60.0, 1000) as humidity
)
// 1分钟窗口聚合
select bar(timestamp, 1m) as minute,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
std(temperature) as std_temp,
avg(humidity) as avg_humidity
from t
group by bar(timestamp, 1m)
5.2 滑动窗口聚合
python
// 滑动窗口聚合
select timestamp, temperature,
mavg(temperature, 10) as moving_avg_10,
mstd(temperature, 10) as moving_std_10,
mmax(temperature, 10) as moving_max_10,
mmin(temperature, 10) as moving_min_10
from t
5.3 分组聚合
python
// 分组聚合
t = table(
take(1..10, 1000) as device_id,
2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
rand(20.0..30.0, 1000) as temperature
)
// 按设备分组聚合
select device_id,
bar(timestamp, 1h) as hour,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp
from t
group by device_id, bar(timestamp, 1h)
六、批量预处理
6.1 预处理流水线
python
// 预处理流水线
def preprocessingPipeline(data, config) {
result = data
// 1. 时间对齐
if (config.alignTime) {
result = select *, bar(timestamp, config.interval) as aligned_time
from result
}
// 2. 重采样
if (config.resample) {
result = select aligned_time,
avg(temperature) as temperature,
avg(humidity) as humidity
from result
group by aligned_time
}
// 3. 插值
if (config.interpolate) {
result = select aligned_time,
interpolate(temperature, "linear") as temperature,
interpolate(humidity, "linear") as humidity
from result
}
return result
}
七、实战案例
7.1 多源数据预处理系统
python
// ========== 多源数据预处理系统 ==========
// 1. 创建多源数据
// 温度数据(每分钟)
tempData = table(
2024.01.01T00:00:00 + 0..1439 * 60000 as timestamp,
rand(20.0..30.0, 1440) as temperature
)
// 湿度数据(每5分钟)
humidData = table(
2024.01.01T00:00:00 + 0..287 * 300000 as timestamp,
rand(40.0..60.0, 288) as humidity
)
// 压力数据(每10分钟)
pressData = table(
2024.01.01T00:00:00 + 0..143 * 600000 as timestamp,
rand(1000.0..1020.0, 144) as pressure
)
// 2. 时间对齐(统一到5分钟)
aligned = select bar(timestamp, 5m) as window_5m,
avg(temperature) as temperature
from tempData
group by bar(timestamp, 5m)
aligned = lj(aligned,
select bar(timestamp, 5m) as window_5m, humidity
from humidData
group by bar(timestamp, 5m), `window_5m)
aligned = lj(aligned,
select bar(timestamp, 5m) as window_5m, avg(pressure) as pressure
from pressData
group by bar(timestamp, 5m), `window_5m)
// 3. 插值填充
preprocessed = select window_5m,
temperature,
humidity,
interpolate(pressure, "linear") as pressure
from aligned
// 4. 写入分布式表
db = database("dfs://preprocessed_db", VALUE, 2024.01.01..2024.12.31)
db.createPartitionedTable(preprocessed, `preprocessed_data, `window_5m)
loadTable("dfs://preprocessed_db", "preprocessed_data").append!(preprocessed)
// 5. 验证
select top 20 * from preprocessed
print("多源数据预处理完成")
八、总结
本文详细介绍了DolphinDB工业数据预处理:
- 时间对齐:时间戳对齐、多源对齐
- 重采样:降采样、升采样、OHLC重采样
- 插值计算:线性插值、前向填充、后向填充
- 数据聚合:时间窗口、滑动窗口、分组聚合
- 批量处理:预处理流水线
思考题:
- 如何选择合适的重采样间隔?
- 不同插值方法有什么优缺点?
- 如何处理多源数据的时间对齐?