DolphinDB工业数据预处理：重采样与对齐

- 摘要
- 一、数据预处理概述
- - [1.1 预处理需求](#1.1 预处理需求)
  - [1.2 预处理场景](#1.2 预处理场景)
- 二、时间对齐
- - [2.1 时间戳对齐](#2.1 时间戳对齐)
  - [2.2 多源数据对齐](#2.2 多源数据对齐)
- 三、重采样
- - [3.1 降采样](#3.1 降采样)
  - [3.2 升采样](#3.2 升采样)
  - [3.3 OHLC重采样](#3.3 OHLC重采样)
- 四、插值计算
- - [4.1 线性插值](#4.1 线性插值)
  - [4.2 前向填充](#4.2 前向填充)
  - [4.3 后向填充](#4.3 后向填充)
  - [4.4 样条插值](#4.4 样条插值)
- 五、数据聚合
- - [5.1 时间窗口聚合](#5.1 时间窗口聚合)
  - [5.2 滑动窗口聚合](#5.2 滑动窗口聚合)
  - [5.3 分组聚合](#5.3 分组聚合)
- 六、批量预处理
- - [6.1 预处理流水线](#6.1 预处理流水线)
- 七、实战案例
- - [7.1 多源数据预处理系统](#7.1 多源数据预处理系统)
- 八、总结
- 参考资料

摘要

本文深入讲解DolphinDB工业数据预处理技术。从时间对齐到重采样，从插值计算到数据聚合，从多源对齐到批量处理，全面介绍数据预处理的核心方法。通过丰富的代码示例，帮助读者掌握工业数据预处理的核心技能。

一、数据预处理概述

1.1 预处理需求

#mermaid-svg-fa6T1grmQtXKtBLF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .error-icon{fill:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fa6T1grmQtXKtBLF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .marker.cross{stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fa6T1grmQtXKtBLF p{margin:0;}#mermaid-svg-fa6T1grmQtXKtBLF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span p{background-color:transparent;}#mermaid-svg-fa6T1grmQtXKtBLF .label text,#mermaid-svg-fa6T1grmQtXKtBLF span{fill:#333;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .node rect,#mermaid-svg-fa6T1grmQtXKtBLF .node circle,#mermaid-svg-fa6T1grmQtXKtBLF .node ellipse,#mermaid-svg-fa6T1grmQtXKtBLF .node polygon,#mermaid-svg-fa6T1grmQtXKtBLF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-anchor:middle;}#mermaid-svg-fa6T1grmQtXKtBLF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label,#mermaid-svg-fa6T1grmQtXKtBLF .node .label,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .node.clickable{cursor:pointer;}#mermaid-svg-fa6T1grmQtXKtBLF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .arrowheadPath{fill:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fa6T1grmQtXKtBLF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF rect.text{fill:none;stroke-width:0;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape p,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label rect,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fa6T1grmQtXKtBLF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fa6T1grmQtXKtBLF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据预处理
原始数据
时间对齐
重采样
插值填充
标准化数据

1.2 预处理场景

场景	说明
时间对齐	多源数据时间同步
重采样	统一采样频率
插值填充	填补缺失数据
数据聚合	降低数据粒度

二、时间对齐

2.1 时间戳对齐

python 复制代码

// 创建不同频率的数据源
t1 = table(
    2024.01.01T00:00:00 + 0..99 * 60000 as timestamp,  // 每分钟
    rand(20.0..30.0, 100) as temperature
)

t2 = table(
    2024.01.01T00:00:00 + 0..19 * 300000 as timestamp,  // 每5分钟
    rand(40.0..60.0, 20) as humidity
)

// 时间对齐：按5分钟窗口
select bar(t1.timestamp, 5m) as window_5m,
       avg(t1.temperature) as avg_temp,
       last(t2.humidity) as humidity
from t1
left join t2 on bar(t1.timestamp, 5m) = t2.timestamp
group by bar(t1.timestamp, 5m)

2.2 多源数据对齐

python 复制代码

// 多源数据对齐
def alignMultiSource(sources, interval) {
    // 获取所有时间戳
    allTimestamps = sort(distinct(concat(each(def(t) { t.timestamp }, sources))))
    
    // 按间隔对齐
    alignedTime = bar(allTimestamps, interval)
    
    // 对齐各数据源
    result = table(alignedTime as timestamp)
    
    for (i in 0..sources.size()) {
        source = sources[i]
        agg = select avg(value) as value
              from source
              group by bar(timestamp, interval)
        result = lj(result, agg, `timestamp)
    }
    
    return result
}

三、重采样

3.1 降采样

python 复制代码

// 创建高频数据
t = table(
    2024.01.01T00:00:00 + 0..999 * 6000 as timestamp,  // 每6秒
    rand(20.0..30.0, 1000) as temperature
)

// 降采样：6秒 -> 1分钟
select bar(timestamp, 1m) as minute,
       first(temperature) as open,
       max(temperature) as high,
       min(temperature) as low,
       last(temperature) as close,
       avg(temperature) as avg_temp,
       count(*) as cnt
from t
group by bar(timestamp, 1m)

// 降采样：6秒 -> 5分钟
select bar(timestamp, 5m) as minute_5,
       first(temperature) as open,
       max(temperature) as high,
       min(temperature) as low,
       last(temperature) as close
from t
group by bar(timestamp, 5m)

3.2 升采样

python 复制代码

// 创建低频数据
t = table(
    2024.01.01T00:00:00 + 0..23 * 3600000 as timestamp,  // 每小时
    rand(20.0..30.0, 24) as temperature
)

// 升采样：小时 -> 分钟（需要插值）
// 生成目标时间序列
targetTime = 2024.01.01T00:00:00 + 0..1439 * 60000  // 每分钟

// 线性插值
resampled = select timestamp,
            interpolate(temperature, "linear") as temperature
            from t

3.3 OHLC重采样

python 复制代码

// OHLC重采样（开高低收）
def ohlcResample(data, timeCol, valueCol, interval) {
    return select bar(timeCol, interval) as window,
                  first(valueCol) as open,
                  max(valueCol) as high,
                  min(valueCol) as low,
                  last(valueCol) as close,
                  count(*) as volume
           from data
           group by bar(timeCol, interval)
}

// 使用
ohlc = ohlcResample(t, `timestamp, `temperature, 5m)

四、插值计算

4.1 线性插值

python 复制代码

// 创建有缺失的数据
t = table(
    2024.01.01T00:00:00 + [0, 1, 3, 5, 8, 10] * 60000 as timestamp,
    [25.0, 26.0, NULL, 27.0, NULL, 28.0] as temperature
)

// 线性插值
select timestamp, temperature,
       interpolate(temperature, "linear") as temp_linear
from t

4.2 前向填充

python 复制代码

// 前向填充
select timestamp, temperature,
       ffill(temperature) as temp_ffill
from t

4.3 后向填充

python 复制代码

// 后向填充
select timestamp, temperature,
       bfill(temperature) as temp_bfill
from t

4.4 样条插值

python 复制代码

// 样条插值
select timestamp, temperature,
       spline(temperature) as temp_spline
from t

五、数据聚合

5.1 时间窗口聚合

python 复制代码

// 时间窗口聚合
t = table(
    2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
    rand(20.0..30.0, 1000) as temperature,
    rand(40.0..60.0, 1000) as humidity
)

// 1分钟窗口聚合
select bar(timestamp, 1m) as minute,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp,
       std(temperature) as std_temp,
       avg(humidity) as avg_humidity
from t
group by bar(timestamp, 1m)

5.2 滑动窗口聚合

python 复制代码

// 滑动窗口聚合
select timestamp, temperature,
       mavg(temperature, 10) as moving_avg_10,
       mstd(temperature, 10) as moving_std_10,
       mmax(temperature, 10) as moving_max_10,
       mmin(temperature, 10) as moving_min_10
from t

5.3 分组聚合

python 复制代码

// 分组聚合
t = table(
    take(1..10, 1000) as device_id,
    2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
    rand(20.0..30.0, 1000) as temperature
)

// 按设备分组聚合
select device_id,
       bar(timestamp, 1h) as hour,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp
from t
group by device_id, bar(timestamp, 1h)

六、批量预处理

6.1 预处理流水线

python 复制代码

// 预处理流水线
def preprocessingPipeline(data, config) {
    result = data
    
    // 1. 时间对齐
    if (config.alignTime) {
        result = select *, bar(timestamp, config.interval) as aligned_time
                 from result
    }
    
    // 2. 重采样
    if (config.resample) {
        result = select aligned_time,
                       avg(temperature) as temperature,
                       avg(humidity) as humidity
                 from result
                 group by aligned_time
    }
    
    // 3. 插值
    if (config.interpolate) {
        result = select aligned_time,
                       interpolate(temperature, "linear") as temperature,
                       interpolate(humidity, "linear") as humidity
                 from result
    }
    
    return result
}

七、实战案例

7.1 多源数据预处理系统

python 复制代码

// ========== 多源数据预处理系统 ==========

// 1. 创建多源数据
// 温度数据（每分钟）
tempData = table(
    2024.01.01T00:00:00 + 0..1439 * 60000 as timestamp,
    rand(20.0..30.0, 1440) as temperature
)

// 湿度数据（每5分钟）
humidData = table(
    2024.01.01T00:00:00 + 0..287 * 300000 as timestamp,
    rand(40.0..60.0, 288) as humidity
)

// 压力数据（每10分钟）
pressData = table(
    2024.01.01T00:00:00 + 0..143 * 600000 as timestamp,
    rand(1000.0..1020.0, 144) as pressure
)

// 2. 时间对齐（统一到5分钟）
aligned = select bar(timestamp, 5m) as window_5m,
          avg(temperature) as temperature
          from tempData
          group by bar(timestamp, 5m)

aligned = lj(aligned, 
    select bar(timestamp, 5m) as window_5m, humidity
    from humidData
    group by bar(timestamp, 5m), `window_5m)

aligned = lj(aligned,
    select bar(timestamp, 5m) as window_5m, avg(pressure) as pressure
    from pressData
    group by bar(timestamp, 5m), `window_5m)

// 3. 插值填充
preprocessed = select window_5m,
               temperature,
               humidity,
               interpolate(pressure, "linear") as pressure
               from aligned

// 4. 写入分布式表
db = database("dfs://preprocessed_db", VALUE, 2024.01.01..2024.12.31)
db.createPartitionedTable(preprocessed, `preprocessed_data, `window_5m)
loadTable("dfs://preprocessed_db", "preprocessed_data").append!(preprocessed)

// 5. 验证
select top 20 * from preprocessed

print("多源数据预处理完成")

八、总结

本文详细介绍了DolphinDB工业数据预处理：

时间对齐：时间戳对齐、多源对齐
重采样：降采样、升采样、OHLC重采样
插值计算：线性插值、前向填充、后向填充
数据聚合：时间窗口、滑动窗口、分组聚合
批量处理：预处理流水线

思考题：

如何选择合适的重采样间隔？
不同插值方法有什么优缺点？
如何处理多源数据的时间对齐？

DolphinDB工业数据预处理：重采样与对齐

目录

摘要

一、数据预处理概述

1.1 预处理需求

1.2 预处理场景

二、时间对齐

2.1 时间戳对齐

2.2 多源数据对齐

三、重采样

3.1 降采样

3.2 升采样

3.3 OHLC重采样

四、插值计算

4.1 线性插值

4.2 前向填充

4.3 后向填充

4.4 样条插值

五、数据聚合

5.1 时间窗口聚合

5.2 滑动窗口聚合

5.3 分组聚合

六、批量预处理

6.1 预处理流水线

七、实战案例

7.1 多源数据预处理系统

八、总结

参考资料