DolphinDB工业数据预处理:重采样与对齐

目录

    • 摘要
    • 一、数据预处理概述
      • [1.1 预处理需求](#1.1 预处理需求)
      • [1.2 预处理场景](#1.2 预处理场景)
    • 二、时间对齐
      • [2.1 时间戳对齐](#2.1 时间戳对齐)
      • [2.2 多源数据对齐](#2.2 多源数据对齐)
    • 三、重采样
      • [3.1 降采样](#3.1 降采样)
      • [3.2 升采样](#3.2 升采样)
      • [3.3 OHLC重采样](#3.3 OHLC重采样)
    • 四、插值计算
      • [4.1 线性插值](#4.1 线性插值)
      • [4.2 前向填充](#4.2 前向填充)
      • [4.3 后向填充](#4.3 后向填充)
      • [4.4 样条插值](#4.4 样条插值)
    • 五、数据聚合
      • [5.1 时间窗口聚合](#5.1 时间窗口聚合)
      • [5.2 滑动窗口聚合](#5.2 滑动窗口聚合)
      • [5.3 分组聚合](#5.3 分组聚合)
    • 六、批量预处理
      • [6.1 预处理流水线](#6.1 预处理流水线)
    • 七、实战案例
      • [7.1 多源数据预处理系统](#7.1 多源数据预处理系统)
    • 八、总结
    • 参考资料

摘要

本文深入讲解DolphinDB工业数据预处理技术。从时间对齐到重采样,从插值计算到数据聚合,从多源对齐到批量处理,全面介绍数据预处理的核心方法。通过丰富的代码示例,帮助读者掌握工业数据预处理的核心技能。


一、数据预处理概述

1.1 预处理需求

#mermaid-svg-fa6T1grmQtXKtBLF{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fa6T1grmQtXKtBLF .error-icon{fill:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fa6T1grmQtXKtBLF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fa6T1grmQtXKtBLF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .marker.cross{stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fa6T1grmQtXKtBLF p{margin:0;}#mermaid-svg-fa6T1grmQtXKtBLF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster-label span p{background-color:transparent;}#mermaid-svg-fa6T1grmQtXKtBLF .label text,#mermaid-svg-fa6T1grmQtXKtBLF span{fill:#333;color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .node rect,#mermaid-svg-fa6T1grmQtXKtBLF .node circle,#mermaid-svg-fa6T1grmQtXKtBLF .node ellipse,#mermaid-svg-fa6T1grmQtXKtBLF .node polygon,#mermaid-svg-fa6T1grmQtXKtBLF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .node .label text,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-anchor:middle;}#mermaid-svg-fa6T1grmQtXKtBLF .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .rough-node .label,#mermaid-svg-fa6T1grmQtXKtBLF .node .label,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label,#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label{text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .node.clickable{cursor:pointer;}#mermaid-svg-fa6T1grmQtXKtBLF .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .arrowheadPath{fill:#333333;}#mermaid-svg-fa6T1grmQtXKtBLF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fa6T1grmQtXKtBLF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster text{fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF .cluster span{color:#333;}#mermaid-svg-fa6T1grmQtXKtBLF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fa6T1grmQtXKtBLF .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fa6T1grmQtXKtBLF rect.text{fill:none;stroke-width:0;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape p,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fa6T1grmQtXKtBLF .icon-shape .label rect,#mermaid-svg-fa6T1grmQtXKtBLF .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fa6T1grmQtXKtBLF .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fa6T1grmQtXKtBLF .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fa6T1grmQtXKtBLF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数据预处理
原始数据
时间对齐
重采样
插值填充
标准化数据

1.2 预处理场景

场景 说明
时间对齐 多源数据时间同步
重采样 统一采样频率
插值填充 填补缺失数据
数据聚合 降低数据粒度

二、时间对齐

2.1 时间戳对齐

python 复制代码
// 创建不同频率的数据源
t1 = table(
    2024.01.01T00:00:00 + 0..99 * 60000 as timestamp,  // 每分钟
    rand(20.0..30.0, 100) as temperature
)

t2 = table(
    2024.01.01T00:00:00 + 0..19 * 300000 as timestamp,  // 每5分钟
    rand(40.0..60.0, 20) as humidity
)

// 时间对齐:按5分钟窗口
select bar(t1.timestamp, 5m) as window_5m,
       avg(t1.temperature) as avg_temp,
       last(t2.humidity) as humidity
from t1
left join t2 on bar(t1.timestamp, 5m) = t2.timestamp
group by bar(t1.timestamp, 5m)

2.2 多源数据对齐

python 复制代码
// 多源数据对齐
def alignMultiSource(sources, interval) {
    // 获取所有时间戳
    allTimestamps = sort(distinct(concat(each(def(t) { t.timestamp }, sources))))
    
    // 按间隔对齐
    alignedTime = bar(allTimestamps, interval)
    
    // 对齐各数据源
    result = table(alignedTime as timestamp)
    
    for (i in 0..sources.size()) {
        source = sources[i]
        agg = select avg(value) as value
              from source
              group by bar(timestamp, interval)
        result = lj(result, agg, `timestamp)
    }
    
    return result
}

三、重采样

3.1 降采样

python 复制代码
// 创建高频数据
t = table(
    2024.01.01T00:00:00 + 0..999 * 6000 as timestamp,  // 每6秒
    rand(20.0..30.0, 1000) as temperature
)

// 降采样:6秒 -> 1分钟
select bar(timestamp, 1m) as minute,
       first(temperature) as open,
       max(temperature) as high,
       min(temperature) as low,
       last(temperature) as close,
       avg(temperature) as avg_temp,
       count(*) as cnt
from t
group by bar(timestamp, 1m)

// 降采样:6秒 -> 5分钟
select bar(timestamp, 5m) as minute_5,
       first(temperature) as open,
       max(temperature) as high,
       min(temperature) as low,
       last(temperature) as close
from t
group by bar(timestamp, 5m)

3.2 升采样

python 复制代码
// 创建低频数据
t = table(
    2024.01.01T00:00:00 + 0..23 * 3600000 as timestamp,  // 每小时
    rand(20.0..30.0, 24) as temperature
)

// 升采样:小时 -> 分钟(需要插值)
// 生成目标时间序列
targetTime = 2024.01.01T00:00:00 + 0..1439 * 60000  // 每分钟

// 线性插值
resampled = select timestamp,
            interpolate(temperature, "linear") as temperature
            from t

3.3 OHLC重采样

python 复制代码
// OHLC重采样(开高低收)
def ohlcResample(data, timeCol, valueCol, interval) {
    return select bar(timeCol, interval) as window,
                  first(valueCol) as open,
                  max(valueCol) as high,
                  min(valueCol) as low,
                  last(valueCol) as close,
                  count(*) as volume
           from data
           group by bar(timeCol, interval)
}

// 使用
ohlc = ohlcResample(t, `timestamp, `temperature, 5m)

四、插值计算

4.1 线性插值

python 复制代码
// 创建有缺失的数据
t = table(
    2024.01.01T00:00:00 + [0, 1, 3, 5, 8, 10] * 60000 as timestamp,
    [25.0, 26.0, NULL, 27.0, NULL, 28.0] as temperature
)

// 线性插值
select timestamp, temperature,
       interpolate(temperature, "linear") as temp_linear
from t

4.2 前向填充

python 复制代码
// 前向填充
select timestamp, temperature,
       ffill(temperature) as temp_ffill
from t

4.3 后向填充

python 复制代码
// 后向填充
select timestamp, temperature,
       bfill(temperature) as temp_bfill
from t

4.4 样条插值

python 复制代码
// 样条插值
select timestamp, temperature,
       spline(temperature) as temp_spline
from t

五、数据聚合

5.1 时间窗口聚合

python 复制代码
// 时间窗口聚合
t = table(
    2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
    rand(20.0..30.0, 1000) as temperature,
    rand(40.0..60.0, 1000) as humidity
)

// 1分钟窗口聚合
select bar(timestamp, 1m) as minute,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp,
       std(temperature) as std_temp,
       avg(humidity) as avg_humidity
from t
group by bar(timestamp, 1m)

5.2 滑动窗口聚合

python 复制代码
// 滑动窗口聚合
select timestamp, temperature,
       mavg(temperature, 10) as moving_avg_10,
       mstd(temperature, 10) as moving_std_10,
       mmax(temperature, 10) as moving_max_10,
       mmin(temperature, 10) as moving_min_10
from t

5.3 分组聚合

python 复制代码
// 分组聚合
t = table(
    take(1..10, 1000) as device_id,
    2024.01.01T00:00:00 + 0..999 * 60000 as timestamp,
    rand(20.0..30.0, 1000) as temperature
)

// 按设备分组聚合
select device_id,
       bar(timestamp, 1h) as hour,
       avg(temperature) as avg_temp,
       max(temperature) as max_temp,
       min(temperature) as min_temp
from t
group by device_id, bar(timestamp, 1h)

六、批量预处理

6.1 预处理流水线

python 复制代码
// 预处理流水线
def preprocessingPipeline(data, config) {
    result = data
    
    // 1. 时间对齐
    if (config.alignTime) {
        result = select *, bar(timestamp, config.interval) as aligned_time
                 from result
    }
    
    // 2. 重采样
    if (config.resample) {
        result = select aligned_time,
                       avg(temperature) as temperature,
                       avg(humidity) as humidity
                 from result
                 group by aligned_time
    }
    
    // 3. 插值
    if (config.interpolate) {
        result = select aligned_time,
                       interpolate(temperature, "linear") as temperature,
                       interpolate(humidity, "linear") as humidity
                 from result
    }
    
    return result
}

七、实战案例

7.1 多源数据预处理系统

python 复制代码
// ========== 多源数据预处理系统 ==========

// 1. 创建多源数据
// 温度数据(每分钟)
tempData = table(
    2024.01.01T00:00:00 + 0..1439 * 60000 as timestamp,
    rand(20.0..30.0, 1440) as temperature
)

// 湿度数据(每5分钟)
humidData = table(
    2024.01.01T00:00:00 + 0..287 * 300000 as timestamp,
    rand(40.0..60.0, 288) as humidity
)

// 压力数据(每10分钟)
pressData = table(
    2024.01.01T00:00:00 + 0..143 * 600000 as timestamp,
    rand(1000.0..1020.0, 144) as pressure
)

// 2. 时间对齐(统一到5分钟)
aligned = select bar(timestamp, 5m) as window_5m,
          avg(temperature) as temperature
          from tempData
          group by bar(timestamp, 5m)

aligned = lj(aligned, 
    select bar(timestamp, 5m) as window_5m, humidity
    from humidData
    group by bar(timestamp, 5m), `window_5m)

aligned = lj(aligned,
    select bar(timestamp, 5m) as window_5m, avg(pressure) as pressure
    from pressData
    group by bar(timestamp, 5m), `window_5m)

// 3. 插值填充
preprocessed = select window_5m,
               temperature,
               humidity,
               interpolate(pressure, "linear") as pressure
               from aligned

// 4. 写入分布式表
db = database("dfs://preprocessed_db", VALUE, 2024.01.01..2024.12.31)
db.createPartitionedTable(preprocessed, `preprocessed_data, `window_5m)
loadTable("dfs://preprocessed_db", "preprocessed_data").append!(preprocessed)

// 5. 验证
select top 20 * from preprocessed

print("多源数据预处理完成")

八、总结

本文详细介绍了DolphinDB工业数据预处理:

  1. 时间对齐:时间戳对齐、多源对齐
  2. 重采样:降采样、升采样、OHLC重采样
  3. 插值计算:线性插值、前向填充、后向填充
  4. 数据聚合:时间窗口、滑动窗口、分组聚合
  5. 批量处理:预处理流水线

思考题

  1. 如何选择合适的重采样间隔?
  2. 不同插值方法有什么优缺点?
  3. 如何处理多源数据的时间对齐?

参考资料