目录
-
- 摘要
- 一、流计算概述
-
- [1.1 什么是流计算](#1.1 什么是流计算)
- [1.2 流计算 vs 批处理](#1.2 流计算 vs 批处理)
- [1.3 DolphinDB流计算特点](#1.3 DolphinDB流计算特点)
- 二、流表
-
- [2.1 创建流表](#2.1 创建流表)
- [2.2 写入流表](#2.2 写入流表)
- [2.3 流表持久化](#2.3 流表持久化)
- [2.4 流表操作](#2.4 流表操作)
- 三、发布订阅
-
- [3.1 发布订阅原理](#3.1 发布订阅原理)
- [3.2 订阅流表](#3.2 订阅流表)
- [3.3 处理函数](#3.3 处理函数)
- [3.4 取消订阅](#3.4 取消订阅)
- 四、实时数据处理
-
- [4.1 实时过滤](#4.1 实时过滤)
- [4.2 实时聚合](#4.2 实时聚合)
- [4.3 实时写入分布式表](#4.3 实时写入分布式表)
- 五、流计算引擎
-
- [5.1 时间序列引擎](#5.1 时间序列引擎)
- [5.2 横截面引擎](#5.2 横截面引擎)
- [5.3 异常检测引擎](#5.3 异常检测引擎)
- 六、流计算监控
-
- [6.1 查看流表状态](#6.1 查看流表状态)
- [6.2 查看订阅状态](#6.2 查看订阅状态)
- [6.3 流计算性能监控](#6.3 流计算性能监控)
- 七、实战案例
-
- [7.1 实时设备监控系统](#7.1 实时设备监控系统)
- 八、总结
- 参考资料
摘要
本文深入讲解DolphinDB流计算基础。从流计算概念到流表创建,从发布订阅到实时处理,全面介绍流计算的核心原理和方法。通过丰富的代码示例,帮助读者掌握实时数据处理的核心技能。
一、流计算概述
1.1 什么是流计算
流计算是对实时数据流进行持续计算的处理模式:
流计算架构
数据源
流表
流计算引擎
结果输出
特点
实时处理
持续计算
低延迟
1.2 流计算 vs 批处理
| 特性 | 批处理 | 流计算 |
|---|---|---|
| 数据 | 静态数据 | 实时数据流 |
| 处理 | 一次性处理 | 持续处理 |
| 延迟 | 分钟/小时 | 毫秒/秒 |
| 适用场景 | 历史分析 | 实时监控 |
1.3 DolphinDB流计算特点
| 特点 | 说明 |
|---|---|
| 流表 | 实时数据表 |
| 发布订阅 | 数据分发机制 |
| 流计算引擎 | 内置计算引擎 |
| 低延迟 | 毫秒级处理 |
二、流表
2.1 创建流表
python
// 创建流表
share streamTable(1:0,
`device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE]) as sensor_stream
// 参数说明:
// - 1:0 -> 初始容量:初始行数
// - 列名
// - 列类型
// - share -> 共享给所有会话
2.2 写入流表
python
// 写入单条数据
insert into sensor_stream values(1, now(), 25.5, 50.0)
// 批量写入
sensor_stream.append!(
table(
1..100 as device_id,
take(now(), 100) as timestamp,
rand(20.0..30.0, 100) as temperature,
rand(40.0..60.0, 100) as humidity
)
)
// 查看流表数据
select count(*) from sensor_stream
2.3 流表持久化
python
// 启用持久化
enableTablePersistence(sensor_stream, true, true, 1000000)
// 参数说明:
// - async=true -> 异步持久化
// - sync=true -> 同步持久化(高可靠)
// - capacity=1000000 -> 内存中保留的最大行数
// 查看持久化状态
getPersistenceStat()
// 禁用持久化
disableTablePersistence(sensor_stream)
2.4 流表操作
python
// 查看流表信息
schema(sensor_stream)
// 清空流表
truncate(sensor_stream)
// 删除流表
undef(`sensor_stream)
三、发布订阅
3.1 发布订阅原理
发布订阅模型
发布者
流表
订阅者1
订阅者2
订阅者3
数据写入
3.2 订阅流表
python
// 创建流表
share streamTable(1:0,
`device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE]) as sensor_stream
// 创建结果表
share table(1:0,
`device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE]) as result_table
// 订阅流表
subscribeTable(, "sensor_stream", "handler1", -1,
def(msg) {
result_table.append!(msg)
}, true)
// 参数说明:
// - tableName: 流表名
// - actionName: 订阅名称
// - offset: -1表示从最新开始,0表示从头开始
// - handler: 处理函数
// - batchSize: 是否批量处理
3.3 处理函数
python
// 简单处理:直接写入
def simpleHandler(msg) {
result_table.append!(msg)
}
// 过滤处理:只保留高温数据
def filterHandler(msg) {
filtered = select * from msg where temperature > 25
result_table.append!(filtered)
}
// 聚合处理:计算平均值
def aggHandler(msg) {
agg = select device_id, avg(temperature) as avg_temp
from msg
group by device_id
agg_result.append!(agg)
}
// 使用处理函数
subscribeTable(, "sensor_stream", "filter_handler", -1, filterHandler, true)
3.4 取消订阅
python
// 查看订阅
getSubscriptionStat()
// 取消订阅
unsubscribeTable(, "sensor_stream", "handler1")
// 取消所有订阅
unsubscribeAll()
四、实时数据处理
4.1 实时过滤
python
// 创建流表和告警表
share streamTable(1:0,
`device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE]) as sensor_stream
share table(1:0,
`device_id`timestamp`alert_type`value,
[INT, TIMESTAMP, SYMBOL, DOUBLE]) as alert_table
// 订阅:实时告警
subscribeTable(, "sensor_stream", "alert_handler", -1,
def(msg) {
// 温度告警
temp_alerts = select device_id, timestamp,
`temperature_high as alert_type,
temperature as value
from msg
where temperature > 30
// 湿度告警
humidity_alerts = select device_id, timestamp,
`humidity_low as alert_type,
humidity as value
from msg
where humidity < 40
alert_table.append!(temp_alerts)
alert_table.append!(humidity_alerts)
}, true)
// 写入数据触发告警
sensor_stream.append!(
table(
1 as device_id,
now() as timestamp,
35.0 as temperature, // 触发温度告警
50.0 as humidity
)
)
// 查看告警
select * from alert_table
4.2 实时聚合
python
// 创建聚合结果表
share table(1:0,
`time_window`device_id`avg_temp`max_temp`min_temp`cnt,
[TIMESTAMP, INT, DOUBLE, DOUBLE, DOUBLE, LONG]) as agg_result
// 订阅:实时聚合
subscribeTable(, "sensor_stream", "agg_handler", -1,
def(msg) {
// 按设备聚合
agg = select device_id,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
count(*) as cnt
from msg
group by device_id
agg_result.append!(
select now() as time_window, * from agg
)
}, true)
4.3 实时写入分布式表
python
// 创建分布式表
db = database("dfs://stream_db", VALUE, 1..100)
schema = table(1:0, `device_id`timestamp`temperature`humidity,
[INT, TIMESTAMP, DOUBLE, DOUBLE])
db.createPartitionedTable(schema, `sensor_data, `device_id)
// 订阅:写入分布式表
subscribeTable(, "sensor_stream", "persist_handler", -1,
def(msg) {
loadTable("dfs://stream_db", "sensor_data").append!(msg)
}, true, 10000, true) // batchSize=10000, throttle=true
五、流计算引擎
5.1 时间序列引擎
python
// 创建时间序列引擎
share streamTable(1:0,
`device_id`timestamp`temperature,
[INT, TIMESTAMP, DOUBLE]) as input_stream
share table(1:0,
`device_id`timestamp`avg_temp`max_temp`min_temp,
[INT, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE]) as output_table
// 创建时间序列聚合引擎
agg = createTimeSeriesEngine("ts_engine", 60000, // 窗口大小60秒
def avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
output_table, `timestamp, `device_id)
// 订阅流表
subscribeTable(, "input_stream", "ts_agg", -1, agg, true)
// 写入数据
input_stream.append!(
table(
1 as device_id,
now() as timestamp,
25.0 as temperature
)
)
5.2 横截面引擎
python
// 创建横截面引擎
share streamTable(1:0,
`device_id`timestamp`temperature,
[INT, TIMESTAMP, DOUBLE]) as input_stream
share table(1:0,
`timestamp`avg_temp`max_temp`device_count,
[TIMESTAMP, DOUBLE, DOUBLE, LONG]) as output_table
// 创建横截面聚合引擎
agg = createCrossSectionalEngine("cs_engine",
def avg(temperature) as avg_temp,
max(temperature) as max_temp,
count(*) as device_count,
output_table, `timestamp)
// 订阅流表
subscribeTable(, "input_stream", "cs_agg", -1, agg, true)
5.3 异常检测引擎
python
// 创建异常检测引擎
share streamTable(1:0,
`device_id`timestamp`temperature,
[INT, TIMESTAMP, DOUBLE]) as input_stream
share table(1:0,
`device_id`timestamp`temperature`anomaly_type,
[INT, TIMESTAMP, DOUBLE, SYMBOL]) as anomaly_table
// 创建异常检测引擎
agg = createAnomalyDetectionEngine("anomaly_engine",
[def(temperature) { return temperature > 30 }, // 规则1:温度>30
def(temperature) { return temperature < 10 }], // 规则2:温度<10
anomaly_table, `timestamp, `device_id)
// 订阅流表
subscribeTable(, "input_stream", "anomaly_detect", -1, agg, true)
六、流计算监控
6.1 查看流表状态
python
// 查看流表信息
getStreamStat()
// 查看持久化状态
getPersistenceStat()
6.2 查看订阅状态
python
// 查看订阅状态
getSubscriptionStat()
// 查看发布者状态
getPublisherStat()
6.3 流计算性能监控
python
// 监控流处理延迟
def monitorStreamLatency() {
stat = getSubscriptionStat()
print("订阅数量: " + string(stat.rows()))
print("处理延迟: " + string(stat.maxLatency))
}
monitorStreamLatency()
七、实战案例
7.1 实时设备监控系统
python
// ========== 1. 创建流表 ==========
share streamTable(1:0,
`device_id`timestamp`temperature`humidity`pressure,
[INT, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE]) as sensor_stream
// ========== 2. 创建结果表 ==========
// 实时数据表
share table(1:0,
`device_id`timestamp`temperature`humidity`pressure,
[INT, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE]) as realtime_data
// 告警表
share table(1:0,
`device_id`timestamp`alert_type`alert_level`value,
[INT, TIMESTAMP, SYMBOL, INT, DOUBLE]) as alert_table
// 统计表
share table(1:0,
`time_window`device_id`avg_temp`max_temp`min_temp`cnt,
[TIMESTAMP, INT, DOUBLE, DOUBLE, DOUBLE, LONG]) as stats_table
// ========== 3. 订阅处理 ==========
// 实时数据订阅
subscribeTable(, "sensor_stream", "realtime_handler", -1,
def(msg) {
realtime_data.append!(msg)
}, true)
// 告警订阅
subscribeTable(, "sensor_stream", "alert_handler", -1,
def(msg) {
alerts = select device_id, timestamp,
`temperature_high as alert_type,
2 as alert_level,
temperature as value
from msg
where temperature > 30
alert_table.append!(alerts)
}, true)
// 统计订阅
subscribeTable(, "sensor_stream", "stats_handler", -1,
def(msg) {
stats = select device_id,
avg(temperature) as avg_temp,
max(temperature) as max_temp,
min(temperature) as min_temp,
count(*) as cnt
from msg
group by device_id
stats_table.append!(
select now() as time_window, * from stats
)
}, true)
// ========== 4. 启用持久化 ==========
enableTablePersistence(sensor_stream, true, true, 1000000)
print("实时设备监控系统启动完成")
八、总结
本文详细介绍了DolphinDB流计算入门:
- 流计算概念:实时数据处理模式
- 流表操作:创建、写入、持久化
- 发布订阅:数据分发机制
- 实时处理:过滤、聚合、写入
- 流计算引擎:时间序列、横截面、异常检测
- 监控运维:状态查看、性能监控
思考题:
- 流计算和批处理有什么区别?
- 如何设计实时告警系统?
- 流表持久化有什么作用?