目录
-
- 摘要
- 一、Kafka概述
-
- [1.1 什么是Kafka](#1.1 什么是Kafka)
- [1.2 Kafka特点](#1.2 Kafka特点)
- [1.3 核心概念](#1.3 核心概念)
- [二、DolphinDB Kafka插件](#二、DolphinDB Kafka插件)
-
- [2.1 插件安装](#2.1 插件安装)
- [2.2 消费者配置](#2.2 消费者配置)
- 三、创建消费者
-
- [3.1 基本消费者](#3.1 基本消费者)
- [3.2 消费消息](#3.2 消费消息)
- [3.3 批量消费](#3.3 批量消费)
- 四、数据解析
-
- [4.1 JSON解析](#4.1 JSON解析)
- [4.2 Avro解析](#4.2 Avro解析)
- [4.3 自定义格式](#4.3 自定义格式)
- 五、Offset管理
-
- [5.1 手动提交Offset](#5.1 手动提交Offset)
- [5.2 指定Offset消费](#5.2 指定Offset消费)
- [5.3 Offset存储](#5.3 Offset存储)
- 六、高可用部署
-
- [6.1 消费者组](#6.1 消费者组)
- [6.2 断线重连](#6.2 断线重连)
- 七、实战案例
-
- [7.1 实时数据采集系统](#7.1 实时数据采集系统)
- 八、总结
- 参考资料
摘要
本文深入讲解DolphinDB Kafka数据接入技术。从Kafka原理到插件配置,从消费者配置到数据解析,从批量消费到高可用部署,全面介绍Kafka数据接入的核心方法。通过丰富的代码示例,帮助读者掌握消息队列集成的核心技能。
一、Kafka概述
1.1 什么是Kafka
Kafka是分布式消息队列系统:
#mermaid-svg-HGK0SUTKfDJllxIW{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-HGK0SUTKfDJllxIW .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-HGK0SUTKfDJllxIW .error-icon{fill:#552222;}#mermaid-svg-HGK0SUTKfDJllxIW .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-HGK0SUTKfDJllxIW .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-HGK0SUTKfDJllxIW .marker{fill:#333333;stroke:#333333;}#mermaid-svg-HGK0SUTKfDJllxIW .marker.cross{stroke:#333333;}#mermaid-svg-HGK0SUTKfDJllxIW svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-HGK0SUTKfDJllxIW p{margin:0;}#mermaid-svg-HGK0SUTKfDJllxIW .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-HGK0SUTKfDJllxIW .cluster-label text{fill:#333;}#mermaid-svg-HGK0SUTKfDJllxIW .cluster-label span{color:#333;}#mermaid-svg-HGK0SUTKfDJllxIW .cluster-label span p{background-color:transparent;}#mermaid-svg-HGK0SUTKfDJllxIW .label text,#mermaid-svg-HGK0SUTKfDJllxIW span{fill:#333;color:#333;}#mermaid-svg-HGK0SUTKfDJllxIW .node rect,#mermaid-svg-HGK0SUTKfDJllxIW .node circle,#mermaid-svg-HGK0SUTKfDJllxIW .node ellipse,#mermaid-svg-HGK0SUTKfDJllxIW .node polygon,#mermaid-svg-HGK0SUTKfDJllxIW .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-HGK0SUTKfDJllxIW .rough-node .label text,#mermaid-svg-HGK0SUTKfDJllxIW .node .label text,#mermaid-svg-HGK0SUTKfDJllxIW .image-shape .label,#mermaid-svg-HGK0SUTKfDJllxIW .icon-shape .label{text-anchor:middle;}#mermaid-svg-HGK0SUTKfDJllxIW .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-HGK0SUTKfDJllxIW .rough-node .label,#mermaid-svg-HGK0SUTKfDJllxIW .node .label,#mermaid-svg-HGK0SUTKfDJllxIW .image-shape .label,#mermaid-svg-HGK0SUTKfDJllxIW .icon-shape .label{text-align:center;}#mermaid-svg-HGK0SUTKfDJllxIW .node.clickable{cursor:pointer;}#mermaid-svg-HGK0SUTKfDJllxIW .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-HGK0SUTKfDJllxIW .arrowheadPath{fill:#333333;}#mermaid-svg-HGK0SUTKfDJllxIW .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-HGK0SUTKfDJllxIW .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-HGK0SUTKfDJllxIW .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-HGK0SUTKfDJllxIW .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-HGK0SUTKfDJllxIW .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-HGK0SUTKfDJllxIW .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-HGK0SUTKfDJllxIW .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-HGK0SUTKfDJllxIW .cluster text{fill:#333;}#mermaid-svg-HGK0SUTKfDJllxIW .cluster span{color:#333;}#mermaid-svg-HGK0SUTKfDJllxIW div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-HGK0SUTKfDJllxIW .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-HGK0SUTKfDJllxIW rect.text{fill:none;stroke-width:0;}#mermaid-svg-HGK0SUTKfDJllxIW .icon-shape,#mermaid-svg-HGK0SUTKfDJllxIW .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-HGK0SUTKfDJllxIW .icon-shape p,#mermaid-svg-HGK0SUTKfDJllxIW .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-HGK0SUTKfDJllxIW .icon-shape .label rect,#mermaid-svg-HGK0SUTKfDJllxIW .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-HGK0SUTKfDJllxIW .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-HGK0SUTKfDJllxIW .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-HGK0SUTKfDJllxIW :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Kafka架构
生产者
Kafka Broker
生产者
消费者1
消费者2
DolphinDB
1.2 Kafka特点
| 特点 | 说明 |
|---|---|
| 高吞吐 | 百万级消息/秒 |
| 持久化 | 消息持久存储 |
| 分布式 | 水平扩展 |
| 高可用 | 副本机制 |
1.3 核心概念
| 概念 | 说明 |
|---|---|
| Topic | 消息主题 |
| Partition | 分区 |
| Consumer Group | 消费者组 |
| Offset | 消息偏移量 |
二、DolphinDB Kafka插件
2.1 插件安装
python
// 加载Kafka插件
loadPlugin("kafka")
// 查看插件函数
kafka::getPluginFunctions()
2.2 消费者配置
python
// Kafka消费者配置
config = dict(STRING, ANY, [
["bootstrap.servers", "localhost:9092"],
["group.id", "dolphindb_consumer"],
["auto.offset.reset", "earliest"],
["enable.auto.commit", "false"]
])
三、创建消费者
3.1 基本消费者
python
// 创建消费者
consumer = kafka::consumer("localhost:9092", "dolphindb_group")
// 订阅主题
kafka::subscribe(consumer, "sensor_data")
// 查看订阅
kafka::subscription(consumer)
3.2 消费消息
python
// 创建流表接收数据
share streamTable(1:0,
`device_id`timestamp`temperature`humidity,
[SYMBOL, TIMESTAMP, DOUBLE, DOUBLE]) as kafka_stream
// 消费消息
kafka::consume(consumer, "sensor_data", kafka_stream,
def(msg) {
// 解析JSON消息
data = parseJson(msg.value)
return table(
data.device_id as device_id,
data.timestamp as timestamp,
data.temperature as temperature,
data.humidity as humidity
)
})
3.3 批量消费
python
// 批量消费配置
kafka::consume(consumer, "sensor_data", kafka_stream,
def(msg) {
data = parseJson(msg.value)
return table(
data.device_id as device_id,
data.timestamp as timestamp,
data.temperature as temperature,
data.humidity as humidity
)
},
1000, // batchSize
5000) // throttle (ms)
四、数据解析
4.1 JSON解析
python
// JSON消息格式
/*
{
"device_id": "D001",
"timestamp": "2024-01-01T00:00:00",
"temperature": 25.5,
"humidity": 50.0
}
*/
// 解析函数
def parseJsonMessage(msg) {
data = parseJson(msg.value)
return table(
data.device_id as device_id,
timestamp(data.timestamp) as timestamp,
double(data.temperature) as temperature,
double(data.humidity) as humidity
)
}
4.2 Avro解析
python
// Avro解析
def parseAvroMessage(msg, schema) {
// 使用Avro schema解析
data = avroDecode(msg.value, schema)
return table(
data.device_id as device_id,
data.timestamp as timestamp,
data.temperature as temperature,
data.humidity as humidity
)
}
4.3 自定义格式
python
// 自定义格式解析
def parseCustomMessage(msg) {
// 假设格式:device_id,timestamp,temperature,humidity
parts = split(msg.value, ",")
return table(
parts[0] as device_id,
timestamp(parts[1]) as timestamp,
double(parts[2]) as temperature,
double(parts[3]) as humidity
)
}
五、Offset管理
5.1 手动提交Offset
python
// 手动提交Offset
kafka::commitSync(consumer)
// 异步提交
kafka::commitAsync(consumer)
5.2 指定Offset消费
python
// 从指定Offset开始消费
kafka::seek(consumer, "sensor_data", 0, 1000) // partition 0, offset 1000
// 从最早开始
kafka::seekToBeginning(consumer, "sensor_data")
// 从最新开始
kafka::seekToEnd(consumer, "sensor_data")
5.3 Offset存储
python
// 将Offset存储到DolphinDB
share table(1:0,
`topic`partition`offset`timestamp,
[STRING, INT, LONG, TIMESTAMP]) as offset_table
def saveOffset(topic, partition, offset) {
insert into offset_table values (topic, partition, offset, now())
}
六、高可用部署
6.1 消费者组
python
// 消费者组实现负载均衡
// 多个消费者实例,同一group.id
// 实例1
consumer1 = kafka::consumer("localhost:9092", "dolphindb_group")
// 实例2
consumer2 = kafka::consumer("localhost:9092", "dolphindb_group")
// 自动分配分区
6.2 断线重连
python
// 断线重连
def consumeWithRetry(brokers, groupId, topic, handler, maxRetries = 5) {
retries = 0
while (retries < maxRetries) {
try {
consumer = kafka::consumer(brokers, groupId)
kafka::subscribe(consumer, topic)
kafka::consume(consumer, topic, handler)
break
} catch (ex) {
retries += 1
print("消费失败,重试 " + string(retries))
sleep(5000)
}
}
}
七、实战案例
7.1 实时数据采集系统
python
// ========== Kafka实时数据采集系统 ==========
// 1. 创建分布式表
db = database("dfs://kafka_db", VALUE, 1..1000)
schema = table(1:0,
`device_id`timestamp`temperature`humidity`pressure,
[SYMBOL, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE])
db.createPartitionedTable(schema, `sensor_data, `device_id)
// 2. 创建流表
share streamTable(100000:0,
`device_id`timestamp`temperature`humidity`pressure,
[SYMBOL, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE]) as kafka_stream
// 3. 启用持久化
enableTablePersistence(kafka_stream, true, true, 1000000)
// 4. 订阅写入分布式表
subscribeTable(, "kafka_stream", "persist", -1,
def(msg) {
loadTable("dfs://kafka_db", "sensor_data").append!(msg)
}, 10000, 5000)
// 5. 创建Kafka消费者
consumer = kafka::consumer("localhost:9092", "dolphindb_iot")
// 6. 订阅主题
kafka::subscribe(consumer, "iot_sensor_data")
// 7. 消费消息
kafka::consume(consumer, "iot_sensor_data", kafka_stream,
def(msg) {
data = parseJson(msg.value)
return table(
data.device_id as device_id,
timestamp(data.timestamp) as timestamp,
double(data.temperature) as temperature,
double(data.humidity) as humidity,
double(data.pressure) as pressure
)
}, 1000, 5000)
// 8. 监控
def monitorKafka() {
print("=== Kafka消费监控 ===")
print("流表行数: " + string(exec count(*) from kafka_stream))
t = loadTable("dfs://kafka_db", "sensor_data")
print("分布式表行数: " + string(exec count(*) from t))
}
monitorKafka()
print("Kafka实时数据采集系统启动完成")
八、总结
本文详细介绍了DolphinDB Kafka数据接入:
- Kafka原理:消息队列、Topic、Partition
- 插件配置:消费者配置、连接管理
- 消息消费:基本消费、批量消费
- 数据解析:JSON、Avro、自定义格式
- Offset管理:手动提交、指定Offset
- 高可用:消费者组、断线重连
思考题:
- Kafka消费者组有什么作用?
- 如何保证消息不丢失?
- 如何处理消息重复问题?