目录
-
- 摘要
- 一、高可用概述
-
- [1.1 什么是高可用](#1.1 什么是高可用)
- [1.2 高可用指标](#1.2 高可用指标)
- [1.3 高可用设计原则](#1.3 高可用设计原则)
- 二、故障检测
-
- [2.1 心跳检测](#2.1 心跳检测)
- [2.2 健康检查](#2.2 健康检查)
- [2.3 故障告警](#2.3 故障告警)
- 三、数据备份
-
- [3.1 数据备份策略](#3.1 数据备份策略)
- [3.2 定时备份任务](#3.2 定时备份任务)
- [3.3 备份清理](#3.3 备份清理)
- 四、断点续传
-
- [4.1 记录采集位置](#4.1 记录采集位置)
- [4.2 断点续传实现](#4.2 断点续传实现)
- [4.3 数据缓冲](#4.3 数据缓冲)
- 五、主备切换
-
- [5.1 主备架构](#5.1 主备架构)
- [5.2 故障检测与切换](#5.2 故障检测与切换)
- [5.3 数据同步](#5.3 数据同步)
- 六、灾备恢复
-
- [6.1 灾备恢复流程](#6.1 灾备恢复流程)
- [6.2 数据验证](#6.2 数据验证)
- [6.3 恢复测试](#6.3 恢复测试)
- 七、监控告警
-
- [7.1 监控指标](#7.1 监控指标)
- [7.2 告警规则](#7.2 告警规则)
- 八、实战案例
-
- [7.1 高可用数据采集系统](#7.1 高可用数据采集系统)
- 九、总结
- 参考资料
摘要
本文深入讲解DolphinDB数据采集高可用技术。从容错设计到故障检测,从数据备份到断点续传,从主备切换到灾备恢复,全面介绍数据采集高可用的核心方法。通过丰富的代码示例,帮助读者掌握容错与恢复的核心技能。
一、高可用概述
1.1 什么是高可用
高可用是指系统在故障情况下仍能持续提供服务:
#mermaid-svg-59SVg1CDnXZubpHM{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-59SVg1CDnXZubpHM .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-59SVg1CDnXZubpHM .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-59SVg1CDnXZubpHM .error-icon{fill:#552222;}#mermaid-svg-59SVg1CDnXZubpHM .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-59SVg1CDnXZubpHM .marker{fill:#333333;stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .marker.cross{stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-59SVg1CDnXZubpHM p{margin:0;}#mermaid-svg-59SVg1CDnXZubpHM .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label text{fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label span{color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label span p{background-color:transparent;}#mermaid-svg-59SVg1CDnXZubpHM .label text,#mermaid-svg-59SVg1CDnXZubpHM span{fill:#333;color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .node rect,#mermaid-svg-59SVg1CDnXZubpHM .node circle,#mermaid-svg-59SVg1CDnXZubpHM .node ellipse,#mermaid-svg-59SVg1CDnXZubpHM .node polygon,#mermaid-svg-59SVg1CDnXZubpHM .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .rough-node .label text,#mermaid-svg-59SVg1CDnXZubpHM .node .label text,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label,#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label{text-anchor:middle;}#mermaid-svg-59SVg1CDnXZubpHM .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .rough-node .label,#mermaid-svg-59SVg1CDnXZubpHM .node .label,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label,#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label{text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .node.clickable{cursor:pointer;}#mermaid-svg-59SVg1CDnXZubpHM .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .arrowheadPath{fill:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-59SVg1CDnXZubpHM .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-59SVg1CDnXZubpHM .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .cluster text{fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster span{color:#333;}#mermaid-svg-59SVg1CDnXZubpHM div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-59SVg1CDnXZubpHM .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM rect.text{fill:none;stroke-width:0;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape,#mermaid-svg-59SVg1CDnXZubpHM .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape p,#mermaid-svg-59SVg1CDnXZubpHM .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label rect,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-59SVg1CDnXZubpHM .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-59SVg1CDnXZubpHM :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 高可用架构
故障检测
故障转移
服务恢复
保障措施
冗余设计
数据备份
自动恢复
1.2 高可用指标
| 指标 | 可用性 | 年停机时间 |
|---|---|---|
| 2个9 | 99% | 87.6小时 |
| 3个9 | 99.9% | 8.76小时 |
| 4个9 | 99.99% | 52.6分钟 |
| 5个9 | 99.999% | 5.26分钟 |
1.3 高可用设计原则
| 原则 | 说明 |
|---|---|
| 冗余 | 关键组件冗余 |
| 故障隔离 | 故障不影响整体 |
| 快速恢复 | 自动故障转移 |
| 数据保护 | 数据不丢失 |
二、故障检测
2.1 心跳检测
python
// 心跳检测机制
share table(1:0,
`node_id`last_heartbeat`status,
[STRING, TIMESTAMP, STRING]) as heartbeat_table
// 心跳发送
def sendHeartbeat(nodeId) {
insert into heartbeat_table values (nodeId, now(), "alive")
}
// 心跳检测
def checkHeartbeat(timeout = 30000) {
threshold = now() - timeout
deadNodes = select node_id from heartbeat_table
where last_heartbeat < threshold
return deadNodes
}
// 定时心跳
def heartbeatTask(nodeId, interval = 10000) {
while (true) {
sendHeartbeat(nodeId)
sleep(interval)
}
}
// 启动心跳
submitJob("heartbeat", "心跳检测", def() { heartbeatTask("node_001") })
2.2 健康检查
python
// 系统健康检查
def healthCheck() {
report = dict(STRING, ANY)
// 检查内存
memStatus = getMemoryStatus()
report["memory"] = dict(STRING, ANY, [
["used", memStatus.used],
["total", memStatus.total],
["rate", memStatus.used * 100.0 / memStatus.total]
])
// 检查磁盘
diskStatus = getDiskStatus()
report["disk"] = dict(STRING, ANY, [
["free", diskStatus.free],
["total", diskStatus.total]
])
// 检查连接
connStatus = getConnectionStatus()
report["connections"] = connStatus
// 检查订阅
subStatus = getSubscriptionStat()
report["subscriptions"] = subStatus.rows()
return report
}
// 定时健康检查
def scheduledHealthCheck() {
report = healthCheck()
// 告警阈值检查
if (report.memory.rate > 80) {
alert("内存使用率过高: " + string(report.memory.rate) + "%")
}
print(report)
}
scheduleJob("health_check", "健康检查", scheduledHealthCheck,
00:05, 2024.01.01, 2030.12.31, 'D')
2.3 故障告警
python
// 故障告警
share table(1:0,
`alert_time`alert_type`alert_level`message`status,
[TIMESTAMP, STRING, INT, STRING, STRING]) as alert_table
def sendAlert(alertType, level, message) {
insert into alert_table values (now(), alertType, level, message, "new")
// 发送通知
notifyAlert(alertType, level, message)
}
def notifyAlert(alertType, level, message) {
// 发送邮件/短信/微信
print("告警: [" + alertType + "] " + message)
}
三、数据备份
3.1 数据备份策略
python
// 备份配置
backupConfig = dict(STRING, ANY, [
["fullBackupInterval", 86400000], // 全量备份间隔(24小时)
["incrementalInterval", 3600000], // 增量备份间隔(1小时)
["retentionDays", 30], // 保留天数
["backupPath", "/backup/dolphindb"]
])
// 全量备份
def fullBackup(dbPath, backupPath) {
timestamp = format(now(), "yyyyMMdd_HHmmss")
backupDir = backupPath + "/full_" + timestamp
// 备份数据
backup(dbPath, backupDir)
print("全量备份完成: " + backupDir)
return backupDir
}
// 增量备份
def incrementalBackup(dbPath, lastBackupTime, backupPath) {
timestamp = format(now(), "yyyyMMdd_HHmmss")
backupDir = backupPath + "/incr_" + timestamp
// 备份增量数据
backup(dbPath, backupDir, lastBackupTime)
print("增量备份完成: " + backupDir)
return backupDir
}
3.2 定时备份任务
python
// 记录最后备份时间
share table(1:0,
`backup_type`backup_time`backup_path,
[STRING, TIMESTAMP, STRING]) as backup_log
// 定时全量备份
def scheduledFullBackup() {
backupDir = fullBackup("dfs://iot_db", "/backup/dolphindb")
insert into backup_log values ("full", now(), backupDir)
}
scheduleJob("full_backup", "全量备份", scheduledFullBackup,
02:00, 2024.01.01, 2030.12.31, 'D')
// 定时增量备份
def scheduledIncrementalBackup() {
lastTime = exec max(backup_time) from backup_log
backupDir = incrementalBackup("dfs://iot_db", lastTime, "/backup/dolphindb")
insert into backup_log values ("incremental", now(), backupDir)
}
scheduleJob("incr_backup", "增量备份", scheduledIncrementalBackup,
00:00, 2024.01.01, 2030.12.31, 'H')
3.3 备份清理
python
// 清理过期备份
def cleanupOldBackups(backupPath, retentionDays) {
threshold = now() - retentionDays * 86400000
// 获取备份目录列表
backupDirs = files(backupPath)
for (dir in backupDirs) {
if (dir.modifiedTime < threshold) {
// 删除过期备份
rmdir(backupPath + "/" + dir.name, true)
print("删除过期备份: " + dir.name)
}
}
}
// 定时清理
def scheduledCleanup() {
cleanupOldBackups("/backup/dolphindb", 30)
}
scheduleJob("backup_cleanup", "备份清理", scheduledCleanup,
03:00, 2024.01.01, 2030.12.31, 'D')
四、断点续传
4.1 记录采集位置
python
// 采集位置记录表
share table(1:0,
`source`position`update_time,
[STRING, STRING, TIMESTAMP]) as collection_position
// 记录位置
def recordPosition(source, position) {
existing = select * from collection_position where source = source
if (existing.rows() > 0) {
update collection_position
set position = position, update_time = now()
where source = source
} else {
insert into collection_position values (source, position, now())
}
}
// 获取位置
def getPosition(source) {
pos = exec position from collection_position where source = source
if (pos.size() > 0) {
return pos[0]
}
return NULL
}
4.2 断点续传实现
python
// Kafka断点续传
def kafkaResumeConsume(consumer, topic, positionTable) {
// 获取上次位置
lastOffset = getPosition("kafka_" + topic)
if (not isNull(lastOffset)) {
// 从上次位置继续
kafka::seek(consumer, topic, 0, long(lastOffset))
} else {
// 从最新开始
kafka::seekToEnd(consumer, topic)
}
// 消费并记录位置
kafka::consume(consumer, topic,
def(msg) {
// 处理消息
processMessage(msg)
// 记录位置
recordPosition("kafka_" + topic, string(msg.offset))
})
}
// MQTT断点续传
def mqttResumeSubscribe(conn, topic, positionTable) {
// 获取上次时间
lastTime = getPosition("mqtt_" + topic)
// 订阅
mqtt::subscribe(conn, topic,
def(msg) {
// 处理消息
processMessage(msg)
// 记录位置
recordPosition("mqtt_" + topic, string(now()))
})
}
4.3 数据缓冲
python
// 数据缓冲队列
share table(100000:0,
`source`data`timestamp`synced,
[STRING, STRING, TIMESTAMP, BOOL]) as data_buffer
// 写入缓冲
def writeToBuffer(source, data) {
insert into data_buffer values (source, toJson(data), now(), false)
}
// 同步缓冲数据
def syncBuffer() {
unsynced = select * from data_buffer where synced = false limit 10000
if (unsynced.rows() > 0) {
try {
// 同步到目标
for (row in unsynced) {
syncToTarget(row.source, parseJson(row.data))
}
// 标记已同步
update data_buffer set synced = true
where timestamp in unsynced.timestamp
} catch (ex) {
print("同步失败: " + ex)
}
}
}
// 定时同步
scheduleJob("sync_buffer", "同步缓冲", syncBuffer,
00:01, 2024.01.01, 2030.12.31, 'D')
五、主备切换
5.1 主备架构
python
// 主备配置
haConfig = dict(STRING, ANY, [
["mode", "active-standby"],
["primary", "node_001"],
["standby", "node_002"],
["heartbeatInterval", 5000],
["failoverThreshold", 3]
])
// 主备状态
share table(1:0,
`node_id`role`status`last_update,
[STRING, STRING, STRING, TIMESTAMP]) as ha_status
5.2 故障检测与切换
python
// 故障检测
def detectFailure(nodeId, threshold = 3) {
failCount = 0
while (failCount < threshold) {
status = checkNodeStatus(nodeId)
if (status == "down") {
failCount += 1
print("节点 " + nodeId + " 无响应,计数: " + string(failCount))
} else {
failCount = 0
}
sleep(haConfig.heartbeatInterval)
}
return true // 确认故障
}
// 故障转移
def failover() {
print("开始故障转移...")
// 1. 确认主节点故障
if (detectFailure(haConfig.primary)) {
// 2. 激活备节点
activateStandby(haConfig.standby)
// 3. 更新状态
update ha_status set role = "primary", status = "active"
where node_id = haConfig.standby
// 4. 通知告警
sendAlert("failover", 2, "主节点故障,已切换到备节点")
print("故障转移完成")
}
}
// 激活备节点
def activateStandby(nodeId) {
// 启动数据采集
startDataCollection(nodeId)
// 恢复断点
resumeFromCheckpoint(nodeId)
}
5.3 数据同步
python
// 主备数据同步
def syncToStandby(standbyNode, data) {
// 同步数据到备节点
conn = connect(standbyNode)
loadTable(conn, "sensor_data").append!(data)
}
// 订阅同步
subscribeTable(, "sensor_stream", "sync_standby", -1,
def(msg) {
syncToStandby(haConfig.standby, msg)
}, true)
六、灾备恢复
6.1 灾备恢复流程
python
// 灾备恢复
def disasterRecovery(backupPath) {
print("开始灾备恢复...")
// 1. 停止服务
stopServices()
// 2. 恢复数据
restoreData(backupPath)
// 3. 验证数据
validateData()
// 4. 重启服务
startServices()
// 5. 恢复采集
resumeCollection()
print("灾备恢复完成")
}
// 恢复数据
def restoreData(backupPath) {
// 获取最新备份
backups = files(backupPath)
latestBackup = select top 1 * from backups order by modifiedTime desc
// 恢复
restore(backupPath + "/" + latestBackup.name, "dfs://iot_db")
}
6.2 数据验证
python
// 数据验证
def validateData() {
// 检查表完整性
tables = listTables("dfs://iot_db")
for (table in tables) {
t = loadTable("dfs://iot_db", table)
count = exec count(*) from t
print("表 " + table + ": " + string(count) + " 条记录")
}
// 检查数据一致性
// ...
}
6.3 恢复测试
python
// 定期恢复测试
def recoveryTest() {
print("开始恢复测试...")
// 1. 创建测试环境
testEnv = createTestEnvironment()
// 2. 执行恢复
disasterRecovery("/backup/test")
// 3. 验证结果
result = validateRecovery()
// 4. 清理测试环境
cleanupTestEnvironment(testEnv)
print("恢复测试完成: " + string(result))
}
// 定时测试
scheduleJob("recovery_test", "恢复测试", recoveryTest,
00:00, 2024.01.01, 2030.12.31, 'M') // 每月测试
七、监控告警
7.1 监控指标
python
// 监控指标收集
def collectMetrics() {
metrics = dict(STRING, ANY)
// 系统指标
metrics["cpu_usage"] = getCpuUsage()
metrics["memory_usage"] = getMemoryUsage()
metrics["disk_usage"] = getDiskUsage()
// 业务指标
metrics["data_rate"] = getDataRate()
metrics["queue_size"] = getQueueSize()
metrics["error_rate"] = getErrorRate()
// 高可用指标
metrics["node_status"] = getNodeStatus()
metrics["replication_lag"] = getReplicationLag()
return metrics
}
7.2 告警规则
python
// 告警规则
alertRules = table(
["cpu_high", "memory_high", "disk_high", "node_down", "replication_lag"] as rule_name,
[80, 85, 90, 0, 60000] as threshold,
[">", ">", ">", "==", ">"] as operator,
[2, 2, 2, 1, 2] as alert_level
)
// 检查告警
def checkAlerts(metrics) {
for (rule in alertRules) {
value = metrics[rule.rule_name]
if (eval(string(value) + rule.operator + string(rule.threshold))) {
sendAlert(rule.rule_name, rule.alert_level,
rule.rule_name + ": " + string(value))
}
}
}
八、实战案例
7.1 高可用数据采集系统
python
// ========== 高可用数据采集系统 ==========
// 1. 初始化高可用配置
haConfig = dict(STRING, ANY, [
["primary", "node_001"],
["standby", "node_002"],
["heartbeatInterval", 5000]
])
// 2. 创建数据表
share streamTable(100000:0,
`device_id`timestamp`temperature`humidity,
[SYMBOL, TIMESTAMP, DOUBLE, DOUBLE]) as ha_stream
enableTablePersistence(ha_stream, true, true, 1000000)
// 3. 创建分布式表
db = database("dfs://ha_db", VALUE, 1..1000)
schema = table(1:0, `device_id`timestamp`temperature`humidity,
[SYMBOL, TIMESTAMP, DOUBLE, DOUBLE])
db.createPartitionedTable(schema, `sensor_data, `device_id)
// 4. 订阅写入
subscribeTable(, "ha_stream", "persist", -1,
def(msg) {
loadTable("dfs://ha_db", "sensor_data").append!(msg)
}, 10000, 5000)
// 5. 采集位置记录
share table(1:0, `source`position`update_time,
[STRING, STRING, TIMESTAMP]) as collection_position
// 6. 数据缓冲
share table(100000:0, `source`data`timestamp`synced,
[STRING, STRING, TIMESTAMP, BOOL]) as data_buffer
// 7. 心跳检测
def heartbeatTask() {
while (true) {
// 发送心跳
recordPosition("heartbeat", string(now()))
// 检查主节点
if (not checkNodeStatus(haConfig.primary)) {
failover()
}
sleep(haConfig.heartbeatInterval)
}
}
submitJob("ha_heartbeat", "高用心跳", heartbeatTask)
// 8. 定时备份
def backupTask() {
fullBackup("dfs://ha_db", "/backup/dolphindb")
}
scheduleJob("ha_backup", "高可用备份", backupTask,
02:00, 2024.01.01, 2030.12.31, 'D')
// 9. 监控
def monitorHA() {
print("=== 高可用监控 ===")
print("主节点: " + haConfig.primary)
print("备节点: " + haConfig.standby)
print("流表行数: " + string(exec count(*) from ha_stream))
print("缓冲行数: " + string(exec count(*) from data_buffer where synced = false))
}
monitorHA()
print("高可用数据采集系统启动完成")
九、总结
本文详细介绍了DolphinDB数据采集高可用:
- 高可用原理:冗余设计、故障隔离、快速恢复
- 故障检测:心跳检测、健康检查、故障告警
- 数据备份:全量备份、增量备份、备份清理
- 断点续传:位置记录、断点恢复、数据缓冲
- 主备切换:故障检测、故障转移、数据同步
- 灾备恢复:恢复流程、数据验证、恢复测试
- 监控告警:监控指标、告警规则
思考题:
- 如何设计高可用的数据采集架构?
- 如何保证数据采集的可靠性?
- 如何快速恢复故障?