DolphinDB数据采集高可用：容错与恢复

- 摘要
- 一、高可用概述
- - [1.1 什么是高可用](#1.1 什么是高可用)
  - [1.2 高可用指标](#1.2 高可用指标)
  - [1.3 高可用设计原则](#1.3 高可用设计原则)
- 二、故障检测
- - [2.1 心跳检测](#2.1 心跳检测)
  - [2.2 健康检查](#2.2 健康检查)
  - [2.3 故障告警](#2.3 故障告警)
- 三、数据备份
- - [3.1 数据备份策略](#3.1 数据备份策略)
  - [3.2 定时备份任务](#3.2 定时备份任务)
  - [3.3 备份清理](#3.3 备份清理)
- 四、断点续传
- - [4.1 记录采集位置](#4.1 记录采集位置)
  - [4.2 断点续传实现](#4.2 断点续传实现)
  - [4.3 数据缓冲](#4.3 数据缓冲)
- 五、主备切换
- - [5.1 主备架构](#5.1 主备架构)
  - [5.2 故障检测与切换](#5.2 故障检测与切换)
  - [5.3 数据同步](#5.3 数据同步)
- 六、灾备恢复
- - [6.1 灾备恢复流程](#6.1 灾备恢复流程)
  - [6.2 数据验证](#6.2 数据验证)
  - [6.3 恢复测试](#6.3 恢复测试)
- 七、监控告警
- - [7.1 监控指标](#7.1 监控指标)
  - [7.2 告警规则](#7.2 告警规则)
- 八、实战案例
- - [7.1 高可用数据采集系统](#7.1 高可用数据采集系统)
- 九、总结
- 参考资料

摘要

本文深入讲解DolphinDB数据采集高可用技术。从容错设计到故障检测，从数据备份到断点续传，从主备切换到灾备恢复，全面介绍数据采集高可用的核心方法。通过丰富的代码示例，帮助读者掌握容错与恢复的核心技能。

一、高可用概述

1.1 什么是高可用

高可用是指系统在故障情况下仍能持续提供服务：
#mermaid-svg-59SVg1CDnXZubpHM{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-59SVg1CDnXZubpHM .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-59SVg1CDnXZubpHM .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-59SVg1CDnXZubpHM .error-icon{fill:#552222;}#mermaid-svg-59SVg1CDnXZubpHM .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-59SVg1CDnXZubpHM .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-59SVg1CDnXZubpHM .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-59SVg1CDnXZubpHM .marker{fill:#333333;stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .marker.cross{stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-59SVg1CDnXZubpHM p{margin:0;}#mermaid-svg-59SVg1CDnXZubpHM .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label text{fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label span{color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster-label span p{background-color:transparent;}#mermaid-svg-59SVg1CDnXZubpHM .label text,#mermaid-svg-59SVg1CDnXZubpHM span{fill:#333;color:#333;}#mermaid-svg-59SVg1CDnXZubpHM .node rect,#mermaid-svg-59SVg1CDnXZubpHM .node circle,#mermaid-svg-59SVg1CDnXZubpHM .node ellipse,#mermaid-svg-59SVg1CDnXZubpHM .node polygon,#mermaid-svg-59SVg1CDnXZubpHM .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .rough-node .label text,#mermaid-svg-59SVg1CDnXZubpHM .node .label text,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label,#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label{text-anchor:middle;}#mermaid-svg-59SVg1CDnXZubpHM .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .rough-node .label,#mermaid-svg-59SVg1CDnXZubpHM .node .label,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label,#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label{text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .node.clickable{cursor:pointer;}#mermaid-svg-59SVg1CDnXZubpHM .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .arrowheadPath{fill:#333333;}#mermaid-svg-59SVg1CDnXZubpHM .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-59SVg1CDnXZubpHM .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-59SVg1CDnXZubpHM .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-59SVg1CDnXZubpHM .cluster text{fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM .cluster span{color:#333;}#mermaid-svg-59SVg1CDnXZubpHM div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-59SVg1CDnXZubpHM .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-59SVg1CDnXZubpHM rect.text{fill:none;stroke-width:0;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape,#mermaid-svg-59SVg1CDnXZubpHM .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape p,#mermaid-svg-59SVg1CDnXZubpHM .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-59SVg1CDnXZubpHM .icon-shape .label rect,#mermaid-svg-59SVg1CDnXZubpHM .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-59SVg1CDnXZubpHM .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-59SVg1CDnXZubpHM .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-59SVg1CDnXZubpHM :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 高可用架构
故障检测
故障转移
服务恢复
保障措施
冗余设计
数据备份
自动恢复

1.2 高可用指标

指标	可用性	年停机时间
2个9	99%	87.6小时
3个9	99.9%	8.76小时
4个9	99.99%	52.6分钟
5个9	99.999%	5.26分钟

1.3 高可用设计原则

原则	说明
冗余	关键组件冗余
故障隔离	故障不影响整体
快速恢复	自动故障转移
数据保护	数据不丢失

二、故障检测

2.1 心跳检测

python 复制代码

// 心跳检测机制
share table(1:0, 
    `node_id`last_heartbeat`status,
    [STRING, TIMESTAMP, STRING]) as heartbeat_table

// 心跳发送
def sendHeartbeat(nodeId) {
    insert into heartbeat_table values (nodeId, now(), "alive")
}

// 心跳检测
def checkHeartbeat(timeout = 30000) {
    threshold = now() - timeout
    
    deadNodes = select node_id from heartbeat_table
                 where last_heartbeat < threshold
    
    return deadNodes
}

// 定时心跳
def heartbeatTask(nodeId, interval = 10000) {
    while (true) {
        sendHeartbeat(nodeId)
        sleep(interval)
    }
}

// 启动心跳
submitJob("heartbeat", "心跳检测", def() { heartbeatTask("node_001") })

2.2 健康检查

python 复制代码

// 系统健康检查
def healthCheck() {
    report = dict(STRING, ANY)
    
    // 检查内存
    memStatus = getMemoryStatus()
    report["memory"] = dict(STRING, ANY, [
        ["used", memStatus.used],
        ["total", memStatus.total],
        ["rate", memStatus.used * 100.0 / memStatus.total]
    ])
    
    // 检查磁盘
    diskStatus = getDiskStatus()
    report["disk"] = dict(STRING, ANY, [
        ["free", diskStatus.free],
        ["total", diskStatus.total]
    ])
    
    // 检查连接
    connStatus = getConnectionStatus()
    report["connections"] = connStatus
    
    // 检查订阅
    subStatus = getSubscriptionStat()
    report["subscriptions"] = subStatus.rows()
    
    return report
}

// 定时健康检查
def scheduledHealthCheck() {
    report = healthCheck()
    
    // 告警阈值检查
    if (report.memory.rate > 80) {
        alert("内存使用率过高: " + string(report.memory.rate) + "%")
    }
    
    print(report)
}

scheduleJob("health_check", "健康检查", scheduledHealthCheck,
            00:05, 2024.01.01, 2030.12.31, 'D')

2.3 故障告警

python 复制代码

// 故障告警
share table(1:0, 
    `alert_time`alert_type`alert_level`message`status,
    [TIMESTAMP, STRING, INT, STRING, STRING]) as alert_table

def sendAlert(alertType, level, message) {
    insert into alert_table values (now(), alertType, level, message, "new")
    
    // 发送通知
    notifyAlert(alertType, level, message)
}

def notifyAlert(alertType, level, message) {
    // 发送邮件/短信/微信
    print("告警: [" + alertType + "] " + message)
}

三、数据备份

3.1 数据备份策略

python 复制代码

// 备份配置
backupConfig = dict(STRING, ANY, [
    ["fullBackupInterval", 86400000],  // 全量备份间隔（24小时）
    ["incrementalInterval", 3600000],  // 增量备份间隔（1小时）
    ["retentionDays", 30],             // 保留天数
    ["backupPath", "/backup/dolphindb"]
])

// 全量备份
def fullBackup(dbPath, backupPath) {
    timestamp = format(now(), "yyyyMMdd_HHmmss")
    backupDir = backupPath + "/full_" + timestamp
    
    // 备份数据
    backup(dbPath, backupDir)
    
    print("全量备份完成: " + backupDir)
    return backupDir
}

// 增量备份
def incrementalBackup(dbPath, lastBackupTime, backupPath) {
    timestamp = format(now(), "yyyyMMdd_HHmmss")
    backupDir = backupPath + "/incr_" + timestamp
    
    // 备份增量数据
    backup(dbPath, backupDir, lastBackupTime)
    
    print("增量备份完成: " + backupDir)
    return backupDir
}

3.2 定时备份任务

python 复制代码

// 记录最后备份时间
share table(1:0, 
    `backup_type`backup_time`backup_path,
    [STRING, TIMESTAMP, STRING]) as backup_log

// 定时全量备份
def scheduledFullBackup() {
    backupDir = fullBackup("dfs://iot_db", "/backup/dolphindb")
    insert into backup_log values ("full", now(), backupDir)
}

scheduleJob("full_backup", "全量备份", scheduledFullBackup,
            02:00, 2024.01.01, 2030.12.31, 'D')

// 定时增量备份
def scheduledIncrementalBackup() {
    lastTime = exec max(backup_time) from backup_log
    backupDir = incrementalBackup("dfs://iot_db", lastTime, "/backup/dolphindb")
    insert into backup_log values ("incremental", now(), backupDir)
}

scheduleJob("incr_backup", "增量备份", scheduledIncrementalBackup,
            00:00, 2024.01.01, 2030.12.31, 'H')

3.3 备份清理

python 复制代码

// 清理过期备份
def cleanupOldBackups(backupPath, retentionDays) {
    threshold = now() - retentionDays * 86400000
    
    // 获取备份目录列表
    backupDirs = files(backupPath)
    
    for (dir in backupDirs) {
        if (dir.modifiedTime < threshold) {
            // 删除过期备份
            rmdir(backupPath + "/" + dir.name, true)
            print("删除过期备份: " + dir.name)
        }
    }
}

// 定时清理
def scheduledCleanup() {
    cleanupOldBackups("/backup/dolphindb", 30)
}

scheduleJob("backup_cleanup", "备份清理", scheduledCleanup,
            03:00, 2024.01.01, 2030.12.31, 'D')

四、断点续传

4.1 记录采集位置

python 复制代码

// 采集位置记录表
share table(1:0, 
    `source`position`update_time,
    [STRING, STRING, TIMESTAMP]) as collection_position

// 记录位置
def recordPosition(source, position) {
    existing = select * from collection_position where source = source
    
    if (existing.rows() > 0) {
        update collection_position 
        set position = position, update_time = now()
        where source = source
    } else {
        insert into collection_position values (source, position, now())
    }
}

// 获取位置
def getPosition(source) {
    pos = exec position from collection_position where source = source
    if (pos.size() > 0) {
        return pos[0]
    }
    return NULL
}

4.2 断点续传实现

python 复制代码

// Kafka断点续传
def kafkaResumeConsume(consumer, topic, positionTable) {
    // 获取上次位置
    lastOffset = getPosition("kafka_" + topic)
    
    if (not isNull(lastOffset)) {
        // 从上次位置继续
        kafka::seek(consumer, topic, 0, long(lastOffset))
    } else {
        // 从最新开始
        kafka::seekToEnd(consumer, topic)
    }
    
    // 消费并记录位置
    kafka::consume(consumer, topic, 
        def(msg) {
            // 处理消息
            processMessage(msg)
            
            // 记录位置
            recordPosition("kafka_" + topic, string(msg.offset))
        })
}

// MQTT断点续传
def mqttResumeSubscribe(conn, topic, positionTable) {
    // 获取上次时间
    lastTime = getPosition("mqtt_" + topic)
    
    // 订阅
    mqtt::subscribe(conn, topic,
        def(msg) {
            // 处理消息
            processMessage(msg)
            
            // 记录位置
            recordPosition("mqtt_" + topic, string(now()))
        })
}

4.3 数据缓冲

python 复制代码

// 数据缓冲队列
share table(100000:0, 
    `source`data`timestamp`synced,
    [STRING, STRING, TIMESTAMP, BOOL]) as data_buffer

// 写入缓冲
def writeToBuffer(source, data) {
    insert into data_buffer values (source, toJson(data), now(), false)
}

// 同步缓冲数据
def syncBuffer() {
    unsynced = select * from data_buffer where synced = false limit 10000
    
    if (unsynced.rows() > 0) {
        try {
            // 同步到目标
            for (row in unsynced) {
                syncToTarget(row.source, parseJson(row.data))
            }
            
            // 标记已同步
            update data_buffer set synced = true 
            where timestamp in unsynced.timestamp
        } catch (ex) {
            print("同步失败: " + ex)
        }
    }
}

// 定时同步
scheduleJob("sync_buffer", "同步缓冲", syncBuffer,
            00:01, 2024.01.01, 2030.12.31, 'D')

五、主备切换

5.1 主备架构

python 复制代码

// 主备配置
haConfig = dict(STRING, ANY, [
    ["mode", "active-standby"],
    ["primary", "node_001"],
    ["standby", "node_002"],
    ["heartbeatInterval", 5000],
    ["failoverThreshold", 3]
])

// 主备状态
share table(1:0, 
    `node_id`role`status`last_update,
    [STRING, STRING, STRING, TIMESTAMP]) as ha_status

5.2 故障检测与切换

python 复制代码

// 故障检测
def detectFailure(nodeId, threshold = 3) {
    failCount = 0
    
    while (failCount < threshold) {
        status = checkNodeStatus(nodeId)
        
        if (status == "down") {
            failCount += 1
            print("节点 " + nodeId + " 无响应，计数: " + string(failCount))
        } else {
            failCount = 0
        }
        
        sleep(haConfig.heartbeatInterval)
    }
    
    return true  // 确认故障
}

// 故障转移
def failover() {
    print("开始故障转移...")
    
    // 1. 确认主节点故障
    if (detectFailure(haConfig.primary)) {
        // 2. 激活备节点
        activateStandby(haConfig.standby)
        
        // 3. 更新状态
        update ha_status set role = "primary", status = "active"
        where node_id = haConfig.standby
        
        // 4. 通知告警
        sendAlert("failover", 2, "主节点故障，已切换到备节点")
        
        print("故障转移完成")
    }
}

// 激活备节点
def activateStandby(nodeId) {
    // 启动数据采集
    startDataCollection(nodeId)
    
    // 恢复断点
    resumeFromCheckpoint(nodeId)
}

5.3 数据同步

python 复制代码

// 主备数据同步
def syncToStandby(standbyNode, data) {
    // 同步数据到备节点
    conn = connect(standbyNode)
    loadTable(conn, "sensor_data").append!(data)
}

// 订阅同步
subscribeTable(, "sensor_stream", "sync_standby", -1,
    def(msg) {
        syncToStandby(haConfig.standby, msg)
    }, true)

六、灾备恢复

6.1 灾备恢复流程

python 复制代码

// 灾备恢复
def disasterRecovery(backupPath) {
    print("开始灾备恢复...")
    
    // 1. 停止服务
    stopServices()
    
    // 2. 恢复数据
    restoreData(backupPath)
    
    // 3. 验证数据
    validateData()
    
    // 4. 重启服务
    startServices()
    
    // 5. 恢复采集
    resumeCollection()
    
    print("灾备恢复完成")
}

// 恢复数据
def restoreData(backupPath) {
    // 获取最新备份
    backups = files(backupPath)
    latestBackup = select top 1 * from backups order by modifiedTime desc
    
    // 恢复
    restore(backupPath + "/" + latestBackup.name, "dfs://iot_db")
}

6.2 数据验证

python 复制代码

// 数据验证
def validateData() {
    // 检查表完整性
    tables = listTables("dfs://iot_db")
    
    for (table in tables) {
        t = loadTable("dfs://iot_db", table)
        count = exec count(*) from t
        
        print("表 " + table + ": " + string(count) + " 条记录")
    }
    
    // 检查数据一致性
    // ...
}

6.3 恢复测试

python 复制代码

// 定期恢复测试
def recoveryTest() {
    print("开始恢复测试...")
    
    // 1. 创建测试环境
    testEnv = createTestEnvironment()
    
    // 2. 执行恢复
    disasterRecovery("/backup/test")
    
    // 3. 验证结果
    result = validateRecovery()
    
    // 4. 清理测试环境
    cleanupTestEnvironment(testEnv)
    
    print("恢复测试完成: " + string(result))
}

// 定时测试
scheduleJob("recovery_test", "恢复测试", recoveryTest,
            00:00, 2024.01.01, 2030.12.31, 'M')  // 每月测试

七、监控告警

7.1 监控指标

python 复制代码

// 监控指标收集
def collectMetrics() {
    metrics = dict(STRING, ANY)
    
    // 系统指标
    metrics["cpu_usage"] = getCpuUsage()
    metrics["memory_usage"] = getMemoryUsage()
    metrics["disk_usage"] = getDiskUsage()
    
    // 业务指标
    metrics["data_rate"] = getDataRate()
    metrics["queue_size"] = getQueueSize()
    metrics["error_rate"] = getErrorRate()
    
    // 高可用指标
    metrics["node_status"] = getNodeStatus()
    metrics["replication_lag"] = getReplicationLag()
    
    return metrics
}

7.2 告警规则

python 复制代码

// 告警规则
alertRules = table(
    ["cpu_high", "memory_high", "disk_high", "node_down", "replication_lag"] as rule_name,
    [80, 85, 90, 0, 60000] as threshold,
    [">", ">", ">", "==", ">"] as operator,
    [2, 2, 2, 1, 2] as alert_level
)

// 检查告警
def checkAlerts(metrics) {
    for (rule in alertRules) {
        value = metrics[rule.rule_name]
        
        if (eval(string(value) + rule.operator + string(rule.threshold))) {
            sendAlert(rule.rule_name, rule.alert_level, 
                     rule.rule_name + ": " + string(value))
        }
    }
}

八、实战案例

7.1 高可用数据采集系统

python 复制代码

// ========== 高可用数据采集系统 ==========

// 1. 初始化高可用配置
haConfig = dict(STRING, ANY, [
    ["primary", "node_001"],
    ["standby", "node_002"],
    ["heartbeatInterval", 5000]
])

// 2. 创建数据表
share streamTable(100000:0, 
    `device_id`timestamp`temperature`humidity,
    [SYMBOL, TIMESTAMP, DOUBLE, DOUBLE]) as ha_stream

enableTablePersistence(ha_stream, true, true, 1000000)

// 3. 创建分布式表
db = database("dfs://ha_db", VALUE, 1..1000)
schema = table(1:0, `device_id`timestamp`temperature`humidity,
               [SYMBOL, TIMESTAMP, DOUBLE, DOUBLE])
db.createPartitionedTable(schema, `sensor_data, `device_id)

// 4. 订阅写入
subscribeTable(, "ha_stream", "persist", -1,
    def(msg) {
        loadTable("dfs://ha_db", "sensor_data").append!(msg)
    }, 10000, 5000)

// 5. 采集位置记录
share table(1:0, `source`position`update_time,
            [STRING, STRING, TIMESTAMP]) as collection_position

// 6. 数据缓冲
share table(100000:0, `source`data`timestamp`synced,
            [STRING, STRING, TIMESTAMP, BOOL]) as data_buffer

// 7. 心跳检测
def heartbeatTask() {
    while (true) {
        // 发送心跳
        recordPosition("heartbeat", string(now()))
        
        // 检查主节点
        if (not checkNodeStatus(haConfig.primary)) {
            failover()
        }
        
        sleep(haConfig.heartbeatInterval)
    }
}

submitJob("ha_heartbeat", "高用心跳", heartbeatTask)

// 8. 定时备份
def backupTask() {
    fullBackup("dfs://ha_db", "/backup/dolphindb")
}

scheduleJob("ha_backup", "高可用备份", backupTask,
            02:00, 2024.01.01, 2030.12.31, 'D')

// 9. 监控
def monitorHA() {
    print("=== 高可用监控 ===")
    print("主节点: " + haConfig.primary)
    print("备节点: " + haConfig.standby)
    print("流表行数: " + string(exec count(*) from ha_stream))
    print("缓冲行数: " + string(exec count(*) from data_buffer where synced = false))
}

monitorHA()

print("高可用数据采集系统启动完成")

九、总结

本文详细介绍了DolphinDB数据采集高可用：

高可用原理：冗余设计、故障隔离、快速恢复
故障检测：心跳检测、健康检查、故障告警
数据备份：全量备份、增量备份、备份清理
断点续传：位置记录、断点恢复、数据缓冲
主备切换：故障检测、故障转移、数据同步
灾备恢复：恢复流程、数据验证、恢复测试
监控告警：监控指标、告警规则

思考题：

如何设计高可用的数据采集架构？
如何保证数据采集的可靠性？
如何快速恢复故障？

DolphinDB数据采集高可用：容错与恢复

目录

摘要

一、高可用概述

1.1 什么是高可用

1.2 高可用指标

1.3 高可用设计原则

二、故障检测

2.1 心跳检测

2.2 健康检查

2.3 故障告警

三、数据备份

3.1 数据备份策略

3.2 定时备份任务

3.3 备份清理

四、断点续传

4.1 记录采集位置

4.2 断点续传实现

4.3 数据缓冲

五、主备切换

5.1 主备架构

5.2 故障检测与切换

5.3 数据同步

六、灾备恢复

6.1 灾备恢复流程

6.2 数据验证

6.3 恢复测试

七、监控告警

7.1 监控指标

7.2 告警规则

八、实战案例

7.1 高可用数据采集系统

九、总结

参考资料