Remote Write：高效数据推送

深入了解 AI Observability Agent 的 Remote Write 功能，实现高效可靠的数据推送

Prometheus Remote Write 协议

协议概述

Prometheus Remote Write 是 Prometheus 生态系统中的一种数据传输协议，用于将监控数据从采集器推送到存储后端。

核心特点：

基于 HTTP：使用 HTTP POST 请求
Protobuf 编码：高效的二进制编码
Snappy 压缩：减小传输体积
批量发送：提高传输效率

协议版本

版本：0.1.0
Content-Type：application/x-protobuf
Content-Encoding：snappy
X-Prometheus-Remote-Write-Version：0.1.0

兼容的存储后端

存储后端	兼容性	特点
Prometheus	✅ 完全兼容	原生支持
VictoriaMetrics	✅ 完全兼容	高性能时序数据库
Cortex	✅ 完全兼容	可扩展的 Prometheus 水平扩展方案
Mimir	✅ 完全兼容	Grafana 开源的 Prometheus 兼容存储
Thanos	✅ 完全兼容	高可用 Prometheus 方案
InfluxDB	✅ 兼容（需适配器）	时间序列数据库

数据编码流程

1. 数据准备

Sample 收集：从采集器和抓取器收集 Sample 数据
数据聚合：按 metric_name + labels 分组
标签处理 ：添加 __name__ 标签，排序标签

2. Protobuf 编码

使用 prost 库进行 Protobuf 编码：

rust 复制代码

// 核心数据结构
#[derive(Clone, PartialEq, Message)]
pub struct Label {
    #[prost(string, tag = "1")]
    pub name: String,
    #[prost(string, tag = "2")]
    pub value: String,
}

#[derive(Clone, PartialEq, Message)]
pub struct Sample {
    #[prost(double, tag = "1")]
    pub value: f64,
    #[prost(int64, tag = "2")]
    pub timestamp_ms: i64,
}

#[derive(Clone, PartialEq, Message)]
pub struct TimeSeries {
    #[prost(message, repeated, tag = "1")]
    pub labels: Vec<Label>,
    #[prost(message, repeated, tag = "2")]
    pub samples: Vec<Sample>,
}

#[derive(Clone, PartialEq, Message)]
pub struct WriteRequest {
    #[prost(message, repeated, tag = "1")]
    pub timeseries: Vec<TimeSeries>,
}

3. Snappy 压缩

使用 snap 库进行 Snappy 压缩：

rust 复制代码

// 压缩流程
fn compress(data: &[u8]) -> Result<Vec<u8>, CompressError> {
    let mut encoder = snap::raw::Encoder::new();
    let mut compressed = Vec::with_capacity(data.len());
    encoder.compress(data, &mut compressed)?;
    Ok(compressed)
}

注意：使用 raw Snappy 格式，不是 framed 格式，以兼容 Prometheus。

4. HTTP 发送

使用 reqwest 库发送 HTTP 请求：

rust 复制代码

// 发送请求
async fn send_request(
    client: &reqwest::Client,
    endpoint: &str,
    body: &[u8],
    auth: Option<&AuthConfig>,
) -> Result<reqwest::Response, SendError> {
    let mut request = client.post(endpoint)
        .header("Content-Type", "application/x-protobuf")
        .header("Content-Encoding", "snappy")
        .header("X-Prometheus-Remote-Write-Version", "0.1.0")
        .body(body.to_vec());
    
    // 添加认证
    if let Some(auth) = auth {
        request = auth.apply(request);
    }
    
    request.send().await
}

批处理与缓冲

Batcher 设计

rust 复制代码

pub struct Batcher {
    buffer: Arc<Mutex<Vec<Sample>>>,
    capacity: usize,
    max_samples_per_send: usize,
    batch_send_deadline: Duration,
    last_send: Arc<Mutex<Instant>>,
}

批处理触发条件

容量触发 ：缓冲区样本数 ≥ max_samples_per_send
时间触发 ：距上次发送时间 ≥ batch_send_deadline

缓冲区管理

容量限制 ：超过 capacity 时淘汰最旧的数据
背压控制 ：drain 操作按 max_samples_per_send 分批取出
线程安全：使用 Mutex 保证线程安全

批处理流程

数据进入 Batcher 缓冲区
后台 flush 线程定期检查
满足触发条件时取出一批数据
编码压缩后发送
清空已发送的数据

重试策略

核心设计

rust 复制代码

pub struct RetryPolicy {
    max_retries: u32,
    min_backoff: Duration,
    max_backoff: Duration,
    current_retry: u32,
}

指数退避算法

rust 复制代码

fn calculate_backoff(&self) -> Duration {
    let backoff = self.min_backoff.mul_f64(2.0_f64.powi(self.current_retry as i32));
    let jitter = Duration::from_millis(rand::thread_rng().gen_range(0..100));
    backoff.saturating_add(jitter).min(self.max_backoff)
}

重试流程

发送失败时，计算退避时间
等待退避时间后重试
重试计数器递增
达到最大重试次数后放弃
成功后重置重试计数器

多端点故障转移

配置示例

yaml 复制代码

remote_write:
  endpoints:
    - name: primary
      endpoint: http://prometheus-1:9090/api/v1/write
      priority: 1
      enabled: true
    - name: backup
      endpoint: http://prometheus-2:9090/api/v1/write
      priority: 2
      enabled: true
  failover:
    enabled: true
    health_check_interval_secs: 30

工作原理

端点排序：按优先级排序端点列表
健康检查：定期检查端点健康状态
故障检测：检测发送失败
自动切换：切换到下一个健康端点
恢复检测：检测端点恢复健康
自动切回：切回最高优先级的健康端点

健康检查

检查方式：发送空的 WriteRequest
检查频率 ：按 health_check_interval_secs 配置
失败判定：连续失败一定次数后标记为不健康
恢复判定：连续成功一定次数后标记为健康

性能调优

1. 批处理配置

配置项	建议值	说明
`capacity`	10000-50000	缓冲区容量
`max_samples_per_send`	1000-5000	每次发送的最大样本数
`batch_send_deadline_secs`	1-5	批量发送截止时间

2. 并发配置

配置项	建议值	说明
`max_shards`	3-10	最大并发分片数

3. 重试配置

配置项	建议值	说明
`max_retries`	3-5	最大重试次数
`min_backoff_secs`	1-3	最小退避时间
`max_backoff_secs`	10-30	最大退避时间

4. 网络配置

连接复用：启用 HTTP 连接池
超时设置：合理设置 HTTP 超时
压缩：启用 Snappy 压缩

5. 性能调优建议

低负载场景（<1000 样本/秒）：

max_shards: 1-2
max_samples_per_send: 500-1000
batch_send_deadline_secs: 5

中等负载场景（1000-10000 样本/秒）：

max_shards: 3-5
max_samples_per_send: 1000-2000
batch_send_deadline_secs: 3

高负载场景（>10000 样本/秒）：

max_shards: 5-10
max_samples_per_send: 2000-5000
batch_send_deadline_secs: 1

API 端点

1. 端点管理

列出所有端点：

bash 复制代码

curl http://localhost:9090/api/v1/endpoints | jq

响应示例：

json 复制代码

{
  "success": true,
  "data": {
    "endpoints": [
      {
        "name": "primary",
        "endpoint": "http://prometheus-1:9090/api/v1/write",
        "priority": 1,
        "enabled": true,
        "healthy": true
      },
      {
        "name": "backup",
        "endpoint": "http://prometheus-2:9090/api/v1/write",
        "priority": 2,
        "enabled": true,
        "healthy": true
      }
    ],
    "buffer_size": 150,
    "max_shards": 5
  }
}

2. 添加端点

bash 复制代码

curl -X POST http://localhost:9090/api/v1/endpoints \
  -H "Content-Type: application/json" \
  -d '{
    "name": "new-endpoint",
    "endpoint": "http://prometheus-3:9090/api/v1/write",
    "priority": 3,
    "enabled": true
  }'

3. 更新端点

bash 复制代码

curl -X PUT http://localhost:9090/api/v1/endpoints \
  -H "Content-Type: application/json" \
  -d '[
    {
      "name": "primary",
      "endpoint": "http://new-prometheus:9090/api/v1/write",
      "priority": 1,
      "enabled": true
    }
  ]'

4. 删除端点

bash 复制代码

curl -X DELETE http://localhost:9090/api/v1/endpoints/primary

5. 启用/禁用端点

bash 复制代码

# 启用
curl -X POST http://localhost:9090/api/v1/endpoints/primary/enable

# 禁用
curl -X POST http://localhost:9090/api/v1/endpoints/primary/disable

故障排查

1. 发送失败

症状：日志中出现 remote write failed 错误

排查步骤：

检查网络连接
检查端点 URL
检查认证配置
检查 Prometheus 状态
查看详细错误信息

解决方案：

修复网络连接
修正端点 URL
检查认证配置
确保 Prometheus 启用了 Remote Write 接收器

2. 数据丢失

症状：数据未到达远程存储

排查步骤：

检查 Batcher 配置
检查重试策略
检查本地持久化
查看发送日志

解决方案：

调整 Batcher 配置
优化重试策略
启用本地持久化
增加缓冲区容量

3. 性能问题

症状：发送延迟高、CPU 或内存使用高

排查步骤：

检查系统资源
检查网络带宽
检查批处理配置
检查并发配置

解决方案：

增加系统资源
优化网络带宽
调整批处理配置
优化并发设置

4. 故障转移不生效

症状：主端点故障时未切换到备份端点

排查步骤：

检查故障转移配置
检查端点健康状态
查看故障转移日志
测试备份端点

解决方案：

启用故障转移
检查端点配置
确保备份端点可用
调整健康检查配置

最佳实践

1. 配置最佳实践

端点配置：

使用多端点配置实现高可用
设置合理的优先级
定期检查端点健康状态

批处理配置：

根据负载调整批处理参数
平衡延迟和吞吐量
避免过大的批处理大小

重试配置：

设置合理的重试次数
避免过短的退避时间
防止重试风暴

2. 监控最佳实践

关键指标：

agent_remote_write_requests_total：发送请求总数
agent_remote_write_errors_total：发送错误总数
agent_remote_write_retries_total：重试总数
agent_buffer_size：当前缓冲区大小

告警规则：

yaml 复制代码

groups:
- name: remote_write
  rules:
  - alert: RemoteWriteErrors
    expr: rate(agent_remote_write_errors_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Remote Write 错误"
      description: "Remote Write 在过去 5 分钟内出现错误"

  - alert: RemoteWriteBacklog
    expr: agent_buffer_size > 5000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Remote Write 积压"
      description: "Remote Write 缓冲区积压超过 5000 个样本"

3. 部署最佳实践

网络配置：

确保网络连接稳定
配置适当的网络超时
考虑使用专线或 VPN

存储配置：

选择高性能的存储后端
确保存储后端有足够的容量
配置适当的存储保留策略

安全配置：

使用 HTTPS 连接
配置适当的认证
限制网络访问

总结

AI Observability Agent 的 Remote Write 功能提供了高效可靠的数据推送能力：

高效编码：Protobuf 编码 + Snappy 压缩
批量发送：提高传输效率
智能重试：指数退避 + 抖动
故障转移：多端点自动切换
性能调优：丰富的配置选项
易于管理：完整的 API 接口

通过 Remote Write，Agent 可以将监控数据高效可靠地推送到各种存储后端，为监控系统提供强大的数据传输能力。

下一步

Grafana 可视化 - 开箱即用的监控面板
快速开始 - 5分钟部署指南