【Elasticsearch】es7.2 跨集群迁移大量数据

说明

在 Elasticsearch 7.2 环境中，对 >100GB 的大数据量进行跨集群迁移时，不能直接使用同步 _reindex（会超时、OOM），而应采用异步 + 分片滚动 + 性能调优的策略。测试环境使用 Docker 部署了两个Elasticsearch 7.2 实例，并通过端口映射分别暴露为：

源集群：宿主机 9200 → 容器内 9200
目标集群：宿主机 19200 → 容器内 9200

且两个容器运行在同一台宿主机上（这是关键前提），测试时我使用宿主机 IP 实现跨容器通信，实际使用时请以自己的环境信息为准。

主要步骤包括：

源集群：创建索引 + 插入测试数据
目标集群：创建目标索引（可选）+ 配置白名单
执行远程 _reindex
验证数据一致性

环境说明

组件	地址（宿主机视角）	容器内部
源 ES	http://192.168.1.10:9200	单节点，含大索引 logs-2025-03
目标 ES	http://192.168.1.10:19200	单节点或集群
网络	同一宿主机（Docker 默认 bridge）
ES	版本 7.2.0

数据规模 >100GB，数亿文档

假设你已启动两个容器：

xml 复制代码

docker run -d --name es-source -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=true" -e "ELASTIC_PASSWORD=fC3!eG5#iF" docker.elastic.co/elasticsearch/elasticsearch:7.2.0
docker run -d --name es-target -p 19200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=true" -e "ELASTIC_PASSWORD=fC3!eG5#iF" docker.elastic.co/elasticsearch/elasticsearch:7.2.0

准备源索引（模拟大数据结构）

实际生产中，索引已存在。此处仅展示典型结构。关键：使用 @timestamp 字段 + 自动生成 _id（或业务唯一 ID）

创建源索引（带时间字段，便于分片迁移）

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:9200/logs-2025-03" \
-H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "30s",
    "index.codec": "best_compression"
  },
  "mappings": {
    "properties": {
      "message": { "type": "text" },
      "service": { "type": "keyword" },
      "level": { "type": "keyword" },
      "host.ip": { "type": "ip" },
      "@timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }
    }
  }
}'

关键：包含 @timestamp 字段，用于按时间分段迁移。

配置目标集群

注意：此步骤必须在目标集群操作，且需重启服务

编辑 elasticsearch.yml

登录目标集群机器，当前环境可以直接进入docker后修改/usr/share/elasticsearch/config/elasticsearch.yml 文件

在文件末尾添加：

xml 复制代码

#允许从源集群拉取 reindex 数据
reindex.remote.whitelist: "192.168.1.10:9200"
#支持多个地址（逗号分隔）：
reindex.remote.whitelist: "192.168.1.10:9200,192.168.1.11:9200"

重启 Elasticsearch 服务

xml 复制代码

docker restart es-target
#等待服务完全启动

不重启 = 白名单不生效 = reindex 失败！

（可选）预建目标索引（推荐）

如果你想自定义目标索引结构（如改 mapping、调分片数），必须提前创建。否则 _reindex 会自动创建索引（继承源结构）。可以在源集群查看索引后然后在原来的基础上进行修改。

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:19200/logs-2025-03" \
-H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 3,          # 根据目标集群规模调整
    "number_of_replicas": 0,        # 迁移期间关闭副本，提升速度
    "refresh_interval": "-1",       # 临时关闭刷新
    "index.translog.durability": "async"
  },
  "mappings": {
    "properties": {
      "message": { "type": "text" },
      "service": { "type": "keyword" },
      "level": { "type": "keyword" },
      "host.ip": { "type": "ip" },
      "@timestamp": { "type": "date" },
      "migrated_at": { "type": "date" }  // 新增字段
    }
  }
}'

大数据迁移策略

不要一次性 reindex 整个索引！ 应按时间分段 + 异步任务 + 限速

4.1 按时间分段迁移（示例：每天一个批次）

时间范围查询方式：

xml 复制代码

MIN_TS=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_search?size=1&sort=@timestamp:asc" -s | jq -r '.hits.hits[0]._source["@timestamp"]')
MAX_TS=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_search?size=1&sort=@timestamp:desc" -s | jq -r '.hits.hits[0]._source["@timestamp"]')

echo "Time range: $MIN_TS → $MAX_TS"

假设数据时间范围：2025-03-01 到 2025-03-31

迁移 2025-03-01 的数据：

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X POST "http://localhost:19200/_reindex?wait_for_completion=false" \
-H 'Content-Type: application/json' -d '
{
  "source": {
    "remote": {
      "host": "http://192.168.1.10:9200",
      "username": "elastic",
      "password": "fC3!eG5#iF"
    },
    "index": "logs-2025-03",
    "query": {
      "range": {
        "@timestamp": {
          "gte": "2025-03-01T00:00:00Z",
          "lt": "2025-03-02T00:00:00Z"
        }
      }
    }
  },
  "dest": {
    "index": "logs-2025-03"
  },
  "script": {
    "source": "ctx._source.migrated_at = new Date();",
    "lang": "painless"
  }
}'

返回 {"task":"drtUe-2aQmuZBf0iwE34Eg:34799"}，记录 task ID

监控任务进度

xml 复制代码

curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/_tasks/drtUe-2aQmuZBf0iwE34Eg:34799"

限速参数（避免压垮源集群）

在 URL 中添加：

?requests_per_second=500&wait_for_completion=false

完整示例：

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X POST "http://localhost:19200/_reindex?requests_per_second=500&wait_for_completion=false" -H 'Content-Type: application/json' -d '{...}'

💡 requests_per_second=500 表示每秒最多 500 个文档（根据源集群负载调整）

批量自动化脚本（推荐）

创建 migrate.sh：

xml 复制代码

#!/bin/bash
INDEX="logs-2025-03"
START="2025-03-01"
END="2025-03-10"
USER="elastic"
PASS="fC3!eG5#iF"
SRC_HOST="http://192.168.0.155:9200"
DST_HOST="http://192.168.0.155:19200"

# 记录已成功迁移的日期
CHECKPOINT_FILE="/tmp/es_migrate_checkpoint_${INDEX}.txt"
> "$CHECKPOINT_FILE"  # 清空

current=$START
while [[ "$current" < "$END" ]] || [[ "$current" == "$END" ]]; do
  next=$(date -d "$current +1 day" +%Y-%m-%d)

  echo "Migrating [$current, $next) ..."

  # 提交异步任务
  response=$(curl -u $USER:"$PASS" -s -X POST "$DST_HOST/_reindex?wait_for_completion=false" \
    -H 'Content-Type: application/json' -d "
    {
      \"source\": {
        \"remote\": {
          \"host\": \"$SRC_HOST\",
          \"username\": \"$USER\",
          \"password\": \"$PASS\"
        },
        \"index\": \"$INDEX\",
        \"query\": {
          \"range\": {
            \"@timestamp\": {
              \"gte\": \"$current\",
              \"lt\": \"$next\"
            }
          }
        }
      },
      \"dest\": {
        \"index\": \"$INDEX\"
      },
      \"script\": {
        \"source\": \"ctx._source.migrated_at = new Date();\",
        \"lang\": \"painless\"
      }
    }")

  task_id=$(echo "$response" | jq -r '.task')
  echo "Task: $task_id for [$current, $next)"

  # 轮询等待完成
  while true; do
    status=$(curl -u $USER:"$PASS" -s "$DST_HOST/_tasks/$task_id")
    if echo "$status" | jq -e '.completed' > /dev/null; then
      break
    fi
    sleep 10
  done

  # 检查是否有失败
  failures=$(echo "$status" | jq '.response.failures | length')
  total=$(echo "$status" | jq '.response.total')
  created=$(echo "$status" | jq '.response.created')

  if [ "$failures" -gt 0 ]; then
    echo "❌ FAILED for [$current, $next): $failures failures"
    exit 1
  else
    echo "✅ Success: $created/$total docs migrated for [$current, $next)"
    echo "$current" >> "$CHECKPOINT_FILE"
  fi

  current=$next
done

echo "All segments migrated."

赋予执行权限并运行：

chmod +x migrate.sh

./migrate.sh

其实在实际的生产中，我们一般是将数据索引按天或者一段时间进行轮转保存，避免单个索引太大。所以就不存在上面对单个索引时间进行划分的问题

迁移完成后优化目标索引

恢复副本和刷新

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:19200/logs-2025-03/_settings" \
-H 'Content-Type: application/json' -d '
{
  "index": {
    "number_of_replicas": 1,
    "refresh_interval": "30s"
  }
}'

强制合并（减少 segment，提升查询性能）

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X POST "http://localhost:19200/logs-2025-03/_forcemerge?max_num_segments=1"

forcemerge 是重型操作，建议在业务低峰期执行。

验证数据一致性

文档总数比对

xml 复制代码

src_count=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:9200/logs-2025-03/_count" -s | jq '.count')
dst_count=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:19200/logs-2025-03/_count" -s | jq '.count')


if [ "$src_count" -eq "$dst_count" ]; then
  echo "✅ Total count matches: $src_count"
else
  echo "❌ Count mismatch: src=$src_count, dst=$dst_count"
fi

按时间分段校验（防止漏段）

xml 复制代码

# 读取 checkpoint
while read day; do
  count_src=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:9200/logs-2025-03/_count" -H 'Content-Type: application/json' -d "{\"query\":{\"range\":{\"@timestamp\":{\"gte\":\"$day\",\"lt\":\"$(date -d "$day +1 day" +%Y-%m-%d)\"}}}}" -s | jq '.count')
  count_dst=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:19200/logs-2025-03/_count" -H 'Content-Type: application/json' -d "{\"query\":{\"range\":{\"@timestamp\":{\"gte\":\"$day\",\"lt\":\"$(date -d "$day +1 day" +%Y-%m-%d)\"}}}}" -s | jq '.count')
  
  if [ "$count_src" -ne "$count_dst" ]; then
    echo "❌ Segment $day mismatch!"
  fi
done < /tmp/es_migrate_checkpoint_logs-2025-03.txt

抽样 ID 校验（防重复/丢失）

xml 复制代码

# 从源随机取 10 个 _id
ids=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:9200/logs-2025-03/_search?size=10" -s | jq -r '.hits.hits[]."_id"')


for id in $ids; do
  exists=$(curl -u elastic:'fC3!eG5#iF' "http://192.168.1.10:19200/logs-2025-03/_doc/$id" -s | jq -r 'if .found then "yes" else "no" end')
  if [ "$exists" != "yes" ]; then
    echo "❌ Document $id missing in target!"
  fi
done

异常处理与重试

Checkpoint 文件记录已成功迁移的日期，中断后可从中断点继续。
若某天失败，可单独重试该时间段，如：
./migrate_single_day.sh 2025-03-05
目标索引设置 op_type: create 可避免重复（但 _reindex 默认是 index，会覆盖）。
如需幂等，可在脚本中加 ctx.op = "create"，但需确保 _id 不冲突。

大数据迁移注意事项

项目	建议
分片数	目标索引分片数 ≤ 源，避免小分片过多
副本	迁移期间设为 0，完成后恢复
刷新间隔	设为 -1（关闭），完成后恢复
限速	requests_per_second=200~1000，观察源集群 CPU/IO
网络	确保宿主机带宽充足（>1Gbps）
磁盘	目标集群预留 2 倍空间（含 translog、merge）
监控	使用 _tasks、_cat/nodes、_nodes/stats 实时观察

总结：大数据跨集群迁移 checklist

✅ 源索引含时间字段（如 @timestamp）

✅ 目标 ES 配置 reindex.remote.whitelis

✅ 预建目标索引（优化 settings/mappings）

✅ 按时间分段 + 异步 + 限速执行 _reindex

✅ 迁移后恢复副本、刷新、强制合并

✅ 全量 + 抽样验证数据一致性

以上就是少量数据在不同集群中的迁移方法，大家可在自己的测试环境自行测试。过程中难免出差，敬请指正。谢谢