【Elasticsearch】es7.2 跨集群迁移大量数据方法二

说明

在 Elasticsearch 7.2 环境中，对 >100GB 的大数据量进行跨集群迁移时，可以选择使用通过 Snapshot + Restore 方式迁移多个 >100GB 索引。测试环境使用 Docker 部署了两个Elasticsearch 7.2 实例，并通过端口映射分别暴露为：

源集群：宿主机 9200 → 容器内 9200
目标集群：宿主机 19200 → 容器内 9200

且两个容器运行在同一台宿主机上（当前测试环境），测试时我使用宿主机 IP 实现跨容器通信，实际使用时请以自己的环境信息为准。

主要步骤包括：

源集群：创建索引 + 插入测试数据
目标集群：创建目标索引（可选）+ 配置白名单
执行远程 _reindex
验证数据一致性

整体架构

宿主机

│

├── /es-snapshots/ ← （两个容器同时创建了一个映射到本地的目录，后面数据需进行拷贝一下）

│

├── es-source (9200) ← 源集群

│ └── indices: logs-2025-03, metrics-2025-03

│

└── es-target (19200) ← 目标集群

└── restore → logs-2025-03, metrics-2025-03

Snapshot + Restore 是官方推荐的大数据迁移方案，具备：

强一致性（基于 Lucene 段文件原子快照）

内置校验和（自动检测损坏）

支持增量备份

可验证完整性

部署两个 ES 实例（Docker）

启动了两个es实例，并各自挂载一个数据目录，后续需要将备份文件备份到该文件夹并进行同步。

xml 复制代码

# 启动源 ES（9200）
docker run -d \
  --name es-source \
  --hostname es-source \
  -p 9200:9200 \
  -v /data/elasticsearch/data:/usr/share/elasticsearch/data:rw\
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=true" \
  -e "ELASTIC_PASSWORD=fC3!eG5#iF" \
  -e "path.repo=/usr/share/elasticsearch/data/backups" \
  docker.elastic.co/elasticsearch/elasticsearch:7.2.0

# 启动目标 ES（19200）
docker run -d \
  --name es-target \
  --hostname es-target \
  -p 19200:9200 \
  -v /data/elasticsearch-bak/data:/usr/share/elasticsearch/data:rw\
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=true" \
  -e "ELASTIC_PASSWORD=fC3!eG5#iF" \
  -e "path.repo=/usr/share/elasticsearch/data/backups" \
  docker.elastic.co/elasticsearch/elasticsearch:7.2.0

关键：挂载一个目录到宿主机，方便后续进行拷贝。或者更为方便直接将同一个宿主机目录挂载到容器内部。

配置共享快照仓库（源 & 目标）

在源集群注册仓库

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:9200/_snapshot/my_backup" \
-H 'Content-Type: application/json' -d '
{
  "type": "fs",
  "settings": {
    "location": "/usr/share/elasticsearch/data/backups",
    "compress": true,
    "chunk_size": "64mb"
  }
}'

在目标集群注册同名仓库

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:19200/_snapshot/my_backup" \
-H 'Content-Type: application/json' -d '
{
  "type": "fs",
  "settings": {
    "location": "/usr/share/elasticsearch/data/backups",
    "compress": true,
    "chunk_size": "64mb"
  }
}'

✅ 验证：

xml 复制代码

curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/_snapshot/my_backup/_all"
curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/_snapshot/my_backup/_all"

创建源索引并写入测试数据（多索引）

创建 logs-2025-03 索引

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:9200/logs-2025-03" \
-H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 6,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "message": { "type": "text" },
      "service": { "type": "keyword" },
      "@timestamp": { "type": "date" }
    }
  }
}'

创建 metrics-2025-03 索引

xml 复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:9200/metrics-2025-03" \
-H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "cpu_usage": { "type": "float" },
      "host": { "type": "keyword" },
      "ts": { "type": "date" }
    }
  }
}'

批量写入测试数据（Python2 脚本示例，注意版本）

xml 复制代码

from __future__ import print_function
import requests
import json
from datetime import datetime, timedelta


def bulk_index(index_name, docs):
    url = "http://localhost:9200/{}/_bulk".format(index_name)
    auth = ("elastic", "fC3!eG5#iF")
    headers = {"Content-Type": "application/x-ndjson"}

    lines = []
    for doc in docs:
        lines.append(json.dumps({"index": {}}))
        lines.append(json.dumps(doc, ensure_ascii=False))

    data = "\n".join(lines) + "\n"
    resp = requests.post(url, auth=auth, headers=headers, data=data)
    print("{}: {}".format(index_name, resp.status_code))


log_docs = []
for i in range(100):
    ts = (datetime.utcnow() - timedelta(minutes=i)).strftime("%Y-%m-%dT%H:%M:%S") + "Z"
    log_docs.append({
        "message": "User login {}".format(i),
        "service": "auth",
        "@timestamp": ts
    })

metric_docs = []
for i in range(50000):
    ts = (datetime.utcnow() - timedelta(minutes=i)).strftime("%Y-%m-%dT%H:%M:%S") + "Z"
    metric_docs.append({
        "cpu_usage": round(20 + (i % 80), 2),
        "host": "host-{}".format(i % 10),
        "ts": ts
    })

bulk_index("logs-2025-03", log_docs)
bulk_index("metrics-2025-03", metric_docs)

实际生产中，数据由 Logstash/Filebeat 写入，天然带时间字段。

执行快照（源集群）

创建包含多个索引的快照

复制代码

curl -u elastic:'fC3!eG5#iF' -X PUT "http://localhost:9200/_snapshot/my_backup/snapshot_multi_20250320?wait_for_completion=false" \
-H 'Content-Type: application/json' -d '
{
  "indices": "logs-2025-03,metrics-2025-03",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "migrated_by": "snapshot-migration",
    "version": "1.0"
  }
}'

说明：大数据建议 wait_for_completion=false，通过 API 轮询状态。

查看快照进度

复制代码

curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/_snapshot/my_backup/snapshot_multi_20250320?pretty"

#成功返回应包含：
{
  "snapshots": [{
    "snapshot": "snapshot_multi_20250320",
    "uuid": "...",
    "state": "SUCCESS",
    "indices": ["logs-2025-03", "metrics-2025-03"],
    "shards": { "total": 9, "successful": 9 }
  }]
}

恢复到目标集群

xml 复制代码

# 注意需先将源节点映射到主节点的数据手动拷贝一下到目标节点映射的目录
cp /data/elasticsearch/data/backups/* /data/elasticsearch-bak/data/backups

# 恢复所有索引（可重命名）
curl -u elastic:'fC3!eG5#iF' -X POST "http://localhost:19200/_snapshot/my_backup/snapshot_multi_20250320/_restore?wait_for_completion=false" \
-H 'Content-Type: application/json' -d '
{
  "indices": "logs-2025-03,metrics-2025-03",
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_$1"
}'

使用 rename 避免覆盖，验证后再切换别名。

查看恢复进度

xml 复制代码

curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/_tasks?detailed=true&actions=*restore&pretty"

如何保证数据完整性？（核心！）

✅ 1. 快照本身的原子性与一致性

快照基于 Lucene 不可变段文件（immutable segments）

即使快照过程中有写入，快照仍反映创建时刻的一致视图

不会出现部分新、部分旧的"撕裂"数据

✅ 2. 内置校验和（Checksum）

ES 在写入每个段文件时自动计算 CRC32 校验和

恢复时自动验证，任何损坏都会导致恢复失败并报错

不会静默丢失或损坏数据

✅ 3. 三重完整性校验

(1) 文档总数校验

xml 复制代码

#源集群
src_logs=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_count" -s | jq '.count')
src_metrics=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/metrics-2025-03/_count" -s | jq '.count')

#目标集群
dst_logs=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/restored_logs-2025-03/_count" -s | jq '.count')
dst_metrics=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/restored_metrics-2025-03/_count" -s | jq '.count')

echo "Logs: src=$src_logs, dst=$dst_logs"
echo "Metrics: src=$src_metrics, dst=$dst_metrics"

(2) 分片级元数据校验（高级）

xml 复制代码

# 对比源和目标的 _segments 信息（含 size、checksum）
curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_segments" > src_segments.json

curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/restored_logs-2025-03/_segments" > dst_segments.json

# 比较关键字段（需脚本解析）
diff <(jq '.indices."logs-2025-03".shards[][]|.segments|to_entries[].value|{size,size_in_bytes}' src_segments.json) \
     <(jq '.indices."restored_logs-2025-03".shards[][]|.segments|to_entries[].value|{size,size_in_bytes}' dst_segments.json)

(3) 内容抽样校验

注意由于重新命名了索引的名称，索引对比的时候一定要注意。

xml 复制代码

# 随机取 10 个文档比对完整内容
ids=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_search?size=10" -s | jq -r '.hits.hits[]."_id"')
for id in $ids; do
  src=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:9200/logs-2025-03/_doc/$id" -s)
  dst=$(curl -u elastic:'fC3!eG5#iF' "http://localhost:19200/restored_logs-2025-03/_doc/$id" -s)
  if [ "$(echo "$src" | jq -S '._source')" != "$(echo "$dst" | jq -S '._source')" ]; then
    echo "❌ Content mismatch on ID: $id"
  fi
done
echo "Content sample verified"

(4) 使用别名实现零停机切换

xml 复制代码

# 创建写入别名
curl -u elastic:'fC3!eG5#iF' -X POST "http://localhost:19200/_aliases" -H 'Content-Type: application/json' -d '
{
  "actions": [
    { "add": { "index": "restored_logs-2025-03", "alias": "logs-write" }},
    { "add": { "index": "restored_metrics-2025-03", "alias": "metrics-write" }}
  ]
}'

大数据优化建议（>100GB）

优化项	建议
存储	使用 SSD 或高性能 NAS（如 NFS v4）
压缩	"compress": true（节省 30~50% 空间）
分片	单分片 ≤ 50GB，避免过多小分片
副本	快照前设 number_of_replicas=0，恢复后加回
监控	用 _snapshot/{id} 和 _tasks API 监控进度
增量迁移	首次全量 + 后续增量快照（减少业务中断）

总结

步骤	关键点
环境部署	共享 backups 目录 + path.repo
快照创建	指定多个索引，include_global_state=false
恢复操作	使用 rename 避免冲突
完整性保障	原子快照 + checksum + 三重校验
上线切换	通过别名平滑过渡

Snapshot + Restore 是 Elasticsearch 官方唯一支持跨版本、跨集群、大数据量、强一致性迁移的方案。

按照上述流程操作，可确保 >100GB 多索引迁移零数据丢失、可验证、可回滚。

以上就是大量数据在不同集群中的迁移方法二，大家可在自己的测试环境自行测试。过程中难免出差，敬请指正。谢谢