Elasticsearch之_reindex

_reindex可是个好东西，尤其是针对开发者而言，从小的方面讲在存储数据是我们常常可能由于字段类型的问题，值大小写的问题，分词器的问题导致查询不到，或者结构不对，或者分片数，副本数不对 等这类问题，从大的方面讲，跨集群数据迁移 的时候，你就需要用到关键指令 _reindex ,换句话说，数据库大家都用过吧，总有的时候需要调整表结构，或者值大小写等等这种恶心的情况，笨一点，新建一张正确的临时表，写个脚本，把数据从错误的表读取出来，通过程序处理数据符合预期后，在插入到新表，然后在删除旧表，在创建一个和旧表相同的表名，在把临时表数据导入到旧表中。这一系列操作下来，整个人都麻了。当然思路是这个思路，但是实现过程我们在elasticsearch中不需要写脚本，而是直接使用指令 _reindex 即可完成，废话不多少，懂的人自然懂。

注意事项

源和目的不能相同，比如不能将数据流reindex给它自身
源索引的文档中 _source 字段必须开启（默认为开启）
reindex不会复制源的setting和源所匹配的模板，因此在调用_reindex前，你需要设置好目标索引的mapping，(action.auto_create_index 为 false 或者 -.* 时)
目标索引的mapping，主分片数，副本数等推荐提前配置

如果配置了安全策略和权限策略

如果elasticsearch集群配置了安全策略和权限策略, 则进行reindex必须拥有以下权限
- 如果reindex的源为远程集群，必须在当前集群的请求节点 elasticsearch.yml文件配置远程白名单reindex.remote.whitelist
- 读取源的数据流、索引、索引别名等索引级别权限。
- 对于目的数据流、索引、索引别名的写权限。
最简单的使用方式

bash 复制代码

curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "旧索引"
    },
    "dest": {
        "index": "新索引"
    }
}'

指定size控制复制的条数，不指定则为全部

bash 复制代码

curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "size": 100,
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index"
    }
}'

将多个索引reindex到一个目标

bash 复制代码

curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": [
            "source_index_1",
            "source_index_2"
        ],
        "type": [
            "source_type_1",
            "source_type_2"
        ]
    },
    "dest": {
        "index": "dest_index"
    }
}'

只复制特定的字段

bash 复制代码

curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "source_index_1",
        "_source": [
            "username",
            "sex"
        ]
    },
    "dest": {
        "index": "dest_index"
    }
}'

使用script（例：_id的值需要大写）

bash 复制代码

curl --location 'http://192.168.5.235:9210/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "script": {
        "source": "String uppercaseId = ctx._id.toUpperCase(); ctx._source.remove(\"id\"); ctx._id = uppercaseId;  ",
        "lang": "painless"
    },
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index"
    }
}'

# 如果是_source中的值需要：String uppercaseUuid = ctx._source.ENTITY_UUID.toUpperCase(); ctx._source.remove(\"_source.ENTITY_UUID\"); ctx._source.ENTITY_UUID = uppercaseUuid;

跨集群使用remote属性（涵query match和sort）

bash 复制代码

# 跨集群传输时，如果单个document的平均大小超过100Kb，则有可能会报错，需要在source中指定size，定义每批次传输的doc个数
curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        // "sort": {
            // "date": "desc"
        // },
        // "query": {
        //     "match": {
        //         "test": "data"
        //     }
        // },
        // "size": 100,
        "remote": {
            "host": "http://otherhost:9200",
            "username": "username",
            "password": "password"
        },
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index"
    }
}'

如果目标index中有数据，并且可能发生冲突

bash 复制代码

# version_type为internal则Elasticsearch强制性的将文档转储到目标中，覆盖具有相同类型和ID的任何内容
# version_type为external则做更新
curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index",
        "version_type": "internal"
    }
}'

op_type为create

bash 复制代码

# 只在dest index中添加不不存在的doucments。如果相同的documents已经存在，则会报version confilct的错误。
curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index",
        "op_type": "create"
    }
}'

由于op_type为create引发的version confilct

bash 复制代码

curl --location 'http://localhost:9200/_reindex' \
--header 'Content-Type: application/json' \
--data '{
    "conflicts": "proceed",
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index",
        "op_type": "create"
    }
}'

查看reindex进度

bash 复制代码

curl --location --request POST 'http://localhost:9200/_tasks?detailed=true&actions=*reindex'

问题发现

reindex的核心做跨索引、跨集群的数据迁移，慢的原因及优化思路无非包括：
1）批量大小值可能太小。需要结合堆内存、线程池调整大小；
2）reindex的底层是scroll实现，借助scroll并行优化方式，提升效率；
3）跨索引、跨集群的核心是写入数据，考虑写入优化角度提升效率。
- 提升批量写入大小值：在source中指定 size 的值改变每个批次的大小
- sliced并行，每个Scroll请求，可以分成多个Slice请求，可以理解为切片，各Slice独立并行，利用Scroll重建或者遍历要快很多倍，自动设置分片如下：
  - 1）slices大小的设置可以手动指定，或者设置slices设置为auto，auto的含义是：针对单索引，slices大小=分片数；针对多索引，slices=分片的最小值。
  - 2）当slices的数量等于索引中的分片数量时，查询性能最高效。slices大小大于分片数，非但不会提升效率，反而会增加开销。
  - 3）如果这个slices数字很大(例如500)，建议选择一个较低的数字，因为过大的slices会影响性能。

bash 复制代码

curl --location 'http://localhost:9200/_reindex?slices=5&refresh=null' \
--header 'Content-Type: application/json' \
--data '{
    "source": {
        "index": "source_index"
    },
    "dest": {
        "index": "dest_index"
    }
}'