Elasticsearch相关知识@1

目录标题

Lucene
- - [1. **什么是 Lucene?**](#1. 什么是 Lucene?)
  - [2. **Lucene 在 Elasticsearch 中的作用**](#2. Lucene 在 Elasticsearch 中的作用)
  - [3. **Lucene 的核心功能**](#3. Lucene 的核心功能)
  - - [(1) **倒排索引**](#(1) 倒排索引)
    - [(2) **分词**](#(2) 分词)
    - [(3) **查询解析**](#(3) 查询解析)
    - [(4) **相关性评分**](#(4) 相关性评分)
  - [4. **为什么 Elasticsearch 使用 Lucene?**](#4. 为什么 Elasticsearch 使用 Lucene?)
  - [5. **Lucene 和 Elasticsearch 的区别**](#5. Lucene 和 Elasticsearch 的区别)
  - [6. **总结**](#6. 总结)
分片
- - 列的解释：
- 扩容分片
- - [1. **创建新的索引并重新索引**](#1. 创建新的索引并重新索引)
  - [2. **使用索引别名（Index Alias）**](#2. 使用索引别名（Index Alias）)
  - [3. **调整副本数（Replica Shards）**](#3. 调整副本数（Replica Shards）)
  - [4. **使用跨集群复制（CCR）**](#4. 使用跨集群复制（CCR）)
  - 总结：
esrally
备份

Lucene

在 Elasticsearch（ES）中，Lucene 是其核心的底层搜索引擎库。它为 Elasticsearch 提供了全文搜索和索引的核心功能。以下是关于 Lucene 的详细介绍：

1. 什么是 Lucene?

Lucene 是一个开源的全文搜索库，由 Apache 软件基金会维护。最初由 Doug Cutting 创建，用于为 Java 应用程序提供高效的文本搜索和索引功能。

它本身并不是一个独立的搜索引擎。
它是一个用于构建搜索功能的基础库，支持高度可定制的索引和查询功能。

2. Lucene 在 Elasticsearch 中的作用

Elasticsearch 是一个分布式搜索引擎，基于 Lucene 构建。它将 Lucene 的强大功能封装起来，提供了简单易用的 RESTful API 和分布式架构支持。

索引数据：Elasticsearch 使用 Lucene 的倒排索引来组织和存储数据，使全文检索变得高效。
执行查询：查询解析和执行是由 Lucene 的查询功能完成的，包括布尔查询、范围查询、分词等。
评分机制 ：Lucene 提供的 TF-IDF 或 BM25 算法计算文档与查询的相关性。

3. Lucene 的核心功能

(1) 倒排索引

Lucene 使用倒排索引存储数据，这是一种专门为快速文本搜索设计的数据结构。

示例：
如果有两篇文档：

复制代码

Doc1: "Lucene is a search library"
Doc2: "Elasticsearch uses Lucene"

倒排索引会存储类似以下信息：

复制代码

Term       -> DocID
"Lucene"   -> Doc1, Doc2
"search"   -> Doc1
"library"  -> Doc1
"uses"     -> Doc2

(2) 分词

Lucene 通过 分析器（Analyzer） 将输入的文本拆分成更小的单元（称为词元），然后将其索引。

示例：输入文本 "Lucene is powerful!"，可能会被分词为 ["lucene", "powerful"]。

(3) 查询解析

Lucene 支持复杂的查询语法，例如布尔查询（AND/OR/NOT）、范围查询、通配符查询等。

(4) 相关性评分

Lucene 使用 TF-IDF 和 BM25 算法对文档相关性进行评分，并返回与查询最匹配的结果。

4. 为什么 Elasticsearch 使用 Lucene?

高性能：Lucene 提供了非常高效的索引和搜索能力。
成熟可靠：Lucene 已经过多年的开发和优化，是业界公认的标准。
灵活性：Lucene 提供了丰富的功能，比如分词、查询解析、打分机制等。
开源：作为 Apache 基金会的开源项目，它与 Elasticsearch 的开源理念一致。

5. Lucene 和 Elasticsearch 的区别

特性	Lucene	Elasticsearch
类型	库	分布式搜索引擎
用户接口	Java API	RESTful API
功能	提供核心搜索和索引功能	提供分布式、横向扩展和高可用能力
配置和使用	需要开发者手动集成和配置	开箱即用，内置大量功能
分布式支持	无	原生分布式支持

6. 总结

Lucene 是 Elasticsearch 的基础组件，为其提供了核心的搜索和索引能力。但 Elasticsearch 不仅仅是 Lucene，它还添加了分布式架构、集群管理、REST API 等，使得用户可以更轻松地构建和管理搜索解决方案。

分片

csharp 复制代码

curl -u admin:OLP5HNAT1NU2WQGUJ441M5TP2MPXZBRM 10.10.x.x:xx/_cat/shards?v

csharp 复制代码

_cat/shards/woqu02?v

这行输出来自Elasticsearch的_cat/shards API，显示了某个索引的分片信息。以下是对每一列的解释：

复制代码

filebeat-7.16.2-2024.12.23-000001  0 r STARTED 11533  3.1mb  245.0.2.227 es-510083ba-es-data-0

列的解释：

索引名称 (filebeat-7.16.2-2024.12.23-000001)：这是显示分片信息的索引名称。这个索引遵循filebeat-<版本>-<日期>-<分片号>的命名模式。
分片号 (0)：表示这个是索引的第一个主分片。Elasticsearch中的索引被分为多个分片，每个分片由一个编号来标识，这里是主分片（0）。
分片类型 (r)：表示该分片的类型。r表示这是一个副本分片，副本分片是主分片的复制品，用于冗余备份和负载均衡。
分片状态 (STARTED)：表示该分片的当前状态。STARTED表示该分片已经激活并且成功分配到集群中的某个节点。
文档数量 (11533)：表示该分片中存储的文档数量，这里是11533个文档。
磁盘使用空间 (3.1mb)：该分片使用的磁盘空间大小，这里是3.1 MB。
节点IP地址 (245.0.2.227)：表示存储该分片的Elasticsearch节点的IP地址。
节点名称 (es-510083ba-es-data-0)：这是存储该分片的Elasticsearch节点的名称，帮助识别处理该分片的具体节点。

这段输出有助于了解在Elasticsearch集群中数据的分布情况和分片的状态。

扩容分片

在Elasticsearch中，扩容分片（shard expansion）不是直接改变现有分片数量的操作，因为每个索引的分片数量在创建时是固定的。要增加分片，通常需要通过以下几种方式：

1. 创建新的索引并重新索引

如果你希望增加分片数量，可以创建一个新的索引，并在该索引上设置更多的主分片。然后，通过重新索引（reindex）将旧索引中的数据迁移到新的索引。

步骤：

创建新的索引，设置所需的分片数。例如：

bash 复制代码

PUT /new_index_name
{
  "settings": {
    "number_of_shards": 6,  # 你需要的分片数量
    "number_of_replicas": 1  # 副本数
  }
}

重新索引数据从旧索引到新索引：

bash 复制代码

POST /_reindex
{
  "source": {
    "index": "old_index_name"
  },
  "dest": {
    "index": "new_index_name"
  }
}

csharp 复制代码

POST /_reindex
{
  "source": {
    "index": "nyc_taxis"
  },
  "dest": {
    "index": "woqu01"
  },
  "conflicts": "proceed"
}

GET /nyc_taxis/_search?size=10

GET /woqu02/_search?size=10

GET /nyc_taxis/_mapping

GET /woqu02/_mapping

GET _cat/indices?v

GET /woqu01/_count

过程非常漫长。。。

切换应用程序以开始使用新的索引。
删除旧的索引（如果不再需要的话）：
bash 复制代码
```
DELETE /old_index_name
```

2. 使用索引别名（Index Alias）

如果你不想在应用中修改索引名称，可以使用索引别名来进行无缝切换。索引别名使你可以在多个索引之间进行切换，而不必修改应用程序中的索引名称。

步骤：

创建新的索引并设置更多的分片。
将新索引的别名指向旧索引。
使用_reindex将数据迁移到新索引。
更新别名，指向新的索引。

示例：

bash 复制代码

POST /_aliases
{
  "actions": [
    { "add": { "index": "new_index_name", "alias": "index_alias" } },
    { "remove": { "index": "old_index_name", "alias": "index_alias" } }
  ]
}

3. 调整副本数（Replica Shards）

如果你的目的是为了负载均衡或提高可用性，增加副本分片（replica shards）是一个有效的解决方案。副本分片不会增加主分片的数量，但会提升查询性能和故障恢复能力。

步骤：

修改索引的副本数：

bash 复制代码

PUT /your_index_name/_settings
{
  "settings": {
    "number_of_replicas": 2  # 设置副本数量
  }
}

4. 使用跨集群复制（CCR）

如果你需要在多个集群之间复制数据并增加分片，可以考虑使用**跨集群复制（CCR）**来分发数据到多个集群中。

总结：

Elasticsearch不支持直接增加现有索引的主分片数量。
你可以通过重新索引将数据迁移到新的具有更多分片的索引中。
如果只是想增加容错能力或负载均衡，可以增加副本分片的数量。

通常推荐的做法是使用索引别名和重新索引的方式来实现分片扩容，避免影响现有的业务和数据。

esrally

安装

https://cloud.tencent.com/developer/article/1959723

docker load -i rally.tar

docker run -t -i -v /bpx:/bpx elastic/rally bash

esrally race --track=nyc_taxis--test-mode --pipeline=benchmark-only --target-hosts="http://10.10.x.x:xx" --client-options="basic_auth_user:'admin',basic_auth_password:'OLP5HNAT1NU2WQGUJ441M5TP2MPXZBRM'"

数据库介绍

https://cloud.tencent.com/developer/article/1595636

/rally/.rally/benchmarks/tracks/default/download.sh geonames

tar -xf rally-track-data-geonames.tar -C ~/.rally/benchmarks

esrally race --track=geonames --test-mode --pipeline=benchmark-only --target-hosts="http://10.10.66.231:12219" --client-options="basic_auth_user:'admin',basic_auth_password:'OLP5HNAT1NU2WQGUJ441M5TP2MPXZBRM'" --report-format=csv --report-file=~/result.csv

csharp 复制代码

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [e69491a8-20d4-4ca2-9503-bbd37817f2ba]
[INFO] Downloading track data (30.6 kB total size)                                [100.0%]
[INFO] Decompressing track data from [/rally/.rally/benchmarks/data/nyc_taxis/documents-1k.json.bz2] to [/rally/.rally/benchmarks/data/nyc_taxis/documents-1k.json] ... [OK]
[INFO] Preparing file offset table for [/rally/.rally/benchmarks/data/nyc_taxis/documents-1k.json] ... [OK]
[INFO] Racing on track [nyc_taxis], challenge [append-no-conflicts] and car ['external'] with version [7.16.2].

[WARNING] merges_total_time is 58649 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 15700 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 79088 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [100% done]
Running index                                                                  [100% done]
Running refresh-after-index                                                    [100% done]
Running force-merge                                                            [100% done]
Running refresh-after-force-merge                                              [100% done]
Running wait-until-merges-finish                                               [100% done]
Running default                                                                [100% done]
Running range                                                                  [100% done]
Running distance_amount_agg                                                    [100% done]
Running autohisto_agg                                                          [100% done]
Running date_histogram_agg                                                     [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |                Task |          Value |   Unit |
|---------------------------------------------------------------:|--------------------:|---------------:|-------:|
|                     Cumulative indexing time of primary shards |                     |     0.265783   |    min |
|             Min cumulative indexing time across primary shards |                     |     0          |    min |
|          Median cumulative indexing time across primary shards |                     |     0          |    min |
|             Max cumulative indexing time across primary shards |                     |     0.136317   |    min |
|            Cumulative indexing throttle time of primary shards |                     |     0          |    min |
|    Min cumulative indexing throttle time across primary shards |                     |     0          |    min |
| Median cumulative indexing throttle time across primary shards |                     |     0          |    min |
|    Max cumulative indexing throttle time across primary shards |                     |     0          |    min |
|                        Cumulative merge time of primary shards |                     |     0.977483   |    min |
|                       Cumulative merge count of primary shards |                     |  1030          |        |
|                Min cumulative merge time across primary shards |                     |     0          |    min |
|             Median cumulative merge time across primary shards |                     |     0          |    min |
|                Max cumulative merge time across primary shards |                     |     0.652033   |    min |
|               Cumulative merge throttle time of primary shards |                     |     0          |    min |
|       Min cumulative merge throttle time across primary shards |                     |     0          |    min |
|    Median cumulative merge throttle time across primary shards |                     |     0          |    min |
|       Max cumulative merge throttle time across primary shards |                     |     0          |    min |
|                      Cumulative refresh time of primary shards |                     |     1.3198     |    min |
|                     Cumulative refresh count of primary shards |                     |  9938          |        |
|              Min cumulative refresh time across primary shards |                     |     0          |    min |
|           Median cumulative refresh time across primary shards |                     |     0          |    min |
|              Max cumulative refresh time across primary shards |                     |     0.98985    |    min |
|                        Cumulative flush time of primary shards |                     |     0.03155    |    min |
|                       Cumulative flush count of primary shards |                     |    21          |        |
|                Min cumulative flush time across primary shards |                     |     0          |    min |
|             Median cumulative flush time across primary shards |                     |     0          |    min |
|                Max cumulative flush time across primary shards |                     |     0.0176833  |    min |
|                                        Total Young Gen GC time |                     |     0.009      |      s |
|                                       Total Young Gen GC count |                     |     1          |        |
|                                          Total Old Gen GC time |                     |     0          |      s |
|                                         Total Old Gen GC count |                     |     0          |        |
|                                                   Dataset size |                     |     0.083899   |     GB |
|                                                     Store size |                     |     0.083899   |     GB |
|                                                  Translog size |                     |     0.0555213  |     GB |
|                                         Heap used for segments |                     |     0.161285   |     MB |
|                                       Heap used for doc values |                     |     0.0156403  |     MB |
|                                            Heap used for terms |                     |     0.124207   |     MB |
|                                            Heap used for norms |                     |     0.00256348 |     MB |
|                                           Heap used for points |                     |     0          |     MB |
|                                    Heap used for stored fields |                     |     0.0188751  |     MB |
|                                                  Segment count |                     |    40          |        |
|                                    Total Ingest Pipeline count |                     |     6          |        |
|                                     Total Ingest Pipeline time |                     |     0.002      |      s |
|                                   Total Ingest Pipeline failed |                     |     0          |        |
|                                                 Min Throughput |               index | 16438.7        | docs/s |
|                                                Mean Throughput |               index | 16438.7        | docs/s |
|                                              Median Throughput |               index | 16438.7        | docs/s |
|                                                 Max Throughput |               index | 16438.7        | docs/s |
|                                        50th percentile latency |               index |    72.4146     |     ms |
|                                       100th percentile latency |               index |    94.6473     |     ms |
|                                   50th percentile service time |               index |    72.4146     |     ms |
|                                  100th percentile service time |               index |    94.6473     |     ms |
|                                                     error rate |               index |     0          |      % |
|                                                 Min Throughput |             default |   133.61       |  ops/s |
|                                                Mean Throughput |             default |   133.61       |  ops/s |
|                                              Median Throughput |             default |   133.61       |  ops/s |
|                                                 Max Throughput |             default |   133.61       |  ops/s |
|                                       100th percentile latency |             default |    12.4156     |     ms |
|                                  100th percentile service time |             default |     4.49154    |     ms |
|                                                     error rate |             default |     0          |      % |
|                                                 Min Throughput |               range |   155.02       |  ops/s |
|                                                Mean Throughput |               range |   155.02       |  ops/s |
|                                              Median Throughput |               range |   155.02       |  ops/s |
|                                                 Max Throughput |               range |   155.02       |  ops/s |
|                                       100th percentile latency |               range |     3.1497     |     ms |
|                                  100th percentile service time |               range |     3.1497     |     ms |
|                                                     error rate |               range |     0          |      % |
|                                                 Min Throughput | distance_amount_agg |    40.07       |  ops/s |
|                                                Mean Throughput | distance_amount_agg |    40.07       |  ops/s |
|                                              Median Throughput | distance_amount_agg |    40.07       |  ops/s |
|                                                 Max Throughput | distance_amount_agg |    40.07       |  ops/s |
|                                       100th percentile latency | distance_amount_agg |     6.27117    |     ms |
|                                  100th percentile service time | distance_amount_agg |     6.27117    |     ms |
|                                                     error rate | distance_amount_agg |     0          |      % |
|                                                 Min Throughput |       autohisto_agg |    37.05       |  ops/s |
|                                                Mean Throughput |       autohisto_agg |    37.05       |  ops/s |
|                                              Median Throughput |       autohisto_agg |    37.05       |  ops/s |
|                                                 Max Throughput |       autohisto_agg |    37.05       |  ops/s |
|                                       100th percentile latency |       autohisto_agg |     5.20097    |     ms |
|                                  100th percentile service time |       autohisto_agg |     5.20097    |     ms |
|                                                     error rate |       autohisto_agg |     0          |      % |
|                                                 Min Throughput |  date_histogram_agg |    73.79       |  ops/s |
|                                                Mean Throughput |  date_histogram_agg |    73.79       |  ops/s |
|                                              Median Throughput |  date_histogram_agg |    73.79       |  ops/s |
|                                                 Max Throughput |  date_histogram_agg |    73.79       |  ops/s |
|                                       100th percentile latency |  date_histogram_agg |     4.10032    |     ms |
|                                  100th percentile service time |  date_histogram_agg |     4.10032    |     ms |
|                                                     error rate |  date_histogram_agg |     0          |      % |


--------------------------------
[INFO] SUCCESS (took 18 seconds)
--------------------------------

结果解释

https://xie.infoq.cn/article/75db26198fd344c4db563f43f

GET /_cat/indices/nyc_taxis?v

GET /nyc_taxis/_search?pretty

GET nyc_taxis/_mapping?pretty

备份

https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html