学技术学英语:elasticsearch 文档ID生成算法

Auto-Generated Document IDs in Elasticsearch

When you index a document without specifying an ID, Elasticsearch automatically generates a unique ID for that document. This ID is a Base64-encoded UUID, which is composed of several parts, each serving a specific purpose.

The ID generation process is optimized for both indexing speed and storage efficiency. The code responsible for this process can be found in Elasticsearch's TimeBasedUUIDGenerator class on GitHub.

elasticsearch/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java at...

github.com

How are the IDs Generated?

The first two bytes of the ID are derived from a sequence ID, which is incremented for each document that's indexed. The first and third bytes of the sequence ID are used. These bytes change frequently, which helps with indexing speed because it makes the IDs sort quickly.

The next four bytes are derived from the current timestamp. These bytes change less frequently, which helps with storage efficiency because it makes the IDs compress well. The timestamp is shifted by different amounts to generate these four bytes, which means they change at different rates.

The next six bytes are the MAC address of the machine where Elasticsearch is running. This helps ensure the uniqueness of the IDs across different machines.

The final three bytes are the remaining bytes of the timestamp and sequence ID. These bytes are likely not to be compressed at all.

The resulting byte array is then Base64-encoded to create the final ID. The Base64 encoding is URL-safe and does not include padding, which makes the IDs safe to use in URLs and efficient to store.

Probability of Collision

The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID (Universally Unique Identifier) for auto-generating IDs. UUIDs are 128-bit values and are designed to be sufficiently random such that the probability of collision (i.e., generating the same UUID more than once) is extremely low.

Example of an Auto-Generated ID

Let's consider an example auto-generated ID: "5PMM3nYBgTGA2v2S6qve". This ID is a Base64-encoded UUID. The first two bytes are derived from a sequence ID, the next four bytes are derived from the current timestamp, the next six bytes are the MAC address of the machine where Elasticsearch is running, and the final three bytes are the remaining bytes of the timestamp and sequence ID.

Q&A

Q: Are auto-generated IDs unique across all indices in a cluster?

A: While the auto-generated IDs are unique within an index, they are not globally unique across all indices in a cluster. If you have two documents with the same auto-generated ID in two different indices, they are considered as two different documents.

Q: What is the probability of collision in auto-generated IDs?

A: The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID for auto-generating IDs, which are designed to be sufficiently random such that the probability of collision is extremely low.

To give you an idea of how low: The number of random version 4 UUIDs (which are the type of UUIDs used by Elasticsearch) that need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 x 1⁰¹⁸). This number is so large that even if you were generating 1 billion UUIDs per second, it would take you over 85 years to generate this many UUIDs.

Conclusion

Elasticsearch's approach to ID generation is a trade-off between indexing speed, storage efficiency, and lookup speed. It's optimized for append-only workloads, where documents are continually being added to the index and rarely updated or deleted.

  1. 当你在索引文档时没有指定ID,Elasticsearch会自动生成一个唯一的ID。这个ID是一个Base64编码的UUID,由多个部分组成,每个部分都有特定的用途。

  2. ID生成过程优化了索引速度和存储效率。相关代码可以在Elasticsearch的TimeBasedUUIDGenerator类中找到。

  3. ID生成的概率几乎可以忽略不计,因为Elasticsearch使用的是UUID(通用唯一标识符),它们是128位的值,设计得非常随机,碰撞的概率极低。

  4. 自动生成的ID在一个索引内是唯一的,但在集群中的所有索引之间并不全局唯一。如果在不同索引中有两个相同的自动生成ID的文档,它们被视为两个不同的文档。

  5. 示例自动生成ID:"5PMM3nYBgTGA2v2S6qve"是一个Base64编码的UUID,前两字节由序列ID派生,接下来的四字节由当前时间戳派生,接下来的六字节是运行Elasticsearch的机器的MAC地址,最后三字节是时间戳和序列ID的剩余字节。

相关推荐
努力的小郑20 小时前
Elasticsearch 避坑指南:我在项目中总结的 14 条实用经验
后端·elasticsearch·性能优化
qq_5470261791 天前
Canal实时同步MySQL数据到Elasticsearch
数据库·mysql·elasticsearch
星光一影1 天前
基于SpringBoot智慧社区系统/乡村振兴系统/大数据与人工智能平台
大数据·spring boot·后端·mysql·elasticsearch·vue
Elasticsearch2 天前
在 Kibana 中引入 Elasticsearch 查询规则界面
elasticsearch
Elastic 中国社区官方博客2 天前
使用 Mastra 和 Elasticsearch 构建具有语义回忆功能的知识 agent
大数据·数据库·人工智能·elasticsearch·搜索引擎·ai·全文检索
新手小白*2 天前
Elasticsearch+Logstash+Filebeat+Kibana部署【7.1.1版本】
大数据·elasticsearch·搜索引擎
金士镧(厦门)新材料有限公司3 天前
稀土化合物:未来科技的隐藏推动力
科技·安全·全文检索
lpfasd1233 天前
git-团队协作基础
chrome·git·elasticsearch
苗壮.3 天前
「个人 Gitee 仓库」与「企业 Gitee 仓库」同步的几种常见方式
大数据·elasticsearch·gitee
Elastic 中国社区官方博客3 天前
如何使用 Ollama 在本地设置和运行 GPT-OSS
人工智能·gpt·elasticsearch·搜索引擎·ai·语言模型