学技术学英语:elasticsearch 文档ID生成算法

Auto-Generated Document IDs in Elasticsearch

When you index a document without specifying an ID, Elasticsearch automatically generates a unique ID for that document. This ID is a Base64-encoded UUID, which is composed of several parts, each serving a specific purpose.

The ID generation process is optimized for both indexing speed and storage efficiency. The code responsible for this process can be found in Elasticsearch's TimeBasedUUIDGenerator class on GitHub.

elasticsearch/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java at...

github.com

How are the IDs Generated?

The first two bytes of the ID are derived from a sequence ID, which is incremented for each document that's indexed. The first and third bytes of the sequence ID are used. These bytes change frequently, which helps with indexing speed because it makes the IDs sort quickly.

The next four bytes are derived from the current timestamp. These bytes change less frequently, which helps with storage efficiency because it makes the IDs compress well. The timestamp is shifted by different amounts to generate these four bytes, which means they change at different rates.

The next six bytes are the MAC address of the machine where Elasticsearch is running. This helps ensure the uniqueness of the IDs across different machines.

The final three bytes are the remaining bytes of the timestamp and sequence ID. These bytes are likely not to be compressed at all.

The resulting byte array is then Base64-encoded to create the final ID. The Base64 encoding is URL-safe and does not include padding, which makes the IDs safe to use in URLs and efficient to store.

Probability of Collision

The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID (Universally Unique Identifier) for auto-generating IDs. UUIDs are 128-bit values and are designed to be sufficiently random such that the probability of collision (i.e., generating the same UUID more than once) is extremely low.

Example of an Auto-Generated ID

Let's consider an example auto-generated ID: "5PMM3nYBgTGA2v2S6qve". This ID is a Base64-encoded UUID. The first two bytes are derived from a sequence ID, the next four bytes are derived from the current timestamp, the next six bytes are the MAC address of the machine where Elasticsearch is running, and the final three bytes are the remaining bytes of the timestamp and sequence ID.

Q&A

Q: Are auto-generated IDs unique across all indices in a cluster?

A: While the auto-generated IDs are unique within an index, they are not globally unique across all indices in a cluster. If you have two documents with the same auto-generated ID in two different indices, they are considered as two different documents.

Q: What is the probability of collision in auto-generated IDs?

A: The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID for auto-generating IDs, which are designed to be sufficiently random such that the probability of collision is extremely low.

To give you an idea of how low: The number of random version 4 UUIDs (which are the type of UUIDs used by Elasticsearch) that need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 x 1⁰¹⁸). This number is so large that even if you were generating 1 billion UUIDs per second, it would take you over 85 years to generate this many UUIDs.

Conclusion

Elasticsearch's approach to ID generation is a trade-off between indexing speed, storage efficiency, and lookup speed. It's optimized for append-only workloads, where documents are continually being added to the index and rarely updated or deleted.

  1. 当你在索引文档时没有指定ID,Elasticsearch会自动生成一个唯一的ID。这个ID是一个Base64编码的UUID,由多个部分组成,每个部分都有特定的用途。

  2. ID生成过程优化了索引速度和存储效率。相关代码可以在Elasticsearch的TimeBasedUUIDGenerator类中找到。

  3. ID生成的概率几乎可以忽略不计,因为Elasticsearch使用的是UUID(通用唯一标识符),它们是128位的值,设计得非常随机,碰撞的概率极低。

  4. 自动生成的ID在一个索引内是唯一的,但在集群中的所有索引之间并不全局唯一。如果在不同索引中有两个相同的自动生成ID的文档,它们被视为两个不同的文档。

  5. 示例自动生成ID:"5PMM3nYBgTGA2v2S6qve"是一个Base64编码的UUID,前两字节由序列ID派生,接下来的四字节由当前时间戳派生,接下来的六字节是运行Elasticsearch的机器的MAC地址,最后三字节是时间戳和序列ID的剩余字节。

相关推荐
洛森唛11 小时前
Elasticsearch DSL 查询语法大全:从入门到精通
后端·elasticsearch
Elasticsearch3 天前
如何使用 Agent Builder 排查 Kubernetes Pod 重启和 OOMKilled 事件
elasticsearch
Elasticsearch4 天前
通用表达式语言 ( CEL ): CEL 输入如何改进 Elastic Agent 集成中的数据收集
elasticsearch
海兰6 天前
离线合同结构化提取与检索:LangExtract + 本地DeepSeek + Elasticsearch 9.x
大数据·elasticsearch·django
yumgpkpm6 天前
AI视频生成:Wan 2.2(阿里通义万相)在华为昇腾下的部署?
人工智能·hadoop·elasticsearch·zookeeper·flink·kafka·cloudera
Sheffield6 天前
如果把ZooKeeper按字面意思比作动物园管理员……
elasticsearch·zookeeper·kafka
嗝屁小孩纸6 天前
ES索引重建(零工具纯脚本执行)
大数据·elasticsearch·搜索引擎
Elastic 中国社区官方博客6 天前
使用 Jina Embeddings v5 和 Elasticsearch 构建“与你的网站数据聊天”的 agent
大数据·人工智能·elasticsearch·搜索引擎·容器·全文检索·jina
Elastic 中国社区官方博客6 天前
Elastic 公共 roadmap 在此
大数据·elasticsearch·ai·云原生·serverless·全文检索·aws