学技术学英语：elasticsearch 文档ID生成算法

Auto-Generated Document IDs in Elasticsearch

When you index a document without specifying an ID, Elasticsearch automatically generates a unique ID for that document. This ID is a Base64-encoded UUID, which is composed of several parts, each serving a specific purpose.

The ID generation process is optimized for both indexing speed and storage efficiency. The code responsible for this process can be found in Elasticsearch's TimeBasedUUIDGenerator class on GitHub.

elasticsearch/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java at...

Free and Open, Distributed, RESTful Search Engine. Contribute to elastic/elasticsearch development by creating an...

github.com

How are the IDs Generated?

The first two bytes of the ID are derived from a sequence ID, which is incremented for each document that's indexed. The first and third bytes of the sequence ID are used. These bytes change frequently, which helps with indexing speed because it makes the IDs sort quickly.

The next four bytes are derived from the current timestamp. These bytes change less frequently, which helps with storage efficiency because it makes the IDs compress well. The timestamp is shifted by different amounts to generate these four bytes, which means they change at different rates.

The next six bytes are the MAC address of the machine where Elasticsearch is running. This helps ensure the uniqueness of the IDs across different machines.

The final three bytes are the remaining bytes of the timestamp and sequence ID. These bytes are likely not to be compressed at all.

The resulting byte array is then Base64-encoded to create the final ID. The Base64 encoding is URL-safe and does not include padding, which makes the IDs safe to use in URLs and efficient to store.

Probability of Collision

The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID (Universally Unique Identifier) for auto-generating IDs. UUIDs are 128-bit values and are designed to be sufficiently random such that the probability of collision (i.e., generating the same UUID more than once) is extremely low.

Example of an Auto-Generated ID

Let's consider an example auto-generated ID: "5PMM3nYBgTGA2v2S6qve". This ID is a Base64-encoded UUID. The first two bytes are derived from a sequence ID, the next four bytes are derived from the current timestamp, the next six bytes are the MAC address of the machine where Elasticsearch is running, and the final three bytes are the remaining bytes of the timestamp and sequence ID.

Q&A

Q: Are auto-generated IDs unique across all indices in a cluster?

A: While the auto-generated IDs are unique within an index, they are not globally unique across all indices in a cluster. If you have two documents with the same auto-generated ID in two different indices, they are considered as two different documents.

Q: What is the probability of collision in auto-generated IDs?

A: The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID for auto-generating IDs, which are designed to be sufficiently random such that the probability of collision is extremely low.

To give you an idea of how low: The number of random version 4 UUIDs (which are the type of UUIDs used by Elasticsearch) that need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 x 1⁰¹⁸). This number is so large that even if you were generating 1 billion UUIDs per second, it would take you over 85 years to generate this many UUIDs.

Conclusion

Elasticsearch's approach to ID generation is a trade-off between indexing speed, storage efficiency, and lookup speed. It's optimized for append-only workloads, where documents are continually being added to the index and rarely updated or deleted.

当你在索引文档时没有指定ID，Elasticsearch会自动生成一个唯一的ID。这个ID是一个Base64编码的UUID，由多个部分组成，每个部分都有特定的用途。
ID生成过程优化了索引速度和存储效率。相关代码可以在Elasticsearch的TimeBasedUUIDGenerator类中找到。
ID生成的概率几乎可以忽略不计，因为Elasticsearch使用的是UUID（通用唯一标识符），它们是128位的值，设计得非常随机，碰撞的概率极低。
自动生成的ID在一个索引内是唯一的，但在集群中的所有索引之间并不全局唯一。如果在不同索引中有两个相同的自动生成ID的文档，它们被视为两个不同的文档。
示例自动生成ID："5PMM3nYBgTGA2v2S6qve"是一个Base64编码的UUID，前两字节由序列ID派生，接下来的四字节由当前时间戳派生，接下来的六字节是运行Elasticsearch的机器的MAC地址，最后三字节是时间戳和序列ID的剩余字节。