学技术学英语：elasticsearch硬件相关的配置&优化技巧

If you want to get serious about Elasticsearch, you'll have to learn about hardware. It might be an unpopular opinion in 2017, but don't run Elasticsearch in the cloud. It has nothing to do with latency or losing your AWS spot instances because Netflix has just released a new show, it has to do with picking up the right hardware for your needs. Cloud providers such as AWS provides vCPU but there's no way you know what you're going to get.

Because they have trouble with Java garbage collection, the first thing people ask advice about is memory management. What you should actually care about is, in no particular order:

CPU
Memory
Network
Storage

CPU

Running complex filtered queries, intensive indexing, percolation and queries against non latin charsets have a strong impact on the CPU, so picking up the right one is critical.

Dive into the CPU specs to understand how they behave with Java. It's been more than a decade since I last read Intel specs --- I even used to have them as physical books --- but it prove itself critical to pick up the right hardware.

For example, Xeon E5 v4 provides 60% better performances than the v3 version when running Java. Xeon D works well when you want to scale your cluster horizontally as soon as heavy indexing is split evenly amongst the cluster nodes. Prepare to get into trouble with nodes popping out of the cluster like popcorn otherwise. So picking up the right CPU and horizontal design is critical.

Speaking of CPU, Elasticsearch divides the CPU use into thread pools of various types:

generic for standard operations such as discovery
index for indexing
get for get operations, obviously
bulk for bulk operations such as bulk indexing

These are the most important ones you'll have to deal with, RTFM for everything else

Each pool runs a number of threads, which can be configured, and has a queue, which can be configured too. The default number of threads is defined using a system variable called allocated_cpu, which is never greater than 32, even though you have 48 core and the system variable available_cpu shows 48.

I wouldn't recommend changing the thread pool size unless you really know what you do, the defaults settings are quite sensible. You might want to adapt the queue size though, as filling the queue size means the operations will be rejected. You can get a glance of your thread pool state using the thread_pool API.

复制代码

curl -XGET "localhost:9200/_cat/thread_pool/search?v&h=host,name,active,rejected,completed"

A special trick when your cluster is CPU bound and you can support a replica of your data set on every node: run your cluster behind a Haproxy and bypass the ingest nodes to hit the data node. If you have heterogeneous nodes, give a greater weight to the nodes with the highest number of cores.

Memory

Elasticsearch runs on Java, and Java is a garbage collected language. Which means you'll run into memory management problems.

Memory is divided in 2 parts: what you allocate to the Java heap space, and everything else. Elasticsearch does not rely on Java heap only. For example, every thread created within the thread pool allocates 256Kb of off-heap memory.

The basic thing to understand with heap allocation is: the more memory you give it, the more time Java spends garbage collecting.

Elasticsearch comes with Concurrent Mark Sweep as a default garbage collector. CMS runs multiple concurrent threads to scan the heap for objects that can be recycled. The main problem with CMS is how it might enter "stop the world" mode in which the JVM becomes unresponsive until it is restarted. The main reason for stop the world is when the application has changed the state of the heap while CMS was running concurrently, forcing it to restart from scratch until it has all the objects marked for deletion. Let's put it this way: CMS sucks donkey balls when the heap is over 4GB, which is almost always the case with Elasticsearch.

Java 8 brings a brand new garbage collector called Garbage First, or G1, designed for heaps greater than 4GB. G1 uses background threads to divide the heap into 1 to 32MB regions, then scan the regions that contain the most garbage objects first. Elasticsearch and Lucene does not recommend using G1GC for many reasons, one of them being a nasty bug on 32bits JVM that might lead to data corruption. From an operational point of view, switching to G1GC was miraculous, leading to no more stop the world and only a few memory management issues.

That said, choosing the right amount of memory to fill in the heap is the most touchy part of designing an Elasticsearch cluster. Whatever you pick up, never allocate more than 31GB to the heap.

Elasticsearch provides plenty of metrics to understand how the workload wights on the memory.

复制代码

curl -XGET "localhost:9200/_nodes/stats"

Elasticsearch uses multiple buffers to perform in memory operations, as well as caches to store the queries results with a system of LRU when the cache becomes full. When the results are mostly large datasets and the queries are not repeated often, disabling the caches might be a good idea.

The caches you need to monitor are:

the query cache: defined per node, with a default of 10% of the heap.
the shard request cache: used to compute the result of queries ran on multiple shards.
the fielddata cache: limited to 30% of the total heap.

Since version 5, Elasticsearch buffers were simplified, and there are only 2 buffers to monitor:

the indexing buffer: it is used to buffer data during the indexing process.
the buffer_pools.

Elasticsearch is not idiot proof and won't tell you if you allocate more than 100% of the heap to the multiple buffers and caches, unless you try to fill them all at the same time. Then you'll get an out of memory error.

I said earlier that too much memory might lead to management issues. Actually, the more memory the better when you play outside of the heap. The off heap memory is used to manage threads and for the filesystem to cache the data.

Elasticsearch file system storage has an important impact on the cluster performances. After trying both ElasticSearch default_fs and mmapfs, I've picked up niofs for file system storage.

The NIO FS type stores the shard index on the file system (maps to Lucene NIOFSDirectory) using NIO. It allows multiple threads to read from the same file concurrently.

Niofs lets the kernel manage the file system cache instead of relying on the broken, out of memory error generator mmapfs.

You might also want to commit the exact amount of memory you want to allocate to the heap at startup. This prevents the node from swapping when trying to allocate the memory it needs because no more memory is available.

复制代码

# previously bootstrap.mlockall
boostrap:
  memory_lock: true

Network

Let's put it this way: you never have too much bandwidth. 1GB is good, 10GB is better, and a low latency is even better. Elasticsearch performs lots of network consuming operations, from transferring data during queries to reallocating shards, so networking matters.

The multicast discovery plugin was removed from Elasticsearch 5, so discovery is done either using unicast or a cloud plugin, but you won't run Elasticsearch in the cloud, will you?

If your hosting provider allows it, activate the Jumbo frames on your network interfaces. Jumbo frames might reduce the network latency by about 15% which is noticeable when transferring large amount of data.

复制代码

ifconfig eth0 mtu 9000

Elasticsearch comes with some interesting network related settings, which are low by default and won't go over 2Gb/s, notably the recovery transfer which is limited to 40mb/s

复制代码

indices:
  recovery:
    max_bytes_per_sec: "2g"

Raise this value only if your storage can handle it while serving queries, indexing, and performing administrative tasks such as merges.

Storage

After memory, storage is often the bottleneck of an Elasticsearch cluster. Unless you have a good reason to do it, don't use spinning disks. Large SSD drives are now affordable with a decent life expectancy, and you'll need fast IOs. Also, prefer local storage to anything else to reduce the reads and writes latency. Consider your data nodes as disposable when possible.

A question that comes often about storage design is: should I go with RAID0, RAID1(0) or JBOD?

RAID0 offers the best cost / speed / storage space ratio. It fits perfectly in large clusters where losing a node is not a problem. RAID0 offers the maximum storage space on a single file system, which is convenient when managing large shards. Without enough available storage on a single node, operations like index optimization won't be possible. RAID0 also offers the maximum number of axes, without the RAID1(0) replication overhead, useful for intensive indexing.

On the other hand, losing a single disk means losing a whole data node, so choosing RAID0 implies to have enough spare data nodes to store the whole dataset in case of crash.

JBOD offers the best cost / storage / security ratio. Each disk is affected to a mountpoint, and the mountpoints are listed in Elasticsearch configuration. JBOD is a good choice when you can't afford to lose a whole data node, but losing a whole disk is OK, but provides less read and write performances. Running large shards on JBOD can also be a problem to perform administrative tasks like index optimization.

Depending on how many data nodes you can afford to lose, running many hosts with software RAID0 is the best speed / storage space / cost setup. Otherwise, use JBOD to reduce the I/Os involved by RAID 1 and RAID10 replication.

RAID1(0) is the option for people who run Elasticsearch on a single node. It provides the maximum security for the data availability as losing a disk is possible, but at the cost of space and performances. RAID1(0) implies to use half of the storage for the RAID replication, and the replication overhead is something to take into account.

Elasticsearch comes with 2 storage related throttling protections. The first one limits the bandwidth of the storage you can use, and is as low as 10mb/s. You can change it in the nodes settings:

复制代码

indices:
  store:
    throttle:
      max_bytes_per_sec: "2g"

The second one prevents too many merges from happening, which slows down your indexing process. If you run bulk indexing or don't care about search speed, you can disable merge throttling entirely.

复制代码

indices:
  store:
    throttle:
      type: "none"

硬件选择的重要性：强调在使用Elasticsearch时，选择合适的硬件比在云端运行更关键，特别是对于CPU、内存、网络和存储的选择。
CPU的选择与配置：不同型号的CPU（如Xeon E5 v4与v3）在处理Java任务时的性能差异很大，选择合适的CPU对于复杂查询和索引性能至关重要。
线程池管理：Elasticsearch使用不同的线程池处理各种操作（如索引、搜索等），了解和配置这些线程池的线程数量和队列大小对性能优化非常重要。
内存管理：Java垃圾回收机制对Elasticsearch性能影响很大，堆内存分配和垃圾收集器的选择（如CMS和G1GC）需要谨慎。
堆内存分配：建议不要将堆内存分配超过31GB，过多的堆内存会增加垃圾回收的开销，影响性能。
缓存与缓冲：Elasticsearch有多种缓存和缓冲机制（如查询缓存、分片请求缓存、字段数据缓存），需要监控并适当配置以优化内存使用。
文件系统存储选择：推荐使用niofs而非mmapfs，因为后者可能导致内存泄露和错误，使用niofs可以让内核更好地管理文件系统缓存。
网络性能：网络带宽和延迟对Elasticsearch至关重要，建议使用Jumbo frames来减少网络延迟，并调整网络传输设置以提高效率。
存储配置：建议使用SSD而不是机械硬盘，关于RAID配置，提供了RAID0、JBOD和RAID1(0)的对比，根据数据安全性、性能和成本选择合适的存储策略。
存储限制与优化：Elasticsearch有存储带宽和合并操作的节流机制，可以通过配置来调整这些限制，以优化存储操作的性能。