Elasticsearch聚合查询优化实战
一、聚合查询概述
Elasticsearch的聚合功能是数据分析的核心,支持多种聚合类型来满足不同的分析需求。
1.1 聚合类型
| 类型 | 说明 | 使用场景 |
|---|---|---|
| Metric | 指标聚合 | 求和、平均值、最大值、最小值 |
| Bucket | 桶聚合 | 分组统计、区间统计 |
| Pipeline | 管道聚合 | 基于聚合结果的二次聚合 |
| Matrix | 矩阵聚合 | 多字段统计 |
1.2 聚合执行流程
┌─────────────────────────────────────────────────────────────┐
│ 查询请求 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 分片查询 │
│ (每个分片独立执行聚合) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 结果合并 │
│ (Coordinator节点合并各分片结果) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 返回结果 │
└─────────────────────────────────────────────────────────────┘
二、常见聚合查询
2.1 基础指标聚合
json
{
"aggs": {
"total_sales": { "sum": { "field": "amount" } },
"avg_price": { "avg": { "field": "price" } },
"max_quantity": { "max": { "field": "quantity" } },
"min_quantity": { "min": { "field": "quantity" } },
"order_count": { "value_count": { "field": "order_id" } }
}
}
2.2 桶聚合
json
{
"aggs": {
"group_by_category": {
"terms": {
"field": "category.keyword",
"size": 10,
"order": { "total_sales": "desc" }
},
"aggs": {
"total_sales": { "sum": { "field": "amount" } },
"avg_rating": { "avg": { "field": "rating" } }
}
}
}
}
2.3 日期范围聚合
json
{
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"total_sales": { "sum": { "field": "amount" } },
"unique_customers": { "cardinality": { "field": "customer_id" } }
}
}
}
}
三、聚合优化策略
3.1 使用Filter减少数据量
json
{
"query": {
"bool": {
"filter": [
{ "range": { "order_date": { "gte": "2024-01-01" } } },
{ "term": { "status": "completed" } }
]
}
},
"aggs": {
"total_sales": { "sum": { "field": "amount" } }
}
}
3.2 使用分片优先聚合
json
{
"aggs": {
"fast_cardinality": {
"cardinality": {
"field": "user_id",
"precision_threshold": 10000
}
}
}
}
3.3 避免深度嵌套聚合
json
// 避免:三层以上嵌套
{
"aggs": {
"category": { "terms": { "field": "category" } },
"aggs": {
"sub_category": { "terms": { "field": "sub_category" } },
"aggs": {
"brand": { "terms": { "field": "brand" } },
"aggs": {
"total": { "sum": { "field": "amount" } }
}
}
}
}
}
3.4 使用采样聚合
json
{
"size": 0,
"aggs": {
"sample": {
"sampler": { "shard_size": 1000 },
"aggs": {
"top_sales": {
"terms": { "field": "product_id", "size": 10 },
"aggs": { "revenue": { "sum": { "field": "amount" } } }
}
}
}
}
}
四、性能监控与诊断
4.1 启用慢查询日志
yaml
logger.org.elasticsearch.action.search: DEBUG
logger.org.elasticsearch.search: DEBUG
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
4.2 分析聚合耗时
json
{
"profile": true,
"aggs": {
"category_sales": {
"terms": { "field": "category.keyword" },
"aggs": { "total": { "sum": { "field": "amount" } } }
}
}
}
4.3 监控指标
bash
# 查询聚合相关指标
GET _nodes/stats/search
# 查询索引统计
GET /orders/_stats
五、高级优化技巧
5.1 使用近似聚合
json
{
"aggs": {
"approx_unique_users": {
"cardinality": {
"field": "user_id",
"precision_threshold": 50000
}
},
"approx_percentile": {
"percentiles": {
"field": "amount",
"percents": [95, 99],
"tdigest": { "compression": 100 }
}
}
}
}
5.2 预聚合数据
json
// 使用Transform创建物化视图
PUT _transform/sales_summary
{
"source": { "index": ["orders"] },
"dest": { "index": "sales_summary" },
"pivot": {
"group_by": {
"category": { "terms": { "field": "category.keyword" } },
"date": { "date_histogram": { "field": "order_date", "calendar_interval": "day" } }
},
"aggregations": {
"total_sales": { "sum": { "field": "amount" } },
"order_count": { "value_count": { "field": "order_id" } }
}
},
"sync": {
"time": { "field": "order_date" }
}
}
5.3 使用Runtime Fields
json
{
"runtime_mappings": {
"profit_margin": {
"type": "double",
"script": "emit(doc['revenue'].value - doc['cost'].value)"
}
},
"aggs": {
"avg_profit_margin": { "avg": { "field": "profit_margin" } }
}
}
六、实战案例
6.1 电商销售分析
json
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "range": { "order_date": { "gte": "2024-01-01", "lt": "2024-02-01" } } },
{ "term": { "status": "completed" } }
]
}
},
"aggs": {
"by_category": {
"terms": {
"field": "category.keyword",
"size": 5,
"shard_size": 100
},
"aggs": {
"total_sales": { "sum": { "field": "amount" } },
"avg_order_value": { "avg": { "field": "amount" } },
"top_products": {
"terms": {
"field": "product_id",
"size": 3,
"shard_size": 50
},
"aggs": {
"product_sales": { "sum": { "field": "amount" } }
}
}
}
}
}
}
6.2 用户行为分析
json
{
"size": 0,
"aggs": {
"active_users": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "hour",
"min_doc_count": 0
},
"aggs": {
"unique_users": { "cardinality": { "field": "user_id", "precision_threshold": 10000 } },
"avg_session_duration": { "avg": { "field": "session_duration" } },
"page_views": { "sum": { "field": "page_views" } }
}
}
}
}
七、常见问题与解决方案
7.1 聚合结果不准确
json
// 使用shard_size提高准确性
{
"aggs": {
"top_products": {
"terms": {
"field": "product_id",
"size": 10,
"shard_size": 100
}
}
}
}
7.2 内存不足
json
// 增加内存限制
{
"index": {
"memory": {
"max_result_window": 100000,
"max_terms_count": 1000000
}
}
}
7.3 慢聚合查询
json
// 使用索引排序优化
PUT /orders/_mapping
{
"properties": {
"category": {
"type": "keyword",
"doc_values": true
}
}
}
八、最佳实践总结
8.1 索引设计
- 使用keyword类型:桶聚合字段应使用keyword类型
- 启用doc_values:加速排序和聚合
- 合理分片:根据数据量和查询模式调整分片数
8.2 查询优化
- 先过滤后聚合:使用filter减少聚合数据量
- 限制结果数量:设置合理的size和shard_size
- 避免深度嵌套:三层以上嵌套会显著影响性能
8.3 数据预热
bash
# 预热缓存
POST /orders/_cache/clear
POST /orders/_search?request_cache=true
{
"size": 0,
"aggs": { "warmup": { "terms": { "field": "category.keyword" } } }
}
通过合理设计索引和优化聚合查询,可以显著提升Elasticsearch的分析性能。