目录
- Elasticsearch搜索引擎集成指南
-
- 引言:现代搜索技术的重要性
- 一、Elasticsearch核心概念与架构
-
- [1.1 什么是Elasticsearch?](#1.1 什么是Elasticsearch?)
- [1.2 Elasticsearch的核心概念](#1.2 Elasticsearch的核心概念)
-
- [1.2.1 集群(Cluster)](#1.2.1 集群(Cluster))
- [1.2.2 节点(Node)](#1.2.2 节点(Node))
- [1.2.3 索引(Index)](#1.2.3 索引(Index))
- [1.2.4 类型(Type)(在7.x版本后已弃用)](#1.2.4 类型(Type)(在7.x版本后已弃用))
- [1.2.5 文档(Document)](#1.2.5 文档(Document))
- [1.2.6 分片(Shard)和副本(Replica)](#1.2.6 分片(Shard)和副本(Replica))
- [1.3 Elasticsearch架构图](#1.3 Elasticsearch架构图)
- 二、Elasticsearch的核心原理
-
- [2.1 倒排索引(Inverted Index)](#2.1 倒排索引(Inverted Index))
- [2.2 相关性评分(Relevance Scoring)](#2.2 相关性评分(Relevance Scoring))
-
- [2.2.1 TF-IDF公式](#2.2.1 TF-IDF公式)
- [2.2.2 BM25算法](#2.2.2 BM25算法)
- [2.3 分布式搜索原理](#2.3 分布式搜索原理)
- 三、环境搭建与配置
-
- [3.1 安装Elasticsearch](#3.1 安装Elasticsearch)
-
- [3.1.1 Docker方式安装](#3.1.1 Docker方式安装)
- [3.1.2 本地安装(macOS)](#3.1.2 本地安装(macOS))
- [3.2 安装Python客户端](#3.2 安装Python客户端)
- [3.3 验证安装](#3.3 验证安装)
- 四、Python集成Elasticsearch完整实现
-
- [4.1 项目结构与配置](#4.1 项目结构与配置)
- [4.2 配置文件](#4.2 配置文件)
- [4.3 Elasticsearch客户端封装](#4.3 Elasticsearch客户端封装)
- [4.4 数据模型与映射](#4.4 数据模型与映射)
- [4.5 索引操作服务](#4.5 索引操作服务)
- [4.6 搜索服务](#4.6 搜索服务)
- [4.7 辅助工具](#4.7 辅助工具)
- [4.8 主程序示例](#4.8 主程序示例)
- 五、性能优化与最佳实践
-
- [5.1 索引设计优化](#5.1 索引设计优化)
-
- [5.1.1 映射设计原则](#5.1.1 映射设计原则)
- [5.1.2 分片策略优化](#5.1.2 分片策略优化)
『宝藏代码胶囊开张啦!』------ 我的 CodeCapsule 来咯!✨写代码不再头疼!我的新站点 CodeCapsule 主打一个 "白菜价"+"量身定制 "!无论是卡脖子的毕设/课设/文献复现 ,需要灵光一现的算法改进 ,还是想给项目加个"外挂",这里都有便宜又好用的代码方案等你发现!低成本,高适配,助你轻松通关!速来围观 👉 CodeCapsule官网
Elasticsearch搜索引擎集成指南
引言:现代搜索技术的重要性
在信息爆炸的时代,如何快速、准确地从海量数据中找到所需信息已成为每个应用程序的核心需求。传统的关系型数据库虽然在数据存储和事务处理方面表现出色,但在全文搜索和复杂查询方面却力不从心。Elasticsearch作为当前最流行的开源搜索引擎,凭借其分布式、高性能和易扩展的特性,已经成为企业级搜索解决方案的首选。
本文将深入探讨Elasticsearch的核心概念、工作原理,并通过完整的Python实现示例,展示如何在实际项目中集成和使用Elasticsearch。
一、Elasticsearch核心概念与架构
1.1 什么是Elasticsearch?
Elasticsearch是一个基于Lucene的分布式、RESTful风格的搜索和数据分析引擎。它能够以近实时的方式存储、搜索和分析大量数据,通常被用作复杂搜索场景、日志分析和业务智能等领域的底层引擎。
1.2 Elasticsearch的核心概念
1.2.1 集群(Cluster)
Elasticsearch集群由一个或多个节点组成,这些节点共同存储所有数据,并提供跨所有节点的联合索引和搜索功能。每个集群都有一个唯一的名称标识。
1.2.2 节点(Node)
节点是集群中的一个服务器,存储数据并参与集群的索引和搜索功能。节点可以通过配置为不同的类型:
- 主节点:负责集群管理
- 数据节点:存储数据并执行数据相关操作
- 协调节点:处理客户端请求
1.2.3 索引(Index)
索引是具有相似特征的文档集合,相当于关系型数据库中的"数据库"。每个索引有一个唯一名称,通过这个名称可以对索引中的文档执行索引、搜索、更新和删除操作。
1.2.4 类型(Type)(在7.x版本后已弃用)
在7.x版本之前,类型是索引内部的逻辑类别/分区,相当于关系型数据库中的"表"。从7.x版本开始,每个索引只能包含一个类型。
1.2.5 文档(Document)
文档是可以被索引的基本信息单元,以JSON格式表示。相当于关系型数据库中的一行记录。
1.2.6 分片(Shard)和副本(Replica)
- 分片:索引可以被分成多个分片,每个分片是一个完整的Lucene索引
- 副本:分片的副本,提供数据冗余和高可用性
1.3 Elasticsearch架构图
索引: products
客户端请求
协调节点
数据节点1
数据节点2
数据节点3
主分片P0
副本分片R0
主分片P1
副本分片R1
主分片P2
副本分片R2
文档1
文档2
文档3
文档4
二、Elasticsearch的核心原理
2.1 倒排索引(Inverted Index)
Elasticsearch的核心是Lucene的倒排索引技术。与关系型数据库的正排索引不同,倒排索引建立了从词项到文档的映射。
正排索引 :文档ID → 文档内容 → 单词
倒排索引:单词 → 文档ID
倒排索引的结构可以用以下数学表示:
设文档集合 D = { d 1 , d 2 , . . . , d n } D = \{d_1, d_2, ..., d_n\} D={d1,d2,...,dn},词项集合 T = { t 1 , t 2 , . . . , t m } T = \{t_1, t_2, ..., t_m\} T={t1,t2,...,tm}
倒排索引 I I I 是一个从词项到文档列表的映射:
I ( t i ) = { d j ∣ t i ∈ d j } I(t_i) = \{d_j | t_i \in d_j\} I(ti)={dj∣ti∈dj}
对于每个词项 t i t_i ti,还会存储:
- 文档频率(DF):包含该词项的文档数
- 词项频率(TF):词项在每个文档中出现的次数
- 位置信息:词项在文档中的位置
2.2 相关性评分(Relevance Scoring)
Elasticsearch使用TF-IDF(词频-逆文档频率)和BM25算法来计算文档的相关性分数。
2.2.1 TF-IDF公式
tf-idf ( t , d , D ) = tf ( t , d ) × idf ( t , D ) \text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D) tf-idf(t,d,D)=tf(t,d)×idf(t,D)
其中:
- tf ( t , d ) \text{tf}(t, d) tf(t,d) 是词项 t t t 在文档 d d d 中的词频
- idf ( t , D ) = log N ∣ { d ∈ D : t ∈ d } ∣ \text{idf}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|} idf(t,D)=log∣{d∈D:t∈d}∣N, N N N 是文档总数
2.2.2 BM25算法
BM25是TF-IDF的改进版本,公式更复杂但效果更好:
BM25 ( d , q ) = ∑ t ∈ q IDF ( t ) × f ( t , d ) × ( k 1 + 1 ) f ( t , d ) + k 1 × ( 1 − b + b × ∣ d ∣ avgdl ) \text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})} BM25(d,q)=t∈q∑IDF(t)×f(t,d)+k1×(1−b+b×avgdl∣d∣)f(t,d)×(k1+1)
其中:
- f ( t , d ) f(t, d) f(t,d):词项 t t t 在文档 d d d 中的词频
- ∣ d ∣ |d| ∣d∣:文档 d d d 的长度
- avgdl \text{avgdl} avgdl:所有文档的平均长度
- k 1 k_1 k1 和 b b b:可调参数(通常 k 1 = 1.2 k_1=1.2 k1=1.2, b = 0.75 b=0.75 b=0.75)
2.3 分布式搜索原理
Elasticsearch的分布式搜索过程可以分为两个阶段:
- 查询阶段:协调节点将查询广播到所有相关分片,每个分片执行查询并返回结果(文档ID和分数)
- 取回阶段:协调节点合并所有分片的结果,根据分数排序,然后取回完整的文档数据
三、环境搭建与配置
3.1 安装Elasticsearch
3.1.1 Docker方式安装
bash
# 拉取Elasticsearch镜像
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.15.0
# 运行Elasticsearch容器
docker run -d \
--name elasticsearch \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
docker.elastic.co/elasticsearch/elasticsearch:7.15.0
3.1.2 本地安装(macOS)
bash
# 使用Homebrew安装
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full
# 启动Elasticsearch
elasticsearch
3.2 安装Python客户端
bash
pip install elasticsearch
pip install elasticsearch-dsl # 高级客户端,提供更友好的API
3.3 验证安装
python
import elasticsearch
# 测试连接
es = elasticsearch.Elasticsearch(["http://localhost:9200"])
print(es.info())
四、Python集成Elasticsearch完整实现
4.1 项目结构与配置
首先创建项目目录结构:
elasticsearch-integration/
├── config/
│ ├── __init__.py
│ └── settings.py
├── es_client/
│ ├── __init__.py
│ ├── connection.py
│ └── operations.py
├── models/
│ ├── __init__.py
│ ├── document.py
│ └── mappings.py
├── services/
│ ├── __init__.py
│ ├── indexer.py
│ └── searcher.py
├── utils/
│ ├── __init__.py
│ └── helpers.py
├── tests/
│ └── test_es_operations.py
├── requirements.txt
└── main.py
4.2 配置文件
python
# config/settings.py
"""
Elasticsearch配置设置
"""
class ElasticsearchConfig:
"""Elasticsearch配置类"""
# 连接配置
HOSTS = ["http://localhost:9200"]
TIMEOUT = 30 # 超时时间(秒)
# 索引配置
DEFAULT_INDEX_SETTINGS = {
"number_of_shards": 3, # 主分片数
"number_of_replicas": 1, # 副本数
"refresh_interval": "1s" # 刷新间隔
}
# 批量操作配置
BULK_SIZE = 1000 # 批量操作大小
BULK_REFRESH = "wait_for" # 批量操作后刷新策略
# 搜索配置
DEFAULT_PAGE_SIZE = 10
MAX_RESULT_WINDOW = 10000 # 最大结果窗口
# 重试配置
MAX_RETRIES = 3
RETRY_ON_TIMEOUT = True
# 安全配置(如果启用安全功能)
USERNAME = None
PASSWORD = None
API_KEY = None
@classmethod
def get_connection_args(cls):
"""获取连接参数"""
connection_args = {
"hosts": cls.HOSTS,
"timeout": cls.TIMEOUT,
"max_retries": cls.MAX_RETRIES,
"retry_on_timeout": cls.RETRY_ON_TIMEOUT
}
# 添加认证信息
if cls.USERNAME and cls.PASSWORD:
connection_args["http_auth"] = (cls.USERNAME, cls.PASSWORD)
elif cls.API_KEY:
connection_args["api_key"] = cls.API_KEY
return connection_args
4.3 Elasticsearch客户端封装
python
# es_client/connection.py
"""
Elasticsearch连接管理
"""
from elasticsearch import Elasticsearch, RequestsHttpConnection
from elasticsearch.helpers import bulk, streaming_bulk, parallel_bulk
from elasticsearch_dsl import Search, Q
from config.settings import ElasticsearchConfig
import logging
from typing import Dict, List, Optional, Any, Iterator
# 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ElasticsearchClient:
"""Elasticsearch客户端封装类"""
def __init__(self):
"""初始化Elasticsearch客户端"""
self._client = None
self._connect()
def _connect(self):
"""建立Elasticsearch连接"""
try:
connection_args = ElasticsearchConfig.get_connection_args()
self._client = Elasticsearch(**connection_args)
# 测试连接
if self._client.ping():
logger.info("成功连接到Elasticsearch")
logger.info(f"集群信息: {self._client.info()}")
else:
logger.error("无法连接到Elasticsearch")
raise ConnectionError("无法连接到Elasticsearch")
except Exception as e:
logger.error(f"连接Elasticsearch失败: {e}")
raise
@property
def client(self) -> Elasticsearch:
"""获取Elasticsearch客户端实例"""
if not self._client:
self._connect()
return self._client
def reconnect(self):
"""重新连接"""
self.close()
self._connect()
def close(self):
"""关闭连接"""
if self._client:
self._client.close()
self._client = None
logger.info("Elasticsearch连接已关闭")
def get_cluster_health(self) -> Dict:
"""获取集群健康状态"""
try:
return self.client.cluster.health()
except Exception as e:
logger.error(f"获取集群健康状态失败: {e}")
return {}
def get_indices_stats(self) -> Dict:
"""获取所有索引统计信息"""
try:
return self.client.indices.stats()
except Exception as e:
logger.error(f"获取索引统计信息失败: {e}")
return {}
def __enter__(self):
"""上下文管理器入口"""
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""上下文管理器出口"""
self.close()
# 创建全局客户端实例
es_client = ElasticsearchClient()
4.4 数据模型与映射
python
# models/document.py
"""
Elasticsearch文档数据模型
"""
from datetime import datetime
from typing import Dict, List, Optional, Any
from elasticsearch_dsl import Document, Date, Keyword, Text, Integer, Float, Boolean
from elasticsearch_dsl import analyzer, tokenizer, token_filter
# 自定义分析器
# 中文分词器需要安装ik插件,这里使用标准分词器作为示例
custom_analyzer = analyzer(
'custom_analyzer',
tokenizer='standard',
filter=['lowercase', 'stop', 'snowball']
)
class ProductDocument(Document):
"""产品文档模型"""
# 元数据字段
product_id = Keyword()
name = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
description = Text(analyzer=custom_analyzer)
# 分类信息
category = Keyword()
subcategory = Keyword()
tags = Keyword(multi=True)
# 数值字段
price = Float()
discount_price = Float()
stock = Integer()
sales_count = Integer()
# 评分信息
rating = Float()
review_count = Integer()
# 时间字段
created_at = Date()
updated_at = Date()
# 布尔字段
is_active = Boolean()
is_featured = Boolean()
# 嵌套字段(需要定义嵌套类型)
# attributes = Nested()
class Index:
"""索引配置"""
name = 'products'
settings = {
'number_of_shards': 3,
'number_of_replicas': 1,
'refresh_interval': '1s'
}
def save(self, **kwargs):
"""保存文档"""
# 自动设置时间戳
if not self.created_at:
self.created_at = datetime.now()
self.updated_at = datetime.now()
# 调用父类save方法
return super().save(**kwargs)
@classmethod
def get_mapping(cls) -> Dict:
"""获取索引映射"""
return cls._doc_type.mapping.to_dict()
class ArticleDocument(Document):
"""文章文档模型"""
article_id = Keyword()
title = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
content = Text(analyzer=custom_analyzer)
author = Keyword()
category = Keyword()
tags = Keyword(multi=True)
# 阅读统计
view_count = Integer()
like_count = Integer()
comment_count = Integer()
# 时间字段
published_at = Date()
created_at = Date()
updated_at = Date()
# 状态
status = Keyword() # draft, published, archived
class Index:
name = 'articles'
settings = {
'number_of_shards': 2,
'number_of_replicas': 1
}
class UserDocument(Document):
"""用户文档模型"""
user_id = Keyword()
username = Keyword()
email = Keyword()
full_name = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
# 用户信息
age = Integer()
gender = Keyword()
location = Keyword()
# 用户行为
interests = Keyword(multi=True)
last_login = Date()
login_count = Integer()
# 统计信息
order_count = Integer()
total_spent = Float()
# 时间字段
created_at = Date()
updated_at = Date()
class Index:
name = 'users'
settings = {
'number_of_shards': 2,
'number_of_replicas': 1
}
4.5 索引操作服务
python
# services/indexer.py
"""
Elasticsearch索引操作服务
"""
from typing import Dict, List, Optional, Any, Iterator, Union
from datetime import datetime
import json
import logging
from es_client.connection import es_client
from models.document import ProductDocument, ArticleDocument, UserDocument
logger = logging.getLogger(__name__)
class IndexService:
"""索引服务基类"""
def __init__(self, document_class):
self.document_class = document_class
self.index_name = document_class.Index.name
self.client = es_client.client
def create_index(self, delete_if_exists: bool = False) -> bool:
"""创建索引
Args:
delete_if_exists: 如果索引已存在是否删除
Returns:
bool: 是否创建成功
"""
try:
# 检查索引是否存在
if self.client.indices.exists(index=self.index_name):
if delete_if_exists:
logger.info(f"索引 {self.index_name} 已存在,正在删除...")
self.client.indices.delete(index=self.index_name)
else:
logger.info(f"索引 {self.index_name} 已存在")
return True
# 创建索引
self.document_class.init()
logger.info(f"成功创建索引: {self.index_name}")
return True
except Exception as e:
logger.error(f"创建索引 {self.index_name} 失败: {e}")
return False
def delete_index(self) -> bool:
"""删除索引"""
try:
if self.client.indices.exists(index=self.index_name):
self.client.indices.delete(index=self.index_name)
logger.info(f"成功删除索引: {self.index_name}")
return True
else:
logger.warning(f"索引 {self.index_name} 不存在")
return False
except Exception as e:
logger.error(f"删除索引 {self.index_name} 失败: {e}")
return False
def index_document(self, document_id: str, document_data: Dict,
refresh: str = 'wait_for') -> bool:
"""索引单个文档
Args:
document_id: 文档ID
document_data: 文档数据
refresh: 刷新策略
Returns:
bool: 是否索引成功
"""
try:
result = self.client.index(
index=self.index_name,
id=document_id,
body=document_data,
refresh=refresh
)
if result['result'] in ['created', 'updated']:
logger.debug(f"成功索引文档 {document_id} 到 {self.index_name}")
return True
else:
logger.warning(f"索引文档 {document_id} 返回异常结果: {result}")
return False
except Exception as e:
logger.error(f"索引文档 {document_id} 失败: {e}")
return False
def bulk_index_documents(self, documents: List[Dict],
id_field: str = 'id',
batch_size: int = 1000) -> Dict[str, Any]:
"""批量索引文档
Args:
documents: 文档列表
id_field: 文档ID字段名
batch_size: 批次大小
Returns:
Dict: 批量操作结果
"""
from elasticsearch.helpers import bulk
actions = []
success_count = 0
error_count = 0
errors = []
for doc in documents:
# 提取文档ID
doc_id = str(doc.get(id_field, ''))
if not doc_id:
logger.warning(f"文档缺少ID字段: {doc}")
continue
# 创建批量操作action
action = {
"_index": self.index_name,
"_id": doc_id,
"_source": doc
}
actions.append(action)
try:
# 执行批量操作
success, failed = bulk(
self.client,
actions,
chunk_size=batch_size,
raise_on_error=False,
refresh='wait_for'
)
success_count = success
error_count = len(failed)
if failed:
logger.error(f"批量索引失败 {error_count} 个文档")
errors = failed
else:
logger.info(f"批量索引成功 {success_count} 个文档")
except Exception as e:
logger.error(f"批量索引异常: {e}")
error_count = len(actions)
errors = [{"error": str(e)} for _ in actions]
return {
"total": len(documents),
"success": success_count,
"failed": error_count,
"errors": errors
}
def update_document(self, document_id: str, update_data: Dict,
refresh: str = 'wait_for') -> bool:
"""更新文档
Args:
document_id: 文档ID
update_data: 更新数据
refresh: 刷新策略
Returns:
bool: 是否更新成功
"""
try:
result = self.client.update(
index=self.index_name,
id=document_id,
body={"doc": update_data},
refresh=refresh
)
if result['result'] in ['updated', 'noop']:
logger.debug(f"成功更新文档 {document_id}")
return True
else:
logger.warning(f"更新文档 {document_id} 返回异常结果: {result}")
return False
except Exception as e:
logger.error(f"更新文档 {document_id} 失败: {e}")
return False
def delete_document(self, document_id: str,
refresh: str = 'wait_for') -> bool:
"""删除文档"""
try:
result = self.client.delete(
index=self.index_name,
id=document_id,
refresh=refresh
)
if result['result'] == 'deleted':
logger.debug(f"成功删除文档 {document_id}")
return True
else:
logger.warning(f"删除文档 {document_id} 返回异常结果: {result}")
return False
except Exception as e:
logger.error(f"删除文档 {document_id} 失败: {e}")
return False
def get_document(self, document_id: str) -> Optional[Dict]:
"""获取文档"""
try:
result = self.client.get(
index=self.index_name,
id=document_id
)
if result['found']:
return result['_source']
else:
logger.warning(f"文档 {document_id} 不存在")
return None
except Exception as e:
logger.error(f"获取文档 {document_id} 失败: {e}")
return None
def document_exists(self, document_id: str) -> bool:
"""检查文档是否存在"""
try:
return self.client.exists(
index=self.index_name,
id=document_id
)
except Exception as e:
logger.error(f"检查文档存在性失败: {e}")
return False
class ProductIndexService(IndexService):
"""产品索引服务"""
def __init__(self):
super().__init__(ProductDocument)
def generate_sample_products(self, count: int = 100) -> List[Dict]:
"""生成示例产品数据"""
import random
import string
categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
subcategories = {
'Electronics': ['Phones', 'Laptops', 'Tablets', 'Accessories'],
'Clothing': ['Men', 'Women', 'Kids', 'Shoes'],
'Books': ['Fiction', 'Non-Fiction', 'Science', 'History'],
'Home': ['Furniture', 'Kitchen', 'Decor', 'Lighting'],
'Sports': ['Fitness', 'Outdoor', 'Team Sports', 'Water Sports']
}
products = []
for i in range(1, count + 1):
category = random.choice(categories)
subcategory = random.choice(subcategories[category])
product = {
'id': f'prod_{i:03d}',
'product_id': f'PROD{i:05d}',
'name': f'Product {i} - {category}',
'description': f'This is a sample product {i} in {category} category',
'category': category,
'subcategory': subcategory,
'tags': [category, subcategory, f'tag{i%10}'],
'price': round(random.uniform(10, 1000), 2),
'discount_price': round(random.uniform(5, 500), 2),
'stock': random.randint(0, 1000),
'sales_count': random.randint(0, 500),
'rating': round(random.uniform(1, 5), 1),
'review_count': random.randint(0, 200),
'created_at': datetime.now().isoformat(),
'updated_at': datetime.now().isoformat(),
'is_active': random.choice([True, False]),
'is_featured': random.choice([True, False])
}
products.append(product)
return products
class ArticleIndexService(IndexService):
"""文章索引服务"""
def __init__(self):
super().__init__(ArticleDocument)
class UserIndexService(IndexService):
"""用户索引服务"""
def __init__(self):
super().__init__(UserDocument)
4.6 搜索服务
python
# services/searcher.py
"""
Elasticsearch搜索服务
"""
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime, timedelta
import logging
from es_client.connection import es_client
from elasticsearch_dsl import Search, Q, A
from elasticsearch_dsl.query import MultiMatch, Match, Term, Terms, Range
from elasticsearch_dsl.query import Bool, MatchPhrase, Prefix, Wildcard
from elasticsearch_dsl.aggs import Terms as AggTerms, DateHistogram, Avg, Sum
from elasticsearch_dsl.response import Response
logger = logging.getLogger(__name__)
class SearchService:
"""搜索服务基类"""
def __init__(self, index_name: str):
self.index_name = index_name
self.client = es_client.client
def basic_search(self, query: str, fields: List[str] = None,
page: int = 1, page_size: int = 10) -> Dict:
"""基础搜索
Args:
query: 搜索关键词
fields: 搜索字段列表
page: 页码
page_size: 每页大小
Returns:
Dict: 搜索结果
"""
try:
# 创建搜索对象
s = Search(using=self.client, index=self.index_name)
# 设置分页
start_from = (page - 1) * page_size
s = s[start_from:start_from + page_size]
# 构建查询
if fields:
# 多字段匹配
q = MultiMatch(query=query, fields=fields)
else:
# 默认在所有字段上搜索
q = MultiMatch(query=query, fields=['*'])
s = s.query(q)
# 执行搜索
response = s.execute()
# 构建返回结果
results = {
'total': response.hits.total.value,
'page': page,
'page_size': page_size,
'total_pages': (response.hits.total.value + page_size - 1) // page_size,
'results': [hit.to_dict() for hit in response],
'took': response.took,
'timed_out': response.timed_out
}
return results
except Exception as e:
logger.error(f"搜索失败: {e}")
return {
'total': 0,
'page': page,
'page_size': page_size,
'total_pages': 0,
'results': [],
'error': str(e)
}
def advanced_search(self, search_params: Dict) -> Dict:
"""高级搜索
Args:
search_params: 搜索参数
Returns:
Dict: 搜索结果
"""
try:
# 创建搜索对象
s = Search(using=self.client, index=self.index_name)
# 提取参数
query = search_params.get('query', '')
filters = search_params.get('filters', {})
sort_field = search_params.get('sort_field')
sort_order = search_params.get('sort_order', 'desc')
page = search_params.get('page', 1)
page_size = search_params.get('page_size', 10)
# 设置分页
start_from = (page - 1) * page_size
s = s[start_from:start_from + page_size]
# 构建查询
if query:
# 使用bool查询组合多个条件
bool_query = Bool()
# 添加全文搜索条件
search_fields = search_params.get('search_fields', ['name', 'description'])
bool_query.must.append(
MultiMatch(query=query, fields=search_fields)
)
# 添加过滤条件
for field, value in filters.items():
if isinstance(value, list):
# 多值过滤
bool_query.filter.append(Terms(**{field: value}))
else:
# 单值过滤
bool_query.filter.append(Term(**{field: value}))
s = s.query(bool_query)
else:
# 没有查询关键词,只使用过滤
for field, value in filters.items():
if isinstance(value, list):
s = s.filter(Terms(**{field: value}))
else:
s = s.filter(Term(**{field: value}))
# 添加排序
if sort_field:
if sort_order == 'desc':
s = s.sort(f'-{sort_field}')
else:
s = s.sort(f'{sort_field}')
# 执行搜索
response = s.execute()
# 构建返回结果
results = {
'total': response.hits.total.value,
'page': page,
'page_size': page_size,
'total_pages': (response.hits.total.value + page_size - 1) // page_size,
'results': [hit.to_dict() for hit in response],
'took': response.took,
'timed_out': response.timed_out
}
return results
except Exception as e:
logger.error(f"高级搜索失败: {e}")
return {
'total': 0,
'page': page,
'page_size': page_size,
'total_pages': 0,
'results': [],
'error': str(e)
}
def search_with_aggregations(self, search_params: Dict) -> Dict:
"""带聚合的搜索
Args:
search_params: 搜索参数
Returns:
Dict: 包含聚合的搜索结果
"""
try:
s = Search(using=self.client, index=self.index_name)
# 构建基础查询
query = search_params.get('query', '')
if query:
s = s.query(MultiMatch(query=query, fields=['name', 'description']))
# 添加聚合
aggregations = search_params.get('aggregations', {})
for agg_name, agg_config in aggregations.items():
agg_type = agg_config.get('type', 'terms')
field = agg_config.get('field')
size = agg_config.get('size', 10)
if agg_type == 'terms' and field:
# 词条聚合
s.aggs.bucket(agg_name, AggTerms(field=field, size=size))
elif agg_type == 'date_histogram' and field:
# 日期直方图聚合
interval = agg_config.get('interval', 'day')
s.aggs.bucket(agg_name, DateHistogram(field=field, interval=interval))
elif agg_type == 'avg' and field:
# 平均值聚合
s.aggs.metric(agg_name, Avg(field=field))
elif agg_type == 'sum' and field:
# 求和聚合
s.aggs.metric(agg_name, Sum(field=field))
# 设置分页
page = search_params.get('page', 1)
page_size = search_params.get('page_size', 10)
start_from = (page - 1) * page_size
s = s[start_from:start_from + page_size]
# 执行搜索
response = s.execute()
# 提取聚合结果
agg_results = {}
for agg_name in aggregations.keys():
if hasattr(response.aggregations, agg_name):
agg = getattr(response.aggregations, agg_name)
if hasattr(agg, 'buckets'):
agg_results[agg_name] = [
{'key': bucket.key, 'doc_count': bucket.doc_count}
for bucket in agg.buckets
]
else:
agg_results[agg_name] = agg.value
# 构建返回结果
results = {
'total': response.hits.total.value,
'page': page,
'page_size': page_size,
'results': [hit.to_dict() for hit in response],
'aggregations': agg_results,
'took': response.took
}
return results
except Exception as e:
logger.error(f"聚合搜索失败: {e}")
return {
'total': 0,
'page': page,
'page_size': page_size,
'results': [],
'aggregations': {},
'error': str(e)
}
def autocomplete_search(self, prefix: str, field: str = 'name',
size: int = 10) -> List[str]:
"""自动补全搜索
Args:
prefix: 前缀
field: 补全字段
size: 返回结果数量
Returns:
List[str]: 补全建议列表
"""
try:
s = Search(using=self.client, index=self.index_name)
# 使用前缀查询
s = s.query(Prefix(**{field: prefix}))
s = s.extra(size=size)
# 只返回指定字段
s = s.source([field])
response = s.execute()
# 提取补全结果
suggestions = list(set([hit[field] for hit in response if field in hit]))
return suggestions[:size]
except Exception as e:
logger.error(f"自动补全搜索失败: {e}")
return []
def fuzzy_search(self, query: str, field: str = 'name',
fuzziness: str = 'AUTO', size: int = 10) -> List[Dict]:
"""模糊搜索
Args:
query: 搜索词
field: 搜索字段
fuzziness: 模糊度
size: 返回结果数量
Returns:
List[Dict]: 搜索结果
"""
try:
s = Search(using=self.client, index=self.index_name)
# 使用Match查询的fuzziness参数
s = s.query(
Match(**{field: {"query": query, "fuzziness": fuzziness}})
)
s = s.extra(size=size)
response = s.execute()
return [hit.to_dict() for hit in response]
except Exception as e:
logger.error(f"模糊搜索失败: {e}")
return []
def search_similar(self, document_id: str, fields: List[str] = None,
size: int = 10) -> List[Dict]:
"""相似文档搜索
Args:
document_id: 参考文档ID
fields: 相似度计算字段
size: 返回结果数量
Returns:
List[Dict]: 相似文档列表
"""
try:
# 首先获取参考文档
doc = self.client.get(index=self.index_name, id=document_id)
if not doc['found']:
return []
source = doc['_source']
# 构建more_like_this查询
s = Search(using=self.client, index=self.index_name)
# 排除自己
s = s.filter(~Term(_id=document_id))
# 构建MLT查询
mlt_query = {
"more_like_this": {
"fields": fields or ['name', 'description'],
"like": [{"_id": document_id}],
"min_term_freq": 1,
"max_query_terms": 25
}
}
s = s.query(mlt_query)
s = s.extra(size=size)
response = s.execute()
return [hit.to_dict() for hit in response]
except Exception as e:
logger.error(f"相似文档搜索失败: {e}")
return []
class ProductSearchService(SearchService):
"""产品搜索服务"""
def __init__(self):
super().__init__('products')
def search_products(self, keyword: str = '', category: str = '',
min_price: float = None, max_price: float = None,
min_rating: float = None, in_stock: bool = None,
page: int = 1, page_size: int = 10,
sort_by: str = 'relevance') -> Dict:
"""产品搜索
Args:
keyword: 搜索关键词
category: 产品分类
min_price: 最低价格
max_price: 最高价格
min_rating: 最低评分
in_stock: 是否有库存
page: 页码
page_size: 每页大小
sort_by: 排序方式
Returns:
Dict: 搜索结果
"""
try:
s = Search(using=self.client, index=self.index_name)
# 构建bool查询
bool_query = Bool()
# 关键词搜索
if keyword:
bool_query.must.append(
MultiMatch(
query=keyword,
fields=['name^3', 'description^2', 'tags'],
fuzziness='AUTO'
)
)
# 过滤条件
if category:
bool_query.filter.append(Term(category=category))
if min_price is not None or max_price is not None:
price_range = {}
if min_price is not None:
price_range['gte'] = min_price
if max_price is not None:
price_range['lte'] = max_price
bool_query.filter.append(Range(price=price_range))
if min_rating is not None:
bool_query.filter.append(Range(rating={'gte': min_rating}))
if in_stock is not None:
if in_stock:
bool_query.filter.append(Range(stock={'gt': 0}))
else:
bool_query.filter.append(Term(stock=0))
# 应用查询
if bool_query:
s = s.query(bool_query)
# 设置排序
if sort_by == 'price_asc':
s = s.sort('price')
elif sort_by == 'price_desc':
s = s.sort('-price')
elif sort_by == 'rating':
s = s.sort('-rating')
elif sort_by == 'sales':
s = s.sort('-sales_count')
elif sort_by == 'newest':
s = s.sort('-created_at')
# 设置分页
start_from = (page - 1) * page_size
s = s[start_from:start_from + page_size]
# 执行搜索
response = s.execute()
# 构建返回结果
results = {
'total': response.hits.total.value,
'page': page,
'page_size': page_size,
'total_pages': (response.hits.total.value + page_size - 1) // page_size,
'products': [hit.to_dict() for hit in response],
'took': response.took
}
return results
except Exception as e:
logger.error(f"产品搜索失败: {e}")
return {
'total': 0,
'page': page,
'page_size': page_size,
'total_pages': 0,
'products': [],
'error': str(e)
}
def get_category_facets(self) -> Dict:
"""获取分类面(用于过滤)"""
try:
s = Search(using=self.client, index=self.index_name)
# 添加分类聚合
s.aggs.bucket('categories', AggTerms(field='category', size=100))
# 只获取聚合结果,不获取文档
s = s.extra(size=0)
response = s.execute()
# 提取分类结果
categories = []
if hasattr(response.aggregations, 'categories'):
for bucket in response.aggregations.categories.buckets:
categories.append({
'name': bucket.key,
'count': bucket.doc_count
})
return {'categories': categories}
except Exception as e:
logger.error(f"获取分类面失败: {e}")
return {'categories': []}
class ArticleSearchService(SearchService):
"""文章搜索服务"""
def __init__(self):
super().__init__('articles')
4.7 辅助工具
python
# utils/helpers.py
"""
Elasticsearch辅助工具函数
"""
import json
import hashlib
from datetime import datetime, date
from typing import Any, Dict, List, Union
import logging
logger = logging.getLogger(__name__)
def generate_document_id(data: Dict, id_fields: List[str] = None) -> str:
"""生成文档ID
Args:
data: 文档数据
id_fields: 用于生成ID的字段列表
Returns:
str: 生成的文档ID
"""
if not id_fields:
# 如果没有指定字段,使用所有字段
id_string = json.dumps(data, sort_keys=True)
else:
# 使用指定字段
id_parts = []
for field in id_fields:
if field in data:
id_parts.append(str(data[field]))
if not id_parts:
raise ValueError("无法从指定字段生成ID")
id_string = '_'.join(id_parts)
# 使用SHA256生成哈希ID
return hashlib.sha256(id_string.encode()).hexdigest()[:32]
def format_elasticsearch_response(response: Dict) -> Dict:
"""格式化Elasticsearch响应
Args:
response: Elasticsearch原始响应
Returns:
Dict: 格式化后的响应
"""
if 'hits' in response:
# 搜索响应
formatted = {
'total': response['hits']['total']['value'],
'took': response.get('took', 0),
'timed_out': response.get('timed_out', False),
'results': []
}
for hit in response['hits']['hits']:
result = hit['_source']
result['_id'] = hit['_id']
result['_score'] = hit.get('_score', 0)
formatted['results'].append(result)
return formatted
else:
# 其他类型响应
return response
def convert_to_elasticsearch_date(date_value: Union[str, datetime, date]) -> str:
"""转换为Elasticsearch日期格式
Args:
date_value: 日期值
Returns:
str: ISO格式日期字符串
"""
if isinstance(date_value, str):
# 如果是字符串,尝试解析
try:
dt = datetime.fromisoformat(date_value.replace('Z', '+00:00'))
return dt.isoformat()
except ValueError:
return date_value
elif isinstance(date_value, datetime):
return date_value.isoformat()
elif isinstance(date_value, date):
return datetime.combine(date_value, datetime.min.time()).isoformat()
else:
raise ValueError(f"不支持的日期类型: {type(date_value)}")
def build_range_query(field: str, min_value: Any = None,
max_value: Any = None) -> Dict:
"""构建范围查询
Args:
field: 字段名
min_value: 最小值
max_value: 最大值
Returns:
Dict: 范围查询
"""
range_query = {}
if min_value is not None:
range_query['gte'] = min_value
if max_value is not None:
range_query['lte'] = max_value
if range_query:
return {'range': {field: range_query}}
else:
return {}
def build_terms_query(field: str, values: List[Any]) -> Dict:
"""构建词条查询
Args:
field: 字段名
values: 值列表
Returns:
Dict: 词条查询
"""
if not values:
return {}
return {'terms': {field: values}}
def build_bool_query(must: List[Dict] = None,
filter: List[Dict] = None,
should: List[Dict] = None,
must_not: List[Dict] = None,
minimum_should_match: int = 1) -> Dict:
"""构建布尔查询
Args:
must: 必须匹配的查询
filter: 过滤查询
should: 应该匹配的查询
must_not: 必须不匹配的查询
minimum_should_match: 最少应该匹配的数量
Returns:
Dict: 布尔查询
"""
bool_query = {}
if must:
bool_query['must'] = must
if filter:
bool_query['filter'] = filter
if should:
bool_query['should'] = should
if must_not:
bool_query['must_not'] = must_not
if should and minimum_should_match != 1:
bool_query['minimum_should_match'] = minimum_should_match
if bool_query:
return {'bool': bool_query}
else:
return {}
def estimate_index_size(doc_count: int, avg_doc_size: int) -> Dict[str, str]:
"""估算索引大小
Args:
doc_count: 文档数量
avg_doc_size: 平均文档大小(字节)
Returns:
Dict: 大小估算
"""
# 简单估算公式
total_size = doc_count * avg_doc_size
# 考虑Elasticsearch开销(约30%)
estimated_size = total_size * 1.3
# 转换为可读格式
size_units = ['B', 'KB', 'MB', 'GB', 'TB']
size = estimated_size
unit_index = 0
while size >= 1024 and unit_index < len(size_units) - 1:
size /= 1024
unit_index += 1
return {
'estimated_size': f"{size:.2f} {size_units[unit_index]}",
'doc_count': doc_count,
'avg_doc_size': f"{avg_doc_size} B"
}
def validate_index_name(name: str) -> bool:
"""验证索引名称
Args:
name: 索引名称
Returns:
bool: 是否有效
"""
import re
# Elasticsearch索引命名规则
pattern = r'^[a-z0-9][a-z0-9_\-]*$'
if not re.match(pattern, name):
return False
# 不能以这些字符开头
invalid_prefixes = ['_', '-', '+']
if any(name.startswith(prefix) for prefix in invalid_prefixes):
return False
# 长度限制
if len(name) > 255:
return False
return True
4.8 主程序示例
python
# main.py
"""
Elasticsearch集成示例主程序
"""
import sys
import time
from services.indexer import ProductIndexService, ArticleIndexService
from services.searcher import ProductSearchService
from utils.helpers import estimate_index_size
import logging
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def demonstrate_basic_operations():
"""演示基本操作"""
logger.info("=" * 60)
logger.info("Elasticsearch基本操作演示")
logger.info("=" * 60)
# 创建产品索引服务
product_service = ProductIndexService()
# 1. 创建索引
logger.info("\n1. 创建产品索引...")
if product_service.create_index(delete_if_exists=True):
logger.info("✓ 产品索引创建成功")
else:
logger.error("✗ 产品索引创建失败")
return
# 2. 生成并索引示例数据
logger.info("\n2. 生成并索引示例数据...")
sample_products = product_service.generate_sample_products(50)
bulk_result = product_service.bulk_index_documents(
sample_products,
id_field='id',
batch_size=20
)
logger.info(f"批量索引结果: 成功={bulk_result['success']}, 失败={bulk_result['failed']}")
# 等待索引刷新
time.sleep(2)
# 3. 搜索演示
logger.info("\n3. 搜索演示...")
search_service = ProductSearchService()
# 基础搜索
logger.info("\n 基础搜索 - 关键词: 'electronics'")
results = search_service.basic_search(
query="electronics",
fields=['name', 'description', 'category'],
page=1,
page_size=5
)
logger.info(f" 找到 {results['total']} 个结果")
for i, product in enumerate(results['results'][:3], 1):
logger.info(f" {i}. {product.get('name')} - ${product.get('price')}")
# 高级搜索
logger.info("\n 高级搜索 - 分类: 'Electronics', 价格范围: $100-$500")
search_params = {
'query': '',
'filters': {
'category': 'Electronics'
},
'sort_field': 'price',
'sort_order': 'asc',
'page': 1,
'page_size': 5
}
# 添加价格范围过滤
price_range = {'gte': 100, 'lte': 500}
# 使用elasticsearch_dsl进行范围查询
from elasticsearch_dsl import Search, Q
from es_client.connection import es_client
s = Search(using=es_client.client, index='products')
s = s.filter('term', category='Electronics')
s = s.filter('range', price=price_range)
s = s.sort('price')
s = s.extra(size=5)
response = s.execute()
logger.info(f" 找到 {response.hits.total.value} 个符合条件的电子产品")
for i, hit in enumerate(response[:3], 1):
logger.info(f" {i}. {hit.name} - ${hit.price}")
# 4. 聚合演示
logger.info("\n4. 聚合演示...")
# 按分类聚合
s = Search(using=es_client.client, index='products')
s.aggs.bucket('by_category', 'terms', field='category', size=10)
s = s.extra(size=0) # 不返回具体文档
response = s.execute()
logger.info(" 按分类统计:")
for bucket in response.aggregations.by_category.buckets:
logger.info(f" {bucket.key}: {bucket.doc_count} 个产品")
# 5. 更新和删除演示
logger.info("\n5. 更新和删除演示...")
# 更新文档
sample_product_id = sample_products[0]['id']
update_data = {
'price': 999.99,
'updated_at': time.strftime('%Y-%m-%dT%H:%M:%S')
}
if product_service.update_document(sample_product_id, update_data):
logger.info(f"✓ 成功更新产品 {sample_product_id}")
# 验证更新
updated_doc = product_service.get_document(sample_product_id)
if updated_doc:
logger.info(f" 更新后的价格: ${updated_doc.get('price')}")
# 删除文档
if product_service.delete_document(sample_product_id):
logger.info(f"✓ 成功删除产品 {sample_product_id}")
# 验证删除
if not product_service.document_exists(sample_product_id):
logger.info(" 文档已成功删除")
# 6. 索引统计
logger.info("\n6. 索引统计...")
indices_stats = es_client.client.indices.stats(index='products')
if 'indices' in indices_stats and 'products' in indices_stats['indices']:
products_stats = indices_stats['indices']['products']
total_docs = products_stats['total']['docs']['count']
total_size = products_stats['total']['store']['size_in_bytes']
logger.info(f" 文档总数: {total_docs}")
logger.info(f" 索引大小: {total_size / 1024 / 1024:.2f} MB")
# 估算更多数据时的索引大小
estimation = estimate_index_size(1000000, 2000) # 100万文档,平均2KB
logger.info(f" 100万文档估算大小: {estimation['estimated_size']}")
def demonstrate_advanced_features():
"""演示高级功能"""
logger.info("\n" + "=" * 60)
logger.info("Elasticsearch高级功能演示")
logger.info("=" * 60)
search_service = ProductSearchService()
# 1. 自动补全
logger.info("\n1. 自动补全演示...")
suggestions = search_service.autocomplete_search(
prefix='prod',
field='name',
size=5
)
logger.info(" 产品名称补全建议:")
for i, suggestion in enumerate(suggestions, 1):
logger.info(f" {i}. {suggestion}")
# 2. 模糊搜索
logger.info("\n2. 模糊搜索演示...")
fuzzy_results = search_service.fuzzy_search(
query='electrnics', # 故意拼写错误
field='category',
fuzziness='AUTO',
size=3
)
logger.info(f" 模糊搜索找到 {len(fuzzy_results)} 个结果")
for i, result in enumerate(fuzzy_results[:3], 1):
logger.info(f" {i}. {result.get('name')} - 分类: {result.get('category')}")
# 3. 相似文档搜索
logger.info("\n3. 相似文档搜索演示...")
# 先获取一个文档ID
from es_client.connection import es_client
s = Search(using=es_client.client, index='products')
s = s.extra(size=1)
response = s.execute()
if response.hits:
sample_doc_id = response.hits[0].meta.id
similar_docs = search_service.search_similar(
document_id=sample_doc_id,
fields=['name', 'description', 'category'],
size=3
)
logger.info(f" 与文档 {sample_doc_id} 相似的文档:")
for i, doc in enumerate(similar_docs[:3], 1):
logger.info(f" {i}. {doc.get('name')} - 分类: {doc.get('category')}")
# 4. 复杂聚合
logger.info("\n4. 复杂聚合演示...")
# 按分类聚合,并在每个分类内计算平均价格
from elasticsearch_dsl import A
s = Search(using=es_client.client, index='products')
# 嵌套聚合:先按分类分组,然后计算平均价格
s.aggs.bucket('categories', 'terms', field='category', size=5) \
.metric('avg_price', 'avg', field='price')
s = s.extra(size=0)
response = s.execute()
logger.info(" 各分类平均价格:")
for bucket in response.aggregations.categories.buckets:
avg_price = bucket.avg_price.value
logger.info(f" {bucket.key}: ${avg_price:.2f} (共{bucket.doc_count}个产品)")
def check_system_health():
"""检查系统健康状态"""
logger.info("\n" + "=" * 60)
logger.info("系统健康检查")
logger.info("=" * 60)
from es_client.connection import es_client
# 1. 集群健康
health = es_client.get_cluster_health()
if health:
logger.info(f"集群状态: {health.get('status', 'unknown')}")
logger.info(f"节点数: {health.get('number_of_nodes', 0)}")
logger.info(f"数据节点数: {health.get('number_of_data_nodes', 0)}")
logger.info(f"活动分片数: {health.get('active_shards', 0)}")
# 2. 索引状态
indices = es_client.client.indices.stats()
if 'indices' in indices:
logger.info(f"\n索引数量: {len(indices['indices'])}")
for index_name, index_stats in indices['indices'].items():
docs_count = index_stats['total']['docs']['count']
size_bytes = index_stats['total']['store']['size_in_bytes']
size_mb = size_bytes / 1024 / 1024
logger.info(f" {index_name}: {docs_count} 文档, {size_mb:.2f} MB")
# 3. 连接测试
try:
if es_client.client.ping():
logger.info("\n✓ Elasticsearch连接正常")
else:
logger.error("\n✗ Elasticsearch连接失败")
except Exception as e:
logger.error(f"\n✗ Elasticsearch连接异常: {e}")
def main():
"""主函数"""
try:
logger.info("开始Elasticsearch集成演示")
# 检查系统健康
check_system_health()
# 演示基本操作
demonstrate_basic_operations()
# 演示高级功能
demonstrate_advanced_features()
logger.info("\n" + "=" * 60)
logger.info("演示完成!")
logger.info("=" * 60)
except KeyboardInterrupt:
logger.info("\n用户中断演示")
except Exception as e:
logger.error(f"演示过程中发生错误: {e}", exc_info=True)
return 1
return 0
if __name__ == "__main__":
sys.exit(main())
五、性能优化与最佳实践
5.1 索引设计优化
5.1.1 映射设计原则
python
# 优化的映射设计示例
optimized_mapping = {
"mappings": {
"dynamic": "strict", # 严格控制字段类型
"properties": {
"product_id": {
"type": "keyword",
"ignore_above": 256
},
"title": {
"type": "text",
"analyzer": "ik_max_word", # 使用中文分词器
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"price": {
"type": "scaled_float", # 使用缩放浮点数节省空间
"scaling_factor": 100
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
},
"tags": {
"type": "keyword",
"eager_global_ordinals": True # 预加载全局序数
}
}
},
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s", # 降低刷新频率提高索引速度
"index": {
"max_result_window": 100000 # 调整最大结果窗口
}
}
}
5.1.2 分片策略优化
小数据量 < 100GB
中等数据量 100GB-1TB
大数据量 > 1TB
索引分片策略
数据量评估
单分片
3-5个分片
10+个分片
分片大小建议: < 50GB
分片大小建议: 20-50GB
分片大小建议: 30-50GB
副本数: 1-2
最终建议