Elasticsearch搜索引擎集成指南

目录

  • Elasticsearch搜索引擎集成指南
    • 引言:现代搜索技术的重要性
    • 一、Elasticsearch核心概念与架构
      • [1.1 什么是Elasticsearch?](#1.1 什么是Elasticsearch?)
      • [1.2 Elasticsearch的核心概念](#1.2 Elasticsearch的核心概念)
        • [1.2.1 集群(Cluster)](#1.2.1 集群(Cluster))
        • [1.2.2 节点(Node)](#1.2.2 节点(Node))
        • [1.2.3 索引(Index)](#1.2.3 索引(Index))
        • [1.2.4 类型(Type)(在7.x版本后已弃用)](#1.2.4 类型(Type)(在7.x版本后已弃用))
        • [1.2.5 文档(Document)](#1.2.5 文档(Document))
        • [1.2.6 分片(Shard)和副本(Replica)](#1.2.6 分片(Shard)和副本(Replica))
      • [1.3 Elasticsearch架构图](#1.3 Elasticsearch架构图)
    • 二、Elasticsearch的核心原理
      • [2.1 倒排索引(Inverted Index)](#2.1 倒排索引(Inverted Index))
      • [2.2 相关性评分(Relevance Scoring)](#2.2 相关性评分(Relevance Scoring))
        • [2.2.1 TF-IDF公式](#2.2.1 TF-IDF公式)
        • [2.2.2 BM25算法](#2.2.2 BM25算法)
      • [2.3 分布式搜索原理](#2.3 分布式搜索原理)
    • 三、环境搭建与配置
      • [3.1 安装Elasticsearch](#3.1 安装Elasticsearch)
        • [3.1.1 Docker方式安装](#3.1.1 Docker方式安装)
        • [3.1.2 本地安装(macOS)](#3.1.2 本地安装(macOS))
      • [3.2 安装Python客户端](#3.2 安装Python客户端)
      • [3.3 验证安装](#3.3 验证安装)
    • 四、Python集成Elasticsearch完整实现
      • [4.1 项目结构与配置](#4.1 项目结构与配置)
      • [4.2 配置文件](#4.2 配置文件)
      • [4.3 Elasticsearch客户端封装](#4.3 Elasticsearch客户端封装)
      • [4.4 数据模型与映射](#4.4 数据模型与映射)
      • [4.5 索引操作服务](#4.5 索引操作服务)
      • [4.6 搜索服务](#4.6 搜索服务)
      • [4.7 辅助工具](#4.7 辅助工具)
      • [4.8 主程序示例](#4.8 主程序示例)
    • 五、性能优化与最佳实践
      • [5.1 索引设计优化](#5.1 索引设计优化)
        • [5.1.1 映射设计原则](#5.1.1 映射设计原则)
        • [5.1.2 分片策略优化](#5.1.2 分片策略优化)

『宝藏代码胶囊开张啦!』------ 我的 CodeCapsule 来咯!✨写代码不再头疼!我的新站点 CodeCapsule 主打一个 "白菜价"+"量身定制 "!无论是卡脖子的毕设/课设/文献复现 ,需要灵光一现的算法改进 ,还是想给项目加个"外挂",这里都有便宜又好用的代码方案等你发现!低成本,高适配,助你轻松通关!速来围观 👉 CodeCapsule官网

Elasticsearch搜索引擎集成指南

引言:现代搜索技术的重要性

在信息爆炸的时代,如何快速、准确地从海量数据中找到所需信息已成为每个应用程序的核心需求。传统的关系型数据库虽然在数据存储和事务处理方面表现出色,但在全文搜索和复杂查询方面却力不从心。Elasticsearch作为当前最流行的开源搜索引擎,凭借其分布式、高性能和易扩展的特性,已经成为企业级搜索解决方案的首选。

本文将深入探讨Elasticsearch的核心概念、工作原理,并通过完整的Python实现示例,展示如何在实际项目中集成和使用Elasticsearch。

一、Elasticsearch核心概念与架构

1.1 什么是Elasticsearch?

Elasticsearch是一个基于Lucene的分布式、RESTful风格的搜索和数据分析引擎。它能够以近实时的方式存储、搜索和分析大量数据,通常被用作复杂搜索场景、日志分析和业务智能等领域的底层引擎。

1.2 Elasticsearch的核心概念

1.2.1 集群(Cluster)

Elasticsearch集群由一个或多个节点组成,这些节点共同存储所有数据,并提供跨所有节点的联合索引和搜索功能。每个集群都有一个唯一的名称标识。

1.2.2 节点(Node)

节点是集群中的一个服务器,存储数据并参与集群的索引和搜索功能。节点可以通过配置为不同的类型:

  • 主节点:负责集群管理
  • 数据节点:存储数据并执行数据相关操作
  • 协调节点:处理客户端请求
1.2.3 索引(Index)

索引是具有相似特征的文档集合,相当于关系型数据库中的"数据库"。每个索引有一个唯一名称,通过这个名称可以对索引中的文档执行索引、搜索、更新和删除操作。

1.2.4 类型(Type)(在7.x版本后已弃用)

在7.x版本之前,类型是索引内部的逻辑类别/分区,相当于关系型数据库中的"表"。从7.x版本开始,每个索引只能包含一个类型。

1.2.5 文档(Document)

文档是可以被索引的基本信息单元,以JSON格式表示。相当于关系型数据库中的一行记录。

1.2.6 分片(Shard)和副本(Replica)
  • 分片:索引可以被分成多个分片,每个分片是一个完整的Lucene索引
  • 副本:分片的副本,提供数据冗余和高可用性

1.3 Elasticsearch架构图

索引: products
客户端请求
协调节点
数据节点1
数据节点2
数据节点3
主分片P0
副本分片R0
主分片P1
副本分片R1
主分片P2
副本分片R2
文档1
文档2
文档3
文档4

二、Elasticsearch的核心原理

2.1 倒排索引(Inverted Index)

Elasticsearch的核心是Lucene的倒排索引技术。与关系型数据库的正排索引不同,倒排索引建立了从词项到文档的映射。

正排索引 :文档ID → 文档内容 → 单词
倒排索引:单词 → 文档ID

倒排索引的结构可以用以下数学表示:

设文档集合 D = { d 1 , d 2 , . . . , d n } D = \{d_1, d_2, ..., d_n\} D={d1,d2,...,dn},词项集合 T = { t 1 , t 2 , . . . , t m } T = \{t_1, t_2, ..., t_m\} T={t1,t2,...,tm}

倒排索引 I I I 是一个从词项到文档列表的映射:
I ( t i ) = { d j ∣ t i ∈ d j } I(t_i) = \{d_j | t_i \in d_j\} I(ti)={dj∣ti∈dj}

对于每个词项 t i t_i ti,还会存储:

  • 文档频率(DF):包含该词项的文档数
  • 词项频率(TF):词项在每个文档中出现的次数
  • 位置信息:词项在文档中的位置

2.2 相关性评分(Relevance Scoring)

Elasticsearch使用TF-IDF(词频-逆文档频率)和BM25算法来计算文档的相关性分数。

2.2.1 TF-IDF公式

tf-idf ( t , d , D ) = tf ( t , d ) × idf ( t , D ) \text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D) tf-idf(t,d,D)=tf(t,d)×idf(t,D)

其中:

  • tf ( t , d ) \text{tf}(t, d) tf(t,d) 是词项 t t t 在文档 d d d 中的词频
  • idf ( t , D ) = log ⁡ N ∣ { d ∈ D : t ∈ d } ∣ \text{idf}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|} idf(t,D)=log∣{d∈D:t∈d}∣N, N N N 是文档总数
2.2.2 BM25算法

BM25是TF-IDF的改进版本,公式更复杂但效果更好:

BM25 ( d , q ) = ∑ t ∈ q IDF ( t ) × f ( t , d ) × ( k 1 + 1 ) f ( t , d ) + k 1 × ( 1 − b + b × ∣ d ∣ avgdl ) \text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})} BM25(d,q)=t∈q∑IDF(t)×f(t,d)+k1×(1−b+b×avgdl∣d∣)f(t,d)×(k1+1)

其中:

  • f ( t , d ) f(t, d) f(t,d):词项 t t t 在文档 d d d 中的词频
  • ∣ d ∣ |d| ∣d∣:文档 d d d 的长度
  • avgdl \text{avgdl} avgdl:所有文档的平均长度
  • k 1 k_1 k1 和 b b b:可调参数(通常 k 1 = 1.2 k_1=1.2 k1=1.2, b = 0.75 b=0.75 b=0.75)

2.3 分布式搜索原理

Elasticsearch的分布式搜索过程可以分为两个阶段:

  1. 查询阶段:协调节点将查询广播到所有相关分片,每个分片执行查询并返回结果(文档ID和分数)
  2. 取回阶段:协调节点合并所有分片的结果,根据分数排序,然后取回完整的文档数据

三、环境搭建与配置

3.1 安装Elasticsearch

3.1.1 Docker方式安装
bash 复制代码
# 拉取Elasticsearch镜像
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.15.0

# 运行Elasticsearch容器
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:7.15.0
3.1.2 本地安装(macOS)
bash 复制代码
# 使用Homebrew安装
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full

# 启动Elasticsearch
elasticsearch

3.2 安装Python客户端

bash 复制代码
pip install elasticsearch
pip install elasticsearch-dsl  # 高级客户端,提供更友好的API

3.3 验证安装

python 复制代码
import elasticsearch

# 测试连接
es = elasticsearch.Elasticsearch(["http://localhost:9200"])
print(es.info())

四、Python集成Elasticsearch完整实现

4.1 项目结构与配置

首先创建项目目录结构:

复制代码
elasticsearch-integration/
├── config/
│   ├── __init__.py
│   └── settings.py
├── es_client/
│   ├── __init__.py
│   ├── connection.py
│   └── operations.py
├── models/
│   ├── __init__.py
│   ├── document.py
│   └── mappings.py
├── services/
│   ├── __init__.py
│   ├── indexer.py
│   └── searcher.py
├── utils/
│   ├── __init__.py
│   └── helpers.py
├── tests/
│   └── test_es_operations.py
├── requirements.txt
└── main.py

4.2 配置文件

python 复制代码
# config/settings.py
"""
Elasticsearch配置设置
"""

class ElasticsearchConfig:
    """Elasticsearch配置类"""
    
    # 连接配置
    HOSTS = ["http://localhost:9200"]
    TIMEOUT = 30  # 超时时间(秒)
    
    # 索引配置
    DEFAULT_INDEX_SETTINGS = {
        "number_of_shards": 3,      # 主分片数
        "number_of_replicas": 1,    # 副本数
        "refresh_interval": "1s"    # 刷新间隔
    }
    
    # 批量操作配置
    BULK_SIZE = 1000                # 批量操作大小
    BULK_REFRESH = "wait_for"       # 批量操作后刷新策略
    
    # 搜索配置
    DEFAULT_PAGE_SIZE = 10
    MAX_RESULT_WINDOW = 10000       # 最大结果窗口
    
    # 重试配置
    MAX_RETRIES = 3
    RETRY_ON_TIMEOUT = True
    
    # 安全配置(如果启用安全功能)
    USERNAME = None
    PASSWORD = None
    API_KEY = None
    
    @classmethod
    def get_connection_args(cls):
        """获取连接参数"""
        connection_args = {
            "hosts": cls.HOSTS,
            "timeout": cls.TIMEOUT,
            "max_retries": cls.MAX_RETRIES,
            "retry_on_timeout": cls.RETRY_ON_TIMEOUT
        }
        
        # 添加认证信息
        if cls.USERNAME and cls.PASSWORD:
            connection_args["http_auth"] = (cls.USERNAME, cls.PASSWORD)
        elif cls.API_KEY:
            connection_args["api_key"] = cls.API_KEY
            
        return connection_args

4.3 Elasticsearch客户端封装

python 复制代码
# es_client/connection.py
"""
Elasticsearch连接管理
"""
from elasticsearch import Elasticsearch, RequestsHttpConnection
from elasticsearch.helpers import bulk, streaming_bulk, parallel_bulk
from elasticsearch_dsl import Search, Q
from config.settings import ElasticsearchConfig
import logging
from typing import Dict, List, Optional, Any, Iterator

# 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ElasticsearchClient:
    """Elasticsearch客户端封装类"""
    
    def __init__(self):
        """初始化Elasticsearch客户端"""
        self._client = None
        self._connect()
    
    def _connect(self):
        """建立Elasticsearch连接"""
        try:
            connection_args = ElasticsearchConfig.get_connection_args()
            self._client = Elasticsearch(**connection_args)
            
            # 测试连接
            if self._client.ping():
                logger.info("成功连接到Elasticsearch")
                logger.info(f"集群信息: {self._client.info()}")
            else:
                logger.error("无法连接到Elasticsearch")
                raise ConnectionError("无法连接到Elasticsearch")
                
        except Exception as e:
            logger.error(f"连接Elasticsearch失败: {e}")
            raise
    
    @property
    def client(self) -> Elasticsearch:
        """获取Elasticsearch客户端实例"""
        if not self._client:
            self._connect()
        return self._client
    
    def reconnect(self):
        """重新连接"""
        self.close()
        self._connect()
    
    def close(self):
        """关闭连接"""
        if self._client:
            self._client.close()
            self._client = None
            logger.info("Elasticsearch连接已关闭")
    
    def get_cluster_health(self) -> Dict:
        """获取集群健康状态"""
        try:
            return self.client.cluster.health()
        except Exception as e:
            logger.error(f"获取集群健康状态失败: {e}")
            return {}
    
    def get_indices_stats(self) -> Dict:
        """获取所有索引统计信息"""
        try:
            return self.client.indices.stats()
        except Exception as e:
            logger.error(f"获取索引统计信息失败: {e}")
            return {}
    
    def __enter__(self):
        """上下文管理器入口"""
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        """上下文管理器出口"""
        self.close()

# 创建全局客户端实例
es_client = ElasticsearchClient()

4.4 数据模型与映射

python 复制代码
# models/document.py
"""
Elasticsearch文档数据模型
"""
from datetime import datetime
from typing import Dict, List, Optional, Any
from elasticsearch_dsl import Document, Date, Keyword, Text, Integer, Float, Boolean
from elasticsearch_dsl import analyzer, tokenizer, token_filter

# 自定义分析器
# 中文分词器需要安装ik插件,这里使用标准分词器作为示例
custom_analyzer = analyzer(
    'custom_analyzer',
    tokenizer='standard',
    filter=['lowercase', 'stop', 'snowball']
)

class ProductDocument(Document):
    """产品文档模型"""
    
    # 元数据字段
    product_id = Keyword()
    name = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
    description = Text(analyzer=custom_analyzer)
    
    # 分类信息
    category = Keyword()
    subcategory = Keyword()
    tags = Keyword(multi=True)
    
    # 数值字段
    price = Float()
    discount_price = Float()
    stock = Integer()
    sales_count = Integer()
    
    # 评分信息
    rating = Float()
    review_count = Integer()
    
    # 时间字段
    created_at = Date()
    updated_at = Date()
    
    # 布尔字段
    is_active = Boolean()
    is_featured = Boolean()
    
    # 嵌套字段(需要定义嵌套类型)
    # attributes = Nested()
    
    class Index:
        """索引配置"""
        name = 'products'
        settings = {
            'number_of_shards': 3,
            'number_of_replicas': 1,
            'refresh_interval': '1s'
        }
    
    def save(self, **kwargs):
        """保存文档"""
        # 自动设置时间戳
        if not self.created_at:
            self.created_at = datetime.now()
        self.updated_at = datetime.now()
        
        # 调用父类save方法
        return super().save(**kwargs)
    
    @classmethod
    def get_mapping(cls) -> Dict:
        """获取索引映射"""
        return cls._doc_type.mapping.to_dict()

class ArticleDocument(Document):
    """文章文档模型"""
    
    article_id = Keyword()
    title = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
    content = Text(analyzer=custom_analyzer)
    author = Keyword()
    category = Keyword()
    tags = Keyword(multi=True)
    
    # 阅读统计
    view_count = Integer()
    like_count = Integer()
    comment_count = Integer()
    
    # 时间字段
    published_at = Date()
    created_at = Date()
    updated_at = Date()
    
    # 状态
    status = Keyword()  # draft, published, archived
    
    class Index:
        name = 'articles'
        settings = {
            'number_of_shards': 2,
            'number_of_replicas': 1
        }

class UserDocument(Document):
    """用户文档模型"""
    
    user_id = Keyword()
    username = Keyword()
    email = Keyword()
    full_name = Text(analyzer=custom_analyzer, fields={'raw': Keyword()})
    
    # 用户信息
    age = Integer()
    gender = Keyword()
    location = Keyword()
    
    # 用户行为
    interests = Keyword(multi=True)
    last_login = Date()
    login_count = Integer()
    
    # 统计信息
    order_count = Integer()
    total_spent = Float()
    
    # 时间字段
    created_at = Date()
    updated_at = Date()
    
    class Index:
        name = 'users'
        settings = {
            'number_of_shards': 2,
            'number_of_replicas': 1
        }

4.5 索引操作服务

python 复制代码
# services/indexer.py
"""
Elasticsearch索引操作服务
"""
from typing import Dict, List, Optional, Any, Iterator, Union
from datetime import datetime
import json
import logging
from es_client.connection import es_client
from models.document import ProductDocument, ArticleDocument, UserDocument

logger = logging.getLogger(__name__)

class IndexService:
    """索引服务基类"""
    
    def __init__(self, document_class):
        self.document_class = document_class
        self.index_name = document_class.Index.name
        self.client = es_client.client
    
    def create_index(self, delete_if_exists: bool = False) -> bool:
        """创建索引
        
        Args:
            delete_if_exists: 如果索引已存在是否删除
            
        Returns:
            bool: 是否创建成功
        """
        try:
            # 检查索引是否存在
            if self.client.indices.exists(index=self.index_name):
                if delete_if_exists:
                    logger.info(f"索引 {self.index_name} 已存在,正在删除...")
                    self.client.indices.delete(index=self.index_name)
                else:
                    logger.info(f"索引 {self.index_name} 已存在")
                    return True
            
            # 创建索引
            self.document_class.init()
            logger.info(f"成功创建索引: {self.index_name}")
            return True
            
        except Exception as e:
            logger.error(f"创建索引 {self.index_name} 失败: {e}")
            return False
    
    def delete_index(self) -> bool:
        """删除索引"""
        try:
            if self.client.indices.exists(index=self.index_name):
                self.client.indices.delete(index=self.index_name)
                logger.info(f"成功删除索引: {self.index_name}")
                return True
            else:
                logger.warning(f"索引 {self.index_name} 不存在")
                return False
                
        except Exception as e:
            logger.error(f"删除索引 {self.index_name} 失败: {e}")
            return False
    
    def index_document(self, document_id: str, document_data: Dict, 
                      refresh: str = 'wait_for') -> bool:
        """索引单个文档
        
        Args:
            document_id: 文档ID
            document_data: 文档数据
            refresh: 刷新策略
            
        Returns:
            bool: 是否索引成功
        """
        try:
            result = self.client.index(
                index=self.index_name,
                id=document_id,
                body=document_data,
                refresh=refresh
            )
            
            if result['result'] in ['created', 'updated']:
                logger.debug(f"成功索引文档 {document_id} 到 {self.index_name}")
                return True
            else:
                logger.warning(f"索引文档 {document_id} 返回异常结果: {result}")
                return False
                
        except Exception as e:
            logger.error(f"索引文档 {document_id} 失败: {e}")
            return False
    
    def bulk_index_documents(self, documents: List[Dict], 
                           id_field: str = 'id',
                           batch_size: int = 1000) -> Dict[str, Any]:
        """批量索引文档
        
        Args:
            documents: 文档列表
            id_field: 文档ID字段名
            batch_size: 批次大小
            
        Returns:
            Dict: 批量操作结果
        """
        from elasticsearch.helpers import bulk
        
        actions = []
        success_count = 0
        error_count = 0
        errors = []
        
        for doc in documents:
            # 提取文档ID
            doc_id = str(doc.get(id_field, ''))
            if not doc_id:
                logger.warning(f"文档缺少ID字段: {doc}")
                continue
            
            # 创建批量操作action
            action = {
                "_index": self.index_name,
                "_id": doc_id,
                "_source": doc
            }
            actions.append(action)
        
        try:
            # 执行批量操作
            success, failed = bulk(
                self.client,
                actions,
                chunk_size=batch_size,
                raise_on_error=False,
                refresh='wait_for'
            )
            
            success_count = success
            error_count = len(failed)
            
            if failed:
                logger.error(f"批量索引失败 {error_count} 个文档")
                errors = failed
            else:
                logger.info(f"批量索引成功 {success_count} 个文档")
                
        except Exception as e:
            logger.error(f"批量索引异常: {e}")
            error_count = len(actions)
            errors = [{"error": str(e)} for _ in actions]
        
        return {
            "total": len(documents),
            "success": success_count,
            "failed": error_count,
            "errors": errors
        }
    
    def update_document(self, document_id: str, update_data: Dict,
                       refresh: str = 'wait_for') -> bool:
        """更新文档
        
        Args:
            document_id: 文档ID
            update_data: 更新数据
            refresh: 刷新策略
            
        Returns:
            bool: 是否更新成功
        """
        try:
            result = self.client.update(
                index=self.index_name,
                id=document_id,
                body={"doc": update_data},
                refresh=refresh
            )
            
            if result['result'] in ['updated', 'noop']:
                logger.debug(f"成功更新文档 {document_id}")
                return True
            else:
                logger.warning(f"更新文档 {document_id} 返回异常结果: {result}")
                return False
                
        except Exception as e:
            logger.error(f"更新文档 {document_id} 失败: {e}")
            return False
    
    def delete_document(self, document_id: str, 
                       refresh: str = 'wait_for') -> bool:
        """删除文档"""
        try:
            result = self.client.delete(
                index=self.index_name,
                id=document_id,
                refresh=refresh
            )
            
            if result['result'] == 'deleted':
                logger.debug(f"成功删除文档 {document_id}")
                return True
            else:
                logger.warning(f"删除文档 {document_id} 返回异常结果: {result}")
                return False
                
        except Exception as e:
            logger.error(f"删除文档 {document_id} 失败: {e}")
            return False
    
    def get_document(self, document_id: str) -> Optional[Dict]:
        """获取文档"""
        try:
            result = self.client.get(
                index=self.index_name,
                id=document_id
            )
            
            if result['found']:
                return result['_source']
            else:
                logger.warning(f"文档 {document_id} 不存在")
                return None
                
        except Exception as e:
            logger.error(f"获取文档 {document_id} 失败: {e}")
            return None
    
    def document_exists(self, document_id: str) -> bool:
        """检查文档是否存在"""
        try:
            return self.client.exists(
                index=self.index_name,
                id=document_id
            )
        except Exception as e:
            logger.error(f"检查文档存在性失败: {e}")
            return False

class ProductIndexService(IndexService):
    """产品索引服务"""
    
    def __init__(self):
        super().__init__(ProductDocument)
    
    def generate_sample_products(self, count: int = 100) -> List[Dict]:
        """生成示例产品数据"""
        import random
        import string
        
        categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
        subcategories = {
            'Electronics': ['Phones', 'Laptops', 'Tablets', 'Accessories'],
            'Clothing': ['Men', 'Women', 'Kids', 'Shoes'],
            'Books': ['Fiction', 'Non-Fiction', 'Science', 'History'],
            'Home': ['Furniture', 'Kitchen', 'Decor', 'Lighting'],
            'Sports': ['Fitness', 'Outdoor', 'Team Sports', 'Water Sports']
        }
        
        products = []
        for i in range(1, count + 1):
            category = random.choice(categories)
            subcategory = random.choice(subcategories[category])
            
            product = {
                'id': f'prod_{i:03d}',
                'product_id': f'PROD{i:05d}',
                'name': f'Product {i} - {category}',
                'description': f'This is a sample product {i} in {category} category',
                'category': category,
                'subcategory': subcategory,
                'tags': [category, subcategory, f'tag{i%10}'],
                'price': round(random.uniform(10, 1000), 2),
                'discount_price': round(random.uniform(5, 500), 2),
                'stock': random.randint(0, 1000),
                'sales_count': random.randint(0, 500),
                'rating': round(random.uniform(1, 5), 1),
                'review_count': random.randint(0, 200),
                'created_at': datetime.now().isoformat(),
                'updated_at': datetime.now().isoformat(),
                'is_active': random.choice([True, False]),
                'is_featured': random.choice([True, False])
            }
            products.append(product)
        
        return products

class ArticleIndexService(IndexService):
    """文章索引服务"""
    
    def __init__(self):
        super().__init__(ArticleDocument)

class UserIndexService(IndexService):
    """用户索引服务"""
    
    def __init__(self):
        super().__init__(UserDocument)

4.6 搜索服务

python 复制代码
# services/searcher.py
"""
Elasticsearch搜索服务
"""
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime, timedelta
import logging
from es_client.connection import es_client
from elasticsearch_dsl import Search, Q, A
from elasticsearch_dsl.query import MultiMatch, Match, Term, Terms, Range
from elasticsearch_dsl.query import Bool, MatchPhrase, Prefix, Wildcard
from elasticsearch_dsl.aggs import Terms as AggTerms, DateHistogram, Avg, Sum
from elasticsearch_dsl.response import Response

logger = logging.getLogger(__name__)

class SearchService:
    """搜索服务基类"""
    
    def __init__(self, index_name: str):
        self.index_name = index_name
        self.client = es_client.client
    
    def basic_search(self, query: str, fields: List[str] = None,
                    page: int = 1, page_size: int = 10) -> Dict:
        """基础搜索
        
        Args:
            query: 搜索关键词
            fields: 搜索字段列表
            page: 页码
            page_size: 每页大小
            
        Returns:
            Dict: 搜索结果
        """
        try:
            # 创建搜索对象
            s = Search(using=self.client, index=self.index_name)
            
            # 设置分页
            start_from = (page - 1) * page_size
            s = s[start_from:start_from + page_size]
            
            # 构建查询
            if fields:
                # 多字段匹配
                q = MultiMatch(query=query, fields=fields)
            else:
                # 默认在所有字段上搜索
                q = MultiMatch(query=query, fields=['*'])
            
            s = s.query(q)
            
            # 执行搜索
            response = s.execute()
            
            # 构建返回结果
            results = {
                'total': response.hits.total.value,
                'page': page,
                'page_size': page_size,
                'total_pages': (response.hits.total.value + page_size - 1) // page_size,
                'results': [hit.to_dict() for hit in response],
                'took': response.took,
                'timed_out': response.timed_out
            }
            
            return results
            
        except Exception as e:
            logger.error(f"搜索失败: {e}")
            return {
                'total': 0,
                'page': page,
                'page_size': page_size,
                'total_pages': 0,
                'results': [],
                'error': str(e)
            }
    
    def advanced_search(self, search_params: Dict) -> Dict:
        """高级搜索
        
        Args:
            search_params: 搜索参数
            
        Returns:
            Dict: 搜索结果
        """
        try:
            # 创建搜索对象
            s = Search(using=self.client, index=self.index_name)
            
            # 提取参数
            query = search_params.get('query', '')
            filters = search_params.get('filters', {})
            sort_field = search_params.get('sort_field')
            sort_order = search_params.get('sort_order', 'desc')
            page = search_params.get('page', 1)
            page_size = search_params.get('page_size', 10)
            
            # 设置分页
            start_from = (page - 1) * page_size
            s = s[start_from:start_from + page_size]
            
            # 构建查询
            if query:
                # 使用bool查询组合多个条件
                bool_query = Bool()
                
                # 添加全文搜索条件
                search_fields = search_params.get('search_fields', ['name', 'description'])
                bool_query.must.append(
                    MultiMatch(query=query, fields=search_fields)
                )
                
                # 添加过滤条件
                for field, value in filters.items():
                    if isinstance(value, list):
                        # 多值过滤
                        bool_query.filter.append(Terms(**{field: value}))
                    else:
                        # 单值过滤
                        bool_query.filter.append(Term(**{field: value}))
                
                s = s.query(bool_query)
            else:
                # 没有查询关键词,只使用过滤
                for field, value in filters.items():
                    if isinstance(value, list):
                        s = s.filter(Terms(**{field: value}))
                    else:
                        s = s.filter(Term(**{field: value}))
            
            # 添加排序
            if sort_field:
                if sort_order == 'desc':
                    s = s.sort(f'-{sort_field}')
                else:
                    s = s.sort(f'{sort_field}')
            
            # 执行搜索
            response = s.execute()
            
            # 构建返回结果
            results = {
                'total': response.hits.total.value,
                'page': page,
                'page_size': page_size,
                'total_pages': (response.hits.total.value + page_size - 1) // page_size,
                'results': [hit.to_dict() for hit in response],
                'took': response.took,
                'timed_out': response.timed_out
            }
            
            return results
            
        except Exception as e:
            logger.error(f"高级搜索失败: {e}")
            return {
                'total': 0,
                'page': page,
                'page_size': page_size,
                'total_pages': 0,
                'results': [],
                'error': str(e)
            }
    
    def search_with_aggregations(self, search_params: Dict) -> Dict:
        """带聚合的搜索
        
        Args:
            search_params: 搜索参数
            
        Returns:
            Dict: 包含聚合的搜索结果
        """
        try:
            s = Search(using=self.client, index=self.index_name)
            
            # 构建基础查询
            query = search_params.get('query', '')
            if query:
                s = s.query(MultiMatch(query=query, fields=['name', 'description']))
            
            # 添加聚合
            aggregations = search_params.get('aggregations', {})
            
            for agg_name, agg_config in aggregations.items():
                agg_type = agg_config.get('type', 'terms')
                field = agg_config.get('field')
                size = agg_config.get('size', 10)
                
                if agg_type == 'terms' and field:
                    # 词条聚合
                    s.aggs.bucket(agg_name, AggTerms(field=field, size=size))
                elif agg_type == 'date_histogram' and field:
                    # 日期直方图聚合
                    interval = agg_config.get('interval', 'day')
                    s.aggs.bucket(agg_name, DateHistogram(field=field, interval=interval))
                elif agg_type == 'avg' and field:
                    # 平均值聚合
                    s.aggs.metric(agg_name, Avg(field=field))
                elif agg_type == 'sum' and field:
                    # 求和聚合
                    s.aggs.metric(agg_name, Sum(field=field))
            
            # 设置分页
            page = search_params.get('page', 1)
            page_size = search_params.get('page_size', 10)
            start_from = (page - 1) * page_size
            s = s[start_from:start_from + page_size]
            
            # 执行搜索
            response = s.execute()
            
            # 提取聚合结果
            agg_results = {}
            for agg_name in aggregations.keys():
                if hasattr(response.aggregations, agg_name):
                    agg = getattr(response.aggregations, agg_name)
                    if hasattr(agg, 'buckets'):
                        agg_results[agg_name] = [
                            {'key': bucket.key, 'doc_count': bucket.doc_count}
                            for bucket in agg.buckets
                        ]
                    else:
                        agg_results[agg_name] = agg.value
            
            # 构建返回结果
            results = {
                'total': response.hits.total.value,
                'page': page,
                'page_size': page_size,
                'results': [hit.to_dict() for hit in response],
                'aggregations': agg_results,
                'took': response.took
            }
            
            return results
            
        except Exception as e:
            logger.error(f"聚合搜索失败: {e}")
            return {
                'total': 0,
                'page': page,
                'page_size': page_size,
                'results': [],
                'aggregations': {},
                'error': str(e)
            }
    
    def autocomplete_search(self, prefix: str, field: str = 'name',
                          size: int = 10) -> List[str]:
        """自动补全搜索
        
        Args:
            prefix: 前缀
            field: 补全字段
            size: 返回结果数量
            
        Returns:
            List[str]: 补全建议列表
        """
        try:
            s = Search(using=self.client, index=self.index_name)
            
            # 使用前缀查询
            s = s.query(Prefix(**{field: prefix}))
            s = s.extra(size=size)
            
            # 只返回指定字段
            s = s.source([field])
            
            response = s.execute()
            
            # 提取补全结果
            suggestions = list(set([hit[field] for hit in response if field in hit]))
            
            return suggestions[:size]
            
        except Exception as e:
            logger.error(f"自动补全搜索失败: {e}")
            return []
    
    def fuzzy_search(self, query: str, field: str = 'name',
                    fuzziness: str = 'AUTO', size: int = 10) -> List[Dict]:
        """模糊搜索
        
        Args:
            query: 搜索词
            field: 搜索字段
            fuzziness: 模糊度
            size: 返回结果数量
            
        Returns:
            List[Dict]: 搜索结果
        """
        try:
            s = Search(using=self.client, index=self.index_name)
            
            # 使用Match查询的fuzziness参数
            s = s.query(
                Match(**{field: {"query": query, "fuzziness": fuzziness}})
            )
            s = s.extra(size=size)
            
            response = s.execute()
            
            return [hit.to_dict() for hit in response]
            
        except Exception as e:
            logger.error(f"模糊搜索失败: {e}")
            return []
    
    def search_similar(self, document_id: str, fields: List[str] = None,
                      size: int = 10) -> List[Dict]:
        """相似文档搜索
        
        Args:
            document_id: 参考文档ID
            fields: 相似度计算字段
            size: 返回结果数量
            
        Returns:
            List[Dict]: 相似文档列表
        """
        try:
            # 首先获取参考文档
            doc = self.client.get(index=self.index_name, id=document_id)
            if not doc['found']:
                return []
            
            source = doc['_source']
            
            # 构建more_like_this查询
            s = Search(using=self.client, index=self.index_name)
            
            # 排除自己
            s = s.filter(~Term(_id=document_id))
            
            # 构建MLT查询
            mlt_query = {
                "more_like_this": {
                    "fields": fields or ['name', 'description'],
                    "like": [{"_id": document_id}],
                    "min_term_freq": 1,
                    "max_query_terms": 25
                }
            }
            
            s = s.query(mlt_query)
            s = s.extra(size=size)
            
            response = s.execute()
            
            return [hit.to_dict() for hit in response]
            
        except Exception as e:
            logger.error(f"相似文档搜索失败: {e}")
            return []

class ProductSearchService(SearchService):
    """产品搜索服务"""
    
    def __init__(self):
        super().__init__('products')
    
    def search_products(self, keyword: str = '', category: str = '',
                       min_price: float = None, max_price: float = None,
                       min_rating: float = None, in_stock: bool = None,
                       page: int = 1, page_size: int = 10,
                       sort_by: str = 'relevance') -> Dict:
        """产品搜索
        
        Args:
            keyword: 搜索关键词
            category: 产品分类
            min_price: 最低价格
            max_price: 最高价格
            min_rating: 最低评分
            in_stock: 是否有库存
            page: 页码
            page_size: 每页大小
            sort_by: 排序方式
            
        Returns:
            Dict: 搜索结果
        """
        try:
            s = Search(using=self.client, index=self.index_name)
            
            # 构建bool查询
            bool_query = Bool()
            
            # 关键词搜索
            if keyword:
                bool_query.must.append(
                    MultiMatch(
                        query=keyword,
                        fields=['name^3', 'description^2', 'tags'],
                        fuzziness='AUTO'
                    )
                )
            
            # 过滤条件
            if category:
                bool_query.filter.append(Term(category=category))
            
            if min_price is not None or max_price is not None:
                price_range = {}
                if min_price is not None:
                    price_range['gte'] = min_price
                if max_price is not None:
                    price_range['lte'] = max_price
                bool_query.filter.append(Range(price=price_range))
            
            if min_rating is not None:
                bool_query.filter.append(Range(rating={'gte': min_rating}))
            
            if in_stock is not None:
                if in_stock:
                    bool_query.filter.append(Range(stock={'gt': 0}))
                else:
                    bool_query.filter.append(Term(stock=0))
            
            # 应用查询
            if bool_query:
                s = s.query(bool_query)
            
            # 设置排序
            if sort_by == 'price_asc':
                s = s.sort('price')
            elif sort_by == 'price_desc':
                s = s.sort('-price')
            elif sort_by == 'rating':
                s = s.sort('-rating')
            elif sort_by == 'sales':
                s = s.sort('-sales_count')
            elif sort_by == 'newest':
                s = s.sort('-created_at')
            
            # 设置分页
            start_from = (page - 1) * page_size
            s = s[start_from:start_from + page_size]
            
            # 执行搜索
            response = s.execute()
            
            # 构建返回结果
            results = {
                'total': response.hits.total.value,
                'page': page,
                'page_size': page_size,
                'total_pages': (response.hits.total.value + page_size - 1) // page_size,
                'products': [hit.to_dict() for hit in response],
                'took': response.took
            }
            
            return results
            
        except Exception as e:
            logger.error(f"产品搜索失败: {e}")
            return {
                'total': 0,
                'page': page,
                'page_size': page_size,
                'total_pages': 0,
                'products': [],
                'error': str(e)
            }
    
    def get_category_facets(self) -> Dict:
        """获取分类面(用于过滤)"""
        try:
            s = Search(using=self.client, index=self.index_name)
            
            # 添加分类聚合
            s.aggs.bucket('categories', AggTerms(field='category', size=100))
            
            # 只获取聚合结果,不获取文档
            s = s.extra(size=0)
            
            response = s.execute()
            
            # 提取分类结果
            categories = []
            if hasattr(response.aggregations, 'categories'):
                for bucket in response.aggregations.categories.buckets:
                    categories.append({
                        'name': bucket.key,
                        'count': bucket.doc_count
                    })
            
            return {'categories': categories}
            
        except Exception as e:
            logger.error(f"获取分类面失败: {e}")
            return {'categories': []}

class ArticleSearchService(SearchService):
    """文章搜索服务"""
    
    def __init__(self):
        super().__init__('articles')

4.7 辅助工具

python 复制代码
# utils/helpers.py
"""
Elasticsearch辅助工具函数
"""
import json
import hashlib
from datetime import datetime, date
from typing import Any, Dict, List, Union
import logging

logger = logging.getLogger(__name__)

def generate_document_id(data: Dict, id_fields: List[str] = None) -> str:
    """生成文档ID
    
    Args:
        data: 文档数据
        id_fields: 用于生成ID的字段列表
        
    Returns:
        str: 生成的文档ID
    """
    if not id_fields:
        # 如果没有指定字段,使用所有字段
        id_string = json.dumps(data, sort_keys=True)
    else:
        # 使用指定字段
        id_parts = []
        for field in id_fields:
            if field in data:
                id_parts.append(str(data[field]))
        
        if not id_parts:
            raise ValueError("无法从指定字段生成ID")
        
        id_string = '_'.join(id_parts)
    
    # 使用SHA256生成哈希ID
    return hashlib.sha256(id_string.encode()).hexdigest()[:32]

def format_elasticsearch_response(response: Dict) -> Dict:
    """格式化Elasticsearch响应
    
    Args:
        response: Elasticsearch原始响应
        
    Returns:
        Dict: 格式化后的响应
    """
    if 'hits' in response:
        # 搜索响应
        formatted = {
            'total': response['hits']['total']['value'],
            'took': response.get('took', 0),
            'timed_out': response.get('timed_out', False),
            'results': []
        }
        
        for hit in response['hits']['hits']:
            result = hit['_source']
            result['_id'] = hit['_id']
            result['_score'] = hit.get('_score', 0)
            formatted['results'].append(result)
        
        return formatted
    else:
        # 其他类型响应
        return response

def convert_to_elasticsearch_date(date_value: Union[str, datetime, date]) -> str:
    """转换为Elasticsearch日期格式
    
    Args:
        date_value: 日期值
        
    Returns:
        str: ISO格式日期字符串
    """
    if isinstance(date_value, str):
        # 如果是字符串,尝试解析
        try:
            dt = datetime.fromisoformat(date_value.replace('Z', '+00:00'))
            return dt.isoformat()
        except ValueError:
            return date_value
    elif isinstance(date_value, datetime):
        return date_value.isoformat()
    elif isinstance(date_value, date):
        return datetime.combine(date_value, datetime.min.time()).isoformat()
    else:
        raise ValueError(f"不支持的日期类型: {type(date_value)}")

def build_range_query(field: str, min_value: Any = None, 
                     max_value: Any = None) -> Dict:
    """构建范围查询
    
    Args:
        field: 字段名
        min_value: 最小值
        max_value: 最大值
        
    Returns:
        Dict: 范围查询
    """
    range_query = {}
    
    if min_value is not None:
        range_query['gte'] = min_value
    
    if max_value is not None:
        range_query['lte'] = max_value
    
    if range_query:
        return {'range': {field: range_query}}
    else:
        return {}

def build_terms_query(field: str, values: List[Any]) -> Dict:
    """构建词条查询
    
    Args:
        field: 字段名
        values: 值列表
        
    Returns:
        Dict: 词条查询
    """
    if not values:
        return {}
    
    return {'terms': {field: values}}

def build_bool_query(must: List[Dict] = None, 
                    filter: List[Dict] = None,
                    should: List[Dict] = None,
                    must_not: List[Dict] = None,
                    minimum_should_match: int = 1) -> Dict:
    """构建布尔查询
    
    Args:
        must: 必须匹配的查询
        filter: 过滤查询
        should: 应该匹配的查询
        must_not: 必须不匹配的查询
        minimum_should_match: 最少应该匹配的数量
        
    Returns:
        Dict: 布尔查询
    """
    bool_query = {}
    
    if must:
        bool_query['must'] = must
    
    if filter:
        bool_query['filter'] = filter
    
    if should:
        bool_query['should'] = should
    
    if must_not:
        bool_query['must_not'] = must_not
    
    if should and minimum_should_match != 1:
        bool_query['minimum_should_match'] = minimum_should_match
    
    if bool_query:
        return {'bool': bool_query}
    else:
        return {}

def estimate_index_size(doc_count: int, avg_doc_size: int) -> Dict[str, str]:
    """估算索引大小
    
    Args:
        doc_count: 文档数量
        avg_doc_size: 平均文档大小(字节)
        
    Returns:
        Dict: 大小估算
    """
    # 简单估算公式
    total_size = doc_count * avg_doc_size
    
    # 考虑Elasticsearch开销(约30%)
    estimated_size = total_size * 1.3
    
    # 转换为可读格式
    size_units = ['B', 'KB', 'MB', 'GB', 'TB']
    size = estimated_size
    unit_index = 0
    
    while size >= 1024 and unit_index < len(size_units) - 1:
        size /= 1024
        unit_index += 1
    
    return {
        'estimated_size': f"{size:.2f} {size_units[unit_index]}",
        'doc_count': doc_count,
        'avg_doc_size': f"{avg_doc_size} B"
    }

def validate_index_name(name: str) -> bool:
    """验证索引名称
    
    Args:
        name: 索引名称
        
    Returns:
        bool: 是否有效
    """
    import re
    
    # Elasticsearch索引命名规则
    pattern = r'^[a-z0-9][a-z0-9_\-]*$'
    
    if not re.match(pattern, name):
        return False
    
    # 不能以这些字符开头
    invalid_prefixes = ['_', '-', '+']
    if any(name.startswith(prefix) for prefix in invalid_prefixes):
        return False
    
    # 长度限制
    if len(name) > 255:
        return False
    
    return True

4.8 主程序示例

python 复制代码
# main.py
"""
Elasticsearch集成示例主程序
"""
import sys
import time
from services.indexer import ProductIndexService, ArticleIndexService
from services.searcher import ProductSearchService
from utils.helpers import estimate_index_size
import logging

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def demonstrate_basic_operations():
    """演示基本操作"""
    logger.info("=" * 60)
    logger.info("Elasticsearch基本操作演示")
    logger.info("=" * 60)
    
    # 创建产品索引服务
    product_service = ProductIndexService()
    
    # 1. 创建索引
    logger.info("\n1. 创建产品索引...")
    if product_service.create_index(delete_if_exists=True):
        logger.info("✓ 产品索引创建成功")
    else:
        logger.error("✗ 产品索引创建失败")
        return
    
    # 2. 生成并索引示例数据
    logger.info("\n2. 生成并索引示例数据...")
    sample_products = product_service.generate_sample_products(50)
    
    bulk_result = product_service.bulk_index_documents(
        sample_products,
        id_field='id',
        batch_size=20
    )
    
    logger.info(f"批量索引结果: 成功={bulk_result['success']}, 失败={bulk_result['failed']}")
    
    # 等待索引刷新
    time.sleep(2)
    
    # 3. 搜索演示
    logger.info("\n3. 搜索演示...")
    search_service = ProductSearchService()
    
    # 基础搜索
    logger.info("\n  基础搜索 - 关键词: 'electronics'")
    results = search_service.basic_search(
        query="electronics",
        fields=['name', 'description', 'category'],
        page=1,
        page_size=5
    )
    
    logger.info(f"  找到 {results['total']} 个结果")
    for i, product in enumerate(results['results'][:3], 1):
        logger.info(f"    {i}. {product.get('name')} - ${product.get('price')}")
    
    # 高级搜索
    logger.info("\n  高级搜索 - 分类: 'Electronics', 价格范围: $100-$500")
    search_params = {
        'query': '',
        'filters': {
            'category': 'Electronics'
        },
        'sort_field': 'price',
        'sort_order': 'asc',
        'page': 1,
        'page_size': 5
    }
    
    # 添加价格范围过滤
    price_range = {'gte': 100, 'lte': 500}
    
    # 使用elasticsearch_dsl进行范围查询
    from elasticsearch_dsl import Search, Q
    from es_client.connection import es_client
    
    s = Search(using=es_client.client, index='products')
    s = s.filter('term', category='Electronics')
    s = s.filter('range', price=price_range)
    s = s.sort('price')
    s = s.extra(size=5)
    
    response = s.execute()
    
    logger.info(f"  找到 {response.hits.total.value} 个符合条件的电子产品")
    for i, hit in enumerate(response[:3], 1):
        logger.info(f"    {i}. {hit.name} - ${hit.price}")
    
    # 4. 聚合演示
    logger.info("\n4. 聚合演示...")
    
    # 按分类聚合
    s = Search(using=es_client.client, index='products')
    s.aggs.bucket('by_category', 'terms', field='category', size=10)
    s = s.extra(size=0)  # 不返回具体文档
    
    response = s.execute()
    
    logger.info("  按分类统计:")
    for bucket in response.aggregations.by_category.buckets:
        logger.info(f"    {bucket.key}: {bucket.doc_count} 个产品")
    
    # 5. 更新和删除演示
    logger.info("\n5. 更新和删除演示...")
    
    # 更新文档
    sample_product_id = sample_products[0]['id']
    update_data = {
        'price': 999.99,
        'updated_at': time.strftime('%Y-%m-%dT%H:%M:%S')
    }
    
    if product_service.update_document(sample_product_id, update_data):
        logger.info(f"✓ 成功更新产品 {sample_product_id}")
        
        # 验证更新
        updated_doc = product_service.get_document(sample_product_id)
        if updated_doc:
            logger.info(f"  更新后的价格: ${updated_doc.get('price')}")
    
    # 删除文档
    if product_service.delete_document(sample_product_id):
        logger.info(f"✓ 成功删除产品 {sample_product_id}")
        
        # 验证删除
        if not product_service.document_exists(sample_product_id):
            logger.info("  文档已成功删除")
    
    # 6. 索引统计
    logger.info("\n6. 索引统计...")
    indices_stats = es_client.client.indices.stats(index='products')
    
    if 'indices' in indices_stats and 'products' in indices_stats['indices']:
        products_stats = indices_stats['indices']['products']
        total_docs = products_stats['total']['docs']['count']
        total_size = products_stats['total']['store']['size_in_bytes']
        
        logger.info(f"  文档总数: {total_docs}")
        logger.info(f"  索引大小: {total_size / 1024 / 1024:.2f} MB")
        
        # 估算更多数据时的索引大小
        estimation = estimate_index_size(1000000, 2000)  # 100万文档,平均2KB
        logger.info(f"  100万文档估算大小: {estimation['estimated_size']}")

def demonstrate_advanced_features():
    """演示高级功能"""
    logger.info("\n" + "=" * 60)
    logger.info("Elasticsearch高级功能演示")
    logger.info("=" * 60)
    
    search_service = ProductSearchService()
    
    # 1. 自动补全
    logger.info("\n1. 自动补全演示...")
    suggestions = search_service.autocomplete_search(
        prefix='prod',
        field='name',
        size=5
    )
    
    logger.info("  产品名称补全建议:")
    for i, suggestion in enumerate(suggestions, 1):
        logger.info(f"    {i}. {suggestion}")
    
    # 2. 模糊搜索
    logger.info("\n2. 模糊搜索演示...")
    fuzzy_results = search_service.fuzzy_search(
        query='electrnics',  # 故意拼写错误
        field='category',
        fuzziness='AUTO',
        size=3
    )
    
    logger.info(f"  模糊搜索找到 {len(fuzzy_results)} 个结果")
    for i, result in enumerate(fuzzy_results[:3], 1):
        logger.info(f"    {i}. {result.get('name')} - 分类: {result.get('category')}")
    
    # 3. 相似文档搜索
    logger.info("\n3. 相似文档搜索演示...")
    
    # 先获取一个文档ID
    from es_client.connection import es_client
    s = Search(using=es_client.client, index='products')
    s = s.extra(size=1)
    response = s.execute()
    
    if response.hits:
        sample_doc_id = response.hits[0].meta.id
        similar_docs = search_service.search_similar(
            document_id=sample_doc_id,
            fields=['name', 'description', 'category'],
            size=3
        )
        
        logger.info(f"  与文档 {sample_doc_id} 相似的文档:")
        for i, doc in enumerate(similar_docs[:3], 1):
            logger.info(f"    {i}. {doc.get('name')} - 分类: {doc.get('category')}")
    
    # 4. 复杂聚合
    logger.info("\n4. 复杂聚合演示...")
    
    # 按分类聚合,并在每个分类内计算平均价格
    from elasticsearch_dsl import A
    
    s = Search(using=es_client.client, index='products')
    
    # 嵌套聚合:先按分类分组,然后计算平均价格
    s.aggs.bucket('categories', 'terms', field='category', size=5) \
         .metric('avg_price', 'avg', field='price')
    
    s = s.extra(size=0)
    
    response = s.execute()
    
    logger.info("  各分类平均价格:")
    for bucket in response.aggregations.categories.buckets:
        avg_price = bucket.avg_price.value
        logger.info(f"    {bucket.key}: ${avg_price:.2f} (共{bucket.doc_count}个产品)")

def check_system_health():
    """检查系统健康状态"""
    logger.info("\n" + "=" * 60)
    logger.info("系统健康检查")
    logger.info("=" * 60)
    
    from es_client.connection import es_client
    
    # 1. 集群健康
    health = es_client.get_cluster_health()
    if health:
        logger.info(f"集群状态: {health.get('status', 'unknown')}")
        logger.info(f"节点数: {health.get('number_of_nodes', 0)}")
        logger.info(f"数据节点数: {health.get('number_of_data_nodes', 0)}")
        logger.info(f"活动分片数: {health.get('active_shards', 0)}")
    
    # 2. 索引状态
    indices = es_client.client.indices.stats()
    if 'indices' in indices:
        logger.info(f"\n索引数量: {len(indices['indices'])}")
        for index_name, index_stats in indices['indices'].items():
            docs_count = index_stats['total']['docs']['count']
            size_bytes = index_stats['total']['store']['size_in_bytes']
            size_mb = size_bytes / 1024 / 1024
            
            logger.info(f"  {index_name}: {docs_count} 文档, {size_mb:.2f} MB")
    
    # 3. 连接测试
    try:
        if es_client.client.ping():
            logger.info("\n✓ Elasticsearch连接正常")
        else:
            logger.error("\n✗ Elasticsearch连接失败")
    except Exception as e:
        logger.error(f"\n✗ Elasticsearch连接异常: {e}")

def main():
    """主函数"""
    try:
        logger.info("开始Elasticsearch集成演示")
        
        # 检查系统健康
        check_system_health()
        
        # 演示基本操作
        demonstrate_basic_operations()
        
        # 演示高级功能
        demonstrate_advanced_features()
        
        logger.info("\n" + "=" * 60)
        logger.info("演示完成!")
        logger.info("=" * 60)
        
    except KeyboardInterrupt:
        logger.info("\n用户中断演示")
    except Exception as e:
        logger.error(f"演示过程中发生错误: {e}", exc_info=True)
        return 1
    
    return 0

if __name__ == "__main__":
    sys.exit(main())

五、性能优化与最佳实践

5.1 索引设计优化

5.1.1 映射设计原则
python 复制代码
# 优化的映射设计示例
optimized_mapping = {
    "mappings": {
        "dynamic": "strict",  # 严格控制字段类型
        "properties": {
            "product_id": {
                "type": "keyword",
                "ignore_above": 256
            },
            "title": {
                "type": "text",
                "analyzer": "ik_max_word",  # 使用中文分词器
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "price": {
                "type": "scaled_float",  # 使用缩放浮点数节省空间
                "scaling_factor": 100
            },
            "created_at": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
            },
            "tags": {
                "type": "keyword",
                "eager_global_ordinals": True  # 预加载全局序数
            }
        }
    },
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1,
        "refresh_interval": "30s",  # 降低刷新频率提高索引速度
        "index": {
            "max_result_window": 100000  # 调整最大结果窗口
        }
    }
}
5.1.2 分片策略优化

小数据量 < 100GB
中等数据量 100GB-1TB
大数据量 > 1TB
索引分片策略
数据量评估
单分片
3-5个分片
10+个分片
分片大小建议: < 50GB
分片大小建议: 20-50GB
分片大小建议: 30-50GB
副本数: 1-2
最终建议

相关推荐
冷雨夜中漫步15 分钟前
Python快速入门(6)——for/if/while语句
开发语言·经验分享·笔记·python
郝学胜-神的一滴36 分钟前
深入解析Python字典的继承关系:从abc模块看设计之美
网络·数据结构·python·程序人生
百锦再38 分钟前
Reactive编程入门:Project Reactor 深度指南
前端·javascript·python·react.js·django·前端框架·reactjs
喵手2 小时前
Python爬虫实战:旅游数据采集实战 - 携程&去哪儿酒店机票价格监控完整方案(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·采集结果csv导出·旅游数据采集·携程/去哪儿酒店机票价格监控
2501_944934732 小时前
高职大数据技术专业,CDA和Python认证优先考哪个?
大数据·开发语言·python
helloworldandy3 小时前
使用Pandas进行数据分析:从数据清洗到可视化
jvm·数据库·python
肖永威4 小时前
macOS环境安装/卸载python实践笔记
笔记·python·macos
TechWJ4 小时前
PyPTO编程范式深度解读:让NPU开发像写Python一样简单
开发语言·python·cann·pypto
枷锁—sha4 小时前
【SRC】SQL注入WAF 绕过应对策略(二)
网络·数据库·python·sql·安全·网络安全
abluckyboy5 小时前
Java 实现求 n 的 n^n 次方的最后一位数字
java·python·算法