【Python】Whoosh：全流程自建搜索引擎

Whoosh是一个纯Python编写的全文搜索库，适用于快速构建搜索引擎。

环境部署

在开始使用Whoosh之前，你需要确保你的开发环境已经正确设置。以下是详细的环境部署步骤。

安装Python

首先，确保你的系统上安装了Python。Whoosh支持Python 2.7和Python 3.x。建议使用Python 3.x版本。

你可以从Python官方网站下载并安装适合你操作系统的版本。

创建虚拟环境（可选）

使用虚拟环境可以帮助你隔离不同项目的依赖关系。你可以使用venv模块创建虚拟环境。

bash 复制代码

# 创建虚拟环境
python -m venv whoosh_env

# 激活虚拟环境
# Windows
whoosh_env\Scripts\activate
# macOS/Linux
source whoosh_env/bin/activate

安装Whoosh

在激活的虚拟环境中，使用pip安装Whoosh：

bash 复制代码

pip install Whoosh

你可以在终端中输入以下命令确认Whoosh已成功安装：

bash 复制代码

pip show Whoosh

验证安装

你可以通过创建一个简单的Python脚本来验证Whoosh是否已正确安装。创建一个名为test_whoosh.py的文件，内容如下：

python 复制代码

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
import os

# 定义一个架构
schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True), path=ID(stored=True))

# 创建一个目录来存放索引
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# 创建索引
ix = create_in("indexdir", schema)

print("Whoosh环境部署成功！")

运行脚本：

bash 复制代码

python test_whoosh.py

如果输出"Whoosh环境部署成功！"，则说明Whoosh已成功部署。

创建索引

创建索引是使用Whoosh进行全文搜索的核心步骤。索引允许快速检索存储的数据。下面是有关创建索引的详细介绍，包括相关概念和使用细节。

索引的概念

索引：索引是一个数据结构，用于加速搜索操作。Whoosh通过将文档和相关信息存储在索引中来实现快速查找。
架构（Schema）：索引的结构定义，包括存储哪些字段及其数据类型。例如，你可以定义文档的标题、内容和路径等字段。
文档：文档是被索引的单位。每个文档包含一个或多个字段。Whoosh将文档存储在索引中以便快速检索。

定义索引架构

在创建索引之前，首先需要定义架构。这可以通过whoosh.fields模块来实现。常用的数据类型包括：

FieldType	描述	使用场景
TEXT	存储文本内容，支持分词和搜索	文章内容、评论、描述等文本字段
ID	存储唯一标识符，通常用于标识文档	文档ID、路径、用户ID等
NUMERIC	存储数字值，可以用于范围查询	年龄、价格、评分等数值字段
BOOLEAN	存储布尔值（True/False）	标记状态、启用/禁用等
DATETIME	存储日期和时间，支持日期范围查询	创建时间、更新时间等
KEYWORD	存储单一值的关键字，适合用于精确匹配	标签、类别、关键字等
TEXT(stored=True)	存储文本内容，并支持分词和搜索，存储查询结果	需要返回查询结果的文本字段
STORED	任意类型的字段，存储文档数据	需要存储但不需要索引的字段

TEXT 和 ID 是使用频率最高的类型，适合绝大多数文本和文档标识需求。
NUMERIC 和 BOOLEAN 常用于需要数值或状态的字段。
DATETIME 适用于时间相关的数据。
KEYWORD 则用于精确匹配，适合固定值的场景。

python 复制代码

from whoosh.fields import Schema, TEXT, ID

# 定义一个架构
schema = Schema(
    title=TEXT(stored=True),    # 存储文档标题
    content=TEXT(stored=True),  # 存储文档内容
    path=ID(stored=True)        # 存储文档路径
)

创建索引目录

创建索引时，你需要指定一个存放索引文件的目录。如果目录不存在，Whoosh会自动创建。

python 复制代码

import os
from whoosh.index import create_in

# 创建一个目录来存放索引
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# 创建索引
ix = create_in("indexdir", schema)

添加文档到索引

创建完索引后，可以向其中添加文档。Whoosh提供了IndexWriter类用于写入操作。

python 复制代码

from whoosh.writing import IndexWriter

writer = ix.writer()  # 创建写入器
writer.add_document(title="First document", content="This is the content of the first document.", path="/a")
writer.add_document(title="Second document", content="This is the content of the second document.", path="/b")
writer.commit()  # 提交更改

add_document()：向索引添加文档。每个文档使用定义的架构中的字段名称。
commit()：提交所有更改并关闭写入器。

更新和删除文档

在使用过程中，可能需要更新或删除文档。Whoosh提供了相应的方法来处理这些操作。

python 复制代码

writer.update_document(path="/a", title="Updated document title", content="Updated content.")
writer.commit()

python 复制代码

writer.delete_by_term('path', '/b')
writer.commit()

存储索引

Whoosh支持多种索引存储方式，以满足不同应用场景和性能需求。以下是对Whoosh支持的索引存储方式的详细介绍：

内存存储

概述：索引存储在内存中，适合短期任务、测试或开发阶段。
优点：
- 速度快，检索性能高。
- 不需要管理文件或数据库，简单易用。
缺点：
- 数据在程序结束时会丢失，无法持久化。
- 不适合大规模数据或需要长期存储的应用。

python 复制代码

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID

# 定义Schema
schema = Schema(
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    path=ID(stored=True)
)

# 创建内存索引
ix = create_in("", schema)  # 使用空字符串表示内存存储

文件存储

概述：索引存储在文件系统中，可以持久化存储索引数据。
优点：
- 数据持久化，不会因为程序结束而丢失。
- 支持较大的索引数据，适合大规模应用。
缺点：
- 相对于内存存储，性能较低。
- 需要管理索引文件的存储位置和清理。

python 复制代码

import os
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID

# 定义Schema
schema = Schema(
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    path=ID(stored=True)
)

# 创建索引目录
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# 创建Whoosh索引
ix = create_in("indexdir", schema)

数据库存储

概述：使用SQLite等关系数据库将索引存储在数据库中，适合需要数据完整性和事务管理的场景。
优点：
- 数据持久化，不会丢失，支持事务管理。
- 可以支持多用户访问，适合协作环境。
缺点：
- 需要额外的数据库知识和配置。
- 性能可能受限于数据库的访问速度。

python 复制代码

import sqlite3
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID

# 创建SQLite数据库连接
conn = sqlite3.connect('whoosh_index.db')

# 定义Schema
schema = Schema(
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    path=ID(stored=True)
)

# 创建Whoosh索引并存储在SQLite数据库中
ix = create_in(conn, schema)

# 添加文档
with ix.writer() as writer:
    writer.add_document(title="文档1", content="这是第一篇文档。", path="/a")
    writer.add_document(title="文档2", content="这是第二篇文档。", path="/b")
    writer.commit()  # 提交更改

# 添加文档等操作...

查询索引

查询索引是使用Whoosh进行全文搜索的关键步骤。通过查询，可以检索存储在索引中的文档。以下是关于Whoosh查询索引的详细介绍，包括相关概念和使用细节。

查询的概念

查询：查询是指对索引进行搜索的操作，返回符合条件的文档。Whoosh支持多种查询类型，包括基本查询、布尔查询和短语查询等。
搜索器（Searcher）：用于执行查询的对象。通过搜索器，可以在索引中搜索文档。

创建搜索器

在执行查询之前，需要创建一个搜索器。Whoosh允许在索引中创建多个搜索器，但每个搜索器只能针对一个索引。

python 复制代码

from whoosh.index import open_dir

# 打开已有索引
ix = open_dir("indexdir")

# 创建搜索器
with ix.searcher() as searcher:
    # 执行查询
    pass  # 查询代码在此处

基本查询

Whoosh的基本查询可以通过解析器（QueryParser）创建。解析器将用户输入的查询字符串解析为Whoosh可以理解的查询对象。

python 复制代码

from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("first")
    results = searcher.search(query)
    for result in results:
        print(result['title'], result['path'])

QueryParser("field_name", schema)：指定要搜索的字段和索引架构。
parse("query_string")：将查询字符串解析为查询对象。

布尔查询

Whoosh支持布尔操作符（如AND、OR、NOT）来构建复杂查询。布尔查询可以组合多个条件。

python 复制代码

from whoosh.query import Or, And

query = Or([
    QueryParser("content", ix.schema).parse("first"),
    QueryParser("content", ix.schema).parse("second")
])

with ix.searcher() as searcher:
    results = searcher.search(query)
    for result in results:
        print(result['title'], result['path'])

短语查询

如果需要查找包含特定短语的文档，可以使用短语查询。短语查询将多个词作为一个整体进行匹配。

python 复制代码

query = QueryParser("content", ix.schema).parse('"first document"')

with ix.searcher() as searcher:
    results = searcher.search(query)
    for result in results:
        print(result['title'], result['path'])

排序和评分

Whoosh允许根据特定字段对查询结果进行排序。你可以指定排序字段，并控制返回结果的顺序。

python 复制代码

with ix.searcher() as searcher:
    results = searcher.search(query, sortedby="title")  # 按照标题排序
    for result in results:
        print(result['title'], result['path'])

每个结果还会包含一个评分（score），表示文档与查询的匹配程度。可以在输出结果中查看这些评分。

高亮显示查询结果

Whoosh支持高亮显示查询结果中的匹配部分，方便用户查看搜索结果的相关性。

python 复制代码

from whoosh.highlight import UppercaseFormatter

with ix.searcher() as searcher:
    results = searcher.search(query)
    for result in results:
        title = result.highlights("title", formatter=UppercaseFormatter())
        content = result.highlights("content", formatter=UppercaseFormatter())
        print(f"Title: {title}\nContent: {content}\n")

查询选项

Whoosh的查询支持一些可选参数，例如：

limit：限制返回结果的数量。
offset：控制结果集的起始位置，适用于分页查询。

python 复制代码

with ix.searcher() as searcher:
    results = searcher.search(query, limit=10, offset=20)  # 第21到第30个结果
    for result in results:
        print(result['title'], result['path'])

范围查询

范围查询允许查找某个字段值在指定范围内的文档。常用于数值型字段和日期型字段。

假设我们有一个日期字段，我们可以查找某个日期范围内的文档：

python 复制代码

from whoosh.fields import DATETIME
from whoosh.query import DateRange

# 在Schema中添加DATETIME字段
schema = Schema(
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    path=ID(stored=True),
    date=DATETIME(stored=True)  # 日期字段
)

# 范围查询示例
date_query = DateRange("date", start_date, end_date)

with ix.searcher() as searcher:
    results = searcher.search(date_query)
    print("日期范围查询结果:")
    for result in results:
        print(result['title'], result['path'])

组合查询

可以将不同类型的查询组合在一起，以构建更复杂的检索条件。例如，可以将布尔查询与短语查询结合使用。

python 复制代码

combined_query = And([
    Or([query1, query2]),
    phrase_query
])

with ix.searcher() as searcher:
    results = searcher.search(combined_query)
    print("组合查询结果:")
    for result in results:
        print(result['title'], result['path'])

使用查询解析器

Whoosh的查询解析器可以帮助用户通过自然语言输入构建复杂查询。解析器会自动将用户输入转换为相应的查询对象。

python 复制代码

from whoosh.qparser import QueryParser

query_string = "first AND document"
query = QueryParser("content", ix.schema).parse(query_string)

with ix.searcher() as searcher:
    results = searcher.search(query)
    print("查询解析器结果:")
    for result in results:
        print(result['title'], result['path'])

排序和评分

在Whoosh中，排序和评分是影响搜索结果质量的重要因素。通过适当的排序和评分机制，可以帮助用户更快速地找到相关的文档。下面是对Whoosh中排序和评分的详细介绍，包括相关概念和使用细节。

评分的概念

评分（Score）：评分是Whoosh为每个文档与查询的相关性计算的一个数值。评分越高，表示该文档与查询条件的匹配度越高。
评分算法：Whoosh使用一种基于TF-IDF（词频-逆文档频率）的评分机制来评估文档的相关性。TF表示一个词在文档中出现的频率，IDF则衡量该词在整个文档集中出现的稀有性。

查询结果中的评分

在进行查询时，Whoosh会为每个匹配的文档计算并返回一个评分。可以通过查询结果访问每个文档的评分。

python 复制代码

from whoosh.qparser import QueryParser

query = QueryParser("content", ix.schema).parse("first")

with ix.searcher() as searcher:
    results = searcher.search(query)
    for result in results:
        print(f"Title: {result['title']}, Score: {result.score}")

排序的概念

排序：排序是根据特定字段对查询结果进行排序的操作。Whoosh允许根据一个或多个字段进行升序或降序排序，以便用户可以按最相关的文档查看搜索结果。
排序字段：可以根据文档中的任意字段进行排序，例如标题、创建日期或自定义评分等。

排序查询结果

在Whoosh中，可以使用sortedby参数指定要根据哪个字段进行排序。可以设置多个字段，优先级顺序从左到右。

python 复制代码

with ix.searcher() as searcher:
    results = searcher.search(query, sortedby="title")  # 按照标题排序
    for result in results:
        print(result['title'], result['path'])

python 复制代码

with ix.searcher() as searcher:
    results = searcher.search(query, sortedby=None)  # 默认按评分排序
    for result in results:
        print(f"Title: {result['title']}, Score: {result.score}")

自定义评分

如果你想基于特定逻辑或额外因素来改变文档的评分，可以自定义评分函数。在Whoosh中，你可以通过创建自定义查询和评分类来实现。

python 复制代码

from whoosh.scoring import WeightingModel

class CustomScoringModel(WeightingModel):
    def score(self, fieldname, text, docnum, weighting):
        # 自定义评分逻辑
        return super().score(fieldname, text, docnum, weighting) * 2  # 举例，简单乘以2

with ix.searcher(weighting=CustomScoringModel()) as searcher:
    results = searcher.search(query)
    for result in results:
        print(f"Title: {result['title']}, Custom Score: {result.score}")

处理更新和删除

在Whoosh中，处理更新和删除操作是维持索引数据最新的关键。Whoosh提供了简单的方法来更新和删除索引中的文档。下面是对更新和删除的详细介绍，包括相关概念和使用细节。

更新的概念

更新（Update）：更新指的是对已存在的文档进行修改，通常包括字段值的改变。更新操作会替换原有的文档数据。
唯一标识符：更新操作通常依赖于文档的唯一标识符（例如ID字段），以确保准确找到要更新的文档。

更新文档

使用IndexWriter的update_document方法可以方便地更新文档。需要传入唯一标识符和需要更新的字段。

python 复制代码

with ix.writer() as writer:
    # 更新路径为"/a"的文档
    writer.update_document(path="/a", title="Updated Title", content="Updated content.")

update_document(path="/a", ...)：通过指定唯一标识符path来更新文档。
需要提供新的字段值，原有的字段会被替换。

删除的概念

删除（Delete）：删除操作指的是从索引中移除特定的文档。这通常基于文档的唯一标识符进行。
删除操作不会立即从磁盘上删除文件，但会标记文档为"已删除"，以便在后续的优化过程中清理。

删除文档

使用IndexWriter的delete_by_term方法可以方便地删除文档。需要指定要删除文档的字段和值。

python 复制代码

with ix.writer() as writer:
    # 删除路径为"/b"的文档
    writer.delete_by_term('path', '/b')

delete_by_term('path', '/b')：通过指定字段和字段值来删除文档。
文档一旦被删除，在后续查询中将不再返回。

提交更改

无论是更新还是删除，都需要在操作完成后调用commit()方法来提交更改。这会将所有更改保存到索引。

python 复制代码

with ix.writer() as writer:
    writer.update_document(path="/a", title="Updated Title")
    writer.delete_by_term('path', '/b')
    writer.commit()  # 提交所有更改

批量更新和删除

如果需要批量处理多个文档的更新或删除，可以在一个IndexWriter上下文中执行多次操作，然后一次性提交。

python 复制代码

with ix.writer() as writer:
    # 批量更新
    writer.update_document(path="/a", title="First Updated Title")
    writer.update_document(path="/c", title="Second Updated Title")
    
    # 批量删除
    writer.delete_by_term('path', '/b')
    writer.delete_by_term('path', '/d')

    writer.commit()  # 提交所有更改

优化索引

删除操作会标记文档为"已删除"，实际的磁盘空间不会立即释放。可以通过optimize()方法来优化索引，移除已删除的文档。

python 复制代码

ix.optimize()  # 优化索引，清理已删除的文档

权重调整

在Whoosh中，权重调整是影响搜索结果相关性的重要手段。通过调整不同字段的权重，开发者可以控制搜索结果的排序，从而优化用户的搜索体验。以下是关于权重调整的详细介绍。

权重的概念

权重：在搜索引擎中，权重是指不同字段对最终评分的影响程度。字段的权重越高，其在结果中的重要性也越大。
调整权重：可以根据应用场景和业务需求，灵活调整不同字段的权重，以影响搜索结果的排序。

字段权重设置

在定义Schema时，可以为每个字段设置不同的权重。权重的值通常是浮动数，值越大表示权重越高。

python 复制代码

from whoosh.fields import Schema, TEXT, ID

# 定义Schema并设置字段权重
schema = Schema(
    title=TEXT(stored=True, weight=3.0),  # 标题字段权重较高
    content=TEXT(stored=True, weight=1.0),  # 内容字段权重较低
    path=ID(stored=True)
)

查询中的权重调整

在执行查询时，可以动态地调整字段的权重。这通常通过自定义评分类来实现。

python 复制代码

from whoosh.scoring import Weighting

class CustomWeighting(Weighting):
    def score(self, searcher, fieldname, text, docnum, weight):
        base_score = super().score(searcher, fieldname, text, docnum, weight)
        
        # 根据字段动态调整权重
        if fieldname == "title":
            return base_score * 2  # 对标题字段给予额外权重
        return base_score

使用权重调整进行查询

当执行查询时，可以使用自定义的权重类来影响评分。

python 复制代码

from whoosh.qparser import QueryParser

query = QueryParser("content", ix.schema).parse("文档")
with ix.searcher(weighting=CustomWeighting()) as searcher:
    results = searcher.search(query)
    for result in results:
        print(f"标题: {result['title']}, 评分: {result.score}")

组合权重

你可以通过多种方式组合权重，例如：

字段组合：对多个字段同时进行评分和权重调整。
动态权重：根据查询条件动态修改权重，例如如果查询中包含特定关键词，可以增加相关字段的权重。

分词

分词是文本搜索引擎中的重要步骤，负责将文本内容切分为可以索引和检索的基本单元（词）。Whoosh在处理文本数据时提供了灵活的分词机制，以支持多种语言和应用场景。以下是对Whoosh中分词的详细介绍，包括相关概念和使用细节。

分词的概念

分词（Tokenization）：分词是将文本分解成单独的词或符号的过程。在搜索引擎中，分词用于生成索引，使搜索更高效。
词元（Token）：分词后的单个单位称为词元。词元是构建索引的基础。

Whoosh的分词机制

Whoosh提供了多种内置的分词器，允许用户根据需要选择合适的分词策略。默认情况下，Whoosh使用的是StandardAnalyzer，该分词器会对文本进行常见的处理，如小写化和去除停用词。

常见的分词器

StandardAnalyzer：默认分词器，支持小写转换、去停用词、标点符号处理等。
SimpleAnalyzer：简单分词器，仅进行小写转换，适用于不需要复杂处理的情况。
WhitespaceAnalyzer：基于空格的分词器，仅根据空格切分文本，适合处理特殊格式的文本。
StemmingAnalyzer：在分词的同时应用词干提取，适用于需要处理不同词形的情况。

使用自定义分词器

用户可以根据特定需求创建自定义分词器，定义分词规则和处理逻辑。

python 复制代码

from whoosh.analysis import RegexTokenizer, Tokenizer, LowercaseFilter, StopFilter

class CustomAnalyzer(Tokenizer):
    def __init__(self):
        self.tokenizer = RegexTokenizer()  # 使用正则分词
        self.lowercase = LowercaseFilter()  # 转为小写
        self.stop_filter = StopFilter()  # 去停用词

    def __call__(self, text):
        for token in self.tokenizer(text):
            token = self.lowercase(token)
            token = self.stop_filter(token)
            yield token

# 在Schema中使用自定义分词器
schema = Schema(
    title=TEXT(analyzer=CustomAnalyzer(), stored=True),
    content=TEXT(analyzer=CustomAnalyzer(), stored=True),
    path=ID(stored=True)
)

分词示例

创建索引时，分词器会自动应用于字段内容。在索引创建时，Whoosh会根据定义的分词器处理文本。

python 复制代码

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
import os

# 创建索引目录
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# 定义Schema
schema = Schema(
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    path=ID(stored=True)
)

# 创建索引
ix = create_in("indexdir", schema)

# 添加文档
writer = ix.writer()
writer.add_document(title="First Document", content="This is the first example.", path="/a")
writer.add_document(title="Second Document", content="Another example is here.", path="/b")
writer.commit()

在这个过程中，content字段的内容会被分词处理。

查询时的分词

在查询时，Whoosh会使用相同的分词器处理用户的查询字符串，以确保查询和索引中的词元一致。

python 复制代码

from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("first example")
    results = searcher.search(query)
    for result in results:
        print(result['title'], result['path'])

在执行查询时，"first example"也会经过分词处理。

中文搜索

中文搜索的挑战主要在于中文文本的分词，因为中文没有明显的单词边界。这使得正确分词成为实现有效搜索的关键。Whoosh为中文搜索提供了相应的支持，以下是对Whoosh中中文搜索的详细介绍，包括相关概念和使用细节。

中文搜索的挑战

分词：中文句子中的词与词之间没有空格，分词器需要能够有效识别词组。
停用词：一些常见的词（如"的"、"是"、"在"等）在搜索中往往不具有实际意义，需要去除。
词形变化：中文中的同义词和近义词处理也是提高搜索相关性的重要因素。

Whoosh中的中文分词器

Whoosh并不自带中文分词器，但可以使用第三方分词库，如jieba，结合Whoosh实现中文搜索。

使用`jieba`实现中文分词

首先，需要安装jieba库：

bash 复制代码

pip install jieba

然后，可以定义自定义的分词器，以便在Whoosh中使用。

python 复制代码

import jieba
from whoosh.analysis import Tokenizer, Token

class JiebaTokenizer(Tokenizer):
    def __call__(self, text):
        # 使用jieba分词
        for word in jieba.cut(text):
            yield Token(word)

# 定义Schema
from whoosh.fields import Schema, TEXT, ID

schema = Schema(
    title=TEXT(analyzer=JiebaTokenizer(), stored=True),
    content=TEXT(analyzer=JiebaTokenizer(), stored=True),
    path=ID(stored=True)
)

创建索引

在创建索引时，使用上面定义的分词器来处理中文内容。

python 复制代码

from whoosh.index import create_in
import os

# 创建索引目录
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# 创建索引
ix = create_in("indexdir", schema)

# 添加文档
writer = ix.writer()
writer.add_document(title="第一篇文档", content="这是一个中文搜索的示例。", path="/a")
writer.add_document(title="第二篇文档", content="另一篇文档也在这里。", path="/b")
writer.commit()

查询索引

在查询时，Whoosh会使用相同的分词器处理用户的查询字符串。

python 复制代码

from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("中文 搜索")
    results = searcher.search(query)
    for result in results:
        print(f"标题: {result['title']}, 路径: {result['path']}")

停用词处理

对于中文搜索，可以定义停用词列表，并在分词过程中进行过滤。可以在jieba分词后对分词结果进行处理。

python 复制代码

stop_words = {"的", "是", "在", "和"}

class JiebaTokenizerWithStopwords(Tokenizer):
    def __call__(self, text):
        for word in jieba.cut(text):
            if word not in stop_words:
                yield Token(word)

# 使用新分词器定义Schema
schema = Schema(
    title=TEXT(analyzer=JiebaTokenizerWithStopwords(), stored=True),
    content=TEXT(analyzer=JiebaTokenizerWithStopwords(), stored=True),
    path=ID(stored=True)
)

同义词处理

可以使用同义词库来增强搜索的相关性。在分词后，查找同义词并将其加入搜索词。

python 复制代码

synonyms = {
    "搜索": ["查找", "检索"],
}

def expand_synonyms(query_terms):
    expanded = set(query_terms)
    for term in query_terms:
        expanded.update(synonyms.get(term, []))
    return list(expanded)

# 扩展查询词
query_terms = expand_synonyms(["中文", "搜索"])

统计和分析

Whoosh提供了一些基本的统计和分析功能，可以帮助你更好地理解索引数据的特性和搜索结果。以下是对Whoosh中统计和分析功能的详细介绍。

文档数量统计

概念：可以通过Whoosh获取索引中的文档总数，这对于分析索引的规模和容量很有帮助。
实现：
使用index.num_docs()方法可以获取当前索引中存储的文档数量。

python 复制代码

with ix.searcher() as searcher:
    total_docs = ix.num_docs()
    print(f"索引中的文档总数: {total_docs}")

字段统计

概念：可以统计特定字段的不同值或频率，这对于了解文档特性或内容分布很重要。
实现：
可以使用查询来统计某个字段的唯一值，结合searcher.all_docs()和Python的集合操作实现。

python 复制代码

from collections import Counter

field_values = Counter()
with ix.searcher() as searcher:
    for docnum in range(ix.num_docs()):
        fields = searcher.stored_fields(docnum)
        field_values[fields['title']] += 1

print("标题字段的统计：", field_values)

查询结果分析

概念：可以分析查询结果的相关性和分布，例如不同文档的评分和字段值。
实现：
可以在执行查询后，遍历结果集并分析各文档的评分。

python 复制代码

query = QueryParser("content", ix.schema).parse("文档")
with ix.searcher() as searcher:
    results = searcher.search(query)
    for result in results:
        print(f"文档ID: {result.docnum}, 评分: {result.score}, 标题: {result['title']}")

聚合统计

概念：支持基于某个字段的聚合分析，例如计算字段的平均值、最大值、最小值等。
实现：
Whoosh不直接提供聚合功能，但可以通过遍历查询结果手动实现。

python 复制代码

scores = []
with ix.searcher() as searcher:
    for result in searcher.all():
        scores.append(result.score)

average_score = sum(scores) / len(scores) if scores else 0
print(f"平均评分: {average_score}")

数据可视化

概念：虽然Whoosh本身不支持数据可视化，但可以将统计结果导出到其他工具（如Matplotlib、Seaborn等）进行可视化。
实现：
可以将统计结果整理为合适的格式，并使用可视化库展示数据分布。

python 复制代码

import matplotlib.pyplot as plt

# 假设我们已经有了字段值的计数
titles = list(field_values.keys())
counts = list(field_values.values())

plt.bar(titles, counts)
plt.xlabel('标题')
plt.ylabel('文档数量')
plt.title('各标题文档数量统计')
plt.xticks(rotation=45)
plt.show()

【Python】Whoosh：全流程自建搜索引擎

环境部署

安装Python

创建虚拟环境（可选）

安装Whoosh

验证安装

创建索引

索引的概念

定义索引架构

创建索引目录

添加文档到索引

更新和删除文档

存储索引

内存存储

文件存储

数据库存储

查询索引

查询的概念

创建搜索器

基本查询

布尔查询

短语查询

排序和评分

高亮显示查询结果

查询选项

范围查询

组合查询

使用查询解析器

排序和评分

评分的概念

查询结果中的评分

排序的概念

排序查询结果

自定义评分

处理更新和删除

更新的概念

更新文档

删除的概念

删除文档

提交更改

批量更新和删除

优化索引

权重调整

权重的概念

字段权重设置

查询中的权重调整

使用权重调整进行查询

组合权重

分词

分词的概念

Whoosh的分词机制

常见的分词器

使用自定义分词器

分词示例

查询时的分词

中文搜索

中文搜索的挑战

Whoosh中的中文分词器

使用jieba实现中文分词

创建索引

查询索引

停用词处理

同义词处理

统计和分析

文档数量统计

字段统计

查询结果分析

聚合统计

数据可视化

使用`jieba`实现中文分词