PostgreSQL全文检索中文分词器配置与优化实践

引言

在构建RAG(检索增强生成)系统的过程中,提升检索效率与准确性是一个持续优化的课题。除了常见的嵌入向量检索外,结合全文检索技术能进一步改善系统表现。本文基于PostgreSQL数据库,分享中文全文检索分词器的配置、索引创建与使用实践,记录在真实场景中遇到的问题与解决方案。


一、背景

为了提升RAG系统的检索效果,我们探索了全文检索与向量检索结合的混合检索方案。PostgreSQL内置了强大的全文检索功能,并支持扩展插件实现多语言分词。针对中文场景,我们选用了 zhparser 分词插件,并结合 pg_textsearch 扩展实现基于BM25算法的全文检索索引。


二、环境准备:安装扩展

首先需要安装两个关键扩展:

sql 复制代码
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS zhparser;
  • pg_textsearch:提供基于BM25算法的全文检索支持
  • zhparser:中文分词解析器,支持对中文文本进行词语切分

三、配置中文分词器

1. 创建全文检索配置

pg_catalog 模式下创建中文分词配置:

sql 复制代码
CREATE TEXT SEARCH CONFIGURATION pg_catalog.chinese (PARSER = zhparser);

2. 添加分词映射

将常见的词性标签映射到简单字典:

sql 复制代码
ALTER TEXT SEARCH CONFIGURATION pg_catalog.chinese 
ADD MAPPING FOR a, b, c, d, e, f, g, h, i, j, k, l, m, 
                n, o, p, q, r, s, t, u, v, w, x, y, z 
WITH simple;

3. 验证配置

查询所有全文检索配置,确认中文解析器已生效:

sql 复制代码
SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;

执行结果

psql 复制代码
scorpio=# SELECT                                                  
    n.nspname as schema_name, 
    c.cfgname as config_name, 
    p.prsname as parser_name
FROM pg_ts_config c
JOIN pg_namespace n ON n.oid = c.cfgnamespace
JOIN pg_ts_parser p ON p.oid = c.cfgparser;
 schema_name | config_name | parser_name 
-------------+-------------+-------------
 pg_catalog  | simple      | default
 pg_catalog  | arabic      | default
 pg_catalog  | armenian    | default
 pg_catalog  | basque      | default
 pg_catalog  | catalan     | default
 pg_catalog  | danish      | default
 pg_catalog  | dutch       | default
 pg_catalog  | english     | default
 pg_catalog  | finnish     | default
 pg_catalog  | french      | default
 pg_catalog  | german      | default
 pg_catalog  | greek       | default
 pg_catalog  | hindi       | default
 pg_catalog  | hungarian   | default
 pg_catalog  | indonesian  | default
 pg_catalog  | irish       | default
 pg_catalog  | italian     | default
 pg_catalog  | lithuanian  | default
 pg_catalog  | nepali      | default
 pg_catalog  | norwegian   | default
 pg_catalog  | portuguese  | default
 pg_catalog  | romanian    | default
 pg_catalog  | russian     | default
 pg_catalog  | serbian     | default
 pg_catalog  | spanish     | default
 pg_catalog  | swedish     | default
 pg_catalog  | tamil       | default
 pg_catalog  | turkish     | default
 pg_catalog  | yiddish     | default
 pg_catalog  | chinese     | zhparser
(30 rows)

输出中应包含 chinese 配置,其解析器为 zhparser


四、创建全文检索索引

1. 基于中文分词器创建BM25索引

sql 复制代码
CREATE INDEX idx_chunks_content_bm25_zh 
ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');

执行结果

psql 复制代码
scorpio=# CREATE INDEX idx_chunks_content_bm25_zh ON alpha.chunks 
USING bm25 (content) 
WITH (text_config = 'chinese');
NOTICE:  BM25 index build started for relation idx_chunks_content_bm25_zh
NOTICE:  Using text search configuration: chinese
NOTICE:  Using index options: k1=1.20, b=0.75
NOTICE:  BM25 index build completed: 64 documents, avg_length=194.86, text_config='chinese' (k1=1.20, b=0.75)
CREATE INDEX

系统会输出构建过程的详细日志,包括使用的分词配置、文档数量、平均文档长度以及BM25参数(k1=1.20, b=0.75)。

2. 同时创建英文分词器索引(可选对比)

sql 复制代码
CREATE INDEX idx_chunks_content_bm25_en 
ON alpha.chunks 
USING bm25(content) 
WITH (text_config='english');

英文分词器为PostgreSQL内置分词器,所以无需额外配置,索引创建非常顺利。


五、验证全文检索效果

执行中文全文检索查询示例:

sql 复制代码
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;

执行结果

psql 复制代码
scorpio=# SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
 id  |                                       left                                        |   score    
-----+-----------------------------------------------------------------------------------+------------
 216 | # RAG系统介绍                                                                    +| 0.51396555
     |                                                                                  +| 
     | ## 什么是RAG?                                                                   +| 
     |                                                                                  +| 
     | RAG(Retrieval-Augmented Generation,检索增强生成)是一种结合了信息检索和文本生成 | 
(1 row)

scorpio=# 

该查询会返回包含"什么是RAG"的文档片段,并按相关度排序。

通过 EXPLAIN 可查看查询执行计划,确认是否走索引扫描:

sql 复制代码
scorpio=# EXPLAIN (ANALYZE) 
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE to_tsvector('chinese', content) @@ 
      phraseto_tsquery('chinese', '什么是RAG')
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=33.06..33.07 rows=1 width=44) (actual time=13.349..13.351 rows=1 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..33.05 rows=1 width=44) (actual time=13.211..13.315 rows=1 loops=1)
         Filter: (to_tsvector('chinese'::regconfig, content) @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 63
 Planning Time: 0.482 ms
 Execution Time: 13.391 ms
(8 rows)

--强制使用使用bm25索引执行计划

scorpio=# EXPLAIN (ANALYZE)
SELECT 
    id,
    LEFT(content, 80),
    ts_rank(to_tsvector('chinese', content), 
            phraseto_tsquery('chinese', '什么是RAG')) AS score
FROM alpha.chunks
WHERE content @@ phraseto_tsquery('chinese', '什么是RAG')  -- 直接使用content
ORDER BY score DESC;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=32.91..32.91 rows=1 width=44) (actual time=13.940..13.941 rows=0 loops=1)
   Sort Key: (ts_rank(to_tsvector('chinese'::regconfig, content), '''什么'' <-> ''是'' <-> ''rag'''::tsquery)) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on chunks  (cost=0.00..32.90 rows=1 width=44) (actual time=13.723..13.723 rows=0 loops=1)
         Filter: (content @@ '''什么'' <-> ''是'' <-> ''rag'''::tsquery)
         Rows Removed by Filter: 64
 Planning Time: 65.847 ms
 Execution Time: 14.656 ms
(8 rows)

由于数据量小(或者索引不适用),优化器选择了顺序扫描,实际上索引是能够被使用的。


六、关键问题与解决方案

🔧 分词器配置必须位于 pg_catalog

在配置过程中,如果尝试在其他schema下创建分词配置,可能会在创建索引时失败。必须将 TEXT SEARCH CONFIGURATION 创建在 pg_catalog 模式下,否则 pg_textsearch 扩展无法识别该配置。

🔧 删除错误的配置

如果分词器配置有误(如 chinese_zh 配置在schema public中),可使用以下命令清理:

sql 复制代码
DROP TEXT SEARCH CONFIGURATION IF EXISTS chinese_zh CASCADE;

🔧 分词器配置位置

在同一个PostgreSQL实例不同数据库中验证中文分词器配置信息

  • 数据库scorpio中中文分词器配置信息

    scorpio=# \dF
    List of text search configurations
    Schema | Name | Description
    ------------+------------+---------------------------------------
    pg_catalog | arabic | configuration for arabic language
    pg_catalog | armenian | configuration for armenian language
    pg_catalog | basque | configuration for basque language
    pg_catalog | catalan | configuration for catalan language
    pg_catalog | chinese |
    pg_catalog | danish | configuration for danish language
    pg_catalog | dutch | configuration for dutch language
    pg_catalog | english | configuration for english language
    pg_catalog | finnish | configuration for finnish language
    pg_catalog | french | configuration for french language
    pg_catalog | german | configuration for german language
    pg_catalog | greek | configuration for greek language
    pg_catalog | hindi | configuration for hindi language
    pg_catalog | hungarian | configuration for hungarian language
    pg_catalog | indonesian | configuration for indonesian language
    pg_catalog | irish | configuration for irish language
    pg_catalog | italian | configuration for italian language
    pg_catalog | lithuanian | configuration for lithuanian language
    pg_catalog | nepali | configuration for nepali language
    pg_catalog | norwegian | configuration for norwegian language
    pg_catalog | portuguese | configuration for portuguese language
    pg_catalog | romanian | configuration for romanian language
    pg_catalog | russian | configuration for russian language
    pg_catalog | serbian | configuration for serbian language
    pg_catalog | simple | simple configuration
    pg_catalog | spanish | configuration for spanish language
    pg_catalog | swedish | configuration for swedish language
    pg_catalog | tamil | configuration for tamil language
    pg_catalog | turkish | configuration for turkish language
    pg_catalog | yiddish | configuration for yiddish language
    (30 rows)

    scorpio=# \dF+ chinese
    Text search configuration "pg_catalog.chinese"
    Parser: "public.zhparser"
    Token | Dictionaries
    -------+--------------
    a | simple
    b | simple
    c | simple
    d | simple
    e | simple
    f | simple
    g | simple
    h | simple
    i | simple
    j | simple
    k | simple
    l | simple
    m | simple
    n | simple
    o | simple
    p | simple
    q | simple
    r | simple
    s | simple
    t | simple
    u | simple
    v | simple
    w | simple
    x | simple
    y | simple
    z | simple

  • 数据库hbu中中文分词器配置信息

    scorpio=# \c hbu
    You are now connected to database "hbu" as user "hbu".
    hbu=# \dF
    List of text search configurations
    Schema | Name | Description
    ------------+------------+---------------------------------------
    pg_catalog | arabic | configuration for arabic language
    pg_catalog | armenian | configuration for armenian language
    pg_catalog | basque | configuration for basque language
    pg_catalog | catalan | configuration for catalan language
    pg_catalog | danish | configuration for danish language
    pg_catalog | dutch | configuration for dutch language
    pg_catalog | english | configuration for english language
    pg_catalog | finnish | configuration for finnish language
    pg_catalog | french | configuration for french language
    pg_catalog | german | configuration for german language
    pg_catalog | greek | configuration for greek language
    pg_catalog | hindi | configuration for hindi language
    pg_catalog | hungarian | configuration for hungarian language
    pg_catalog | indonesian | configuration for indonesian language
    pg_catalog | irish | configuration for irish language
    pg_catalog | italian | configuration for italian language
    pg_catalog | lithuanian | configuration for lithuanian language
    pg_catalog | nepali | configuration for nepali language
    pg_catalog | norwegian | configuration for norwegian language
    pg_catalog | portuguese | configuration for portuguese language
    pg_catalog | romanian | configuration for romanian language
    pg_catalog | russian | configuration for russian language
    pg_catalog | serbian | configuration for serbian language
    pg_catalog | simple | simple configuration
    pg_catalog | spanish | configuration for spanish language
    pg_catalog | swedish | configuration for swedish language
    pg_catalog | tamil | configuration for tamil language
    pg_catalog | turkish | configuration for turkish language
    pg_catalog | yiddish | configuration for yiddish language
    public | chinese |
    (30 rows)

    hbu=# \dF+ chinese
    Text search configuration "public.chinese"
    Parser: "public.zhparser"
    Token | Dictionaries
    -------+--------------
    a | simple
    e | simple
    i | simple
    j | simple
    l | simple
    m | simple
    n | simple
    t | simple
    v | simple
    x | simple

由于PostgreSQL中文分词器 配置chinese是关联数据库(PostgreSQL语境中的数据库)的,另一个数据库中无法使用该配置,但可以在数据库下不同schema共享使用。


七、总结

通过本次配置,我们成功在PostgreSQL中实现了基于 zhparser 的中文全文检索,并结合 pg_textsearch 的BM25算法构建高效检索索引。主要收获如下:

  1. 分词器配置需位于系统schemachinese 全文检索配置必须创建在 pg_catalog 中,否则索引创建会失败。
  2. 中英文分词器可并存:可为同一列创建不同语言的全文检索索引,适用于多语言内容检索场景。
  3. BM25提供可调参数 :索引构建时支持调整 k1b 参数,可根据文档集特点进行优化。

该方案为RAG系统提供了稳定、高效的全文检索支持,尤其适用于中文文档的精准召回场景。


本文基于真实配置过程整理,适用于 PostgreSQL 17.7版本,使用 pg_textsearchzhparser 扩展。实际部署中需根据数据规模与查询模式进一步优化索引参数与查询结构。

相关推荐
cr72582 小时前
使用 seekdb + PowerMem 构建多模态智能记忆系统 MemBox
ai·memory
恋猫de小郭2 小时前
Meta ShapeR :基于随机拍摄视频的 3D 物体生成,未来的 XR 和机器人基建支持
android·flutter·3d·ai·音视频·xr
博谷2 小时前
AI搜索优化服务商采购指南:从技术架构到效果保障的七步评价
ai
诀窍的心灵11 小时前
deepcode安装实操
ai·deepcode·deepcode安装
木风小助理11 小时前
PostgreSQL基础知识——DDL深度解析
数据库·postgresql
啊阿狸不会拉杆12 小时前
《机器学习》第 1 章 - 机器学习概述
人工智能·机器学习·ai·ml
CoderJia程序员甲16 小时前
GitHub 热榜项目 - 日榜(2026-01-19)
git·ai·开源·llm·github
哥布林学者16 小时前
吴恩达深度学习课程五:自然语言处理 第二周:词嵌入(三)Word2Vec
深度学习·ai
Linux-palpitate17 小时前
PostgreSQL单机部署
数据库·postgresql