LinuxC++项目开发日志——基于正倒排索引的boost搜索引擎（3——通过cppjieba库建立索引模块）

基于正倒排索引的boost搜索引擎

cppjieba分词库
索引模块建立
- 概述
- 核心数据结构
- 索引建立流程
- 查询接口
- 技术特点
- - [1. 权重计算策略](#1. 权重计算策略)
  - [2. 文本预处理](#2. 文本预处理)
  - [3. 内存优化](#3. 内存优化)
- 完整代码
其它更新模块

cppjieba分词库

cppjieba库简介

cppjieba 是 jieba 分词库的 C++ 版本实现，它保留了 Python 版 jieba 的核心功能和分词效果，同时充分利用 C++ 的性能优势，在处理速度上有显著提升，更适合对性能要求较高的生产环境或 C++ 项目中进行中文分词处理。

cppjieba的特点和优势

功能完备：

支持精确模式、全模式、搜索引擎模式三种分词模式，与 Python 版 jieba 保持一致。
提供词性标注功能，可对分词结果进行词性标记（如名词、动词、形容词等）。
支持用户自定义词典，方便添加专业术语、新词等特殊词汇。

支持繁体中文分词。

高性能：

基于 C++ 实现，相比 Python 版 jieba 分词速度更快，内存占用更优。
采用双数组 trie 树（Double Array Trie）等高效数据结构，提升词典查询和匹配效率。

易用性：

提供简洁的 API 接口，便于在 C++ 项目中集成使用。
支持跨平台编译，可在 Windows、Linux、macOS 等系统上运行。

应用场景：

适用于需要高效中文分词的 C++ 项目，如搜索引擎、文本挖掘、自然语言处理系统等。
适合处理大规模中文文本数据，能有效提升系统的处理效率。使用时，通常需要先加载词典（包括主词典、停用词词典等），然后调用相应的分词接口进行处理。其核心分词逻辑与 Python 版 jieba 一致，确保了分词结果的一致性和准确性。

cppjieba安装

cppjieba 是一个开源项目，主要通过源码方式获取和集成到 C++ 项目中。以下是详细的下载安装及使用步骤：

1. 下载源码

cppjieba 的源码托管在 GitHub 上，可通过以下方式获取：

方法 1：使用 git 克隆仓库（推荐）

打开终端（Linux/macOS）或命令提示符（Windows），执行以下命令：

bash 复制代码

git clone https://github.com/yanyiwu/cppjieba.git

该命令会将源码下载到当前目录的 cppjieba 文件夹中。

方法 2：手动下载源码包

访问 GitHub 仓库地址：https://github.com/yanyiwu/cppjieba

点击右上角的 Code 按钮，选择 Download ZIP，下载后解压到本地目录。

2. 目录结构说明

下载后的 cppjieba 目录包含核心代码和必要资源，主要结构如下：

plaintext 复制代码

cppjieba/
├── include/              # 头文件目录（核心算法实现）
│   ├── jieba.hpp         # 分词核心类
│   ├── dict_trie.hpp     # 词典Trie树结构
│   └── ...
├── src/                  # 源文件目录
│   ├── jieba.cpp
│   └── ...
├── dict/                 # 内置词典（必须保留，分词依赖）
│   ├── jieba.dict.utf8   # 主词典
│   ├── hmm_model.utf8    # 隐马尔可夫模型词典
│   ├── user.dict.utf8    # 用户自定义词典（可修改）
│   └── stop_words.utf8   # 停用词词典
├── example/              # 示例代码
│   └── demo.cpp          # 简单使用示例
└── CMakeLists.txt        # CMake 配置文件（用于编译示例）

3. 集成到 C++ 项目中

cppjieba 不需要 "安装" 到系统目录，而是直接将源码和词典集成到你的项目中，步骤如下：

步骤 1：复制必要文件到项目

将 include/ 和 src/ 目录复制到你的项目源码目录（或通过编译选项指定路径）。

将 dict/ 目录复制到你的项目运行目录（确保程序能找到词典，路径可自定义）。

步骤 2：在代码中使用

在你的 C++ 代码中包含头文件，并初始化分词器，示例如下：

cpp 复制代码

运行
#include "jieba.hpp"
#include <iostream>
#include <vector>

using namespace cppjieba;
using namespace std;

int main() {
    // 词典路径（根据实际存放位置修改）
    const string DICT_PATH = "./dict/jieba.dict.utf8";
    const string HMM_PATH = "./dict/hmm_model.utf8";
    const string USER_DICT_PATH = "./dict/user.dict.utf8";
    const string STOP_WORD_PATH = "./dict/stop_words.utf8";

    // 初始化分词器
    Jieba jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, STOP_WORD_PATH);

    // 待分词文本
    string text = "我爱自然语言处理";
    vector<string> words;

    // 精确模式分词
    jieba.Cut(text, words, false);  // false 表示精确模式

    // 输出结果
    for (const auto& word : words) {
        cout << word << " ";
    }
    // 输出：我 爱 自然语言处理
    return 0;
}

4. 编译项目

编译时需要指定源码文件和头文件路径，以 g++ 为例：

bash 复制代码

# 假设你的代码文件为 main.cpp，且 cppjieba 的 src/ 和 include/ 在当前目录
g++ -std=c++11 main.cpp ./src/*.cpp -I./include -o jieba_demo
-I./include：指定头文件目录。
./src/*.cpp：编译 cppjieba 的源文件。
-std=c++11：cppjieba 依赖 C++11 标准，需指定。

5. 运行测试

编译成功后，执行生成的可执行文件：

bash 复制代码

./jieba_demo  # Linux/macOS

# 或
jieba_demo.exe  # Windows

如果输出分词结果，则说明集成成功。

注意事项

词典路径：必须保证程序能访问到 dict/ 目录下的词典文件，路径错误会导致分词失败。
编码问题：cppjieba 处理的是 UTF-8 编码文本，确保输入文本编码正确。
自定义词典：可修改 user.dict.utf8 添加新词（格式：词语词频词性，如 cppjieba 1000 n）。
跨平台：支持 Linux、macOS、Windows，Windows 下可使用 MinGW 或 MSVC 编译。

通过以上步骤，即可在你的 C++ 项目中使用 cppjieba 进行中文分词。如果需要更复杂的功能（如词性标注），可参考 example/ 目录下的示例代码或者查询AI。

补充：

除了直接从 GitHub 下载源码，cppjieba 还有一些其他获取和集成途径，适用于不同场景，以下是详细说明：

1. 通过国内镜像仓库获取（解决 GitHub 访问慢问题）

如果访问 GitHub 速度较慢，可以使用国内代码托管平台的镜像仓库，例如 Gitee（码云）上的同步仓库：

方法：克隆 Gitee 镜像

bash 复制代码

git clone https://gitee.com/mirrors/cppjieba.git

镜像仓库与 GitHub 原仓库内容同步，使用方式和后续集成步骤与 GitHub 源码完全一致。

2. 通过 C++ 包管理器安装（推荐用于现代项目）

对于使用包管理器的 C++ 项目，可以通过以下工具自动获取和集成 cppjieba：

（1）使用 vcpkg（微软开源的 C++ 包管理器）

安装 vcpkg（若未安装）：

bash 复制代码

# 克隆 vcpkg 仓库
git clone https://github.com/microsoft/vcpkg

# 编译安装（Windows 用 powershell，Linux/macOS 用 bash）
.\vcpkg\bootstrap-vcpkg.bat  # Windows
./vcpkg/bootstrap-vcpkg.sh   # Linux/macOS

安装 cppjieba：

bash 复制代码

# 全局安装（需管理员权限）
vcpkg install cppjieba:x64-windows  # Windows
vcpkg install cppjieba:x64-linux    # Linux
vcpkg install cppjieba:x64-osx      # macOS

# 或仅为本项目安装（需在项目中配置 vcpkg）
vcpkg install cppjieba

在项目中使用：安装后，通过 CMake 配置时指定 vcpkg 工具链，即可直接 find_package 引入：

cmake 复制代码

cmake_minimum_required(VERSION 3.10)
project(my_project)
find_package(cppjieba REQUIRED)
add_executable(my_app main.cpp)
target_link_libraries(my_app cppjieba::cppjieba)

索引模块建立

概述

该模块采用正排索引和倒排索引相结合的方式，实现了高效的文档检索功能。

核心数据结构

1. 正排索引结构（ForwordElem）

cpp 复制代码

typedef struct ForwordElem {
    std::string title_;      // 文档标题
    std::string content_;    // 文档内容
    std::string url_;        // 文档URL
    size_t doc_id_ = 0;      // 文档唯一标识
    
    void Set(std::string title, std::string content, std::string url, size_t doc_id);
} Forword_t;

作用：通过文档ID快速定位完整的文档信息，存储原始文档数据。

2. 倒排索引结构（InvertedElem）

cpp 复制代码

typedef struct InvertedElem {
    size_t doc_id_ = 0;      // 文档ID
    std::string word_;       // 关键词
    size_t weight_ = 0;      // 权重值
    
    void Set(size_t doc_id, std::string word, size_t weight);
} Inverted_t;

typedef std::vector<Inverted_t> InvertedList;  // 倒排拉链

作用：建立关键词到文档的映射，支持快速的关键词检索。

索引建立流程

整体架构

cpp 复制代码

class Inedx : public NonCopyable {
private:
    std::vector<Forword_t> Forword_Index_;                        // 正排索引
    std::unordered_map<std::string, InvertedList> Inverted_Index_; // 倒排索引
};

步骤1：主索引构建入口（BulidIndex）

cpp 复制代码

bool BulidIndex() {
    // 1. 打开目标文件
    std::ifstream in(Tragetfile, std::ios::binary | std::ios::in);
    
    // 2. 逐行读取文档数据
    std::string singlefile;
    while (std::getline(in, singlefile)) {
        // 3. 构建正排索引
        bool b = BuildForwordIndex(singlefile);
        if(!b) continue;
        
        // 4. 构建倒排索引（基于刚建立的正排索引项）
        b = BuildInvertedIndex(Forword_Index_.size() - 1);
        if(!b) continue;
    }
    in.close();
    return true;
}

步骤2：正排索引构建（BuildForwordIndex）

cpp 复制代码

bool BuildForwordIndex(std::string& singlefile) {
    // 1. 文档分词处理
    std::vector<std::string> sepfile;
    bool b = ns_util::JiebaUtile::CutDoc(singlefile, sepfile);
    
    // 2. 验证分词结果（标题、内容、URL三部分）
    if(sepfile.size() != 3) return false;
    
    // 3. 创建正排索引项
    Forword_t ft;
    ft.Set(std::move(sepfile[0]),    // 标题
           std::move(sepfile[1]),    // 内容  
           std::move(sepfile[2]),    // URL
           Forword_Index_.size());   // 文档ID（当前索引大小）
    
    // 4. 加入正排索引表
    Forword_Index_.push_back(std::move(ft));
    return true;
}

步骤3：倒排索引构建（BuildInvertedIndex）

cpp 复制代码

bool BuildInvertedIndex(size_t findex) {
    // 1. 获取对应的正排索引项
    Forword_t ft = Forword_Index_[findex];
    
    // 2. 词频统计结构
    std::unordered_map<std::string, DocCount_t> map_s;
    
    // 3. 标题分词与词频统计
    std::vector<std::string> titlesegmentation;
    ns_util::JiebaUtile::CutPhrase(ft.title_, titlesegmentation);
    for(auto& s : titlesegmentation) {
        boost::to_lower(s);  // 统一转换为小写
        map_s[s].title_cnt_++;
    }
    
    // 4. 内容分词与词频统计  
    std::vector<std::string> contentsegmentation;
    ns_util::JiebaUtile::CutPhrase(ft.content_, contentsegmentation);
    for(auto& s : contentsegmentation) {
        boost::to_lower(s);  // 统一转换为小写
        map_s[s].content_cnt_++;
    }
    
    // 5. 权重计算与倒排项创建
    const int X = 10;  // 标题权重系数
    const int Y = 1;   // 内容权重系数
    
    for(auto& p : map_s) {
        Inverted_t it;
        // 计算权重：标题词频×10 + 内容词频×1
        it.Set(findex, p.first, 
               p.second.title_cnt_ * X + p.second.content_cnt_ * Y);
        
        // 加入倒排索引表
        InvertedList& list = Inverted_Index_[p.first];
        list.push_back(std::move(it));
    }
    return true;
}

查询接口

1. 正排查询（按文档ID）

cpp 复制代码

Forword_t* QueryById(size_t id) {
    if(id <= 0 || id >= Forword_Index_.size()) 
        return nullptr;
    return &Forword_Index_[id];
}

2. 倒排查询（按关键词）

cpp 复制代码

InvertedList* QueryByWord(std::string word) {
    auto it = Inverted_Index_.find(word);
    if(it == Inverted_Index_.end()) 
        return nullptr;
    return &it->second;
}

技术特点

1. 权重计算策略

标题权重：词频 × 10（更高优先级）

内容权重：词频 × 1

体现标题关键词比内容关键词更重要

2. 文本预处理

使用结巴分词进行中文分词

统一转换为小写，避免大小写敏感问题

支持中英文混合处理

3. 内存优化

使用std::move避免不必要的拷贝

采用哈希表实现快速查找

向量存储保证内存连续性

完整代码

cpp 复制代码

#pragma once
#include "Log.hpp"
#include "Util.hpp"
#include "common.h"
#include <boost/algorithm/string/case_conv.hpp>
#include <cstddef>
#include <fstream>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>

namespace  ns_index
{
    //正排索引
    typedef struct ForwordElem
    {
        std::string title_;
        std::string content_;
        std::string url_;
        size_t doc_id_ = 0;

        void Set(std::string title, std::string content, std::string url, size_t doc_id)
        {
            title_ = title;
            content_ = content;
            url_ = url;
            doc_id_ = doc_id;
        }
    }Forword_t;

    typedef struct InvertedElem
    {
        size_t doc_id_ = 0;
        std::string word_;
        size_t weight_ = 0;

        void Set(size_t doc_id, std::string word, size_t weight)
        {
            doc_id_ = doc_id;
            word_ = word;
            weight_ = weight;
        }
    }Inverted_t;

    typedef std::vector<Inverted_t> InvertedList;

    class Inedx : public NonCopyable
    {
        public:
            Inedx() = default;

            Forword_t* QueryById(size_t id)
            {
                if(id <= 0 || id >= Forword_Index_.size())
                {
                    Log(LogModule::DEBUG) << "id invalid!";
                    return nullptr;
                }

                return &Forword_Index_[id];
            }

            InvertedList* QueryByWord(std::string word)
            {
                auto it = Inverted_Index_.find(word);
                if(it == Inverted_Index_.end())
                {
                    Log(LogModule::DEBUG) << word << " find fail!";
                    return nullptr;
                }

                return &it->second;
            }

            bool BulidIndex()
            {
                std::ifstream in(Tragetfile, std::ios::binary | std::ios::in);
                if(!in.is_open())
                {
                    Log(LogModule::ERROR) << "Targetfile open fail!BulidIndex fail!";
                    return false;
                }
                std::string singlefile;
                while (std::getline(in, singlefile))
                {
                    bool b = BuildForwordIndex(singlefile);
                    if(!b)
                    {
                        Log(LogModule::DEBUG) << "Build Forword Index Error!";
                        continue;
                    }

                    b = BuildInvertedIndex(Forword_Index_.size() - 1);
                    if(!b)
                    {
                        Log(LogModule::DEBUG) << "Build Inverted Index Error!";
                        continue;
                    }

                }
                in.close();
                return true;
            }

            ~Inedx() = default;

        private:
            typedef struct DocCount
            {
                size_t title_cnt_ = 0;
                size_t content_cnt_ = 0;
            }DocCount_t;

            bool BuildForwordIndex(std::string& singlefile)
            {
                std::vector<std::string> sepfile;
                bool b = ns_util::JiebaUtile::CutDoc(singlefile, sepfile);
                if(!b)
                    return false;

                if(sepfile.size() != 3)
                {
                    Log(LogModule::DEBUG) << "Segmentation fail!";
                    return false;
                }

                Forword_t ft;
                ft.Set(std::move(sepfile[0]), std::move(sepfile[1])
                , std::move(sepfile[2]), Forword_Index_.size());

                Forword_Index_.push_back(std::move(ft));
                return true;
            }

            bool BuildInvertedIndex(size_t findex)
            {
                Forword_t ft = Forword_Index_[findex];
                std::unordered_map<std::string, DocCount_t> map_s;

                std::vector<std::string> titlesegmentation;
                ns_util::JiebaUtile::CutPhrase(ft.title_, titlesegmentation);
                for(auto& s : titlesegmentation)
                {
                    boost::to_lower(s);
                    map_s[s].title_cnt_++;
                }

                std::vector<std::string> contentsegmentation;
                ns_util::JiebaUtile::CutPhrase(ft.content_, contentsegmentation);
                for(auto& s : contentsegmentation)
                {
                    boost::to_lower(s);
                    map_s[s].content_cnt_++;
                }

                const int X = 10;
                const int Y = 1;
                for(auto& p : map_s)
                {
                    Inverted_t it;
                    it.Set(findex, p.first
                        , p.second.title_cnt_ * X + p.second.content_cnt_ * Y);
                    InvertedList& list = Inverted_Index_[p.first];
                    list.push_back(std::move(it));
                }
                return true;
            }
        private:
            std::vector<Forword_t> Forword_Index_;
            std::unordered_map<std::string, InvertedList> Inverted_Index_;
    };
};

其它更新模块

common.h

cpp 复制代码

#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <cstddef>
#include "Log.hpp"
using std::cout;
using std::endl;
using namespace LogModule;

const std::string Boost_Url_Head = "https://www.boost.org/doc/libs/1_89_0/doc/html";

const std::string Orignaldir = "../Data/html";
const std::string Tragetfile = "../Data/output.txt";

const std::string Output_sep = "\3";
const std::string Line_sep = "\n";

// 定义词典路径（根据实际路径修改）
const std::string DICT_PATH = "dict/jieba.dict.utf8";
const std::string HMM_PATH = "dict/hmm_model.utf8";
const std::string USER_DICT_PATH = "dict/user.dict.utf8";
const std::string IDF_PATH = "dict/idf.utf8";
const std::string STOP_WORD_PATH = "dict/stop_words.utf8";

// 不可复制基类
class NonCopyable {
protected:
    // 允许派生类构造和析构
    NonCopyable() = default;
    ~NonCopyable() = default;

    // 禁止移动操作（可选，根据需求决定）
    // NonCopyable(NonCopyable&&) = delete;
    // NonCopyable& operator=(NonCopyable&&) = delete;

private:
    // 禁止拷贝构造和拷贝赋值
    NonCopyable(const NonCopyable&) = delete;
    NonCopyable& operator=(const NonCopyable&) = delete;
};

util.hpp

cpp 复制代码

#pragma once
#include "Log.hpp"
#include "common.h"
#include "cppjieba/Jieba.hpp"
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
namespace ns_util
{
    class FileUtil : public NonCopyable
    {
    public:
        static bool ReadFile(std::string path, std::string* out)
        {
            std::fstream in(path, std::ios::binary | std::ios::in);
            if(!in.is_open())
            {
                Log(LogModule::DEBUG) << "file-" << path << "open fail!";
                return false;
            }

            std::stringstream ss;
            ss << in.rdbuf();
            *out = ss.str();

            in.close();
            return true;
        }
    };

    class JiebaUtile : public NonCopyable
    {
    public:
        static cppjieba::Jieba* GetInstace()
        {
            static cppjieba::Jieba jieba_(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH);
            return &jieba_;
        }
        
        static bool CutPhrase(std::string& src, std::vector<std::string>& out)
        {
            try
            {
                GetInstace()->CutForSearch(src, out, true);
            }
            catch (const std::exception& e)
            {
                Log(LogModule::ERROR) << "CutString Error!" << e.what();
                return false;
            }
            catch (...)
            {
                Log(ERROR) << "Unknow Error!";
                return false;
            }
            return true;
        }

        static bool CutDoc(std::string& filestr, std::vector<std::string>& out)
        {
            try
            {
                boost::split(out, filestr, boost::is_any_of("\3"));
            }
            catch (const std::exception& e)
            {
                Log(LogModule::ERROR) << "std Error-" << e.what();
                return false;
            }            
            catch(...)
            {
                Log(LogModule::ERROR) << "UnKnown Error!";
                return false;
            }

            return true;
        }
    private:
        JiebaUtile() = default;
        ~JiebaUtile() = default;
    };
    
};

main.cc测试

cpp 复制代码

#include "Log.hpp"
#include "common.h"
#include "Parser.h"
#include "index.hpp"

const bool INIT = false;



int main()
{
    if(INIT)
    {
        Parser parser(Orignaldir, Tragetfile);
        parser.Init();
    }

    ns_index::Inedx index;
    index.BulidIndex();

    auto a = index.QueryByWord("split");
    if(a == nullptr)
    {
        Log(LogModule::DEBUG) << "find fail!";
        return 0;
    }
    for(auto e : *a)
    {
        cout << "id:" << e.doc_id_ << " weight: " << e.weight_ << endl;
    }

    auto b = index.QueryById(764);
    cout << "id:764--title-- " << b->title_ << endl;
    return 0;
}