[项目]基于正倒排索引的Boost搜索引擎---编写搜索引擎模块 Searcher

前置内容

索引模块

[项目]基于正倒排索引的Boost搜索引擎---编写建立索引的模块Index-CSDN博客

一、整体框架

1.整体架构目标

功能定位：提供一个轻量级、可嵌入的本地搜索引擎后端模块。
输入：原始文档集合路径（用于建索引）；用户查询字符串（用于搜索）。
输出：结构化 JSON 格式的搜索结果（含标题、摘要、URL、相关性权重等）

2.包含功能

①初始化模块

功能：创建索引

cpp 复制代码

void InitSearcher(std::string &input)
{
    //1.获取或创建index对象
    //2.根据index对象创建索引
}

②搜索模块

功能：主要模块，根据搜索词查找索引并返回包含搜索词的文档

cpp 复制代码

void Search(std::string& query, std::string *json_string)
{}

参数解释：query：用户搜索的词

json_string：返回给用户的搜索结果

二、初始化模块

1.单例模式

初始化创建index对象并根据index对象创建索引，因为创建索引只需要一次就够了，为防止多次创建对象浪费资源，所以index模块我们使用单例模式

cpp 复制代码

Class index
{
private:
        Index(){}
        Index(const Index&) = delete;
        Index& operator=(const Index&) = delete;
        static Index* instance;
        static std::mutex mutex;
public:
    //获取单例
    static Index* GetInstance()
    {
        if(nullptr == instance)//双层锁减少加锁，提高效率
        {
            mutex.lock();
            if(nullptr == instance)
            {
                    instance = new Index();
            }
            mutex.unlock();
        }
        return instance;
    }
}；
Index* Index::instance = nullptr;

创建单例思路：通过使用双检锁机制确保在多线程环境下仅创建一个Index类的实例，并且该实例只能通过静态方法GetInstance访问。为避免频繁加锁带来的性能损耗，在首次检查确认实例未创建后才进行加锁操作，并在加锁后再做一次检查以防止多个线程同时创建实例的情况。此外，为了保证单例的有效性，删除了拷贝构造函数和赋值操作符，阻止了通过拷贝方式生成新的实例。实例化过程采用懒加载策略，即实例在第一次调用GetInstance方法时才会被创建，这样可以节省资源，特别适合初始化成本较高的场景。

2.初始化的构建

思路：首先根据单例创建索引对象，然后调用对象的构建索引的模块就完啦，太简单了

cpp 复制代码

void InitSearcher(const std::string &input)
{
    //1.获取或创建index对象
    index = ns_index::Index::GetInstance();
    std::cout << "获取单例成功" << std::endl;

    //2.根据index对象创建索引
    index->BuildIndex(input);
    std::cout << "创建索引成功" << std::endl;
}

不解释

三、搜索模块

1.思路

①分词：对我们的query进行按照searcher的要求进行分词

②触发：就是根据分词的各个"词"，按照index进行查找

③合并排序：汇总查找，按照相关性(weight)降序排序

④构建：根据查找出来的结果，构建json串

2.分词

使用工具类Util中的CutString函数

cpp 复制代码

std::vector<std::string> words;
Util::JiebaUtil::CutString(query, &words);

3.触发

①功能：根据用户输入的一组分词（words），从倒排索引中查找每个词对应的文档列表（倒排拉链），并将所有结果合并到一个统一的列表 inverted_list_all 中

②思路

cpp 复制代码

struct InvertElem
{
    std::string word;
    uint64_t doc_id;
    int weight;//权重
};
//倒排拉链：存储所有文档中的同一个单词的InvertElem，里边记录这这个单词所处id和权重，方便根据权重展示
typedef std::vector<InvertElem> InvertList;

创建一个空的倒排列表 inverted_list_all，用于收集所有匹配词项对应的文档ID及其相关信息 （如词频、位置等，取决于 InvertList 的定义）。
使用 boost::to_lower 将其转换为小写，保证查询与索引时的大小写一致（通常索引构建时也会转小写，以实现大小写不敏感搜索）。
调用索引对象 index 的 GetInvertList(word) 方法，尝试获取该词对应的倒排拉链（即包含该词的所有文档ID列表）。
如果返回 nullptr，说明该词不在索引中（例如是停用词或未被索引的词），则跳过该词。
将当前词的倒排拉链中的所有元素（文档记录）追加到 inverted_list_all 的末尾。

cpp 复制代码

ns_index::InvertList inverted_list_all;//根据单词找到id存入其中
for(auto word : words)
{
    boost::to_lower(word);

    //根据单词获取倒排拉链
    ns_index::InvertList* inverted_list = index->GetInvertList(word);
    if(nullptr == inverted_list)
    {
        continue;
    }
    inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());
}

执行完这一步后inverted_list_all里边存的是包含搜索词的InvertElem的集合

4.合并排序

功能：按照相关性进行降序排序

cpp 复制代码

std::sort(inverted_list_all.begin(), inverted_list_all.end(),
[](const ns_index::InvertElem& e1, const ns_index::InvertElem& e2){
    return e1.weight > e2.weight;//降序>, 升序<
});

执行完这一步后inverted_list_all中就会按weight的顺序将InvertedElem进行降序

小技巧：编写仿函数时降序用>，升序用<

5.构建

①功能：根据查找出来的结果，构建json串

②Json的使用

请移步下边的文章

Linux---序列化与反序列化-CSDN博客

③思路

遍历inverted_list_all，根据其中元素的id值查找正排索引，就可以得到文档的全部内容，然后就能构建Json串了

cpp 复制代码

Json::Value root;
for(auto &item : inverted_list_all)
{
    ns_index::DocInfo* doc = index->GetForwordIndex(item.doc_id);
    if(nullptr == doc)
    {
        continue;
    }
    Json::Value elem;
    elem["title"] = doc->title;
    elem["desc"] = GetDesc(doc->content, item.word);
    // elem["desc"] = doc->content;//content是文档去标签后的结果，但不是我们想要的，我们要的是一部分
    elem["url"] = doc->url;
    elem["id"] = item.doc_id;
    elem["weight"] = item.weight;

    root.append(elem);
}
Json::StyledWriter writer;
*json_string = writer.write(root);

④GetDesc的实现

功能：提取文档内容中的一部分，而不是全部展示出来

类似下图中框起来的部分，不设置的话就会将文档内容全部展示出来，不够清晰

思路：我们采用的是将搜索词的前50个字符和后100个字符显示出来作为概要

①找到首次出现的位置

②获取开始位置和结束位置

③返回字符串

cpp 复制代码

std::string GetDesc(std::string& content, std::string& word)
{
    //找到word在content中首次出现的位置，然后往前找50个字节(不够从头开始)，往后找100个字节(不够到结尾)
    const int prev_step = 50;
    const int next_step = 100;
    //1.找到首次出现
    auto iter = std::search(content.begin(), content.end(),word.begin(),word.end(),
    [](int x, int y){
            return (std::tolower(x) == std::tolower(y));
    });
    if(iter == content.end()){
        return "None";
    }
    std::size_t pos = std::distance(content.begin(),iter);
    //下边这种写法有坑：读取的是文档中的内容，例如：文档中Split，而word是split，无法区分大小写
    // std::size_t pos = content.find(word);
    // if(pos == std::string::npos)
    //     return "None";
    //2.获取开始和结束位置
    std::size_t start = 0;
    std::size_t end = content.size();
    if(pos > start + prev_step) start = pos-prev_step;
    if(pos + next_step < end) end = pos+next_step;
    //3.返回字串
    if(start >= end) return "None1";
    return content.substr(start, end-start);
}

四、代码展示

cpp 复制代码

#pragma once
#include "index.hpp"
#include <jsoncpp/json/json.h>

namespace ns_searcher
{
    class Searcher
    {
    private:
        ns_index::Index* index;//供系统进行查找索引
    public:
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index对象
            index = ns_index::Index::GetInstance();
            std::cout << "获取单例成功" << std::endl;

            //2.根据index对象创建索引
            index->BuildIndex(input);
            std::cout << "创建索引成功" << std::endl;
        }
        //query:搜索的词
        //json_string:返回给用户的搜索结果
        void Search(std::string &query, std::string *json_string)
        {
            //1.分词：对我们的query进行按照searcher的要求进行分词
            std::vector<std::string> words;
            Util::JiebaUtil::CutString(query, &words);
            //2.触发：就是根据分词的各个"词"，按照index查找
            ns_index::InvertList inverted_list_all;//根据单词找到id存入其中
            for(auto word : words)
            {
                boost::to_lower(word);

                //根据单词获取倒排拉链
                ns_index::InvertList* inverted_list = index->GetInvertList(word);
                if(nullptr == inverted_list)
                {
                    continue;
                }
                inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());
            }
            //3.合并排序：汇总查找结果，按照相关性(weight)降序排序
            std::sort(inverted_list_all.begin(), inverted_list_all.end(),
            [](const ns_index::InvertElem& e1, const ns_index::InvertElem& e2){
                return e1.weight > e2.weight;//降序>, 升序<
            });
            //4.构建：根据查找出来的结果，构建json串
            Json::Value root;
            for(auto &item : inverted_list_all)
            {
                ns_index::DocInfo* doc = index->GetForwordIndex(item.doc_id);
                if(nullptr == doc)
                {
                    continue;
                }
                Json::Value elem;
                elem["title"] = doc->title;
                elem["desc"] = GetDesc(doc->content, item.word);
                // elem["desc"] = doc->content;//content是文档去标签后的结果，但不是我们想要的，我们要的是一部分，稍后处理
                elem["url"] = doc->url;
                elem["id"] = item.doc_id;
                elem["weight"] = item.weight;

                root.append(elem);
            }
            Json::StyledWriter writer;
            *json_string = writer.write(root);
        }

        std::string GetDesc(std::string& content, std::string& word)
        {
            //找到word在content中首次出现的位置，然后往前找50个字节(不够从头开始)，往后找100个字节(不够到结尾)
            const int prev_step = 50;
            const int next_step = 100;
            //1.找到首次出现
            auto iter = std::search(content.begin(), content.end(),word.begin(),word.end(),
            [](int x, int y){
                    return (std::tolower(x) == std::tolower(y));
            });
            if(iter == content.end()){
                return "None";
            }
            std::size_t pos = std::distance(content.begin(),iter);
            //下边这种写法有坑：读取的是文档中的内容，例如：文档中Split，而word是split，无法区分大小写
            // std::size_t pos = content.find(word);
            // if(pos == std::string::npos)
            //     return "None";
            //2.获取开始和结束位置
            std::size_t start = 0;
            std::size_t end = content.size();
            if(pos > start + prev_step) start = pos-prev_step;
            if(pos + next_step < end) end = pos+next_step;
            //3.返回字串
            if(start >= end) return "None1";
            return content.substr(start, end-start);
        }
    };
}

[项目]基于正倒排索引的Boost搜索引擎---编写搜索引擎模块 Searcher

一、整体框架

1.整体架构目标

2.包含功能

二、初始化模块

1.单例模式

2.初始化的构建

三、搜索模块

1.思路

2.分词

3.触发

4.合并排序

5.构建

四、代码展示

感谢观看