《C/C+++ Boost 轻量级搜索引擎实战：架构流程、技术栈与工程落地指南——构造正/倒排索引（中篇）》

前引：这是一个聚焦基础搜索引擎核心工作流的实操项目，基于 C/C++ 技术生态落地：从全网爬虫抓取网页资源，到服务器端完成 "去标签 - 数据清洗 - 索引构建" 的预处理，再通过 HTTP 服务接收客户端请求、检索索引并拼接结果页返回 ------ 完整覆盖了轻量级搜索引擎的端到端逻辑。项目采用 C++11、STL、Boost 等核心技术栈，搭配 CentOS 7 云服务器 + GCC 编译环境（或 VS 系列开发工具）部署，既适配后端工程的性能需求，也能通过可选的前端技术（HTML5/JS 等）优化用户交互，是理解搜索引擎底层原理与 C++ 工程实践的典型案例！

【一】Jieba分词工具

【一】Jieba分词工具

我们在对使用倒排索引的时候需要用到"关键词"，这个关键词由每个.html文档的标题和内容而来，因此就涉及到分词，所以我们使用 cppjieba 分词工具来完成，下面是上传：

如果需要使用到cppjieba分词工具，可以在gitcode上面直接搜索使用Git命令上传，我这里是本地上传到服务器：

然后对 cppjieba/include/cppjieba和cppjieba/dict分别建立软链接：头文件和词库

我们把cppjieba移动到上级目录，然后更新一下这两个软链接：

【二】正/倒排索引结构设计

cpp 复制代码

//正排结构
    typedef struct Forward_index
    {
        //文档内容
        std::string title;
        std::string source;
        std::string chain;
        //对应正排ID
        uint64_t doc_id;
    }Forwardindex;
//倒排结构
    typedef struct Inverted_index
    {
        //对应正排ID
        int doc_id;
        //涉及到的关键字
        std::string word;
        //权重
        int weight;
    }Invertedindex;

正排：根据ID映射对应的文档（标题、内容、URL），这个ID刚好利用vector的下标！

倒排：根据"关键字"映射对应的ID，所以利用unordered_map快速的搜索特性

cpp 复制代码

//正排存储
std::vector<Forwardindex> Forward;

typedef std::vector<Invertedindex> Stock_Inverted;
//倒排存储
std::unordered_map<std::string,Stock_Inverted> Inverted;

【三】关键函数设计

（1）由文档ID返回文档内容

含义：即正排外部的实现，根据ID返回vector中对应的具体内容

cpp 复制代码

//根据ID返回文档内容
Forwardindex* GetForward_index(const long long& id)
{
     if(id>=Forward.size())
     {
          std::cerr<<"GetForward_index is errno"<<std::endl;
          return nullptr;
     }
     return &Forward[id];
}

（2）由关键字返回倒排拉链

含义：倒排外部实现，根据"关键字"返回对应的ID，即返回对应的vector<Invertedindex>成员

解释：每个关键词涉及到的ID肯定不止一个，所以需要vector<Invertedindex>代表一个关键字涉及到的所有文档

cpp 复制代码

//根据关键字返回倒排拉链
Stock_Inverted* GetInverted_index(const std::string word)
{
     auto it =Inverted.find(word);
     if(it == Inverted.end())
     {
          std::cerr<<"GetInverted_index is errno"<<std::endl;
          return {};
     }
     return &it->second;
}

（3）说明

现在我们已经有了给外部的接口：由ID返回文档内容、由关键词返回文档ID

那么接下来就是填变量、建立索引关系：

cpp 复制代码

//正排存储
std::vector<Forwardindex> Forward;

typedef std::vector<Invertedindex> Stock_Inverted;
//倒排存储
std::unordered_map<std::string,Stock_Inverted> Inverted;

（4）建立索引

原理：利用上一篇文章完成的数据清洗，最终每个.html文档内容都被输出到了data_clean.txt中

我们先利用getline按行读取（也就是按单个文档读取，因为data_clean.txt中也是按行写的）到 line变量里面，这样就拿到了单个文档的内容，因此while就是拿取data_clean.txt的全部文档内容！

建索引也就是同时建立正排和倒排的内容，也就是填vector和Unordered_map的操作！

cpp 复制代码

//建立索引
bool Buildindex(const std::string &input)
{
     std::ifstream in(input,std::ios::in | std::ios::binary);
     if(!in.is_open())
     {
          std::cerr<<"Buildindex is errno"<<std::endl;
          return false;
     }
     //存单个读取文档内容
     std::string line;
     while(std::getline(in,line))
     {
          //正排
          Forwardindex * doc = BUild_Forward_Index(line);
          printf("正在建立索引:%lld\ntitle:%s\nchain:%s\n",doc->doc_id,doc->title.c_str(),doc->chain.c_str());

          if(doc==nullptr)continue;
          //倒排
          Build_Inverted_Index(*doc);
     }
     return true;
}

（5）建立正排索引

思路：首先我们拿到了单个文档的内容，也就是 line（string）变量，然后对这个内容进行解析，提取出标题、内容、URL（每个line变量都是：标题 | 内容 | URL 的格式，分割即可！）然后填入到vector里面，该文档的ID刚好就是vector的下标，下面有正排的结构体，大家可以对照着看：

cpp 复制代码

//建立正排索引
Forwardindex *BUild_Forward_Index(const std::string&line)
{
     Forwardindex* index = new Forwardindex();

     //截取title
     size_t set_pos1 = line.find('\3');
     if(set_pos1 == std::string::npos) 
     {
           delete index;
           return nullptr;
     }
     index->title = line.substr(0, set_pos1);
     if (index->title.empty()) index->title = "空";

     //截取source
     size_t set_pos2 = line.find('\3', set_pos1+1);
     if(set_pos2 == std::string::npos) 
     {
           delete index;
           return nullptr;
     }
     index->source = line.substr(set_pos1+1, set_pos2 - (set_pos1+1));
     if (index->source.empty()) index->source = "空";

     //截取URL(chain)
     index->chain = line.substr(set_pos2+1);
     if (index->chain.empty()) index->chain = "空";

     index->doc_id = Forward.size();
     Forward.push_back(*index);
     return index;
}

cpp 复制代码

//正排结构
typedef struct Forward_index
{
    //文档内容
    std::string title;
    std::string source;
    std::string chain;
    //对应正排ID
    uint64_t doc_id;
}Forwardindex;

（6）建立倒排索引

cpp 复制代码

//建立倒排索引
bool Build_Inverted_Index(const Forwardindex& doc)
{
     //建立分词对象
     JiebaUtil jieba;
     std::vector<std::string> S;
     //对标题统计
     S=jieba.Tokenize(doc.title);
     //此时V存放的是每个标题的分词，然后统计结果和出现次数
     struct culculate
     {
         int title_size=0;
         int source_size=0;
     };
     std::unordered_map<std::string,culculate> V;
     //遍历S统计标题的出现次数
     for(auto e:S)
     {
          (V[e].title_size)++;
     }
     //统计内容出现次数
     S.clear();
     S=jieba.Tokenize(doc.source);
     for(auto e:S)
     {
         (V[e].source_size)++;
     }
     //遍历V的结果写入到Inverted
     for(auto it:V)
     {
          Invertedindex index_t;
          index_t.word=it.first;
          index_t.doc_id=doc.doc_id;
          index_t.weight=((it.second.title_size)*2+(it.second.source_size)*1);
          Inverted[it.first].push_back(std::move(index_t));
     }
     return true;
}

【四】单例模式

单例模式：只允许创建一个Index对象，后面只能由 index 指针调用该对象

cpp 复制代码

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

class Index
{
public:
     typedef std::vector<Invertedindex> Stock_Inverted;
     //禁止拷贝构造和赋值重载
     Index(const Index&)=delete;
     Index& operator=(const Index&)=delete;
        
     static Index*handle()
     {
          if(instance==nullptr)
          {
             pthread_mutex_lock(&mutex);
             if (instance == nullptr)
             {
                  instance = new Index;
             }
             pthread_mutex_unlock(&mutex);
          }
          return instance;
     }

......

}

cpp 复制代码

 //静态成员初始化
Index* Index::instance=nullptr;