【标准项目】C++基于正倒排索引的Boost搜索引擎

C++基于正倒排索引的Boost搜索引擎

文章目录

C++基于正倒排索引的Boost搜索引擎
一、项目的相关背景
二、搜索引擎地相关宏观原理
三、搜索引擎技术栈和项目环境
[四、正排索引 vs 倒排索引 - 搜索引擎具体原理](#四、正排索引 vs 倒排索引 - 搜索引擎具体原理)
五、编写数据去标签与数据清洗的模块Parser
- [5.1 先将boost库解压到Linux下](#5.1 先将boost库解压到Linux下)
- [5.2 在Linux下面创建目录](#5.2 在Linux下面创建目录)
- [5.3 编写parse模块](#5.3 编写parse模块)
[六、编写建立索引的模块 Index](#六、编写建立索引的模块 Index)
[七、编写搜索引擎模块 Searcher](#七、编写搜索引擎模块 Searcher)
[八、编写 http_server 模块](#八、编写 http_server 模块)
九、编写前端模块
总结

一、项目的相关背景

二、搜索引擎地相关宏观原理

三、搜索引擎技术栈和项目环境

四、正排索引 vs 倒排索引 - 搜索引擎具体原理

五、编写数据去标签与数据清洗的模块Parser

5.1 先将boost库解压到Linux下

boost官网如下

⽬前只需要boost_1_78_0/doc/html⽬录下的html⽂件，

⽤它来进⾏建⽴索引

这里我没有去官网下载完整boost库，而是直接下载纯html文件！
但是这里我还是要编写程序来清理除了html相关的其他文件~

5.2 在Linux下面创建目录

Boost_Search是所有代码模块和数据存放的位置

5.3 编写parse模块

这里创建raw.txt文件，用来存储干净的数据文档

代码如下（示例）：

c 复制代码

#include <iostream>
#include <string>
#include <vector>
 
// 首先我们肯定会读取文件，所以先将文件的路径名 罗列出来
// 将 数据源的路径 和 清理后干净文档的路径 定义好
 
 
const std::string src_path = "data/input";          // 数据源的路径
const std::string output = "data/raw_html/raw.txt"; // 清理后干净文档的路径
 
//DocInfo --- 文件信息结构体
typedef struct DocInfo
{
    std::string title;   //文档的标题
    std::string content; //文档的内容
    std::string url;     //该文档在官网当中的url
}DocInfo_t;
 
// 命名规则
// const & ---> 输入
// * ---> 输出
// & ---> 输入输出
 
 
//把每个html文件名带路径，保存到files_list中
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);
 
//按照files_list读取每个文件的内容，并进行解析
bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results);
 
//把解析完毕的各个文件的内容写入到output
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);
 
 
 
int main()
{
    std::vector<std::string> files_list; // 将所有的 html文件名保存在 files_list 中
 
    // 第一步：递归式的把每个html文件名带路径，保存到files_list中，方便后期进行一个一个的文件读取
 
    // 从 src_path 这个路径中提取 html文件，将提取出来的文件存放在 string 类型的 files_list 中
    if(!EnumFile(src_path, &files_list)) //EnumFile--枚举文件
    {
        std::cerr << "enum file name error! " << std::endl;
        return 1;
    }
    return 0;
 
    // 第二步：从 files_list 文件中读取每个.html的内容，并进行解析
 
     std::vector<DocInfo_t> results;
     // 从 file_list 中进行解析，将解析出来的内容存放在 DocInfo 类型的 results 中
    if(!ParseHtml(files_list, &results))//ParseHtml--解析html
    {
        std::cerr << "parse html error! " << std::endl;
        return 2;
    }
 
 
    // 第三部：把解析完毕的各个文件的内容写入到output，按照 \3 作为每个文档的分隔符
 
    // 将 results 解析好的内容，直接放入 output 路径下
    if(!SaveHtml(results, output))//SaveHtml--保存html
    {
        std::cerr << "save html error! " << std::endl;
        return 3;
    }
 
    return 0;
}

命令： sudo apt install -y libboost-all-dev

枚举文件（示例）：

c 复制代码

//在原有的基础上添加这个头文件
#include <boost/filesystem.hpp>
 
//把每个html文件名带路径，保存到files_list中
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
    // 简化作用域的书写
    namespace fs = boost::filesystem;
    fs::path root_path(src_path); // 定义一个path对象，枚举文件就从这个路径下开始
    // 判断路径是否存在
    if(!fs::exists(root_path))
    {
        std::cerr << src_path << " not exists" << std::endl;
        return false;
    }
    // 对文件进行递归遍历
    fs::recursive_directory_iterator end; // 定义了一个空的迭代器，用来进行判断递归结束  -- 相当于 NULL
    for(fs::recursive_directory_iterator iter(root_path); iter != end; iter++)
    {
        // 判断指定路径是不是常规文件，如果指定路径是目录或图片直接跳过
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
 
        // 如果满足了是普通文件，还需满足是.html结尾的
        // 如果不满足也是需要跳过的
        // ---通过iter这个迭代器（理解为指针）的一个path方法（提取出这个路径）
        // ---然后通过extension()函数获取到路径的后缀
        if(iter->path().extension() != ".html")
        {
            continue;
        }
 
        //std::cout << "debug: " << iter->path().string() << std::endl; // 测试代码
      
        // 走到这里一定是一个合法的路径，以.html结尾的普通网页文件
        files_list->push_back(iter->path().string()); // 将所有带路径的html保存在files_list中，方便后续进行文本分析
    }
    return true;
}

makefile代码如下（示例）：

c 复制代码

cpp=g++
parser:parser.cpp
    $(cpp) -o $@ $^ -lboost_system -lboost_filesystem -std=c++11
.PHONY:clean
clean:
   rm -f parser

代码如下（示例）：

c 复制代码

bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
{
    for(const std::string &file : files_list)
    {
        // 1.读取文件，Read()
        std::string result;
        if(!ns_util::FileUtil::ReadFile(file, &result))
        {
            continue;
        }
        // 2.解析指定的文件，提取title
        DocInfo_t doc;
        if(!ParseTitle(result, &doc.title))
        {
            continue;
        }
        // 3.解析指定的文件，提取content
        if(!ParseContent(result, &doc.content))
        {
            continue;
        }
        // 4.解析指定的文件路径，构建url
        if(!ParseUrl(file, &doc.url))
        {
            continue;        
        }
 
        // 到这里，一定是完成了解析任务，当前文档的相关结果都保存在了doc里面
        results->push_back(std::move(doc)); // 本质会发生拷贝，效率肯能会比较低，这里我们使用move后的左值变成了右值，去调用push_back的右值引用版本
    }
    return true;                                                                                                                                                                                                                                                    
}

代码如下（示例）：

c 复制代码

#pragma once
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
 
namespace ns_util
{
    class FileUtil
    {                                                                                                                                                                                                                                                                                                                                                                        
    public:
        //输入文件名，将文件内容读取到out中
        static bool ReadFile(const std::string &file_path, std::string *out)
        {
            // 读取 file_path（一个.html文件） 中的内容  -- 打开文件
            std::ifstream in(file_path, std::ios::in);
            //文件打开失败检查
            if(!in.is_open())
            {
                std::cerr << "open file " << file_path << " error" << std::endl;
                return false;
            }
            //读取文件内容
            std::string line;
            //while(bool),getline的返回值istream会重载操作符bool，读到文件尾eofset被设置并返回false
            //如何理解getline读取到文件结束呢？？getline的返回值是一个&，while(bool), 本质是因为重载了强制类型转化
            while(std::getline(in, line)) // 每循环一次，读取的是文件的一行内容
            {
                *out += line;    // 将文件内容保存在 *out 里面
            }
            in.close(); // 关掉文件
            return true;
        }
    };
}

代码如下（示例）：

c 复制代码

bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
    #define SEP '\3'//分割符---区分标题、内容和网址
    
    // 打开文件，在里面进行写入
    // 按照二进制的方式进行写入 -- 你写的是什么文档就保存什么
    std::ofstream out(output, std::ios::out | std::ios::binary);
    if(!out.is_open())
    {
        std::cerr << "open " << output << " failed!" << std::endl;
        return false;
    }
 
    // 到这里就可以进行文件内容的写入了
    for(auto &item : results)
    {
        std::string out_string;
        out_string = item.title;//标题
        out_string += SEP;//分割符
        out_string += item.content;//内容
        out_string += SEP;//分割符
        out_string += item.url;//网址
        out_string += '\n';//换行，表示区分每一个文件
 
        // 将字符串内容写入文件中
        out.write(out_string.c_str(), out_string.size());
    }
 
    out.close();
    return true;
}

【标准项目】C++基于正倒排索引的Boost搜索引擎

C++基于正倒排索引的Boost搜索引擎

文章目录

一、项目的相关背景

二、搜索引擎地相关宏观原理

三、搜索引擎技术栈和项目环境

四、正排索引 vs 倒排索引 - 搜索引擎具体原理

五、编写数据去标签与数据清洗的模块Parser

5.1 先将boost库解压到Linux下

5.2 在Linux下面创建目录

5.3 编写parse模块

六、编写建立索引的模块 Index

七、编写搜索引擎模块 Searcher

八、编写 http_server 模块

九、编写前端模块

总结