【C++项目实战】：基于正倒排索引的Boost搜索引擎（1）

1. 项目的相关背景与目标

针对boost网站没有搜索导航功能，为boost网站文档的查找提供搜索功能

站内搜索：搜索的数据更垂直，数据量小

类似于cplusplus.com的搜索

2.搜索引擎的相关宏观原理

3.技术栈和项目环境

技术栈：C/C++，C++11，STL，准标准库Boost（相关文件操作），jsoncpp（客户端和数据端数据交互），cppjieba（将搜索关键字进行切分），cpp-httplib（构建http服务器）

其他技术栈（前端）：html5，css，js，jQuery，Ajax

项目环境：Centos 7云服务器，vim/gcc（g++）/Makefile，vs2019/vs code(网页)

4. 正排索引、倒排索引

正排索引：从文档ID找到文档内容（文档内的关键字）

正排索引类似于书的目录，我们可以根据页数查找到对应的内容

目标文档进行分词：目的：方便建立倒排索引和查找

停止词：了，吗，的，the，a，一般情况我们在分词的时候可以不考虑

倒排索引：根据文档内容，分词，整理不重复的各个关键字，对应联系到文档ID的方案

文档ID中，各个文档ID的排序按照权重进行排序

倒排索引和正排索引是相反的概念，我们可以根据文档内容查询到这部分内容在哪些文件中出现，从而找到对应的文件

模拟查找过程

用户输入：

关键字->倒排索引中查找->提取出是文档ID（x，y，z，，，）->根据正排索引->找到文档的内容->将文档内容中的title+conent（desc）+url+文档结果进行摘要->构建响应结果

5.编写数据去标签与数据清洗的模块Parser

cpp 复制代码

boost 官⽹： https://www.boost.org/
//⽬前只需要boost_1_78_0/doc/html⽬录下的html⽂件，⽤它来进⾏建⽴索引

去标签

cpp 复制代码

[@VM-0-3-centos boost_searcher]$ touch parser.cc
//原始数据 -> 去标签之后的数据
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html> <!--这是⼀个标签-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 30. Boost.Process</title>
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook
Documentation Subset">
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries
(BoostBook Subset)">
<link rel="prev" href="poly_collection/acknowledgments.html"title="Acknowledgments">
<link rel="next" href="boost_process/concepts.html" title="Concepts">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"alink="#0000FF">
<table cellpadding="2" width="100%"><tr>
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86"src="../../boost.png"></td>
<td align="center"><a href="../../index.html">Home</a></td>
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
<td align="center"><a href="http://www.boost.org/users/people.html">People</a>
</td>
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
<td align="center"><a href="../../more/index.htm">More</a></td>
</tr></table>
.........
// <> : html的标签，这个标签对我们进⾏搜索是没有价值的，需要去掉这些标签，⼀般标签都是成对出现的！
[@VM-0-3-centos data]$ mkdir raw_html
[@VM-0-3-centos data]$ lltotal 20
drwxrwxr-x 60 16384 Mar 24 16:49 input //这⾥放的是原始的html⽂档
drwxrwxr-x 2 4096 Mar 24 16:56 raw_html //这是放的是去标签之后的⼲净⽂档
[@VM-0-3-centos input]$ ls -Rl | grep -E '*.html' | wc -l
8141
⽬标：把每个⽂档都去标签，然后写⼊到同⼀个⽂件中！每个⽂档内容不需要任何\n！⽂档和⽂档之间
⽤ \3 区分version1：
类似：XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3
采⽤下⾯的⽅案：
version2: 写⼊⽂件中，⼀定要考虑下⼀次在读取的时候，也要⽅便操作!
类似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
⽅便我们getline(ifsream, line)，直接获取⽂档的全部内容：title\3content\3url

编写parser

cpp 复制代码

//代码的基本结构：
#include <iostream>
#include <string>
#include <vector>
//是⼀个⽬录，下⾯放的是所有的html⽹⻚
const std::string src_path = "data/input/";const std::string output = "data/raw_html/raw.txt";
typedef struct DocInfo{
std::string title; //⽂档的标题
std::string content; //⽂档内容
std::string url; //该⽂档在官⽹中的url
}DocInfo_t;
//const &: 输⼊
//*: 输出
//&：输⼊输出
bool EnumFile(const std::string &src_path, std::vector<std::string>
*files_list);
bool ParseHtml(const std::vector<std::string> &files_list,
std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string
&output);
int main()
{
std::vector<std::string> files_list;//第⼀步: 递归式的把每个html⽂件名带路径，保存到files_list中，⽅便后期进⾏⼀个⼀个的
⽂件进⾏读取
if(!EnumFile(src_path, &files_list)){
std::cerr << "enum file name error!" << std::endl;
return 1;
}
//第⼆步: 按照files_list读取每个⽂件的内容，并进⾏解析
std::vector<DocInfo_t> results;
if(!ParseHtml(files_list, &results)){
std::cerr << "parse html error" << std::endl;
return 2;
}
//第三步: 把解析完毕的各个⽂件内容，写⼊到output,按照\3作为每个⽂档的分割符if(!SaveHtml(results, output)){
std::cerr << "sava html error" << std::endl;
return 3;
}
return 0;
}
bool EnumFile(const std::string &src_path, std::vector<std::string>
*files_list)
{
return true;
}
bool ParseHtml(const std::vector<std::string> &files_list,
std::vector<DocInfo_t> *results)
{
return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
return true;
}

Boost开发库的安装

cpp 复制代码

[whb@VM-0-3-centos boost_searcher]$ sudo yum install -y boost-devel //是boost 开
发库

目标：

把每个文档都去标签，然后写入到同一个文件中，每个文档内容只占一行！文档和文档之间'\3'区分