【Boost搜索引擎】构建Boost站内搜索引擎实践

|------------|---------------|
| 关键字（具有唯一性） | 文档id，weight权重 |
| 雷军 | 文档1，文档2 |
| 买 | 文档1 |
| 四斤 | 文档1 |
| 小米 | 文档1，文档2 |
| 四斤小米 | 文档1 |
| 小米手机 | 文档2 |

模拟一次查找的过程：

用户输入：小米 -> 倒排索引中查找 -> 提取出文档ID(1,2) -> 根据正排索引 -> 找到文档的内容 -> title+conent（desc）+url文档结果进行摘要 -> 构建响应结果

3. 编写数据去标签与数据清洗的模块 Parser

boost 官网：https://www.boost.org/

下载一个最新版的即可

//目前只需要boost_1_88_0/doc/html目录下的html文件，用它来进行建立索引

现在搜索源就在data中了，我们拼接一下即可

去标签

cpp 复制代码

  touch parser.cc
//原始数据 -> 去标签之后的数据 （随便截取一段举个例子）
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html> <!--这是一个标签-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 30. Boost.Process</title>
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook
Documentation Subset">
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries
(BoostBook Subset)">
<link rel="prev" href="poly_collection/acknowledgments.html"
title="Acknowledgments">
<link rel="next" href="boost_process/concepts.html" title="Concepts">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"
alink="#0000FF">
<table cellpadding="2" width="100%"><tr>
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86"
src="../../boost.png"></td>
<td align="center"><a href="../../index.html">Home</a></td>
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
<td align="center"><a href="http://www.boost.org/users/people.html">People</a>
</td>
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
<td align="center"><a href="../../more/index.htm">More</a></td>
</tr></table>
.........
// <> : html的标签，这个标签对我们进行搜索是没有价值的，需要去掉这些标签，一般标签都是成对出现的！

yang@hcss-ecs-5a79:~/boost$ cd data
yang@hcss-ecs-5a79:~/boost/data$ mkdir raw_html
yang@hcss-ecs-5a79:~/boost/data$ ll
drwxrwxr-x 56 yang yang 12288 Apr 24 21:27 input/     //这里放的是原始的html文档
drwxrwxr-x  2 yang yang  4096 Apr 24 21:37 raw_html/  //这是放的是去标签之后的干净文档

yang@hcss-ecs-5a79:~/boost/data$ ls -Rl | grep -E '*.html' | wc -l
8637

目标：把每个文档都去标签，然后写入到同一个文件中！每个文档内容不需要任何\n！文档和文档之间用 \3 区分
version1：
类似：XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3

采用下面的方案：
version2: 写入文件中，一定要考虑下一次在读取的时候，也要方便操作!
类似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...

方便我们getline(ifsream, line)，直接获取文档的全部内容：title\3content\3url

编写parser

先将代码结构罗列出来

cpp 复制代码

#include <iostream>
#include <string>
#include <vector>

//是一个目录，下面放的就是所以的html网页
const std::string src_path = "data/intput/";
const std::string output = "data/raw_html/raw.txt";

typedef struct DocInfo
{
    std::string title;   //文档的标题
    std::string content; //文档的内容
    std::string url;     //该文档在官网中的url
}DocInfo_t;

bool EnumFile(const std::string &stc_path,std::vector<std::string> *file_lists);
bool ParseHtml(const std::vector<std::string> &file_lists,std::vector<DocInfo_t> *result);
bool SaveHtml(const std::vector<DocInfo_t> &result,const std::string &output);

int main()
{
    std::vector<std::string> files_list;
    //第一步：递归式的吧每个html文件名带路径，保存到files_list中，方便后期进行一个一个的文件进行读取
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr << "enum file name error!" << std::endl;
        return 1;
    }
    //第二步：按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> result;
    if(!ParseHtml(files_list,&result))
    {
        std::cout << "parse html error" << std::endl;
        return 2;
    }
    //第三步：把解析完毕的各个文件内容，写入到output，按照/3最为文档的分割符
    if(!SaveHtml(result,output))
    {
        std::cerr << "save html error" << std::endl;
        return 3;
    }
}

boost 开发库的安装

因为这里我们需要用到boost库中的功能

cpp 复制代码

sudo apt-get install libboost-all-dev

用boost枚举文件名

cpp 复制代码

//枚举文件名
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
    namespace fs = boost::filesystem;
    fs::path root_path(src_path);

    //判断路径是否存在，不存在，就没有必要再往后走了
    if(!fs::exists(root_path))
    {
        std::cerr << src_path << " not exists" << std::endl;
        return false;
    }

    //定义一个空的迭代器，用来进行判断递归结束
    fs::recursive_directory_iterator end;
    for(fs::recursive_directory_iterator iter(root_path); iter != end; iter++)
    {
        //判断文件是否是普通文件，html都是普通文件
        if(!fs::is_regular_file(*iter))
            continue;
        if(iter->path().extension() != ".html")//判断文件路径名的后缀是否符合要求
            continue;
    
        std::cout << "debug: " << iter->path().string() << std::endl;
        //当前的路径一定是一个合法的，以.html结束的普通网页文件
        files_list->push_back(iter->path().string()); //将所有带路径的html保存在files_list,方便后续进行文本分析
    }

    return true;
}

解析html

先把结构列出

cpp 复制代码

//枚举文件名
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
    namespace fs = boost::filesystem;
    fs::path root_path(src_path);

    //判断路径是否存在，不存在，就没有必要再往后走了
    if(!fs::exists(root_path))
    {
        std::cerr << src_path << " not exists" << std::endl;
        return false;
    }

    //定义一个空的迭代器，用来进行判断递归结束
    fs::recursive_directory_iterator end;
    for(fs::recursive_directory_iterator iter(root_path); iter != end; iter++)
    {
        //判断文件是否是普通文件，html都是普通文件
        if(!fs::is_regular_file(*iter))
            continue;
        if(iter->path().extension() != ".html")//判断文件路径名的后缀是否符合要求
            continue;
    
        std::cout << "debug: " << iter->path().string() << std::endl;
        //当前的路径一定是一个合法的，以.html结束的普通网页文件
        files_list->push_back(iter->path().string()); //将所有带路径的html保存在files_list,方便后续进行文本分析
    }

    return true;
}

提取title

用string 找 < >，找到title的起始位置，前闭后开，找到的是中间的位置

去标签

提取content，本质是进行去标签，只保留有效字段

在进行遍历的时候，只要碰到了**>** ,就意味着，当前的标签被处理完毕.；只要碰到了**<**意味着新的标签开始了

构建URL

boost库的官方文档，和我们下载下来的文档，是有路径的对应关系的

bash 复制代码

官网URL样例：
https://www.boost.org/doc/libs/1_78_0/doc/html/accumulators.html

我们下载下来的url样例：boost_1_78_0/doc/html/accumulators.html

我们拷贝到我们项目中的样例：data/input/accumulators.html //我们把下载下来的boost库doc/html/* copy data/input/

url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";
url_tail = [data/input](删除) /accumulators.html -> url_tail = /accumulators.html

url = url_head + url_tail ; 相当于形成了一个官网链接

cpp 复制代码

static bool ParseUrl(const std::string &file_path,std::string *url)
{
    std::string url_head = "https://www.boost.org/doc/libs/1_88_0/doc/html";
    std::string url_tail =  file_path.substr(src_path.size());

    *url = url_head + url_tail;

    return true;
}

将解析内容写入文件中

cpp 复制代码

//见代码
采用下面的方案：
version2: 写入文件中，一定要考虑下一次在读取的时候，也要方便操作!
类似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
方便我们getline(ifsream, line)，直接获取文档的全部内容：title\3content\3url

cpp 复制代码

bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
#define SEP '\3'
    //按照二进制方式写入
    std::ofstream out(output,std::ios::out | std::ios::binary);
    if(!out.is_open())
    {
        std::cerr << "open " << output << std::endl;
        return false;
    }

    //就可以进行文件内容的携入了
    for(auto &item : results)
    {
        std::string out_string;
        out_string = item.title;
        out_string += SEP;
        out_string += item.content;
        out_string += SEP;
        out_string += item.url;
        out_string += '\n';
        
        out.write(out_string.c_str(),out_string.size());
    }

    out.close();

    return true;
}

4. 编写建立索引的模块 Index

先把基本结构搭建出

cpp 复制代码

#pragma once
#include <iostream>
#include<vector>
#include<string>
#include<unordered_map>

namespace ns_index
{
    struct DocInfo
    {
        std::string title;      //标题
        std::string content;    //文档去标签后的内容
        std::string url;
        uint64_t doc_id;         //文档的ID
    };
    
    struct InvertedElem
    {
        uint64_t doc_id;
        std::string word;
        int weight;
    };
    //倒排拉链
    typedef std::vector<InvertedElem> InvertedList;

    class Index
    {
    public:
        Index()
        {}
        ~Index()
        {}
    
    public:
        DocInfo *GetForwardIndex(uint64_t doc_id)
        {
            return nullptr;
        }

        //根据关键字string，获得倒排拉链
        InvertedList * GetForwardIndex(const std::string &word)
        {
            return nullptr;
        }

        //根据去标签，格式化之后的文档，构建正排和倒排索引
        //data/raw_html/raw.txt
        bool BuildIndex(const std::string &input) //parse处理完的数据交给我
        {}

    private:
        //正排索引的数据结构用数组，数组下标天然石文档的ID
        std::vector<DocInfo> forward_index; //正排索引
        //倒排索引一定是一个关键字和一组InvertedElem对应(关键字和倒排拉链的映射关系)
        std::unordered_map<std::string,InvertedList> inverted_index;
    };
}

建立正排的基本代码

cpp 复制代码

DocInfo *BuildForwardIndex(const std::string &line)
{
    //1. 解析line，字符串切分
    //line -> 3 string, title, content, url
    std::vector<std::string> results;
    const std::string sep = "\3"; //行内分隔符
    ns_util::StringUtil::CutString(line, &results, sep);
    //ns_util::StringUtil::CutString(line, &results, sep);
    if(results.size() != 3)
    return nullptr;
    
    //2. 字符串进行填充到DocIinfo
    DocInfo doc;
    doc.title = results[0]; //title
    doc.content = results[1]; //content
    doc.url = results[2]; ///url
    //先进行保存id，在插入，对应的id就是当前doc在vector中的下标!
    doc.doc_id = forward_index.size();
    //3. 插入到正排索引的vector
    forward_index.push_back(std::move(doc)); //doc,html文件内容
    
    return &forward_index.back();
}

建立倒排的基本代码

cpp 复制代码

//原理:
struct InvertedElem
{
    uint64_t doc_id;
    std::string word;
    int weight;
};

//倒排拉链
typedef std::vector<InvertedElem> InvertedList;
//倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]
std::unordered_map<std::string, InvertedList> inverted_index;

//我们拿到的文档内容
struct DocInfo
{
    std::string title; //文档的标题
    std::string content; //文档对应的去标签之后的内容
    std::string url; //官网文档url
    uint64_t doc_id; //文档的ID，暂时先不做过多理解
};
//文档：
title ： 吃葡萄
content: 吃葡萄不吐葡萄皮
url: http://XXXX
doc_id: 123

根据文档内容，形成一个或者多个InvertedElem(倒排拉链)
因为当前我们是一个一个文档进行处理的，一个文档会包含多个"词"， 都应当对应到当前的doc_id

1. 需要对 title && content都要先分词 --使用jieba分词
title: 吃/葡萄/吃葡萄(title_word)
content：吃/葡萄/不吐/葡萄皮(content_word)

词和文档的相关性（词频：在标题中出现的词，可以认为相关性更高一些，在内容中出现相关性低一些
2. 词频统计
struct word_cnt
{
    title_cnt;
    content_cnt;
}
unordered_map<std::string, word_cnt> word_cnt;
for &word : title_word
{
    word_cnt[word].title_cnt++; //吃（1）/葡萄（1）/吃葡萄（1）
}
for &word : content_word 
{
    word_cnt[word].content_cnt++; //吃（1）/葡萄（1）/不吐（1）/葡萄皮（1）
}

知道了在文档中，标题和内容每个词出现的次数
3. 自定义相关性
for &word : word_cnt
{
//具体一个词和123文档的对应关系，当有多个不同的词，指向同一个文档的时候，此时该优先显示谁？？相关性！
    struct InvertedElem elem;
    elem.doc_id = 123;
    elem.word = word.first;
    elem.weight = 10*word.second.title_cnt + word.second.content_cnt ; //相关性
    inverted_index[word.first].push_back(elem);
}


//jieba的使用--cppjieba
获取链接：https://gitcode.com/gh_mirrors/cp/cppjieba
如何使用：源代码都写进头文件include/cppjieba/*.hpp里，include即可使用

如何使用：注意细节，我们需要自己执行： cp -rf deps/limonp include/cppjieba/

也就是说要将limonp移到cppjieba目录下，不然会报错

cpp 复制代码

//使用  demo.cpp
#include "inc/cppjieba/Jieba.hpp"
#include <iostream>
#include <string>
#include <vector>
using namespace std;

const char* const DICT_PATH = "./dict/jieba.dict.utf8";
const char* const HMM_PATH = "./dict/hmm_model.utf8";
const char* const USER_DICT_PATH = "./dict/user.dict.utf8";
const char* const IDF_PATH = "./dict/idf.utf8";
const char* const STOP_WORD_PATH = "./dict/stop_words.utf8";

int main(int argc, char** argv) 
{
    cppjieba::Jieba jieba(DICT_PATH,
    HMM_PATH,
    USER_DICT_PATH,
    IDF_PATH,
    STOP_WORD_PATH);
    vector<string> words;
    string s;
    s = "小明硕士毕业于中国科学院计算所，后在日本京都大学深造";
    cout << s << endl;
    cout << "[demo] CutForSearch" << endl;
    jieba.CutForSearch(s, words);
    cout << limonp::Join(words.begin(), words.end(), "/") << endl;
    return EXIT_SUCCESS;
}

cpp 复制代码

        bool BuildForwardIndex(const DocInfo &doc)
        {
            //DoInfo(title,content,url,doc_id)
            //word->倒排拉链
            struct word_cnt
            {
                int title_cnt;
                int content_cnt;
                word_cnt():title_cnt(0),content_cnt(0){}
            };
            std::unordered_map<std::string,word_cnt> word_map; //用来暂存词频的映射表
            
            //对标题进行分词
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title,&title_words);
            
            //对标题词频进行统计
            for(auto &s : title_words)
            {
                 word_map[s].title_cnt++;//如果存在就获取，不存在就新建
            }

            //对文档内容进行分词
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.content,&content_words);

            //对内容进行词频统计
            std::unordered_map<std::string,word_cnt> concent_map;
            for(auto &c : content_words)
            {
                word_map[c].content_cnt++;
            }

        #define X 10
        #define Y 1
            for(auto &word_pair : word_map)
            {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = X*word_pair.second.title_cnt + Y*word_pair.second.content_cnt;
                InvertedList &inverted_list = inverted_index[word_pair.first];
                inverted_list.push_back(item);
            }

            return true;
        }

5. 编写索引模块 Searcher

先把基本代码结构搭建出来

cpp 复制代码

namespace ns_seracher
{
    class Searcher
    {
    public:
        Searcher() {} 
        
        ~Searcher() {}
    public:
        void InitSercher(const std::string &input)
    {
        //1. 获取或者创建Index对象

        //2. 根据index对象建立索引
    }

    void Search(const std::string &queue, std::string *ison String)
    {
        //1. [分词]：对我们的query进行按照searcher的要求进行分词
        //2. [触发]：就是根据分词的各个"词"，进行index查找
        //3. [合并]：汇总查找结果，按照相关性(weight)降序排序
        //4. [构建]: 根据查找出来的结果，构建json串 -- jsoncpp
    }

    private:
        ns_index::Index *index; //索引
    };
}

搜索：雷军小米 -> 雷军、小米->查倒排->两个倒排拉链（文档1，文档2，文档1、文档2）

安装jsoncpp

cpp 复制代码

sudo apt-get install libjsoncpp-dev

具体实现代码

https://gitee.com/uyeonashi/boost

获取摘要

注意定义start和end双指针的时候，要注意size_t类型与int类型的符号比较，很容易出错！

由于size_t是无符号类型，如果使用不当（比如使用负数做运算），可能会导致意想不到的结果。例如，将负数赋值给size_t会导致它变成一个很大的正数。

代码

cpp 复制代码

std::string GetDesc(const std::string &html_content, const std::string &word)
{
    // 找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的)
    // 截取出这部分内容
    const int prev_step = 50;
    const int next_step = 100;
    // 1. 找到首次出现
    // 不能使用find查找，可能因为大小写不匹配而报错
    auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y)
                            { return (std::tolower(x) == std::tolower(y)); });
    if (iter == html_content.end())
    {
        return "None1";
    }
    int pos = std::distance(html_content.begin(), iter);

    // 2. 获取start，end , std::size_t 无符号整数
    int start = 0;
    int end = html_content.size() - 1;
    // 如果之前有50+字符，就更新开始位置
    if (pos > start + prev_step)
        start = pos - prev_step;
    if (pos < end - next_step)
        end = pos + next_step;

    // 3. 截取子串,return
    if (start >= end)
        return "None2";
    std::string desc = html_content.substr(start, end - start);
    desc += "...";
    return desc;
}

比如我们在搜索"你是一个好人"时，jieba会将该语句分解为你/一个/好人/一个好人，在建立图的时候，可能会指向同一个文档，导致我们在搜索的时候会出现重复的结果。

所以我们要做去重操作，如何判断相同呢？直接看文档id即可。并且要将权值修改，我们应该将搜索到的相同内容进行权值的累加，作为该文档的真正权值！

去重后的代码

cpp 复制代码

#pragma once

#include "index.hpp"
#include "util.hpp"
#include "log.hpp"
#include <algorithm>
#include <unordered_map>
#include <jsoncpp/json/json.h>

namespace ns_searcher
{

    struct InvertedElemPrint
    {
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;
        InvertedElemPrint() : doc_id(0), weight(0) {}
    };

    class Searcher
    {
    private:
        ns_index::Index *index; // 供系统进行查找的索引
    public:
        Searcher() {}
        ~Searcher() {}

    public:
        void InitSearcher(const std::string &input)
        {
            // 1. 获取或者创建index对象
            index = ns_index::Index::GetInstance();
            std::cout << "获取index单例成功..." << std::endl;
            // LOG(NORMAL, "获取index单例成功...");
            // 2. 根据index对象建立索引
            index->BuildIndex(input);
            std::cout << "建立正排和倒排索引成功..." << std::endl;
            // LOG(NORMAL, "建立正排和倒排索引成功...");
        }
        // query: 搜索关键字
        // json_string: 返回给用户浏览器的搜索结果
        void Search(const std::string &query, std::string *json_string)
        {
            // 1.[分词]:对我们的query进行按照searcher的要求进行分词
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(query, &words);
            // 2.[触发]:就是根据分词的各个"词"，进行index查找,建立index是忽略大小写，所以搜索，关键字也需要
            // ns_index::InvertedList inverted_list_all; //内部InvertedElem
            std::vector<InvertedElemPrint> inverted_list_all;

            std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;

            for (std::string word : words)
            {
                boost::to_lower(word);

                ns_index::InvertedList *inverted_list = index->GetInvertedList(word);
                if (nullptr == inverted_list)
                {
                    continue;
                }
                // 不完美的地方： 你/是/一个/好人 100
                // inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());
                for (const auto &elem : *inverted_list)
                {
                    auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建
                    // item一定是doc_id相同的print节点
                    item.doc_id = elem.doc_id;
                    item.weight += elem.weight;
                    item.words.push_back(elem.word);
                }
            }
            for (const auto &item : tokens_map)
            {
                inverted_list_all.push_back(std::move(item.second));
            }

            // 3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序
            // std::s ort(inverted_list_all.begin(), inverted_list_all.end(),\
                //      [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){
            //         return e1.weight > e2.weight;
            //         });
            std::sort(inverted_list_all.begin(), inverted_list_all.end(),
                      [](const InvertedElemPrint &e1, const InvertedElemPrint &e2)
                      {
                          return e1.weight > e2.weight;
                      });
            // 4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp --通过jsoncpp完成序列化&&反序列化
            Json::Value root;
            for (auto &item : inverted_list_all)
            {
                ns_index::DocInfo *doc = index->GetForwardIndex(item.doc_id);
                if (nullptr == doc)
                {
                    continue;
                }
                Json::Value elem;
                elem["title"] = doc->title;
                elem["desc"] = GetDesc(doc->content, item.words[0]); // content是文档的去标签的结果，但是不是我们想要的，我们要的是一部分 TODO
                elem["url"] = doc->url;
                // for deubg, for delete
                elem["id"] = (int)item.doc_id;
                elem["weight"] = item.weight; // int->string

                root.append(elem);
            }

            Json::StyledWriter writer;
            // Json::FastWriter writer;
            *json_string = writer.write(root);
        }

        std::string GetDesc(const std::string &html_content, const std::string &word)
        {
            // 找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的)
            // 截取出这部分内容
            const int prev_step = 50;
            const int next_step = 100;
            // 1. 找到首次出现
            // 不能使用find查找，可能因为大小写不匹配而报错
            auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y)
                                    { return (std::tolower(x) == std::tolower(y)); });
            if (iter == html_content.end())
            {
                return "None1";
            }
            int pos = std::distance(html_content.begin(), iter);

            // 2. 获取start，end , std::size_t 无符号整数
            int start = 0;
            int end = html_content.size() - 1;
            // 如果之前有50+字符，就更新开始位置
            if (pos > start + prev_step)
                start = pos - prev_step;
            if (pos < end - next_step)
                end = pos + next_step;

            // 3. 截取子串,return
            if (start >= end)
                return "None2";
            std::string desc = html_content.substr(start, end - start);
            desc += "...";
            return desc;
        }
    };
}

测试

每写一个模块，最好都要测试一下不然最后bug成群

测试代码

cpp 复制代码

#include "seracher.hpp"
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>

const std::string input = "data/raw_html/raw.txt";

int main()
{
    ns_seracher::Searcher *search = new ns_seracher::Searcher();
    search->InitSercher(input);

    std::string query;
    std::string json_string;
    char buffer[1024];
    while (true)
    {
        std::cout << "Plase Enter Your Search Query# ";
        fgets(buffer,sizeof(buffer)-1,stdin);
        buffer[strlen(buffer)-1] = 0;
        query = buffer;
        search->Search(query,&json_string);
        std::cout << json_string << std::endl;
    }

    return 0;
}

编译然后运行，查看

编写http_server 模块

bash 复制代码

我们这里不用自己去搭建轮子，直接用网上的cpp-httplib库即可搭建网络通信。
所以我们只要会使用基本的接口即可
cpp-httplib库：https://gitee.com/royzou/cpp-httplib
注意：cpp-httplib在使用的时候需要使用较新版本的gcc

基本使用测试

cpp 复制代码

#include"httplib.h"
 
int main()
{
    httplib::Server svr;
    
    svr.Get("/hi", [](const httplib::Request &req, httplib::Response &rsp){
        rsp.set_content("你好,世界!", "text/plain; charset=utf-8");
        });

        svr.listen("0.0.0.0",8081);
    return 0;
}

编写一个简单的前端做测试

cpp 复制代码

#include"httplib.h"

const std::string root_path = "./wwwroot";

int main()
{
    httplib::Server svr;
    svr.set_base_dir(root_path.c_str());
    
    svr.Get("/hi", [](const httplib::Request &req, httplib::Response &rsp){
        rsp.set_content("你好,世界!", "text/plain; charset=utf-8");
        });
        svr.listen("0.0.0.0",8081);

    return 0;
}

html 复制代码

<!DOCTYPE html>

<html>
<head>
    <meta charset="UTF-8">
    <title>for test</title>
</head>
<body>
    <h1>你好,世界</h1>
    <p>这是一个测试网页</p>
</body>
</html>

cpp 复制代码

#include"httplib.h"
#include "seracher.hpp"

const std::string input = "data/raw_html/raw.txt";
const std::string root_path = "./wwwroot";

int main()
{
    ns_seracher::Searcher search;
    search.InitSercher(input);

    httplib::Server svr;
    svr.set_base_dir(root_path.c_str());
    
    svr.Get("/s", [&search](const httplib::Request &req, httplib::Response &rsp){
            if(!req.has_param("word"))
            {
                rsp.set_content("要有搜索关键词！","text/plain: charset=utf-8");
                return;
            }
            std::string word = req.get_param_value("word");
            std::cout << "用户在搜索：" << word << std::endl;
            std::string json_string;
            search.Search(word,&json_string);
            rsp.set_content(json_string,"application/json"); 
        // rsp.set_content("你好,世界!", "text/plain; charset=utf-8");
        });
        svr.listen("0.0.0.0",8081);

    return 0;
}

简单编写日志

cpp 复制代码

#pragma once

#include <iostream>
#include <string>
#include <ctime>

#define NORMAL  1
#define WARNING 2
#define DEBUG   3
#define FATAL   4

#define LOG(LEVEL, MESSAGE) log(#LEVEL, MESSAGE, __FILE__, __LINE__)

void log(std::string level, std::string message, std::string file, int line)
{
    std::cout << "[" << level << "]" << "[" << time(nullptr) << "]" << "[" << message << "]" << "[" << file << " : " << line << "]" << std::endl;
}

编写前端模块

html 复制代码

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

    <title>boost 搜索引擎</title>
    <style>
        /* 去掉网页中的所有的默认内外边距，html的盒子模型 */
        * {
            /* 设置外边距 */
            margin: 0;
            /* 设置内边距 */
            padding: 0;
        }

        /* 将我们的body内的内容100%和html的呈现吻合 */
        html,
        body {
            height: 100%;
        }

        /* 类选择器.container */
        .container {
            /* 设置div的宽度 */
            width: 800px;
            /* 通过设置外边距达到居中对齐的目的 */
            margin: 0px auto;
            /* 设置外边距的上边距，保持元素和网页的上部距离 */
            margin-top: 15px;
        }

        /* 复合选择器，选中container 下的 search */
        .container .search {
            /* 宽度与父标签保持一致 */
            width: 100%;
            /* 高度设置为52px */
            height: 52px;
        }

        /* 先选中input标签， 直接设置标签的属性，先要选中， input：标签选择器*/
        /* input在进行高度设置的时候，没有考虑边框的问题 */
        .container .search input {
            /* 设置left浮动 */
            float: left;
            width: 600px;
            height: 50px;
            /* 设置边框属性：边框的宽度，样式，颜色 */
            border: 1px solid black;
            /* 去掉input输入框的有边框 */
            border-right: none;
            /* 设置内边距，默认文字不要和左侧边框紧挨着 */
            padding-left: 10px;
            /* 设置input内部的字体的颜色和样式 */
            color: #000be2;
            font-size: 14px;
        }

        /* 先选中button标签， 直接设置标签的属性，先要选中， button：标签选择器*/
        .container .search button {
            /* 设置left浮动 */
            float: left;
            width: 150px;
            height: 52px;
            /* 设置button的背景颜色，#4e6ef2 */
            background-color: #4e6ef2;
            /* 设置button中的字体颜色 */
            color: #FFF;
            /* 设置字体的大小 */
            font-size: 19px;
            font-family: Georgia, 'Times New Roman', Times, serif;
        }

        .container .result {
            width: 100%;
        }

        .container .result .item {
            margin-top: 15px;
        }

        .container .result .item a {
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* a标签的下划线去掉 */
            text-decoration: none;
            /* 设置a标签中的文字的字体大小 */
            font-size: 20px;
            /* 设置字体的颜色 */
            color: #4e6ef2;
        }

        .container .result .item a:hover {
            text-decoration: underline;
        }

        .container .result .item p {
            margin-top: 5px;
            font-size: 16px;
            font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
        }

        .container .result .item i {
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* 取消斜体风格 */
            font-style: normal;
            color: rgba(103, 6, 193, 0.838);
        }
    </style>
</head>

<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字">
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="result">
            <!-- 动态生成网页内容 -->
        </div>
    </div>
    <script>
        function Search() {
            // 是浏览器的一个弹出框
            // alert("hello js!");
            // 1. 提取数据, $可以理解成就是JQuery的别称
            let query = $(".container .search input").val();
            console.log("query = " + query); //console是浏览器的对话框，可以用来进行查看js数据

            //2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数，JQuery中的
            $.ajax({
                type: "GET",
                url: "/s?word=" + query,
                success: function (data) {
                    console.log(data);
                    BuildHtml(data);
                }
            });
        }

        function BuildHtml(data) {
            // 获取html中的result标签
            let result_lable = $(".container .result");
            // 清空历史搜索结果
            result_lable.empty();

            for (let elem of data) {
                // console.log(elem.title);
                // console.log(elem.url);
                let a_lable = $("<a>", {
                    text: elem.title,
                    href: elem.url,
                    // 跳转到新的页面
                    target: "_blank"
                });
                let p_lable = $("<p>", {
                    text: elem.desc
                });
                let i_lable = $("<i>", {
                    text: elem.url
                });
                let div_lable = $("<div>", {
                    class: "item"
                });
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                div_lable.appendTo(result_lable);
            }
        }
    </script>
</body>

</html>

成品

搜索展示

OK，还是很不错的，本篇完！