从零开始实现一个AI搜索引擎

一.思考:我们为什么需要AI搜索引擎?

现有搜索引擎返回的是网页,而非我们真正需要的内容。

AI搜索引擎的价值在于,它能梳理网页,提取相关内容,组织逻辑,最终一步到位地呈现结果。这是传统搜索引擎技术的局限所在。因此,搜索领域必将被大模型技术彻底革新。


二.效果展示

先看效果,这是个开源项目,不是广告哦~


三.市面上已经有哪些AI搜索引擎了?

我认为比较好的两个AI搜索引擎:一个是:Devv

一个是:Perplexity

但是这俩都还没有开源代码,所以我参照上面的项目,站在巨人的肩膀上,

搞了这个开源的AI搜索引擎项目:github.com/code-moss/c...


四.核心代码流程讲解

话不多说,以码交友,直接上核心流程代码

代码位置 src/pages/api/query.ts

第一步:获取与用户问题相关的谷歌数据

js 复制代码
import axios from "axios";

async function SerperApi(query: any) {
  if (!query) return "";
  let data = JSON.stringify({
    q: query,
    gl: "cn",
    hl: "zh-cn",
  });

  let config = {
    method: "post",
    url: "https://google.serper.dev/search",
    headers: {
      "X-API-KEY": process.env.SERPER_API_KEY,
      "Content-Type": "application/json",
    },
    data: data,
  };

  const res = await axios(config);
  return res.data.organic;
}

export default SerperApi;

第二步:将获得的谷歌数据和原始问题交给OpenAI处理(这里用了Prompt)

js 复制代码
/**
 * 回答部分Prompt
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {string} 系统消息内容。
 */
function generateSystemMessageContent(serperData: any) {
  return `
  You are a large language AI assistant built by CodeMoss AI. You are given a user question, and please write clean, concise and accurate answer to the question. You will be given a set of related contexts to the question, each starting with a reference number like [[citation:x]], where x is a number. Please use the context and cite the context at the end of each sentence if applicable.

  Your answer must be correct, accurate and written by an expert using an unbiased and professional tone. Please limit to 1024 tokens. Do not give any information that is not related to the question, and do not repeat. Say "information is missing on" followed by the related topic, if the given context do not provide sufficient information.  

  Please cite the contexts with the reference numbers, in the format [citation:x]. If a sentence comes from multiple contexts, please list all applicable citations, like [citation:3][citation:5]. Other than code and specific names and citations, your answer must be written in the same language as the question.  
  
  Here are the set of contexts:

  ${serperData.map((c: any) => c.snippet).join("\n\n")}

  Remember, don't blindly repeat the contexts verbatim. And here is the user question: \n\n`;
}

第三步:生成相关问题

js 复制代码
/**
 * 相关问题的Prompt
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {string} 系统消息内容。
 */
function generateRelatedMessageContent(serperData: any) {
  return `
  You are a helpful assistant that helps the user to ask related questions, based on user's original question and the related contexts. Please identify worthwhile topics that can be follow-ups, and write questions no longer than 20 words each. Please make sure that specifics, like events, names, locations, are included in follow up questions so they can be asked standalone. For example, if the original question asks about "the Manhattan project", in the follow up question, do not just say "the project", but use the full name "the Manhattan project". Your related questions must be in the same language as the original question.
  
  Here are the contexts of the question:

  ${serperData.map((c: any) => c.snippet).join("\n\n")}

  Remember, based on the original question and related contexts, suggest three such further questions. Do NOT repeat the original question. Each related question should be no longer than 20 words. Here is the original question:
  `;
}

抽丝剥茧之后,核心其实就这三步


五.完整核心代码

代码位置 src/pages/api/query.ts

js 复制代码
// 引入所需模块
import type { NextApiRequest, NextApiResponse } from "next";
import { Readable } from "stream";
import SerperApi from "../../utils/serper.api";
import OpenAI from "openai";

let MODEL = "gpt-3.5-turbo";

/**
 * 入口:处理API请求,根据用户问题获取相关上下文并调用OpenAI生成回答和相关问题。
 * @param {NextApiRequest} req 请求对象。
 * @param {NextApiResponse} res 响应对象。
 */
export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse,
) {
  const { query, rid, model } = req.body;

  MODEL = model ? model : process.env.CHAT_MODEL;

  // 设置响应头并将流内容发送给客户端
  res.setHeader("Content-Type", "text/event-stream; charset=utf-8");
  res.setHeader("Transfer-Encoding", "chunked");
  res.setHeader("Access-Control-Allow-Origin", "*");
  res.setHeader("Cache-Control", "no-cache, no-transform");
  res.setHeader("X-Accel-Buffering", "no");
  // 创建一个Readable流用于响应
  const readable = new Readable({ read() {} });
  readable.pipe(res);

  // 第一步:获取与用户问题相关的数据
  const serperData = await SerperApi(query);

  const initialPayload = createInitialPayload(query, rid, serperData);
  readable.push(initialPayload);

  // 第二步:将获得的数据发送给OpenAI处理
  const openai = initializeOpenAI();
  const stream = await requestOpenAICompletion(openai, query, serperData);

  // 读取并处理OpenAI返回的流数据
  for await (const chunk of stream) {
    readable.push(chunk.choices[0]?.delta?.content || "");
  }

  // 第三步:生成相关问题
  const relatedQuestions = await generateRelatedQuestions(
    openai,
    query,
    serperData,
  );
  readable.push("\n\n__RELATED_QUESTIONS__\n\n");
  readable.push(JSON.stringify(relatedQuestions));

  readable.push(null); // 结束流
}

/**
 * 初始化OpenAI客户端。
 * @returns {OpenAI} OpenAI客户端实例。
 */
function initializeOpenAI() {
  return new OpenAI({
    apiKey: process.env.API_KEY,
    baseURL: process.env.BASE_URL,
  });
}

/**
 * 创建初始请求负载。
 * @param {string} query 用户查询。
 * @param {string} rid 请求ID。
 * @param {Object[]} serperData 从SerperApi获取的数据。
 * @returns {string} 初始请求负载。
 */
function createInitialPayload(query: string, rid: string, serperData: any) {
  return `{"query": "${query.trim()}", "rid": "${rid}", "contexts": ${JSON.stringify(serperData)}}\n\n__LLM_RESPONSE__\n\n`;
}

/**
 * 请求OpenAI生成回答。
 * @param {OpenAI} openai OpenAI客户端实例。
 * @param {string} query 用户查询。
 * @param {Object[]} serperData 从SerperApi获取的数据。
 * @returns {AsyncIterableIterator<any>} OpenAI生成回答的流。
 */
async function requestOpenAICompletion(
  openai: OpenAI,
  query: string,
  serperData: any,
) {
  return openai.chat.completions.create({
    model: MODEL || "gpt-3.5-turbo",
    messages: createOpenAIMessages(query, serperData, "answer"),
    stream: true,
  });
}

/**
 * 根据用户查询和相关上下文生成OpenAI请求的消息体。
 * @param {string} query 用户查询。
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {Object[]} OpenAI请求的消息体。
 */
function createOpenAIMessages(query: string, serperData: any, type: any): any {
  const systemMessageContent =
    type === "answer"
      ? generateSystemMessageContent(serperData)
      : generateRelatedMessageContent(serperData);
  return [
    { role: "system", content: systemMessageContent },
    { role: "user", content: query },
  ];
}

/**
 * 回答部分Prompt
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {string} 系统消息内容。
 */
function generateSystemMessageContent(serperData: any) {
  return `
  You are a large language AI assistant built by CodeMoss AI. You are given a user question, and please write clean, concise and accurate answer to the question. You will be given a set of related contexts to the question, each starting with a reference number like [[citation:x]], where x is a number. Please use the context and cite the context at the end of each sentence if applicable.

  Your answer must be correct, accurate and written by an expert using an unbiased and professional tone. Please limit to 1024 tokens. Do not give any information that is not related to the question, and do not repeat. Say "information is missing on" followed by the related topic, if the given context do not provide sufficient information.  

  Please cite the contexts with the reference numbers, in the format [citation:x]. If a sentence comes from multiple contexts, please list all applicable citations, like [citation:3][citation:5]. Other than code and specific names and citations, your answer must be written in the same language as the question.  
  
  Here are the set of contexts:

  ${serperData.map((c: any) => c.snippet).join("\n\n")}

  Remember, don't blindly repeat the contexts verbatim. And here is the user question: \n\n`;
}

/**
 * 相关问题的Prompt
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {string} 系统消息内容。
 */
function generateRelatedMessageContent(serperData: any) {
  return `
  You are a helpful assistant that helps the user to ask related questions, based on user's original question and the related contexts. Please identify worthwhile topics that can be follow-ups, and write questions no longer than 20 words each. Please make sure that specifics, like events, names, locations, are included in follow up questions so they can be asked standalone. For example, if the original question asks about "the Manhattan project", in the follow up question, do not just say "the project", but use the full name "the Manhattan project". Your related questions must be in the same language as the original question.
  
  Here are the contexts of the question:

  ${serperData.map((c: any) => c.snippet).join("\n\n")}

  Remember, based on the original question and related contexts, suggest three such further questions. Do NOT repeat the original question. Each related question should be no longer than 20 words. Here is the original question:
  `;
}

/**
 * 根据用户原始查询和相关上下文生成相关问题。
 * @param {OpenAI} openai OpenAI客户端实例。
 * @param {string} query 用户查询。
 * @param {Object[]} serperData 相关上下文数据。
 * @returns {Promise<Object[]>} 相关问题的数组。
 */
async function generateRelatedQuestions(
  openai: OpenAI,
  query: string,
  serperData: any,
) {
  const chatCompletion = await openai.chat.completions.create({
    model: MODEL,
    messages: createOpenAIMessages(query, serperData, "related"),
  });
  return transformString(chatCompletion.choices[0].message.content);
}

/**
 * 工具函数:将字符串按行分割,并转换为问题对象数组。
 * @param {any} str 待转换的字符串。
 * @returns {Object[]} 转换后的问题对象数组。
 */
function transformString(str: any) {
  return str.split("\n").map((line: any) => ({ question: line }));
}

六.代码部署

代码开源地址:github.com/code-moss/c...

项目启动需要两个东西 OpenAI key(可以直接用我发的免费的key调试,也可以用自己的key) 和 Serper key(参考.env.template中的教程免费领取)

sh 复制代码
# 第一步:git clone https://github.com/code-moss/codemoss-search

# 第二步 复制 .env.template 文件,改成 .env

# 第三步:在 .env 文件中配置 OpenAI key 和 Serper KEY

# 第四步:安装依赖
pnpm install

# 第五步:启动
npm run dev

# 第六步:访问下面链接进行使用
http://localhost:3000/

七.OpenAI Key 免费发放

为了让大家能正常启动项目,我都这样破费了,给项目点个免费的star不过分吧~~~

项目开源地址:github.com/code-moss/c...

sh 复制代码
# 20个OpenAI 3.5 Key,更新时间:2024年3月11日(不确定什么时候就不能用了)
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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
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

题外话

另外大家不如在评论区聊聊,在AI技术发展的当下,大家对AI技术的看法?

例如:大家会不会担心日常工作中积累的技术优势在未来某天被AI 低成本抹平?

相关推荐
gyeolhada1 分钟前
Web: 基础知识、HTML、CSS、JavaScript(英文版--知识点学习/复习)
前端·javascript·css3·html5·web
元让_vincent1 分钟前
论文Review 3DGSSLAM GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels
图像处理·人工智能·平面·3d·图形渲染
武子康3 分钟前
AI炼丹日志-30-新发布【1T 万亿】参数量大模型!Kimi‑K2开源大模型解读与实践
人工智能·gpt·ai·语言模型·chatgpt·架构·开源
荣达6 分钟前
「CoT」巧思还是骗局?
前端·aigc·产品经理
学习的学习者18 分钟前
CS课程项目设计1:交互友好的井字棋游戏
人工智能·课程设计·井字棋游戏
好记性不如19 分钟前
引入了模块但没有使用”,会不会被打包进去
前端
今天也在写bug23 分钟前
webpack中SplitChunks的分割策略
前端·webpack·性能优化·代码分割·splitchunks
EmpressBoost24 分钟前
解决‘vue‘ 不是内部或外部命令,也不是可运行的程序
开发语言·前端·javascript
ᥬ 小月亮24 分钟前
webpack高级配置
运维·前端·webpack
Jenny32 分钟前
数据预处理与清洗
人工智能