基础RAG实现，最佳入门选择（三）

基础RAG入门实现，最佳入门选择（二）访问我的网站www.shuyixiao.cloud查看

对提取的块运行查询

python 复制代码

import json
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any
import os

def compute_embeddings(texts: List[str], model: SentenceTransformer) -> Dict[str, np.ndarray]:
    """
    计算文本的嵌入向量
    Args:
        texts: 文本列表
        model: 句子转换模型
    Returns:
        文本到嵌入向量的映射字典
    """
    embeddings = {}
    for text in texts:
        embedding = model.encode(text)
        embeddings[text] = embedding
    return embeddings

def semantic_search(query: str, text_chunks: List[str], model: SentenceTransformer, k: int = 2) -> List[str]:
    """
    执行语义搜索，找到与查询最相关的文本块
    Args:
        query: 查询文本
        text_chunks: 文本块列表
        model: 句子转换模型
        k: 返回的最相关文本块数量
    Returns:
        最相关的k个文本块列表
    """
    try:
        # 计算查询的嵌入向量
        query_embedding = model.encode(query)
        
        # 计算所有文本块的嵌入向量
        chunk_embeddings = compute_embeddings(text_chunks, model)
        
        # 计算相似度分数
        scores = []
        for chunk, embedding in chunk_embeddings.items():
            similarity = np.dot(query_embedding, embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(embedding)
            )
            scores.append((chunk, similarity))
        
        # 按相似度排序并返回前k个结果
        scores.sort(key=lambda x: x[1], reverse=True)
        return [chunk for chunk, _ in scores[:k]]
    except Exception as e:
        print(f"执行语义搜索时出错: {e}")
        return []

def main():
    """
    主函数：加载数据并执行语义搜索
    """
    try:
        # 检查数据文件是否存在
        required_files = ['data/val.json', 'data/text_chunks.json']
        for file in required_files:
            if not os.path.exists(file):
                print(f"错误：找不到文件 {file}")
                return

        # 加载验证数据
        print("正在加载验证数据...")
        with open('data/val.json', 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        if not data or not isinstance(data, list) or len(data) == 0:
            print("错误：验证数据格式不正确或为空")
            return
        
        # 加载文本块
        print("正在加载文本块...")
        with open('data/text_chunks.json', 'r', encoding='utf-8') as f:
            text_chunks = json.load(f)
        
        # 初始化模型
        print("正在初始化模型...")
        model = SentenceTransformer('all-MiniLM-L6-v2')
        
        # 获取第一个查询
        query = data[0]['question']
        print("\n查询:", query)
        
        # 执行语义搜索
        print("\n正在执行语义搜索...")
        top_chunks = semantic_search(query, text_chunks, model, k=2)
        
        if not top_chunks:
            print("未找到相关结果")
            return
            
        # 打印搜索结果
        print("\n搜索结果:")
        for i, chunk in enumerate(top_chunks, 1):
            print(f"\n上下文 {i}:")
            print(chunk)
            print("=" * 50)
            
    except FileNotFoundError as e:
        print(f"错误：找不到必要的文件 - {e}")
    except json.JSONDecodeError as e:
        print(f"错误：JSON文件格式不正确 - {e}")
    except Exception as e:
        print(f"发生错误：{e}")

if __name__ == "__main__":
    main()

json 复制代码

embeddings.json 
{
    "机器学习是人工智能的一个分支，它使计算机系统能够从数据中学习和改进，而无需明确编程。这种方法使计算机能够自动从经验中学习，并随着时间的推移提高其性能。": [0.1, 0.2, 0.3, 0.4, 0.5],
    "深度学习是机器学习的一个子集，它使用多层神经网络来学习数据的层次表示。这些网络可以自动学习特征，而不需要人工特征工程。": [0.2, 0.3, 0.4, 0.5, 0.6],
    "传统机器学习通常需要人工特征工程，而深度学习可以自动学习特征。深度学习在处理大规模数据时特别有效，但需要更多的计算资源。": [0.3, 0.4, 0.5, 0.6, 0.7],
    "机器学习算法可以分为监督学习、无监督学习和强化学习。监督学习使用标记数据进行训练，无监督学习使用未标记数据，而强化学习通过与环境交互来学习。": [0.4, 0.5, 0.6, 0.7, 0.8]
}

css 复制代码

text_chunks.json
[    "机器学习是人工智能的一个分支，它使计算机系统能够从数据中学习和改进，而无需明确编程。这种方法使计算机能够自动从经验中学习，并随着时间的推移提高其性能。",    "深度学习是机器学习的一个子集，它使用多层神经网络来学习数据的层次表示。这些网络可以自动学习特征，而不需要人工特征工程。",    "传统机器学习通常需要人工特征工程，而深度学习可以自动学习特征。深度学习在处理大规模数据时特别有效，但需要更多的计算资源。",    "机器学习算法可以分为监督学习、无监督学习和强化学习。监督学习使用标记数据进行训练，无监督学习使用未标记数据，而强化学习通过与环境交互来学习。"]

kotlin 复制代码

val.json
[
    {
      "question": "What is 'Explainable AI' and why is it considered important?",
      "ideal_answer": "Explainable AI (XAI) aims to make AI systems more transparent and understandable, providing insights into how they make decisions. It's considered important for building trust, accountability, and ensuring fairness in AI systems.",
      "reference": "Chapter 5: The Future of Artificial Intelligence - Explainable AI (XAI); Chapter 19: AI and Ethics",
      "has_answer": true,
      "reasoning": "The document directly defines and explains the importance of XAI."
    },
    {
      "question": "Can AI be used to predict earthquakes?",
      "ideal_answer": "I don't have enough information to answer that.",
      "reference": "None",
      "has_answer": false,
      "reasoning": "The document does not mention the use of AI for earthquake prediction."
    },
    {
      "question": "What are some of the ethical concerns related to AI-powered facial recognition?",
      "ideal_answer": "I don't have enough information to answer that.",
      "reference": "None, although related concepts appear in Chapter 4 (Ethical and Societal Implications) and Chapter 2 (Computer Vision)",
      "has_answer": false,
      "reasoning": "While the document discusses ethical concerns about AI *in general* and mentions facial recognition as a *technology*, it doesn't specifically discuss the ethical concerns *of* facial recognition."
    },
    {
      "question": "How does AI contribute to personalized medicine?",
      "ideal_answer": "AI enables personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions to specific needs. This enhances treatment effectiveness and reduces adverse effects.",
      "reference": "Chapter 11: AI and Healthcare - Personalized Medicine",
      "has_answer": true,
      "reasoning": "The document directly explains AI's role in personalized medicine."
    },
    {
      "question": "Does the document mention any specific companies developing AI technology?",
      "ideal_answer": "I don't have enough information to answer that.",
      "reference": "None",
      "has_answer": false,
      "reasoning": "The document focuses on AI concepts and applications, not specific companies."
    },
    {
      "question": "What is the role of AI in smart grids?",
      "ideal_answer": "AI optimizes energy distribution in smart grids by enabling real-time monitoring, demand response, and integration of distributed energy resources. This enhances grid reliability, reduces energy waste, and supports renewable energy.",
      "reference": "Chapter 5: The Future of Artificial Intelligence - Energy Storage and Grid Management - Smart Grids; Chapter 15",
      "has_answer": true,
      "reasoning": "The document directly describes the function of AI in smart grids."
    },
    {
      "question": "Can AI write a complete, original novel?",
      "ideal_answer": "I don't have enough information to answer that.",
      "reference": "Chapter 9: AI, Creativity, and Innovation - AI in Writing and Content Creation (and potentially Chapter 16)",
      "has_answer": false,
      "reasoning": "The document mentions AI being used for writing and content creation, assisting with research and editing. It does *not* state that AI can write a *complete, original novel* independently."
    },
      {
      "question": "What is a 'cobot'?",
      "ideal_answer": "It mentions collaborative settings (cobots) in industrial robots.",
      "reference": "Chapter 6: AI and Robotics- Types of Robots- Industrial Robots",
      "has_answer": true,
      "reasoning": "The document defines 'cobot'."
    },
    {
      "question": "What is Direct Air Capture (DAC) used for?",
      "ideal_answer": "DAC technology removes CO2 directly from the atmosphere. The captured CO2 can be stored or used in various applications.",
      "reference": "Chapter 5: The Future of Artificial Intelligence - Carbon Capture and Utilization - Direct Air Capture; Chapter 15",
      "has_answer": true,
      "reasoning": "The document directly explains the purpose of Direct Air Capture."
    },
    {
      "question": "Is AI currently being used to control nuclear weapons systems?",
      "ideal_answer": "I don't have enough information to answer that.",
      "reference": "None (although Chapter 4 discusses the weaponization of AI)",
      "has_answer": false,
      "reasoning": "The document discusses the *ethical concerns* about weaponizing AI, but it doesn't state whether AI is *currently* used to control nuclear weapons."
    },
    {
        "question": "什么是机器学习？",
        "answer": "机器学习是人工智能的一个分支，它使计算机系统能够从数据中学习和改进，而无需明确编程。"
    },
    {
        "question": "深度学习与传统机器学习有什么区别？",
        "answer": "深度学习是机器学习的一个子集，它使用多层神经网络来学习数据的层次表示。"
    }
  ]

运行结果

markdown 复制代码

E:\AworkNew2025\all-rag-techniques\venv\Scripts\python.exe E:\AworkNew2025\all-rag-techniques\test\RAG05.py 
正在加载验证数据...
正在加载文本块...
正在初始化模型...

查询: What is 'Explainable AI' and why is it considered important?

正在执行语义搜索...

搜索结果:

上下文 1:
深度学习是机器学习的一个子集，它使用多层神经网络来学习数据的层次表示。这些网络可以自动学习特征，而不需要人工特征工程。
==================================================

上下文 2:
传统机器学习通常需要人工特征工程，而深度学习可以自动学习特征。深度学习在处理大规模数据时特别有效，但需要更多的计算资源。
==================================================

进程已结束，退出代码为 0

根据检索到的块生成响应

改进了系统提示词：

使用中文提示词

添加了更详细的回答要求

允许基于部分信息给出回答

优化了搜索参数：

增加搜索的文本块数量（从3个增加到5个）

显示找到的相关文本块内容

改进了生成参数：

使用更强大的 qwen-max 模型

增加温度值（temperature）到0.7，使回答更有创造性

增加最大输出长度到2000个token

优化了查询问题：

使用更详细的问题描述

明确指定了需要覆盖的技术领域

添加了更多输出信息：

显示找到的相关文本块

提供更详细的进度提示

现在代码应该能够：

找到更多相关的文本内容

生成更详细和专业的回答

提供更好的上下文理解

python 复制代码

# 导入必要的库
from dashscope import Generation
from dashscope.api_entities.dashscope_response import DashScopeAPIResponse
import os
from typing import List, Dict
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import PyPDF2
import re
import dashscope

# 设置阿里云API密钥
dashscope.api_key = "sk-1107e7fa3f6e40b8839b9xxxxx"

# 定义系统提示词
system_prompt = """你是一个专业的Java技术面试助手。请基于给定的上下文信息，详细回答用户的问题。
如果上下文中的信息不足以完整回答问题，请尽可能基于已有信息给出部分回答，并说明哪些方面需要补充信息。
回答时请：
1. 保持专业性和准确性
2. 使用清晰的结构和示例
3. 如果可能，提供具体的代码示例
4. 说明每个概念的重要性和应用场景"""

def load_documents(file_path: str) -> List[str]:
    """
    加载文档内容，支持PDF文件

    参数:
    file_path (str): 文档文件路径

    返回:
    List[str]: 文档内容列表
    """
    try:
        if file_path.lower().endswith('.pdf'):
            # 处理PDF文件
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                text_chunks = []
                
                # 遍历每一页
                for page in pdf_reader.pages:
                    text = page.extract_text()
                    # 将文本分割成较小的块
                    chunks = split_text_into_chunks(text)
                    text_chunks.extend(chunks)
                
                return text_chunks
        else:
            # 处理文本文件
            with open(file_path, 'r', encoding='utf-8') as f:
                return f.readlines()
    except Exception as e:
        print(f"读取文档时发生错误: {str(e)}")
        return []

def split_text_into_chunks(text: str, chunk_size: int = 1000) -> List[str]:
    """
    将文本分割成较小的块

    参数:
    text (str): 要分割的文本
    chunk_size (int): 每个块的最大字符数

    返回:
    List[str]: 文本块列表
    """
    # 清理文本
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 按句子分割
    sentences = re.split(r'[.!?。！？]', text)
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
            
        if current_size + len(sentence) > chunk_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_size = 0
            
        current_chunk.append(sentence)
        current_size += len(sentence)
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

def generate_response(system_prompt: str, user_message: str, model: str = "qwen-max") -> Dict:
    """
    基于系统提示词和用户消息生成AI响应

    参数:
    system_prompt (str): 指导AI行为的系统提示词
    user_message (str): 用户的消息或查询
    model (str): 用于生成响应的模型名称，默认使用 qwen-max

    返回:
    dict: AI模型的响应
    """
    try:
        response = Generation.call(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message}
            ],
            result_format='message',  # 返回格式为消息
            temperature=0.7,  # 增加一些创造性
            max_tokens=2000,  # 增加输出长度
        )
        return response
    except Exception as e:
        print(f"生成响应时发生错误: {str(e)}")
        return None

def create_embeddings(texts: List[str]) -> np.ndarray:
    """
    为文本创建嵌入向量

    参数:
    texts (List[str]): 文本列表

    返回:
    np.ndarray: 嵌入向量数组
    """
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        return model.encode(texts)
    except Exception as e:
        print(f"创建嵌入向量时发生错误: {str(e)}")
        return np.array([])

def create_faiss_index(embeddings: np.ndarray) -> faiss.Index:
    """
    创建FAISS索引

    参数:
    embeddings (np.ndarray): 嵌入向量数组

    返回:
    faiss.Index: FAISS索引对象
    """
    try:
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings.astype('float32'))
        return index
    except Exception as e:
        print(f"创建FAISS索引时发生错误: {str(e)}")
        return None

def search_similar_chunks(query: str, index: faiss.Index, texts: List[str], k: int = 5) -> List[str]:
    """
    搜索相似文本块

    参数:
    query (str): 查询文本
    index (faiss.Index): FAISS索引
    texts (List[str]): 原始文本列表
    k (int): 返回的相似文本数量

    返回:
    List[str]: 相似文本块列表
    """
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        query_embedding = model.encode([query])
        distances, indices = index.search(query_embedding.astype('float32'), k)
        return [texts[i] for i in indices[0]]
    except Exception as e:
        print(f"搜索相似文本块时发生错误: {str(e)}")
        return []

def main():
    # 检查API密钥是否设置
    if not dashscope.api_key:
        print("错误：请设置 dashscope.api_key")
        return

    # 示例文档路径
    doc_path = "2888年Java程序员找工作最新场景题.pdf"
    
    # 加载文档
    documents = load_documents(doc_path)
    if not documents:
        print("错误：无法加载文档")
        return
    
    print(f"成功加载文档，共 {len(documents)} 个文本块")
    
    # 创建嵌入向量
    embeddings = create_embeddings(documents)
    if embeddings.size == 0:
        print("错误：无法创建嵌入向量")
        return
    
    # 创建FAISS索引
    index = create_faiss_index(embeddings)
    if index is None:
        print("错误：无法创建FAISS索引")
        return
    
    # 示例查询
    query = "请详细说明Java程序员面试中常见的技术问题和解决方案，包括但不限于：Java基础、多线程、JVM、Spring框架等核心知识点。"
    
    # 搜索相似文本块
    top_chunks = search_similar_chunks(query, index, documents)
    if not top_chunks:
        print("错误：无法找到相似文本块")
        return
    
    print("\n找到的相关文本块：")
    for i, chunk in enumerate(top_chunks, 1):
        print(f"\n文本块 {i}:")
        print(chunk[:200] + "..." if len(chunk) > 200 else chunk)
    
    # 创建用户提示词
    user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
    user_prompt = f"{user_prompt}\nQuestion: {query}"
    
    print("\n正在生成回答，请稍候...")
    
    # 生成AI响应
    ai_response = generate_response(system_prompt, user_prompt)
    if ai_response and hasattr(ai_response, 'output'):
        print("\nAI回答：")
        print(ai_response.output.choices[0].message.content)
    else:
        print("错误：无法生成AI响应")

if __name__ == "__main__":
    main()

运行的结果展示

markdown 复制代码

E:\AworkNew2025\all-rag-techniques\venv\Scripts\python.exe E:\AworkNew2025\all-rag-techniques\test\RAG06.py 
成功加载文档，共 465 个文本块

找到的相关文本块：

文本块 1:
33 5 4jinfo jinfo可以用来查看正在运行的java应用程序的扩展参数，包括JavaSystem属性和 JVM命令行参数；也可以动态的修改正在运行的JVM一些参数 1jinfopid

文本块 2:
贝jar包，而且arthas会拖慢应用本身，如果条件不允许，又该如何诊断呢 这边简 单介绍下jdk自带的命令行工具 33 5JVM问题定位命令 在JDK安装目录的bin目录下默认提供了很多有价值的命令行工具 每个小工具体积 基本都比较小，因为这些工具只是jdk\lib\tools jar的简单封装 其中，定位排查问题时最为常用命令包括： jps（进程）、jmap（内存）、jstack（线 程）、j...

文本块 3:
当然，我们也可以使用Linux提供的查询进程状态命令也能快速获取Tomcat服务的 进程id 比如： 1ps-ef|greptomcat 33 5 2jmap jmap（JavaMemoryMap）可以输出所有内存中对象的工具，甚至可以将VM中的 heap，以二进制输 出成文本，使用方式如下： jmap-heap： 1jmap-heappid 输出当前进程JVM堆内存新生代、老年代、持久代、 GC...

文本块 4:
那我们要如何做呢 参照阿里发布的《阿里巴巴Java开发手册v1 4 0（详尽版）》， 我们可以将原先的三层架构细化成下面的样子： 解释一下这个分层架构中的每一层的作用

文本块 5:
Java虚拟机GC日志是用于定位问题重要的日志信息，频繁的GC将导致应用吞吐量 下降、响应时间增加，甚至导致服务不可用 1-XX:+PrintGCDetails-XX:+PrintGCDateStamps-Xloggc:/apps/logs/gc/gc log- 2XX:+UseConcMarkSweepGC 我们可以在Java应用的启动参数中增加-XX:+PrintGCDetails可以输出GC...

正在生成回答，请稍候...

AI回答：
在Java程序员面试中，常见的技术问题主要集中在以下几个方面：Java基础、多线程编程、JVM（Java虚拟机）调优及监控、Spring框架等。下面我将针对这些领域分别介绍一些典型问题及其解决方案。

### 1. Java基础
**常见问题**：
- String, StringBuilder, 和 StringBuffer 的区别。
- == 和 equals() 方法的区别。
- Java 中的异常处理机制。
- final, finally, finalize 的含义与使用场景。
- Java集合框架的理解（List, Set, Map等）。

**解决方案/知识点**：
- 理解String是不可变对象，而StringBuilder和StringBuffer则用于构建可变字符串。其中StringBuffer是线程安全的。
- `==` 比较的是对象引用，而 `equals()` 默认比较也是引用，但通常被重写以比较对象内容。
- 掌握try-catch-finally结构，以及如何自定义异常。
- `final` 可以用来修饰变量、方法或类；`finally` 块总是被执行来清理资源；`finalize()` 是Object类的一个方法，垃圾回收器会在对象销毁前调用它。
- 熟悉不同集合的特点，如ArrayList适合随机访问，LinkedList适合频繁插入删除操作等。

### 2. 多线程
**常见问题**：
- 创建线程的方式有哪些？
- synchronized关键字的作用是什么？
- volatile关键字的作用是什么？
- 如何实现线程间的通信？

**解决方案/知识点**：
- 线程可以通过继承Thread类或者实现Runnable接口来创建。
- `synchronized` 用于控制多个线程对共享资源的安全访问。
- `volatile` 保证了变量的可见性，即当一个线程修改了该变量的值，新值对其他线程来说是可以立即获取到的。
- 使用wait(), notify(), notifyAll()方法可以实现线程间的基本通信。

### 3. JVM
**常见问题**：
- 什么是GC (Garbage Collection)？有哪些类型的GC算法？
- 如何查看和调整JVM参数？
- OutOfMemoryError(OOM)的原因可能有哪些？

**解决方案/知识点**：
- GC负责自动管理内存分配和释放。主要类型包括Serial GC, Parallel GC, CMS, G1等。
- 使用jinfo命令可以查看和修改运行中的JVM配置。例如，`jinfo -flag <name=value> <pid>` 可以动态改变某些JVM设置。
- OOM错误通常由内存泄漏或不合理的堆大小设置引起。合理配置-Xms和-Xmx参数可以帮助缓解这类问题。

### 4. Spring框架
**常见问题**：
- Spring的核心功能是什么？
- IoC容器是什么？DI (Dependency Injection) 的好处是什么？
- AOP (Aspect Oriented Programming) 在Spring中的应用？

**解决方案/知识点**：
- Spring提供了一个轻量级的IoC容器，支持AOP编程模型，并提供了事务管理和数据访问抽象等功能。
- IoC容器管理应用程序组件及其依赖关系。DI允许更灵活地配置对象之间的依赖关系，从而提高代码的可测试性和可维护性。
- AOP允许开发者将横切关注点（如日志记录、安全性检查等）从核心业务逻辑中分离出来，增强模块化程度。

通过以上几个方面的深入理解和准备，应聘者可以在Java技术面试中展现出扎实的基础知识和技术能力。此外，保持对最新技术和最佳实践的关注也是非常重要的。

进程已结束，退出代码为 0