RAG 推理管线：检索增强生成的完整流程

本文基于昇腾CANN和昇腾NPU，围绕 cann-recipes-infer 仓库的相关技术展开。

RAG（Retrieval-Augmented Generation）把一个推理请求变成一个管线：先查知识库，再带着上下文生成答案。CANN 在这个管线里的角色不只是跑 LLM------Embedding 生成、rerank 排序、LLM 生成三个环节都在 NPU 上跑。全管线上 NPU 才能省掉 CPU-GPU 之间的数据搬运。

RAG 推理管线的三个步骤

python 复制代码

# RAG 全流程------所有步骤在 NPU 上完成

class RAGPipeline:
    """
    一条 RAG 请求的完整链路：
    
    用户查询 → [NPU] Embed 模型 → 向量检索 → [NPU] Rerank 
              → [NPU] LLM 生成 → 回答
    """
    def __init__(self):
        # 所有模型都在 NPU 上
        self.embed_model = load_model("bge-large-zh.om")       # Embedding
        self.rerank_model = load_model("bge-reranker-large.om") # Rerank
        self.llm = load_model("qwen-14b-chat.om")               # 生成
        
        # 向量库------存在 NPU 显存里
        self.vector_store = VectorStoreOnNPU(dim=1024)
    
    def query(self, user_question):
        # Step 1: Embedding------在 NPU 上把问题转向量
        print("Step 1: Embedding...")
        q_embedding = self.embed_model.infer(
            [user_question], 
            pooling="cls"
        )  # [1, 1024]
        
        # Step 2: 向量检索------在 NPU 上做 Top-K 搜索
        # 全程不走 CPU，向量库在显存里
        top_k_docs = self.vector_store.search(
            q_embedding, k=20
        )  # 召回 20 条文档
        
        # Step 3: Rerank------NPU 上做精排
        rerank_pairs = [(user_question, doc) for doc in top_k_docs]
        rerank_scores = self.rerank_model.infer(rerank_pairs)
        
        # 取 Top-3 作为上下文
        best_docs = [top_k_docs[i] for i in rerank_scores.topk(3)]
        
        # Step 4: 构造 Prompt 给 LLM 生成
        prompt = f"""
        基于以下文档回答问题：
        
        文档：
        {''.join(best_docs)}
        
        问题：{user_question}
        回答：
        """
        
        answer = self.llm.chat(prompt)
        return answer

向量检索在 NPU 上的实现

cpp 复制代码

// CANN 上做 Top-K 向量检索------用 Cube Unit 算余弦距离

class VectorSearchOnNPU {
    // 向量库存在 NPU 显存
    void* database;     // [num_vectors, dim] FP16
    int num_vectors;
    int dim;
    
    // 搜索结果
    struct SearchResult {
        int indices[20];
        float scores[20];
    };
    
    SearchResult TopK(float* query, int k) {
        // NPU 上没有现成的 FAISS，但可以用 MatMul 算内积
        // cos(q, x) = q·x / (|q|·|x|)
        // 向量库已 L2 归一化 → cos = q·x
        
        // 分配输出缓存
        AscendC::LocalTensor<float> scores_local;
        AscendC::LocalAlloc(scores_local, num_vectors);
        
        // 把 Query 广播成 [num_vectors, dim]
        // 然后一次 MatMul 算出所有内积
        // q: [1, dim], db: [dim, num_vectors] → scores: [1, num_vectors]
        AscendC::MatMul(scores_local, query_tensor, database_tensor,
                      AscendC::CUBE_MATRIX_TYPE::NORMAL);
        
        // Top-K 选择------NPU 没有 TopK 指令
        // 用分块 + 插入排序做
        float top_scores[20];
        int top_indices[20];
        
        // 初始化
        for (int i = 0; i < k; i++) {
            top_scores[i] = -1e10;
            top_indices[i] = -1;
        }
        
        // 遍历向量库------分 Tile 加载
        int tile_size = 4096;  // L1 Buffer 能装下的向量数
        for (int t = 0; t < num_vectors; t += tile_size) {
            int this_tile = min(tile_size, num_vectors - t);
            
            // 搬 Tile 到 Local Memory
            AscendC::LocalTensor<float> tile_scores;
            AscendC::LocalAlloc(tile_scores, this_tile);
            AscendC::DataCopy(tile_scores, scores_local + t, this_tile);
            
            // 在 L1 上做 Top-K
            for (int i = 0; i < this_tile; i++) {
                float val = tile_scores[i];
                // 插入排序------K=20，O(num_vectors * 20)
                for (int j = k-1; j >= 0; j--) {
                    if (val > top_scores[j]) {
                        if (j < k-1) {
                            top_scores[j+1] = top_scores[j];
                            top_indices[j+1] = top_indices[j];
                        }
                        top_scores[j] = val;
                        top_indices[j] = t + i;
                    } else {
                        break;
                    }
                }
            }
        }
        
        return SearchResult{top_indices, top_scores};
    }
};

把 Top-K 检索搬到 NPU 上的收益：省掉 Embedding 结果从 NPU→CPU→NPU 的两次 PCIe 搬运。100 万向量的检索在 NPU 上约 8ms------跟 CPU 上的 FAISS IVF 相当，但省了数据搬运的 3-5ms。

RAG 管线端到端性能

阶段	设备	时延	瓶颈
Embedding	NPU 单卡	12ms (4K tokens)	小模型跑不快
Top-20 检索	NPU MatMul	8ms (1M 向量)	纯算力
Rerank	NPU 单卡	35ms (20 对)	20 次小推理
LLM 生成	NPU 单卡	280ms (200 tokens)	Decode 阶段

全 NPU 管线总计约 335ms，其中 Embed→CPU→检索→CPU→NPU 的搬运路径可以省 15-20ms。CANN 的 Runtime 支持多模型上下文共享显存------Embed 模型和 LLM 的显存不冲突，可以常驻。

参考仓库

RAG 推理示例

Embedding 模型适配

pyasc NPU 编程接口