LLM 系统设计核心：为什么必须压缩上下文？有哪些工程策略

为什么需要上下文压缩

为了塞下更多信息 ，我们最后给大模型输入的（context window）他大小是有限制的，比如gpt-4o 这个模型他的context window的限制就只有128k

k是数量单位，1k ===1000 token

1 token约等于 1-1.5个汉字或4个字符

网上有些说法里说压缩上下文是为了减少每次输入的token消耗，从而实现省钱，其实这不是主要原因。因为压缩token 这件事，要么把当前的历史记给大模型进行压缩（这一步本来就要花钱），要么存进向量数据库（存储费用和每次生成向量的费用本来也是要花钱的）

压缩上下文主要是为了提高信息密度，在有限 context window 里 ,保留最重要的信息,让对话能够持续，并提高模型推理质量 省钱这个事情，只是压缩上下文的副作用。非主意图。

context window的大致的组成部分是：历史会话 + 系统提示词 + rag内容（如果有的话） + 用户输入 + 模型输出

我们说的压缩上下文指的就是压缩 context window 中的 历史会话

如何压缩上下文

context window 中有一个非常重要的组成部分是 历史记录，我们所谓的压缩上下文，主要就是压缩这玩意。

方法一：摘要压缩

即：把获取所有历史记录，先把他们单独走一次大模型，让大模型压缩这段历史记录，留下重要信息形成摘要。最后把"摘要"重新作为历史记录拼接回context window。

触发时机：一般来说，我们每次给到大模型以前，都会使用一些库或者自定义方法判断下当前 context window 的大小是否超出限制。一但检测到超出限制了，就走摘要压缩逻辑，然后继续进行当前问答。

示例

python 复制代码

原始历史：  
用户：你好，今天我想学Python。  
AI：好啊，你想学基础还是进阶？  
用户：基础。  
AI：可以先学变量、循环和函数。  
...  
  
压缩后：  
"用户想学习Python基础，包括变量、循环和函数。"

javascript 代码案例

js 复制代码

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

// 保存对话历史
let history = [];

// 假设模型 context window 为 4000 token
const MAX_TOKENS = 4000;

/**
 * 粗略估算 token 数量
 * 实际生产环境一般使用 tiktoken 等库计算
 */
function estimateTokens(messages) {
  return JSON.stringify(messages).length / 4;
}

/**
 * 调用模型对历史记录做摘要
 * 这里建议用便宜的小模型
 */
async function summarizeHistory() {

  // 把历史消息拼接成文本
  const historyText = history
    .map(msg => `${msg.role}: ${msg.content}`)
    .join("\n");

  const response = await client.responses.create({
    model: "gpt-4.1-mini", // 用小模型做摘要更省钱
    input: `请总结以下对话，只保留关键事实和上下文信息：\n${historyText}`
  });

  return response.output_text;
}

/**
 * 核心聊天函数
 */
async function chat(userInput) {

  // 1 用户输入加入历史
  history.push({
    role: "user",
    content: userInput
  });

  // 2 检查当前 token 是否超出 context window
  if (estimateTokens(history) > MAX_TOKENS) {

    console.log("⚠️ 上下文过长，开始压缩历史记录");

    // 3 调用模型生成历史摘要
    const summary = await summarizeHistory();

    // 4 用摘要替换原历史记录
    history = [{
      role: "system",
      content: `以下是之前对话的摘要：${summary}`
    }];
  }

  // 5 调用主模型回答问题
  const response = await client.responses.create({
    model: "gpt-4.1",
    input: history
  });

  const reply = response.output_text;

  // 6 保存模型回答
  history.push({
    role: "assistant",
    content: reply
  });

  return reply;
}

python案例

python 复制代码

from openai import OpenAI
import json

client = OpenAI()

# 保存历史消息
history = []

# 假设最大 token
MAX_TOKENS = 4000


def estimate_tokens(messages):
    """
    粗略估算 token 数量
    实际项目推荐使用 tiktoken
    """
    return len(json.dumps(messages)) / 4


def summarize_history():
    """
    使用小模型对历史记录做摘要
    """

    history_text = "\n".join(
        [f"{m['role']}: {m['content']}" for m in history]
    )

    res = client.responses.create(
        model="gpt-4.1-mini",  # 小模型更省钱
        input=f"请总结以下对话，只保留关键上下文：\n{history_text}"
    )

    return res.output_text


def chat(user_input):

    global history

    # 1 保存用户输入
    history.append({
        "role": "user",
        "content": user_input
    })

    # 2 检查是否超过 context window
    if estimate_tokens(history) > MAX_TOKENS:

        print("⚠️ 上下文过长，触发摘要压缩")

        # 3 调用模型生成摘要
        summary = summarize_history()

        # 4 用摘要替换历史
        history = [{
            "role": "system",
            "content": f"历史对话摘要：{summary}"
        }]

    # 5 再调用主模型生成回答
    res = client.responses.create(
        model="gpt-4.1",
        input=history
    )

    reply = res.output_text

    # 6 保存 AI 回复
    history.append({
        "role": "assistant",
        "content": reply
    })

    return reply

java案例

java 复制代码

import java.net.URI;
import java.net.http.*;
import java.util.*;

public class ChatMemory {

    // 保存历史对话
    static List<String> history = new ArrayList<>();

    // 简化版 context window 限制
    static int MAX_SIZE = 4000;

    /**
     * 判断当前上下文是否超长
     */
    static boolean exceedLimit() {
        return history.toString().length() > MAX_SIZE;
    }

    /**
     * 调用模型生成摘要
     * 一般用更便宜的小模型
     */
    static String summarize() throws Exception {

        String historyText = String.join("\n", history);

        String body = """
        {
          "model": "gpt-4.1-mini",
          "input": "请总结以下对话，只保留关键上下文：%s"
        }
        """.formatted(historyText);

        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://api.openai.com/v1/responses"))
                .header("Authorization", "Bearer " + System.getenv("OPENAI_API_KEY"))
                .header("Content-Type", "application/json")
                .POST(HttpRequest.BodyPublishers.ofString(body))
                .build();

        HttpClient client = HttpClient.newHttpClient();

        HttpResponse<String> response =
                client.send(request, HttpResponse.BodyHandlers.ofString());

        return response.body();
    }

    static void chat(String userInput) throws Exception {

        // 1 保存用户输入
        history.add("user:" + userInput);

        // 2 检查是否超过上下文限制
        if (exceedLimit()) {

            System.out.println("⚠️ 上下文过长，开始压缩");

            // 3 调用模型生成摘要
            String summary = summarize();

            // 4 用摘要替换历史
            history.clear();
            history.add("system:历史摘要:" + summary);
        }

        // 5 再调用模型回答（此处省略）
    }
}

这种压缩方式的优缺点

优点：
- 简单，直观
缺点：
- 摘要可能丢掉一些细节

这种方式操作简单暴力，只适合业务不太复杂的聊天机器人

方法二:语义向量压缩

第一种方法是直接压缩当前的历史记录，但如果当前历史记录是在非常非常长，那么他的弊端也明显，压缩之后的摘要，内容会丢失的越来越多。语义向量压缩就是解决这个问题的。

语义向量压缩，就是把对话历史，直接存进向量数据库，然后在后续的每次回答中，再去向量数据库中向量检索相关内容拼接到context window中。

具体步骤如下

将历史对话向量化（embedding），存进向量数据库。
新对话时，只取最相关的向量重写成新 prompt。

流程

sql 复制代码

历史对话
   ↓
Embedding 向量化
   ↓
存入 Vector DB
   ↓
用户新问题 → Embedding
   ↓
向量相似度搜索
   ↓
取 TopK 相关内容
   ↓
拼接进 Context Window
   ↓
调用大模型回答

javascript代码实例

javascript 复制代码

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

// 模拟向量数据库
const vectorDB = [];

/**
 * 计算两个向量的余弦相似度
 */
function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;

  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

/**
 * 把文本转成 embedding 向量
 */
async function embedding(text) {

  const res = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });

  return res.data[0].embedding;
}

/**
 * 存储历史对话
 */
async function saveMemory(text) {

  const vector = await embedding(text);

  vectorDB.push({
    text,
    vector
  });
}

/**
 * 从向量数据库检索最相关的历史
 */
async function searchMemory(query) {

  const queryVector = await embedding(query);

  const scored = vectorDB.map(item => ({
    text: item.text,
    score: cosineSimilarity(queryVector, item.vector)
  }));

  // 按相似度排序
  scored.sort((a, b) => b.score - a.score);

  // 取 Top3
  return scored.slice(0, 3).map(i => i.text);
}

/**
 * 聊天函数
 */
async function chat(userInput) {

  // 1 从向量数据库找最相关历史
  const memories = await searchMemory(userInput);

  // 2 拼接进 prompt
  const context = memories.join("\n");

  const res = await client.responses.create({
    model: "gpt-4.1",
    input: `
相关历史信息：
${context}

用户问题：
${userInput}
`
  });

  const reply = res.output_text;

  // 3 保存新的对话到向量数据库
  await saveMemory(`user:${userInput}`);
  await saveMemory(`assistant:${reply}`);

  return reply;
}

python代码实例

python 复制代码

from openai import OpenAI
import numpy as np

client = OpenAI()

# 模拟向量数据库
vector_db = []


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def embedding(text):
    """
    把文本转为向量
    """
    res = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

    return res.data[0].embedding


def save_memory(text):
    """
    保存历史记录到向量数据库
    """
    vector = embedding(text)

    vector_db.append({
        "text": text,
        "vector": vector
    })


def search_memory(query):
    """
    根据用户问题检索最相关历史
    """
    query_vector = embedding(query)

    scored = []

    for item in vector_db:
        score = cosine_similarity(query_vector, item["vector"])
        scored.append((score, item["text"]))

    # 按相似度排序
    scored.sort(reverse=True)

    # 取Top3
    return [x[1] for x in scored[:3]]


def chat(user_input):

    memories = search_memory(user_input)

    context = "\n".join(memories)

    res = client.responses.create(
        model="gpt-4.1",
        input=f"""
相关历史信息：
{context}

用户问题：
{user_input}
"""
    )

    reply = res.output_text

    # 保存新对话
    save_memory(f"user:{user_input}")
    save_memory(f"assistant:{reply}")

    return reply

java代码实例

java 复制代码

import java.util.*;

public class VectorMemory {

    static class Memory {
        String text;
        float[] vector;

        Memory(String text, float[] vector) {
            this.text = text;
            this.vector = vector;
        }
    }

    // 模拟向量数据库
    static List<Memory> vectorDB = new ArrayList<>();

    /**
     * 计算余弦相似度
     */
    static double cosineSimilarity(float[] a, float[] b) {

        double dot = 0;
        double normA = 0;
        double normB = 0;

        for (int i = 0; i < a.length; i++) {
            dot += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }

        return dot / (Math.sqrt(normA) * Math.sqrt(normB));
    }

    /**
     * 搜索最相关历史
     */
    static List<String> search(float[] queryVector) {

        vectorDB.sort((a, b) -> 
            Double.compare(
                cosineSimilarity(queryVector, b.vector),
                cosineSimilarity(queryVector, a.vector)
            )
        );

        List<String> result = new ArrayList<>();

        for (int i = 0; i < Math.min(3, vectorDB.size()); i++) {
            result.add(vectorDB.get(i).text);
        }

        return result;
    }
}

这种压缩方式的优缺点

优点：
- 可以长期记忆
- 可以检索最相关的上下文
缺点：
- 需要额外数据库
- 模型不知道聊天的 连续语境

方法三：分层记忆

这是市面上 AI产品里最常用的方法

主流 AI 系统基本都是这种结构：

diff 复制代码

短期记忆（最近对话）
+ 摘要记忆（历史压缩）
+ 长期记忆（向量检索）

很多框架默认就是这种架构，比如：

LangChain
LlamaIndex
AutoGPT

为什么分层记忆最常用

因为它解决了 三个问题：

1.保证上下文连贯

只用向量检索会有一个问题：

模型不知道聊天的 连续语境。

例如：

复制代码

用户：帮我写一篇Java教程
AI：好的

用户：加一点Spring Boot

第二句话如果只用向量检索：

模型可能不知道 Spring Boot 是对刚才教程的补充。

所以必须保留：

sql 复制代码

最近对话（Sliding Window）

2.防止 Context Window 爆炸

历史对话越来越长：

复制代码

100轮
200轮
500轮

如果全部塞进 context window：必炸,或者需要裁剪，内容大量丢失

所以要：

复制代码

历史 → 摘要

3.支持长期记忆

例如：

复制代码

用户：我叫张三
用户：我住上海

几天后再问：

复制代码

你还记得我叫什么吗？

如果没有 向量记忆：

模型已经忘了。

所以要：

复制代码

Embedding → Vector DB

真实 AI 系统的 Context 结构

假定当前模型的context window限制128k

一个完整 prompt 通常长这样：

java 复制代码

SYSTEM PROMPT    //系统提示词
+
Conversation Summary   //对话摘要
+
Recent Messages (最近5~10轮) //历史记录
+
RAG Documents //RAG内容  --如果有的话
+
User Question   //用户输入

示例：

css 复制代码

system prompt         8k
summary memory        16k
recent messages       32k
RAG documents         64k
user input            7k

总共：

复制代码

127k tokens

超限的取舍

如果我们最后的prompt超出了目标模型的context window的限制，应该做什么取舍。

这取决于你的业务逻辑的权重 ，一般来说：都是 summary memory二次压缩 或 RAG documents 截断，在这两个中做出取舍，其他三个组成部分都非常重要。

最后

总结一波

市面上最常用的是ai记忆结构是分层记忆，即：短期记忆+长期记忆+摘要记忆
真实的AI系统完整的prompt组成部分大概是：系统提示词 + 对话摘要 + 历史记录 + RAG数据 + 用户输入
压缩上下文主要是为了提高信息密度，在有限 context window 里 ,保留最重要的信息,让对话能够持续，并提高模型推理质量
token计费单位中k是数量
单位 k是数量单位，1k ===1000 token

如果对你有用的话