美团龙猫开源LongCat-Flash-Lite

模型介绍

我们推出LongCat-Flash-Lite------一款具备256k上下文窗口（通过YaRN方法实现）的非思考型混合专家模型（MoE），总参数量达685亿，其中激活参数量约30亿。该模型在LongCat-Flash架构基础上创新性地融合了N元嵌入表技术，不仅显著提升模型性能，更实现了推理速度的突破性进展。尽管为嵌入层分配了超过300亿参数，LongCat-Flash-Lite不仅超越了同等参数规模的MoE基线模型，更在智能体应用与代码生成领域展现出与同量级模型的卓越竞争力。

核心优势

🌟 突破性扩展效率：MoE的优质替代方案

通过多场景扩展实验，我们发现嵌入扩展策略在特定场景下能建立比增加专家数量更优的帕累托前沿，为模型扩展提供了高效新范式。我们系统论证了决定嵌入扩展效果的架构要素，包括集成时机、参数预算、哈希冲突缓解、超参配置、嵌入初始化策略，以及模型宽度与深度的影响。

🌟 专项系统优化带来的极致推理效率

与传统基于FFN的专家模块相比，N元嵌入表天然缓解了MoE层的I/O瓶颈，实现推理延迟的显著降低。我们还创新开发了专用N元缓存系统及同步计算内核，协同带来数量级的效率提升。

🌟 卓越的智能体与代码能力

LongCat-Flash-Lite在工具调用与代码生成任务中展现出超越同等规模模型的强劲实力。

详情请参阅我们的技术报告！

Evaluation Results

Benchmark	Kimi-Linear-48B-A3B	Qwen3-Next-80B-A3B-Instruct	Gemini 2.5 Flash-Lite	LongCat-Flash-Lite
Architecture	MoE	MoE	-	MoE + NE
# Total Params	48B	80B	-	68.5B
# Activated Params	3B	3B	-	2.9B~4.5B
Agentic Tool Use
Tau2-Airline(avg@8)	44.00	45.5*	35.00	58.00
Tau2-Retail(avg@8)	18.86	57.3*	37.50	73.10
Tau2-Telecom(avg@8)	15.68	13.2*	21.93	72.80
Agentic Coding
SWE-Bench(acc)	32.80	37.60	41.3*	54.40
TerminalBench(acc)	20.00	15.19	20.00	33.75
SWE-Bench Multiligual	37.20	31.30	-	38.10
PRDBench	-	15.36	-	39.63
General Domains
GPQA-Diamond(avg@16)	69.89	74.33	70.20*	66.78
MMLU(acc)	79.91	89.28	84.68	85.52
MMLU-Pro(acc)	67.22	82.93	78.95	78.29
CEval(acc)	78.48	90.91	75.16	86.55
CMMLU(acc)	76.26	86.50	72.06	82.48
Mathematical Reasoning
MATH500(acc)	94.20	98.00	95.20	96.80
AIME24(avg@32)	70.52	81.35	63.33	72.19
AIME25(avg@32)	59.58	68.44	50.1*	63.23

注：标*数值来源于公开报告。NE为N-gram Embedding缩写。

快速开始

要搭配transformers使用LongCat-Flash-Lite，我们至少需要2张GPU（每张80GB显存，如H100/A100 80GB），推荐配置如下环境：

python >= 3.10
torch >= 2.6
transformers >= 4.57.6
accelerate >= 1.10.0

shell 复制代码

pip install -U transformers==4.57.6 accelerate==1.10.0

基本用法示例：

py 复制代码

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meituan-longcat/LongCat-Flash-Lite"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a brief introduction to large language models."}
]
input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
output_ids = generated_ids[0][len(input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(response)

工具调用示例：

py 复制代码

tools = [
    {
        "type": "function",
        "function": {
            "name": "func_add",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "x1": {"type": "number", "description": "The first addend"},
                    "x2": {"type": "number", "description": "The second addend"}
                },
                "required": ["x1", "x2"]
            }
        }
    }
]
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please tell me what is $$125679 + 234519$$?"},
    {
        "role": "assistant", 
        "content": "I'll calculate the sum of 125679 and 234519 for you.", 
        "tool_calls": [{"type": "function", "function": {"name": "func_add", "arguments": {"x1": 125679, "x2": 234519}}}]
    },
    {"role": "tool", "name": "func_add", "content": '{"ans": 360198}'}
]

input_ids = tokenizer.apply_chat_template(
    messages, 
    tools=tools,
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
output_ids = generated_ids[0][len(input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(response)

响应解析：

python 复制代码

from parse_model_response import parse_model_response

response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
parsed_message = parse_model_response(response, tools)

请查看 parse_model_response.py 获取详细实现与示例。

部署说明

我们已在SGLang框架中完成基础适配（PR链接）以支持LongCat-Flash-Lite模型的部署。

通过结合张量并行与专家并行技术，LongCat-Flash-Lite可在单节点（例如8张H20-141G显卡）上完成服务部署。

请先编译并更新sgl-kernel组件。

shell 复制代码

cd sgl-kernel
python3 -m uv build --wheel --color=always --no-build-isolation \
        -Ccmake.define.SGL_KERNEL_ENABLE_SM90A=1 \
        -Ccmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5 \
        -Cbuild-dir=build .
pip3 install dist/sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl --force-reinstall

然后启动服务器。

py 复制代码

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Lite \
    --port 8080 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.9 \
    --max-running-requests 64 \
    --trust-remote-code \
    --skip-server-warmup \
    --attention-backend flashinfer \
    --ep 8 \
    --tp 8 \
    --disable-cuda-graph

许可证协议

本仓库（包括模型权重及源代码）依据MIT许可证发布。

所有对本仓库的贡献默认采用MIT许可证授权，另有声明的情况除外。本许可证不授予使用美团商标或专利的任何权利。

详情请参阅LICENSE文件。

使用须知

本模型未针对所有可能的下游应用场景进行专门设计或全面评估。

开发者应充分考虑大语言模型的已知局限性（包括不同语言间的性能差异），在敏感或高风险场景部署前需审慎评估准确性、安全性和公平性。

开发者及下游用户有责任理解并遵守与其使用场景相关的所有适用法律法规，包括但不限于数据保护、隐私及内容安全要求。

本模型卡片中的任何内容均不应被解释为对模型发布所依据的MIT许可证条款的变更或限制。

联系我们

如有疑问，请发送邮件至longcat-team@meituan.com或提交issue与我们联系。