【AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

【AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

(1)使用华为昇腾NPU推理部署DeepSeek

参考博客华为昇腾推理DeepSeek-R1,性能比肩高端GPU,API免费无限量!潞晨自研推理引擎出手了

来自华为昇腾社区的 MindIE 框架成功适配了 DeepSeek-V3 的 BF16 版本。

有关 Ascend NPU 的分步指南,请按照此处的说明进行操作。

(2)使用TRT-LLM推理部署DeepSeek

GitHub地址https://github.com/NVIDIA/TensorRT-LLM

TensorRT-LLM 现在支持 DeepSeek-V3 模型,仅提供 BF16 和 INT4/INT8 权重等精度选项。对 FP8 的支持目前正在进行中,并将很快发布。

您可以通过以下链接访问 TRTLLM 专门用于 DeepSeek-V3 支持的自定义分支,直接体验新功能:https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3

2.8.1 下载DeepSeek模型权重

Download DeepSeek-V3 weights from HF https://huggingface.co/deepseek-ai/DeepSeek-V3-Base.

shell 复制代码
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

可选项: 转化 FP8 权重到 BF16.

This is not necessary unless you want to run the model E2E in BF16 precision.

shell 复制代码
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference/
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-V3 --output-bf16-hf-path /path/to/deepseek-v3-bf16
cp /path/to/DeepSeek-V3/config.json /path/to/DeepSeek-V3/configuration_deepseek.py /path/to/deepseek-v3-bf16/

2.8.2 构建TensorRT引擎

首先,利用convert_checkpoint.py将DeepSeek权重转换为TensorRT-LLM权重,然后,使用TensorRT-LLM权重构建TensorRT引擎。

  • 模型转化

转化为 FP8 权重:

shell 复制代码
# Convert Deepseek-v3 HF Native FP8 weights to TensorRT-LLM checkpoint.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
                            --dtype bfloat16 \
                            --use_fp8_weights \
                            --tp_size 8 \
                            --workers 8 # using multiple workers can accelerate the conversion process

可选项: 转化为 BF16 权重:

shell 复制代码
# Convert Deepseek-v3 HF weights to TensorRT-LLM checkpoint in BF16.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
                            --dtype bfloat16 \
                            --tp_size 32 \
                            --workers 8 # using multiple workers can accelerate the conversion process
  • 构建TensorRT引擎

对于FP8模型:

shell 复制代码
# Build FP8 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
            --output_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

对于BF16模型:

shell 复制代码
# Build BF16 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
            --output_dir ./trtllm_engines/deepseek_v3/bf16/tp32-sel4096-isl2048-bs4 \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

Caution: --max_batch_size and --max_seq_len are the main factors to determine how many GPU memory will be used during runtime, so later when try to run e.g., summarize.py or mmlu.py or gptManagerBenchmark.cppmay need adjust --max_batch_size and --max_seq_len accordingly to avoid OOM.(meaning rebuild TensorRT engine with smaller --max_batch_size and --max_seq_len if needed based on GPU memory size), there is beautiful technical log perf-best-practices.md (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md) explained the mechanism.

2.8.3 模型推理

使用run.py 脚本测试FP8模型:

shell 复制代码
# run.sh
python3 ../run.py --input_text "Today is a nice day." \
        --max_output_len 30 \
        --tokenizer_dir ./DeepSeek-V3 \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --top_p 0.95 \
        --temperature 0.3

多节点推理:

shell 复制代码
srun -N 2 -w node-[1-2] --gres=gpu:8 --ntasks-per-node 8 \
    --container-image tensorrt_llm/release:latest \
    --container-mounts ${PWD}:/workspace \
    sh /workspace/command/run.sh

输出结果:

shell 复制代码
...
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
Input [Text 0]: "Today is a nice day."
Output [Text 0 Beam 0]: " I am going to the park with my friends. We are going to play soccer. We are going"

2.8.4 模型评估

使用 mmlu.py 脚本实现模型评估:

shell 复制代码
# Download MMLU dataset
mkdir mmlu_data && cd mmlu_data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
# Run MMLU evaluation
python3 mmlu.py \
        --hf_model_dir ${MODEL_DIR} \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --data_dir mmlu_data \
        --test_trt_llm 2>&1 | tee ${ENGINE_DIR}/test_with_mmlu.log

输出结果:

复制代码
Average accuracy 0.926 - high_school_macroeconomics
Average accuracy 0.752 - high_school_mathematics
Average accuracy 0.954 - high_school_microeconomics
Average accuracy 0.848 - high_school_physics
Average accuracy 0.967 - high_school_psychology
Average accuracy 0.861 - high_school_statistics
Average accuracy 0.956 - high_school_us_history
Average accuracy 0.954 - high_school_world_history
Average accuracy 0.861 - human_aging
Average accuracy 0.931 - human_sexuality
Average accuracy 0.975 - international_law
Average accuracy 0.907 - jurisprudence
Average accuracy 0.920 - logical_fallacies
Average accuracy 0.848 - machine_learning
Average accuracy 0.951 - management
Average accuracy 0.957 - marketing
Average accuracy 0.950 - medical_genetics
Average accuracy 0.957 - miscellaneous
Average accuracy 0.870 - moral_disputes
Average accuracy 0.798 - moral_scenarios
Average accuracy 0.918 - nutrition
Average accuracy 0.916 - philosophy
Average accuracy 0.932 - prehistory
Average accuracy 0.869 - professional_accounting
Average accuracy 0.714 - professional_law
Average accuracy 0.956 - professional_medicine
Average accuracy 0.908 - professional_psychology
Average accuracy 0.800 - public_relations
Average accuracy 0.869 - security_studies
Average accuracy 0.960 - sociology
Average accuracy 0.950 - us_foreign_policy
Average accuracy 0.578 - virology
Average accuracy 0.930 - world_religions
Average accuracy 0.852 - math
Average accuracy 0.874 - health
Average accuracy 0.905 - physics
Average accuracy 0.936 - business
Average accuracy 0.958 - biology
Average accuracy 0.825 - chemistry
Average accuracy 0.888 - computer science
Average accuracy 0.912 - economics
Average accuracy 0.890 - engineering
Average accuracy 0.851 - philosophy
Average accuracy 0.917 - other
Average accuracy 0.932 - history
Average accuracy 0.944 - geography
Average accuracy 0.904 - politics
Average accuracy 0.936 - psychology
Average accuracy 0.949 - culture
Average accuracy 0.744 - law
Average accuracy 0.883 - STEM
Average accuracy 0.827 - humanities
Average accuracy 0.926 - social sciences
Average accuracy 0.898 - other (business, health, misc.)
Average accuracy: 0.877
相关推荐
kishu_iOS&AI1 分钟前
机器学习 —— 线性回归
人工智能·机器学习·线性回归
阿里云大数据AI技术6 分钟前
OpenClaw 长记忆增强:基于 Hologres + Mem0 的企业级方案
人工智能
北京耐用通信7 分钟前
工业自动化领域耐中达讯自动化CC-Link IE转EtherCAT技术解决方案
人工智能·物联网·网络协议·自动化·信息与通信
Utopia^9 分钟前
Flutter 框架跨平台鸿蒙开发 - 起床战争
flutter·华为·harmonyos
autumn200512 分钟前
Flutter 框架跨平台鸿蒙开发 - 习惯养成塔
flutter·华为·harmonyos
秋风不问归客12 分钟前
Springboot面试全面整理
spring boot·后端·面试
飞哥数智坊22 分钟前
【大纲】TRAE AI 编程入门扩展课:一些可能有用的编程常识
人工智能·ai编程·trae
恋猫de小郭23 分钟前
Google 开源大模型 Gemma4 怎么选,本地跑的话需要什么条件?
前端·人工智能·ai编程
小冷coding24 分钟前
【面试】结合项目整理的场景面试题,覆盖 Java 基础、锁、多线程、数据库、分布式锁 / 事务、消息中间件等核心维度
java·数据库·面试
用户20187928316725 分钟前
Cli开端之 /init命令
人工智能