【AIGC魔童】DeepSeek v3推理部署：华为昇腾NPU/TRT-LLM

（1）使用华为昇腾NPU推理部署DeepSeek

参考博客 ：华为昇腾推理DeepSeek-R1，性能比肩高端GPU，API免费无限量！潞晨自研推理引擎出手了

来自华为昇腾社区的 MindIE 框架成功适配了 DeepSeek-V3 的 BF16 版本。

有关 Ascend NPU 的分步指南，请按照此处的说明进行操作。

（2）使用TRT-LLM推理部署DeepSeek

GitHub地址 ：https://github.com/NVIDIA/TensorRT-LLM

TensorRT-LLM 现在支持 DeepSeek-V3 模型，仅提供 BF16 和 INT4/INT8 权重等精度选项。对 FP8 的支持目前正在进行中，并将很快发布。

您可以通过以下链接访问 TRTLLM 专门用于 DeepSeek-V3 支持的自定义分支，直接体验新功能：https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3。

2.8.1 下载DeepSeek模型权重

Download DeepSeek-V3 weights from HF https://huggingface.co/deepseek-ai/DeepSeek-V3-Base.

shell 复制代码

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

可选项: 转化 FP8 权重到 BF16.

This is not necessary unless you want to run the model E2E in BF16 precision.

shell 复制代码

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference/
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-V3 --output-bf16-hf-path /path/to/deepseek-v3-bf16
cp /path/to/DeepSeek-V3/config.json /path/to/DeepSeek-V3/configuration_deepseek.py /path/to/deepseek-v3-bf16/

2.8.2 构建TensorRT引擎

首先，利用convert_checkpoint.py将DeepSeek权重转换为TensorRT-LLM权重，然后，使用TensorRT-LLM权重构建TensorRT引擎。

模型转化

转化为 FP8 权重:

shell 复制代码

# Convert Deepseek-v3 HF Native FP8 weights to TensorRT-LLM checkpoint.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
                            --dtype bfloat16 \
                            --use_fp8_weights \
                            --tp_size 8 \
                            --workers 8 # using multiple workers can accelerate the conversion process

可选项: 转化为 BF16 权重:

shell 复制代码

# Convert Deepseek-v3 HF weights to TensorRT-LLM checkpoint in BF16.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
                            --dtype bfloat16 \
                            --tp_size 32 \
                            --workers 8 # using multiple workers can accelerate the conversion process

构建TensorRT引擎

对于FP8模型:

shell 复制代码

# Build FP8 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
            --output_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

对于BF16模型:

shell 复制代码

# Build BF16 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
            --output_dir ./trtllm_engines/deepseek_v3/bf16/tp32-sel4096-isl2048-bs4 \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

Caution: --max_batch_size and --max_seq_len are the main factors to determine how many GPU memory will be used during runtime, so later when try to run e.g., summarize.py or mmlu.py or gptManagerBenchmark.cppmay need adjust --max_batch_size and --max_seq_len accordingly to avoid OOM.(meaning rebuild TensorRT engine with smaller --max_batch_size and --max_seq_len if needed based on GPU memory size), there is beautiful technical log perf-best-practices.md (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md) explained the mechanism.

2.8.3 模型推理

使用run.py 脚本测试FP8模型:

shell 复制代码

# run.sh
python3 ../run.py --input_text "Today is a nice day." \
        --max_output_len 30 \
        --tokenizer_dir ./DeepSeek-V3 \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --top_p 0.95 \
        --temperature 0.3

多节点推理：

shell 复制代码

srun -N 2 -w node-[1-2] --gres=gpu:8 --ntasks-per-node 8 \
    --container-image tensorrt_llm/release:latest \
    --container-mounts ${PWD}:/workspace \
    sh /workspace/command/run.sh

输出结果:

shell 复制代码

...
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
Input [Text 0]: "Today is a nice day."
Output [Text 0 Beam 0]: " I am going to the park with my friends. We are going to play soccer. We are going"

2.8.4 模型评估

使用 mmlu.py 脚本实现模型评估:

shell 复制代码

# Download MMLU dataset
mkdir mmlu_data && cd mmlu_data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
# Run MMLU evaluation
python3 mmlu.py \
        --hf_model_dir ${MODEL_DIR} \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --data_dir mmlu_data \
        --test_trt_llm 2>&1 | tee ${ENGINE_DIR}/test_with_mmlu.log

输出结果：

复制代码

Average accuracy 0.926 - high_school_macroeconomics
Average accuracy 0.752 - high_school_mathematics
Average accuracy 0.954 - high_school_microeconomics
Average accuracy 0.848 - high_school_physics
Average accuracy 0.967 - high_school_psychology
Average accuracy 0.861 - high_school_statistics
Average accuracy 0.956 - high_school_us_history
Average accuracy 0.954 - high_school_world_history
Average accuracy 0.861 - human_aging
Average accuracy 0.931 - human_sexuality
Average accuracy 0.975 - international_law
Average accuracy 0.907 - jurisprudence
Average accuracy 0.920 - logical_fallacies
Average accuracy 0.848 - machine_learning
Average accuracy 0.951 - management
Average accuracy 0.957 - marketing
Average accuracy 0.950 - medical_genetics
Average accuracy 0.957 - miscellaneous
Average accuracy 0.870 - moral_disputes
Average accuracy 0.798 - moral_scenarios
Average accuracy 0.918 - nutrition
Average accuracy 0.916 - philosophy
Average accuracy 0.932 - prehistory
Average accuracy 0.869 - professional_accounting
Average accuracy 0.714 - professional_law
Average accuracy 0.956 - professional_medicine
Average accuracy 0.908 - professional_psychology
Average accuracy 0.800 - public_relations
Average accuracy 0.869 - security_studies
Average accuracy 0.960 - sociology
Average accuracy 0.950 - us_foreign_policy
Average accuracy 0.578 - virology
Average accuracy 0.930 - world_religions
Average accuracy 0.852 - math
Average accuracy 0.874 - health
Average accuracy 0.905 - physics
Average accuracy 0.936 - business
Average accuracy 0.958 - biology
Average accuracy 0.825 - chemistry
Average accuracy 0.888 - computer science
Average accuracy 0.912 - economics
Average accuracy 0.890 - engineering
Average accuracy 0.851 - philosophy
Average accuracy 0.917 - other
Average accuracy 0.932 - history
Average accuracy 0.944 - geography
Average accuracy 0.904 - politics
Average accuracy 0.936 - psychology
Average accuracy 0.949 - culture
Average accuracy 0.744 - law
Average accuracy 0.883 - STEM
Average accuracy 0.827 - humanities
Average accuracy 0.926 - social sciences
Average accuracy 0.898 - other (business, health, misc.)
Average accuracy: 0.877