前言
💡 痛点: 大模型太大无法在手机/边缘设备运行?延迟高、功耗大、隐私泄露风险?量化后精度下降严重?如何在 iOS/Android/NVIDIA Jetson/树莓派上高效部署?
🎯 解决方案: 从量化技术→推理框架→iOS/Android/嵌入式部署→NPU优化→隐私保护,系统掌握端侧 AI 部署全链路。
#mermaid-svg-ZIy5aKMjer8Zttdv{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZIy5aKMjer8Zttdv .error-icon{fill:#552222;}#mermaid-svg-ZIy5aKMjer8Zttdv .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZIy5aKMjer8Zttdv .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZIy5aKMjer8Zttdv .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZIy5aKMjer8Zttdv .marker.cross{stroke:#333333;}#mermaid-svg-ZIy5aKMjer8Zttdv svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZIy5aKMjer8Zttdv p{margin:0;}#mermaid-svg-ZIy5aKMjer8Zttdv .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster-label text{fill:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster-label span{color:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster-label span p{background-color:transparent;}#mermaid-svg-ZIy5aKMjer8Zttdv .label text,#mermaid-svg-ZIy5aKMjer8Zttdv span{fill:#333;color:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv .node rect,#mermaid-svg-ZIy5aKMjer8Zttdv .node circle,#mermaid-svg-ZIy5aKMjer8Zttdv .node ellipse,#mermaid-svg-ZIy5aKMjer8Zttdv .node polygon,#mermaid-svg-ZIy5aKMjer8Zttdv .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZIy5aKMjer8Zttdv .rough-node .label text,#mermaid-svg-ZIy5aKMjer8Zttdv .node .label text,#mermaid-svg-ZIy5aKMjer8Zttdv .image-shape .label,#mermaid-svg-ZIy5aKMjer8Zttdv .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZIy5aKMjer8Zttdv .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZIy5aKMjer8Zttdv .rough-node .label,#mermaid-svg-ZIy5aKMjer8Zttdv .node .label,#mermaid-svg-ZIy5aKMjer8Zttdv .image-shape .label,#mermaid-svg-ZIy5aKMjer8Zttdv .icon-shape .label{text-align:center;}#mermaid-svg-ZIy5aKMjer8Zttdv .node.clickable{cursor:pointer;}#mermaid-svg-ZIy5aKMjer8Zttdv .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZIy5aKMjer8Zttdv .arrowheadPath{fill:#333333;}#mermaid-svg-ZIy5aKMjer8Zttdv .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZIy5aKMjer8Zttdv .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZIy5aKMjer8Zttdv .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZIy5aKMjer8Zttdv .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZIy5aKMjer8Zttdv .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZIy5aKMjer8Zttdv .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster text{fill:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv .cluster span{color:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZIy5aKMjer8Zttdv .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZIy5aKMjer8Zttdv rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZIy5aKMjer8Zttdv .icon-shape,#mermaid-svg-ZIy5aKMjer8Zttdv .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZIy5aKMjer8Zttdv .icon-shape p,#mermaid-svg-ZIy5aKMjer8Zttdv .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZIy5aKMjer8Zttdv .icon-shape .label rect,#mermaid-svg-ZIy5aKMjer8Zttdv .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZIy5aKMjer8Zttdv .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZIy5aKMjer8Zttdv .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZIy5aKMjer8Zttdv :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 目标平台
推理框架
量化压缩
云端大模型
大语言模型
70B/130B
用户输入
INT4 量化
体积↓4x 速度↑2x
INT8 量化
体积↓2x 速度↑1.5x
剪枝
移除冗余权重
知识蒸馏
小模型学大模型
MNN
阿里
TNN
腾讯
TensorFlow Lite
Google
Core ML
Apple
ONNX Runtime
跨平台
LLM Runtime
LLM专用
📱 iOS
Neural Engine
📱 Android
Hexagon NPU
NVIDIA Jetson
GPU
RK3588
NPU 6TOPS
树莓派
CPU/Edge TPU
端侧 AI 2026 技术格局:
| 技术 | 作用 | 效果 |
|---|---|---|
| INT4 量化 | 权重从 FP32→INT4 | 体积↓4x,显存↓4x |
| INT8 量化 | 权重从 FP32→INT8 | 体积↓2x,功耗↓30% |
| AWQ 量化 | 激活感知权重量化 | 精度损失最小 |
| GPTQ 量化 | 逐层渐进式量化 | 精度优于 Naive INT4 |
| 知识蒸馏 | 小模型学习大模型 | 保留 95%+ 能力 |
| NPU 加速 | 专用神经网络处理器 | 能效比 CPU ↑10x |
| 推测解码 | 小模型 Draft + 大模型 Verify | 推理速度 ↑2-3x |
一、模型量化技术
1.1 量化原理
python
# ===== 量化基础 =====
"""
量化核心:FP32 → INT8/INT4
原理:用整数近似表示浮点数
FP32: 1位符号 + 8位指数 + 23位尾数 = 32位
INT8: 1位符号 + 7位数据 = 8位
INT4: 1位符号 + 3位数据 = 4位
映射公式: quantized = round(float_value / scale)
反量化: float_value = quantized * scale
量化误差来源:
- 截断误差: float 超出 int 范围的部分被丢弃
- 舍入误差: round() 操作的近似
"""
import torch
import torch.nn as nn
# ===== 动态量化(Post-Training Dynamic Quantization)=====
def dynamic_quantize(model: nn.Module):
"""
动态量化:权重离线量化,激活值实时量化
最简单,效果一般
"""
quantized_model = torch.quantization.quantize_dynamic(
model, # 模型
{nn.Linear, nn.LSTM, nn.GRU}, # 要量化的层
dtype=torch.qint8 # 目标精度
)
return quantized_model
# 使用
# model = YourLLM()
# quantized_model = dynamic_quantize(model)
# ===== 静态量化(Post-Training Static Quantization)=====
def static_quantize(model: nn.Module, calibration_data):
"""
静态量化:需要校准数据统计激活值分布
效果更好,需要额外校准步骤
"""
# 1. Fuse modules
model.eval()
model = torch.quantization.fuse_modules(model, [
["conv1", "bn1", "relu"],
])
# 2. 指定量化方案
model.qconfig = torch.quantization.get_default_qconfig("fbgemm")
torch.quantization.prepare(model, inplace=True)
# 3. 校准(跑几个 batch 的推理,收集激活分布)
model.eval()
with torch.no_grad():
for batch in calibration_data:
model(batch)
# 4. 转换
quantized_model = torch.quantization.convert(model, inplace=False)
return quantized_model
# ===== INT4 量化(GPTQ/ AWQ)=====
# 安装:pip install auto-gptq transformers accelerate
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
def gptq_quantize_model(model_name: str, output_path: str, bits: int = 4):
"""
GPTQ: Generative Pretrained Transformer Quantization
逐层量化,利用 Hessian 矩阵信息减少精度损失
"""
# 加载模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantization_config=BaseQuantizeConfig(bits=bits, desc_act=True),
)
# 校准数据
calibration_samples = [
"This is a sample text for calibration.",
# ... 更多样本
]
# 量化
model.quantize(calibration_samples)
model.save_quantized(output_path)
return model, tokenizer
# 使用
model, tokenizer = gptq_quantize_model(
"Qwen/Qwen2-7B-Instruct",
"./qwen2-7b-int4",
bits=4,
)
# ===== AWQ 量化(Activation-Aware Weight Quantization)=====
# 安装:pip install awq
from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM
def awq_quantize(model_path: str, output_path: str):
"""
AWQ: 激活感知权重量化
效果优于 naive INT4,尤其对 LLM
"""
# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 量化配置
quant_config = {
"zero_point": True,
"q_group_size": 128, # 量化组大小
"w_bit": 4,
"version": "GEMM", # 或 "GEMV"(内存受限场景)
}
# 量化
model.quantize(model, quant_config=quant_config)
model.save_quantized(output_path)
return model, tokenizer
# ===== GGUF 格式(llama.cpp)=====
"""
GGUF: 专为本地 LLM 设计的量化格式
优势:单文件、跨平台、内存映射(mmap)支持
常用精度:
- Q4_K_M: 4bit,中等质量,主流选择
- Q5_K_S: 5bit,高质量
- Q8_0: 8bit,几乎无精度损失
- F16: 16bit 浮点,无量化
工具:llama.cpp / ollama / LocalAI
"""
1.2 模型剪枝
python
# ===== 剪枝技术 =====
import torch
import torch.nn.utils.prune as prune
# 1. 结构化剪枝(移除整个神经元/通道)
def structured_pruning(model: nn.Module, amount: float = 0.3):
"""
结构化剪枝:移除整个 Filter/Channel
对推理友好,但精度损失较大
"""
for name, module in model.named_modules():
if isinstance(module, nn.Conv2d):
# L1 范数剪枝
prune.l1_unstructured(
module, name="weight", amount=amount
)
prune.remove(module, "weight")
# 2. 非结构化剪枝(移除单个权重)
def unstructured_pruning(model: nn.Module, sparsity: float = 0.5):
"""
非结构化剪枝:按 magnitude 剪枝
精度损失小,但稀疏矩阵存储开销大
"""
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
prune.l1_unstructured(
module, name="weight", amount=sparsity
)
# 3. Winograd 剪枝(针对 Transformer)
def transformer_pruning(model, head_importance, mlp_importance):
"""
剪枝 Attention Head 和 FFN 层
"""
# 找出最不重要的 heads
num_heads_to_remove = 4
heads_to_remove = torch.argsort(head_importance)[:num_heads_to_remove]
# 移除 heads
for layer_idx, layer in enumerate(model.transformer.h):
layer.attn.prune_heads(heads_to_remove)
return model
# 4. SparseGPT(OBC 剪枝)
"""
SparseGPT: 专为 LLM 设计的渐进度量剪枝
无需微调,适合超大模型
论文: https://arxiv.org/abs/2301.00774
"""
1.3 知识蒸馏
python
# ===== 知识蒸馏 =====
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
"""
知识蒸馏:用大模型(Teacher)指导小模型(Student)学习
Loss = α * Task Loss + (1-α) * KD Loss
"""
def __init__(self, temperature: float = 2.0, alpha: float = 0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha
def forward(
self,
student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
labels: torch.Tensor,
) -> torch.Tensor:
# 任务 loss(硬标签)
task_loss = F.cross_entropy(student_logits, labels)
# 蒸馏 loss(软标签)
soft_teacher = F.softmax(teacher_logits / self.temperature, dim=-1)
soft_student = F.log_softmax(student_logits / self.temperature, dim=-1)
distill_loss = F.kl_div(
soft_student, soft_teacher, reduction="batchmean"
) * (self.temperature ** 2)
# 合并
return self.alpha * task_loss + (1 - self.alpha) * distill_loss
# ===== TinyBERT 蒸馏 =====
"""
TinyBERT: 针对 BERT 的蒸馏
4 层 Teacher → 4 层 Student(或者更多层)
关键蒸馏位置:
- Embedding 层
- Attention 层
- Feed-Forward 层
- 输出层
"""
# ===== MiniLLM 蒸馏(LLM 蒸馏)=====
"""
MiniLLM: 蒸馏 LLM 到小模型
核心:用 KL 散度 + 逆向 KL(避免生硬模仿)
训练:
1. 用 Teacher 生成高质量数据
2. Student 学习生成类似的文本
3. 用 reverse KL 避免 Student 过于保守
"""
二、推理框架
2.1 ONNX Runtime 跨平台部署
python
# ===== ONNX 模型导出与优化 =====
# 安装:pip install onnx onnxruntime onnxoptimizer
import torch
import onnx
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. 导出模型到 ONNX
def export_to_onnx(model, tokenizer, output_path: str):
"""将 PyTorch 模型导出为 ONNX"""
# 准备输入
input_ids = torch.tensor([[1, 2, 3]]) # 示例
attention_mask = torch.tensor([[1, 1, 1]])
# 导出
torch.onnx.export(
model,
(input_ids, attention_mask),
output_path,
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "sequence"},
"attention_mask": {0: "batch", 1: "sequence"},
"logits": {0: "batch", 1: "sequence"},
},
opset_version=17, # ONNX 算子集版本
)
print(f"Model exported to {output_path}")
# 2. ONNX Runtime 优化
import onnxruntime as ort
from onnxruntime.transformers import optimizer
def optimize_onnx_model(onnx_path: str, optimized_path: str):
"""优化 ONNX 模型"""
# 加载优化器
optimized_model = optimizer.optimize_model(
onnx_path,
num_heads=12, # Attention heads
hidden_size=768,
optimization_level=99, # 最大优化
)
# 保存
optimized_model.save_model_to_file(optimized_path)
return optimized_model
# 3. ONNX Runtime 推理
def run_onnx_inference(onnx_path: str, input_data: dict):
"""ONNX Runtime 推理"""
# 创建 session
session_options = ort.SessionOptions()
session_options.graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)
session_options.intra_op_num_threads = 4
session = ort.InferenceSession(onnx_path, session_options)
# 推理
outputs = session.run(None, input_data)
return outputs
# ===== MNN 推理框架(阿里)=====
"""
MNN: 阿里巴巴端侧推理框架
支持:iOS / Android / Linux / Windows / macOS
优势:轻量(~2MB)、低功耗、CPU/GPU/NPU 统一调度
安装:编译源码或下载预编译库
"""
# ===== TNN 推理框架(腾讯)=====
"""
TNN: 腾讯优图端侧推理框架
支持:iOS / Android / Linux / GPU
优势:多平台统一、内存优化、8bit 量化支持
"""
2.2 Core ML 苹果生态
python
# ===== Core ML 部署(iOS/macOS)=====
# 1. 导出 Core ML 模型
import coremltools as cml
from transformers import AutoModelForCausalLM
def export_to_coreml(pytorch_model, output_path: str):
"""导出为 Core ML 格式"""
# traced model
traced_model = torch.jit.trace(
pytorch_model,
(torch.tensor([[1, 2, 3]]),)
)
# 转换为 Core ML
mlmodel = cml.convert(
traced_model,
compute_units=cml.ComputeUnit.ALL, # 使用 Neural Engine
)
# 保存
mlmodel.save(output_path)
return mlmodel
# 2. Core ML 量化
def quantize_coreml(mlmodel_path: str, output_path: str):
"""Core ML 6bit 量化"""
model = cml.models.MLModel(mlmodel_path)
quantized_model = cml.optimize.coreml.mlprogram_utils.linear_quantize_weights(
model,
dtype="float16", # 或 "uint8" / "int8"
)
quantized_model.save(output_path)
return quantized_model
# 3. iOS Swift 调用
"""
import CoreML
import NaturalLanguage
func runInference(input: MLMultiArray) async throws -> MLMultiArray {
let config = MLModelConfiguration()
config.computeUnits = .all // 使用 Neural Engine
let model = try NLModel(contentsOf: modelURL, configuration: config)
let inputFeature = MLFeatureValue(multiArray: input)
let output = try model.prediction(from: inputFeature)
return output.featureValue(for: "output")!.multiArrayValue!
}
"""
# ===== Transformers.js 浏览器部署 =====
// 安装:npm i @transformers.js
import { pipeline, env } from '@transformers.js';
// 设置缓存目录(可选)
env.cacheDir = '/models';
// 创建 pipeline
const classifier = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
// 推理
const result = await classifier('I love transformers.js!');
console.log(result);
// [{ label: 'POSITIVE', score: 0.9998 }]
// ===== 支持的任务类型 =====
// 'feature-extraction' - 文本嵌入
// 'sentiment-analysis' - 情感分析
// 'question-answering' - 问答
// 'fill-mask' - 完形填空
// 'text-generation' - 文本生成(需模型支持)
三、iOS 部署实战
3.1 LLM 在 iOS 上运行
swift
// ===== iOS LLM 推理(llama.cpp + Swift)=====
import Foundation
// 1. llama.cpp Swift wrapper
// 使用 LLaMA Swift: https://github.com/yanunon/LLaMA Swift
class LLMEngine {
private var model: OpaquePointer?
private var context: OpaquePointer?
// 模型参数
let modelPath: String
let contextSize: Int = 2048
let threads: Int = 4
let gpuLayers: Int = 32 // 使用 Metal GPU
init(modelPath: String) throws {
self.modelPath = modelPath
// 初始化 llama
guard llama_init_backend() == 0 else {
throw LLMError.initFailed
}
// 加载模型
model = try llama_load_model_from_file(modelPath)
// 创建上下文
context = llama_init_context(model: model!, ctxSize: contextSize)
}
func generate(prompt: String, maxTokens: Int = 256) throws -> String {
// Tokenize
let tokens = llama_tokenize(model!, prompt)
// 生成参数
var params = llama_sampling_params_default()
params.temperature = 0.7
params.top_p = 0.9
params.repeat_penalty = 1.1
// 生成
var outputTokens: [Int32] = []
for _ in 0..<maxTokens {
let token = llama_sample(
ctx: context!,
params: params,
candidates: ... // 获取 token candidates
)
if token == llama_token_eos() {
break
}
outputTokens.append(token)
llama_eval(ctx: context!, token: token, nTokens: 1, ...)
}
// Decode
return decodeTokens(outputTokens)
}
}
// ===== MLX(Apple Silicon 专用)=====
"""
MLX: Apple Silicon (M1/M2/M3) 专用机器学习框架
优势:统一内存架构、Metal GPU 加速、Python/C++/Swift API
安装:pip install mlx
示例:
"""
# Python 端使用 MLX
try:
import mlx.core as mx
import mlx.nn as nn
from mlx.utils import tree_flatten
# 加载量化模型
model_path = "./models/llama-3-8b-instruct-4bit"
model = nn.QuantizedLinear.from_large_linear(
... # 加载 4bit 量化权重
)
# 生成
def generate(prompt: str, max_tokens: int = 256) -> str:
tokens = tokenize(prompt)
for _ in range(max_tokens):
logits = model(tokens)
token = sample(logits)
if token == eos_token:
break
tokens.append(token)
return detokenize(tokens)
except ImportError:
print("MLX not available, use llama.cpp")
// ===== Swift 调用 MLX =====
// 暂无官方 Swift API,可通过 C++ 调用
3.2 图像模型 iOS 部署
swift
// ===== 图像分类 iOS(Vision + Core ML)=====
import Vision
import CoreML
class ImageClassifier {
private var model: VNCoreMLModel?
init() async throws {
// 加载 Core ML 模型
let config = MLModelConfiguration()
config.computeUnits = .all // Neural Engine
let model = try await MLModel.load(
contentsOf: modelURL,
configuration: config
)
self.model = try VNCoreMLModel(for: model.model)
}
func classify(image: CGImage) async throws -> [ClassificationResult] {
let request = VNCoreMLRequest(model: model!) { request, error in
guard let results = request.results as? [VNClassificationObservation] else {
return
}
return results.map {
ClassificationResult(label: $0.identifier, confidence: $0.confidence)
}
}
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([request])
return []
}
}
// ===== Stable Diffusion iOS =====
"""
CoreSD: iOS 上的 Stable Diffusion
- 使用 Core ML
- 支持 iPhone 15 Pro (8GB RAM) 以上
- 需量化到 ~2GB
"""
// 使用/CoreML- diffusion-swift: https://github.com/danielgatis/CoreML-Diffusion
四、Android 部署实战
4.1 TFLite 部署
python
# ===== TensorFlow Lite 转换 =====
import tensorflow as tf
# 1. 保存 TF 模型
model = ... # 你的模型
model.save("saved_model")
# 2. 转换为 TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
# 量化配置
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_fn # 校准数据
# INT8 量化
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# 转换
tflite_model = converter.convert()
# 保存
with open("model.tflite", "wb") as f:
f.write(tflite_model)
# ===== Android 集成 =====
"""
1. 添加依赖:
implementation 'org.tensorflow:tensorflow-lite:2.14.0'
2. 放置模型文件到 assets/ 目录
3. Java/Kotlin 代码:
"""
// Kotlin TFLite 推理
class TFLiteClassifier(private val assetManager: AssetManager) {
private var interpreter: Interpreter?
init {
val model = loadModelFile("model.tflite")
interpreter = Interpreter(model)
}
fun classify(inputBuffer: ByteBuffer): FloatArray {
val outputBuffer = ByteBuffer.allocateDirect(4 * numClasses)
interpreter?.run(inputBuffer, outputBuffer)
outputBuffer.rewind()
val outputs = FloatArray(numClasses)
outputBuffer.asFloatBuffer().get(outputs)
return outputs
}
}
// ===== TFLite GPU 委托 =====
"""
使用 GPU 加速 TFLite 推理
"""
// 添加 GPU 支持
implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'
// 启用 GPU
val gpuDelegate = GpuDelegate(
GpuDelegateOptions.builder()
.setPrecisionLossAllowed(false)
.build()
)
val options = Interpreter.Options().apply {
addDelegate(gpuDelegate)
}
val interpreter = Interpreter(modelFile, options)
// ===== Android NNAPI 委托 =====
"""
使用 Android Neural Networks API (NNAPI)
利用 DSP/NPU 加速
"""
val nnapiDelegate = NnApiDelegate()
val options = Interpreter.Options().apply {
addDelegate(nnapiDelegate)
}
4.2 NCNN 安卓部署
cpp
// ===== NCNN 部署(高效 C++ 推理框架)=====
/*
NCNN: 腾讯开源的高性能神经网络推理框架
专为移动端优化,支持 Android/iOS/Linux
*/
// 1. 模型转换
// PyTorch → ONNX → NCNN
// 使用 ncnn/tools/onnx2ncnn
// 2. Android JNI 代码
#include <ncnn/ncnn/net.h>
#include <android/asset_manager_jni.h>
class NCNNClassifier {
private:
ncnn::Net net;
ncnn::Mat in;
public:
void loadModel(AAssetManager* mgr) {
// 加载参数和模型
net.load_param(mgr, "model.param");
net.load_model(mgr, "model.bin");
}
std::vector<float> inference(const ncnn::Mat& inMat) {
// 预处理
ncnn::Mat in = inMat.reshape(3, 224, 224);
ncnn::Mat inNorm;
ncnn::conform_image(in, inNorm);
// 输入
ncnn::Extractor ex = net.create_extractor();
ex.input("data", inNorm);
// 输出
ncnn::Mat out;
ex.extract("prob", out);
// 转换为 float 数组
std::vector<float> result(out.total());
for (int i = 0; i < out.total(); i++) {
result[i] = out[i];
}
return result;
}
};
// JNI 绑定
extern "C" JNIEXPORT jfloatArray JNICALL
Java_com_example_ncnn_NCNNClassifier_inference(
JNIEnv* env, jobject thiz,
jbyteArray imageData, jint width, jint height
) {
// ... 转换为 ncnn::Mat
auto result = classifier.inference(inMat);
// 返回 Java float[]
jfloatArray output = env->NewFloatArray(result.size());
env->SetFloatArrayRegion(output, 0, result.size(), result.data());
return output;
}
4.3 高通 Hexagon NPU
kotlin
// ===== 高通 Hexagon NPU 加速 =====
/*
Hexagon NPU: Qualcomm 处理器上的专用 AI 加速器
Snapdragon 8 Gen 3: 45 TOPS
*/
// 1. 准备量化模型(需 TFLite INT8 格式)
val model = loadModelFile("model_int8.tflite")
// 2. 创建 Options
val options = Interpreter.Options().apply {
// 优先使用 NPU
setAcceleratorPriority(InterpreterOptions.AcceleratorPriority.NPU)
// NPU 不支持时降级到 GPU
setAllowNnapi(true)
}
// 3. 创建 Interpreter
val interpreter = Interpreter(model, options)
// 4. 内存优化(避免 OOM)
val memoryPolicy = InterpreterOptions.MemoryPolicy.Memory_Allow_SysfsAliasing
options.setMemoryPolicy(memoryPolicy)
// ===== 基准测试 =====
/*
Benchmark NPU vs CPU vs GPU 性能
*/
class PerformanceBenchmark {
fun benchmark(modelPath: String, numRuns: Int = 100) {
val cpuTimes = mutableListOf<Long>()
val gpuTimes = mutableListOf<Long>()
val npuTimes = mutableListOf<Long>()
// CPU
for (i in 0..numRuns) {
val start = System.nanoTime()
runInference(Interpreter(modelPath, cpuOptions))
cpuTimes.add(System.nanoTime() - start)
}
// 类似测试 GPU 和 NPU...
println("CPU avg: ${cpuTimes.average() / 1_000_000}ms")
println("GPU avg: ${gpuTimes.average() / 1_000_000}ms")
println("NPU avg: ${npuTimes.average() / 1_000_000}ms")
}
}
五、嵌入式设备部署
5.1 NVIDIA Jetson 部署
bash
# ===== Jetson 环境准备 =====
# 1. JetPack 安装(包含 CUDA/cuDNN/TensorRT)
# JetPack 6.0 + L4T 36.3
# 2. 安装 PyTorch for Jetson
pip install torch torchvision --index-url https://download.pytorch.org/whl/jp6
# 3. 安装 TensorRT
# TensorRT 已包含在 JetPack 中
# 4. 安装推理框架
pip install onnxruntime-gpu # 支持 CUDA
python
# ===== TensorRT 推理 =====
import torch
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
class TensorRTEngine:
def __init__(self, engine_path: str):
# 初始化 CUDA
cuda.init()
self.device = cuda.Device(0)
self.context = self.device.make_context()
# 加载引擎
logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
self.engine = trt.deserialize_runtime_engine(logger, f.read())
self.context = self.engine.create_execution_context()
# 分配 GPU 内存
self.inputs = []
self.outputs = []
self.bindings = []
self.stream = cuda.Stream()
for binding in self.engine:
size = trt.volume(self.engine.get_tensor_shape(binding))
host_mem = cuda.pagelocked_empty(size, dtype=np.float32)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))
if self.engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
self.inputs.append((host_mem, device_mem))
else:
self.outputs.append((host_mem, device_mem))
def inference(self, input_data: np.ndarray) -> np.ndarray:
# 拷贝输入
np.copyto(self.inputs[0][0], input_data.ravel())
cuda.memcpy_htod_async(
self.inputs[0][1],
self.inputs[0][0],
self.stream
)
# 执行
self.context.execute_async_v3(
bindings=self.bindings,
stream_handle=self.stream.handle,
)
# 拷贝输出
cuda.memcpy_dtoh_async(
self.outputs[0][0],
self.outputs[0][1],
self.stream
)
self.stream.synchronize()
return self.outputs[0][0]
def __del__(self):
self.context.pop()
# ===== TensorRT 模型转换 =====
"""
ONNX → TensorRT
"""
import torch
import onnx
import tensorrt as trt
def onnx_to_tensorrt(onnx_path: str, trt_path: str):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(
trt.MemoryPoolType.WORKSPACE, 1 << 30 # 1GB
)
# FP16 加速
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)
with open(trt_path, "wb") as f:
f.write(engine)
5.2 RK3588 NPU 部署
bash
# ===== RK3588 环境配置 =====
# 1. 安装 RKNN Toolkit2(瑞芯微 NPU 工具链)
pip install rknn-toolkit2
# 2. 模型转换
# PyTorch/TensorFlow/ONNX → RKNN
python
# ===== RKNN 模型转换 =====
from rknn.api import RKNN
def convert_to_rknn(
model_path: str,
input_nodes: list,
output_nodes: list,
quantize: bool = True
):
"""
转换为 RKNN 格式(支持 RK3588 NPU)
"""
rknn = RKNN()
# 加载模型
if model_path.endswith(".onnx"):
rknn.config(
mean_values=[[123.675, 116.28, 103.53]], # RGB mean
std_values=[[58.395, 57.12, 57.375]], # RGB std
target_platform="rk3588",
)
ret = rknn.load_onnx(model_path)
elif model_path.endswith(".pt"):
rknn.load_pytorch(model_path, input_size_list=input_nodes)
# 构建
ret = rknn.build(
do_quantization=quantize,
dataset="calibration_data.txt", # 量化数据集
)
# 导出
ret = rknn.export_rknn("model.rknn")
return rknn
# ===== RKNN 推理 =====
def inference_rknn(rknn_path: str, input_data: np.ndarray):
rknn = RKNN()
rknn.load_rknn(rknn_path)
# 初始化运行时
ret = rknn.init_runtime()
# 推理
output = rknn.inference(inputs=[input_data])
return output[0]
# ===== RKNN 性能对比 =====
"""
RK3588 NPU 性能:
- 6 TOPS算力
- INT8 支持
- 能效比 CPU 高 10x
实测对比(MobileNetV3):
- CPU: ~120ms
- GPU: ~35ms
- NPU: ~8ms
"""
六、生产部署架构
6.1 端云协同
python
# ===== 端云协同推理架构 =====
"""
场景划分:
- 简单推理 → 端侧(延迟低、隐私好)
- 复杂推理 → 云端(算力强、精度高)
- 混合 → 端侧快速响应 + 云端精确结果
"""
import asyncio
from typing import Optional, Dict, Any
class HybridInference:
"""端云混合推理引擎"""
def __init__(self, config: Dict[str, Any]):
self.local_engine = LocalEngine(config["local"])
self.cloud_client = CloudClient(config["cloud"])
self.router = InferenceRouter(config["routing"])
async def infer(
self,
task: Task,
priority: str = "balanced"
) -> InferenceResult:
# 1. 路由决策
routing = await self.router.decide(task)
# 2. 执行推理
if routing["strategy"] == "local_only":
return await self.local_engine.infer(task)
elif routing["strategy"] == "cloud_only":
return await self.cloud_client.infer(task)
elif routing["strategy"] == "hybrid":
# 3.1 端侧快速响应
local_result = await self.local_engine.infer(task)
# 3.2 云端精确结果(异步)
cloud_task = asyncio.create_task(
self.cloud_client.infer(task)
)
# 3.3 选择结果(根据超时/置信度)
try:
cloud_result = await asyncio.wait_for(
cloud_task,
timeout=routing["cloud_timeout"]
)
return self.merge_results(local_result, cloud_result)
except asyncio.TimeoutError:
return local_result
else: # cascade
# 4. 级联:端侧判断是否需要云端
local_confidence = local_result.confidence
if local_confidence < routing["confidence_threshold"]:
cloud_result = await self.cloud_client.infer(task)
return cloud_result
return local_result
# ===== 模型版本管理 =====
class ModelManager:
"""端侧模型版本管理与更新"""
def __init__(self, storage_path: str):
self.storage_path = storage_path
self.current_version = None
async def check_updates(self) -> Optional[str]:
"""检查模型更新"""
remote_version = await self.cloud_client.get_latest_version()
if remote_version > self.current_version:
return remote_version
return None
async def download_update(self, version: str) -> bool:
"""下载并安装新模型"""
# 1. 下载到临时目录
temp_path = f"{self.storage_path}/temp_{version}"
await self.cloud_client.download_model(version, temp_path)
# 2. 验证模型
if not self.validate_model(temp_path):
return False
# 3. 原子替换
backup_path = f"{self.storage_path}/backup_{self.current_version}"
os.rename(self.current_path, backup_path)
os.rename(temp_path, self.current_path)
# 4. 更新版本号
self.current_version = version
return True
def rollback(self):
"""回滚到上一个版本"""
if os.path.exists(backup_path):
os.rename(backup_path, self.current_path)
6.2 隐私保护
python
# ===== 隐私保护技术 =====
"""
关键技术:
1. 差分隐私(Differential Privacy)
2. 联邦学习(Federated Learning)
3. 安全多方计算(MPC)
4. 同态加密(Homomorphic Encryption)
5. TEE 可信执行环境
"""
# 1. 本地差分隐私
import numpy as np
def local_differential_privacy(value: float, epsilon: float = 1.0) -> float:
"""
本地差分隐私:在设备上添加噪声
epsilon 越小,隐私保护越强,精度越低
"""
# Laplacian 噪声
noise = np.random.laplace(0, 1 / epsilon)
return value + noise
# 2. 数据最小化
class PrivacyFilter:
"""在发送前过滤敏感信息"""
def filter(self, data: Dict) -> Dict:
# 移除 PII
filtered = {k: v for k, v in data.items() if not self._is_pii(k)}
return filtered
def _is_pii(self, key: str) -> bool:
pii_keywords = ["name", "email", "phone", "address", "ssn"]
return any(kw in key.lower() for kw in pii_keywords)
# 3. TEE 推理(ARM TrustZone)
"""
ARM TrustZone: 在安全世界中执行敏感操作
Android: Keymaster / StrongBox
iOS: Secure Enclave
"""
# 4. 模型加密
class EncryptedModel:
"""加密模型保护"""
def __init__(self, encrypted_path: str, key: bytes):
self.cipher = AES-GCM(key)
self.model = self._load_encrypted(encrypted_path)
def _load_encrypted(self, path: str):
with open(path, "rb") as f:
encrypted = f.read()
return self.cipher.decrypt(encrypted)
def infer(self, input_data):
# 模型在内存中解密后立即使用
decrypted = self._load_encrypted(self.model_path)
return self._run_inference(decrypted, input_data)
七、总结
端侧 AI 部署决策树
输入场景:
├── 文本生成(LLM)
│ ├── iOS/Mac → MLX(Apple Silicon)/ llama.cpp
│ ├── Android → GGUF + llama.cpp Android
│ └── 嵌入式 → ONNX Runtime / TensorRT
│
├── 图像分类/检测
│ ├── iOS → Core ML + Neural Engine
│ ├── Android → TFLite / NCNN + NPU
│ └── 嵌入式 → TensorRT / RKNN
│
└── 端云协同
└── 简单任务 → 端侧
复杂任务 → 云端
隐私敏感 → 端侧 + 差分隐私
量化选型指南
| 量化方式 | 精度 | 体积 | 速度 | 适用场景 |
|---|---|---|---|---|
| FP16 | 100% | 1x | 0.7x | 精度敏感 |
| INT8 | ~98% | 2x | 1.5x | 平衡之选 |
| INT4 | ~95% | 4x | 2x | 内存受限 |
| INT4 + AWQ | ~97% | 4x | 2x | 推荐 |
| INT4 + GPTQ | ~96% | 4x | 2x | LLM 专用 |
平台推荐框架
| 平台 | 推荐框架 | 硬件加速 |
|---|---|---|
| iOS | Core ML / MLX / llama.cpp | Neural Engine |
| Android | TFLite / NCNN / ONNX Runtime | Hexagon NPU |
| NVIDIA Jetson | TensorRT | GPU |
| RK3588 | RKNN Toolkit2 | NPU 6TOPS |
| 树莓派 5 | ONNX Runtime | Edge TPU |
| 通用 Linux | ONNX Runtime / llama.cpp | Vulkan/OpenCL |
生产部署 Checklist
□ 模型量化(INT4 AWQ)压缩体积
□ 推理框架选型(TFLite/Core ML/ONNX Runtime)
□ 硬件加速(GPU/NPU/DSP)
□ 内存优化(内存映射、量化)
□ 降级策略(CPU fallback)
□ 隐私保护(数据过滤/差分隐私)
□ 模型版本管理(热更新/回滚)
□ A/B 测试(新旧模型对比)
□ 监控(延迟/内存/准确率)
本文涵盖端侧 AI 部署完整知识:量化技术(动态/静态/INT4 GPTQ/AWQ + GGUF)+ 剪枝(结构化/非结构化/SparseGPT)+ 知识蒸馏(KL散度/MiniLLM)+ 推理框架(ONNX Runtime/MNN/TNN)+ iOS 部署(Core ML/llama.cpp/MLX)+ Android 部署(TFLite GPU委托/NPU/NCNN JNI)+ 嵌入式(Jetson TensorRT/RK3588 RKNN)+ 生产架构(端云协同/模型版本管理/差分隐私/TEE)。