在AWS Lambda上部署sentence-transformers的优化方案
原始方案的问题
最初我尝试在AWS Lambda上直接使用完整的sentence-transformers库来处理文本嵌入任务。这个方案的主要问题在于:
-
包体积过大:sentence-transformers及其依赖项(如PyTorch)的总大小超过250MB,而Lambda的部署包限制为250MB(未压缩)
-
冷启动时间长:大体积的依赖导致Lambda冷启动时需要较长的初始化时间
-
内存消耗高:完整版模型需要较大的内存,增加了Lambda运行成本
优化后的方案
技术栈调整
我最终采用了以下替代方案:
-
ONNX Runtime:替换PyTorch作为推理引擎
- 显著减小包体积(ONNX Runtime约50MB)
- 提供优化的推理性能
- 支持多种硬件加速选项
-
HuggingFace Tokenizers:单独使用tokenizers库
- 仅包含必要的文本预处理功能
- 体积小且高效(约10MB)
-
模型选择:all-MiniLM-L6-v2的ONNX版本
- 轻量级模型(约90MB)
- 专为生产环境优化的ONNX格式
实现细节
-
Lambda配置:
- Python 3.12运行时
- 内存设置为512MB(实测256MB也可运行)
- 超时时间15秒(足够处理批量请求)
-
部署包优化:
bashpip install --target ./package onnxruntime tokenizers # 手动添加预转换的ONNX模型文件 cd package && zip -r ../function.zip . -
代码结构:
pythonimport onnxruntime as ort from tokenizers import Tokenizer # 初始化模型和tokenizer sess = ort.InferenceSession("model.onnx") tokenizer = Tokenizer.from_file("tokenizer.json") def lambda_handler(event, context): # 文本预处理 inputs = tokenizer.encode(event["text"]).ids # ONNX推理 outputs = sess.run(None, {"input_ids": [inputs]}) return {"embedding": outputs[0].tolist()}
性能对比
| 指标 | 原始方案 | 优化方案 |
|---|---|---|
| 包体积 | ~250MB | ~150MB |
| 冷启动时间 | 3-5秒 | 1-2秒 |
| 每次调用延迟 | 300-500ms | 150-300ms |
| 内存使用 | 700MB+ | 300-400MB |
适用场景
这个优化方案特别适合:
- 需要实时文本嵌入的服务
- 无服务器架构下的语义搜索应用
- 大规模文本相似度计算任务
- 成本敏感的AI服务部署
未来还可以考虑进一步优化,如使用Quantized ONNX模型或尝试更小的蒸馏模型。
安装记录
bash
# 1. 宿主机创建 layer 目录(必做)
mkdir -p $(pwd)/layer && \
# 2. 仅安装依赖,不做任何清理(先保证依赖装成功)
docker run --rm -v $(pwd)/layer:/layer --entrypoint "" public.ecr.aws/lambda/python:3.12 \
bash -c "
# 强制创建容器内路径
mkdir -p /layer/python/lib/python3.12/site-packages/ && \
# 安装核心依赖(NumPy 1.x + ONNX Runtime + tokenizers)
pip install numpy==1.26.4 onnxruntime==1.17.0 tokenizers==0.15.2 \
--no-cache-dir -t /layer/python/lib/python3.12/site-packages/ \
--force-reinstall
"
Collecting numpy==1.26.4
Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting onnxruntime==1.17.0
Downloading onnxruntime-1.17.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.2 kB)
Collecting tokenizers==0.15.2
Downloading tokenizers-0.15.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting coloredlogs (from onnxruntime==1.17.0)
Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting flatbuffers (from onnxruntime==1.17.0)
Downloading flatbuffers-25.12.19-py2.py3-none-any.whl.metadata (1.0 kB)
Collecting packaging (from onnxruntime==1.17.0)
Downloading packaging-26.0-py3-none-any.whl.metadata (3.3 kB)
Collecting protobuf (from onnxruntime==1.17.0)
Downloading protobuf-7.34.0-cp310-abi3-manylinux2014_x86_64.whl.metadata (595 bytes)
Collecting sympy (from onnxruntime==1.17.0)
Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting huggingface_hub<1.0,>=0.16.4 (from tokenizers==0.15.2)
Downloading huggingface_hub-0.36.2-py3-none-any.whl.metadata (15 kB)
Collecting filelock (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading filelock-3.25.0-py3-none-any.whl.metadata (2.0 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading fsspec-2026.2.0-py3-none-any.whl.metadata (10 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading hf_xet-1.3.2-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (4.9 kB)
Collecting pyyaml>=5.1 (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.4 kB)
Collecting requests (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm>=4.42.1 (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime==1.17.0)
Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy->onnxruntime==1.17.0)
Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting charset_normalizer<4,>=2 (from requests->huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading charset_normalizer-3.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (39 kB)
Collecting idna<4,>=2.5 (from requests->huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests->huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting certifi>=2017.4.17 (from requests->huggingface_hub<1.0,>=0.16.4->tokenizers==0.15.2)
Downloading certifi-2026.2.25-py3-none-any.whl.metadata (2.5 kB)
Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.0/18.0 MB 8.6 MB/s eta 0:00:00
Downloading onnxruntime-1.17.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 9.6 MB/s eta 0:00:00
Downloading tokenizers-0.15.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 11.2 MB/s eta 0:00:00
Downloading huggingface_hub-0.36.2-py3-none-any.whl (566 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 566.4/566.4 kB 11.8 MB/s eta 0:00:00
Downloading packaging-26.0-py3-none-any.whl (74 kB)
Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Downloading flatbuffers-25.12.19-py2.py3-none-any.whl (26 kB)
Downloading protobuf-7.34.0-cp310-abi3-manylinux2014_x86_64.whl (324 kB)
Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 9.7 MB/s eta 0:00:00
Downloading fsspec-2026.2.0-py3-none-any.whl (202 kB)
Downloading hf_xet-1.3.2-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 11.3 MB/s eta 0:00:00
Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 11.3 MB/s eta 0:00:00
Downloading pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (807 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 807.9/807.9 kB 13.0 MB/s eta 0:00:00
Downloading tqdm-4.67.3-py3-none-any.whl (78 kB)
Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Downloading filelock-3.25.0-py3-none-any.whl (26 kB)
Downloading requests-2.32.5-py3-none-any.whl (64 kB)
Downloading certifi-2026.2.25-py3-none-any.whl (153 kB)
Downloading charset_normalizer-3.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (196 kB)
Downloading idna-3.11-py3-none-any.whl (71 kB)
Downloading urllib3-2.6.3-py3-none-any.whl (131 kB)
Installing collected packages: mpmath, flatbuffers, urllib3, typing-extensions, tqdm, sympy, pyyaml, protobuf, packaging, numpy, idna, humanfriendly, hf-xet, fsspec, filelock, charset_normalizer, certifi, requests, coloredlogs, onnxruntime, huggingface_hub, tokenizers
Successfully installed certifi-2026.2.25 charset_normalizer-3.4.5 coloredlogs-15.0.1 filelock-3.25.0 flatbuffers-25.12.19 fsspec-2026.2.0 hf-xet-1.3.2 huggingface_hub-0.36.2 humanfriendly-10.0 idna-3.11 mpmath-1.3.0 numpy-1.26.4 onnxruntime-1.17.0 packaging-26.0 protobuf-7.34.0 pyyaml-6.0.3 requests-2.32.5 sympy-1.14.0 tokenizers-0.15.2 tqdm-4.67.3 typing-extensions-4.15.0 urllib3-2.6.3
瘦身
bash
# 进入宿主机的 layer 目录
cd $(pwd)/layer/python/lib/python3.12/site-packages/ && \
# 清理冗余文件(宿主机操作,路径绝对存在)
rm -rf tests test examples docs doc *.egg-info && \
find . -name '*.pyc' -delete && \
find . -type d -name '__pycache__' -exec rm -rf {} + && \
rm -rf onnxruntime/{tools,experimental,contrib} && \
rm -rf tokenizers/{bindings,scripts}