一、引言
上一篇文章梳理了内容出海工具链的全景架构。本文直接动手------用Python + Docker + GitHub Actions搭建一条可自动化的内容出海翻译配音产线,核心目标:上传一个中文视频,自动产出英语/日语/西语三个版本,发布到YouTube。
完整代码在文中可复制运行,架构如下:
Git Push → GitHub Actions触发 → Docker容器启动
→ Whisper语音识别 → DeepL/GPT翻译 → ElevenLabs/Cutrix配音
→ FFmpeg合成 → YouTube API上传 → 钉钉通知
二、环境准备
2.1 依赖清单
bash
# 系统依赖
apt-get install -y ffmpeg python3.11 python3-pip
# Python依赖
pip install faster-whisper openai deepl elevenlabs pyyaml google-api-python-client
pip install google-auth-oauthlib boto3 requests
2.2 API Key 准备
| 服务 | 用途 | 获取地址 | 月免费额度 |
|---|---|---|---|
| DeepL API | 文本翻译 | deepl.com/pro-api | 50万字符 |
| ElevenLabs | TTS配音 | elevenlabs.io/api | 1万字符 |
| OpenAI API (GPT-4o) | 翻译纠错+术语校验 | platform.openai.com | --- |
| YouTube Data API v3 | 视频上传 | console.cloud.google.com | 1万单位/天 |
| Cutrix API | 翻译配音(备选) | cutrix.cc | --- |
生产环境建议用Cutrix API替代DeepL+ElevenLabs组合,单接口完成翻译+配音+字幕,省掉三个API的集成维护。
2.3 目录结构
content-globalization-pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── config.yaml # 语种/API Key/平台配置
├── src/
│ ├── main.py # 主流程入口
│ ├── transcribe.py # Whisper语音识别
│ ├── translate.py # 翻译引擎
│ ├── dubbing.py # TTS配音
│ ├── compose.py # FFmpeg合成
│ ├── distribute.py # 多平台分发
│ └── notify.py # 通知
├── .github/workflows/
│ └── pipeline.yml # CI/CD配置
└── tests/
└── test_pipeline.py
三、核心代码实现
3.1 配置管理 (config.yaml)
yaml
pipeline:
source_lang: "zh"
target_langs: ["en", "ja", "es"]
video_dir: "/data/videos"
output_dir: "/data/output"
asr:
engine: "faster-whisper"
model: "large-v3"
compute_type: "float16" # GPU用float16,CPU用int8
translation:
engine: "deepl" # deepl | gpt4o | cutrix
glossary_path: "/data/glossary.yaml" # 术语表
dubbing:
engine: "elevenlabs" # elevenlabs | cutrix
voice_map: # 多说话人声线配置
default: "zh-CN-YunxiNeural"
distribution:
youtube:
enabled: true
category_id: "28" # Science & Technology
privacy_status: "private" # 先私密,审核后改公开
notify:
dingtalk_webhook: "${DINGTALK_WEBHOOK}"
3.2 主流程 (src/main.py)
python
import yaml
import logging
from pathlib import Path
from transcribe import transcribe_video
from translate import translate_segments
from dubbing import generate_dubbing
from compose import compose_video
from distribute import upload_to_youtube
from notify import send_dingtalk
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def run_pipeline(video_path: str, config: dict):
"""一条视频 → 多语种版本的完整流水线"""
video_name = Path(video_path).stem
results = {}
for target_lang in config["pipeline"]["target_langs"]:
logger.info(f"[{target_lang}] 开始处理: {video_name}")
# Step 1: Whisper语音识别
segments = transcribe_video(
video_path,
model=config["asr"]["model"],
compute_type=config["asr"]["compute_type"]
)
logger.info(f"[{target_lang}] ASR完成: {len(segments)}个片段")
# Step 2: 逐句翻译
translated = translate_segments(
segments,
source_lang=config["pipeline"]["source_lang"],
target_lang=target_lang,
engine=config["translation"]["engine"],
glossary=config["translation"].get("glossary_path")
)
logger.info(f"[{target_lang}] 翻译完成")
# Step 3: TTS配音
audio_path = generate_dubbing(
translated,
target_lang=target_lang,
engine=config["dubbing"]["engine"],
voice=config["dubbing"]["voice_map"].get("default")
)
logger.info(f"[{target_lang}] 配音完成: {audio_path}")
# Step 4: FFmpeg合成
output_path = compose_video(
video_path=video_path,
audio_path=audio_path,
translated_segments=translated,
target_lang=target_lang,
output_dir=config["pipeline"]["output_dir"]
)
logger.info(f"[{target_lang}] 合成完成: {output_path}")
# Step 5: YouTube上传
if config["distribution"]["youtube"]["enabled"]:
video_id = upload_to_youtube(
video_path=output_path,
title=f"{video_name} [{target_lang.upper()}]",
target_lang=target_lang,
config=config["distribution"]["youtube"]
)
results[target_lang] = video_id
logger.info(f"[{target_lang}] 上传完成: {video_id}")
return results
if __name__ == "__main__":
import sys
with open("config.yaml") as f:
config = yaml.safe_load(f)
for video in Path(config["pipeline"]["video_dir"]).glob("*.mp4"):
try:
results = run_pipeline(str(video), config)
send_dingtalk(f"✅ {video.name} 处理完成\n" +
"\n".join(f"{l}: https://youtu.be/{v}" for l, v in results.items()))
except Exception as e:
logger.error(f"❌ {video.name} 失败: {e}")
send_dingtalk(f"❌ {video.name} 失败: {str(e)[:200]}")
raise
3.3 Whisper语音识别 (src/transcribe.py)
python
from faster_whisper import WhisperModel
from dataclasses import dataclass
@dataclass
class Segment:
start: float
end: float
text: str
def transcribe_video(video_path: str, model: str = "large-v3",
compute_type: str = "float16") -> list[Segment]:
"""Whisper转写,返回带时间戳的句段列表"""
whisper = WhisperModel(model, device="cuda", compute_type=compute_type)
segments, info = whisper.transcribe(
video_path,
beam_size=5,
vad_filter=True, # 过滤静音
vad_parameters=dict(
min_silence_duration_ms=500 # 最小静音间隔500ms
)
)
detected_lang = info.language
results = []
for seg in segments:
# 合并过短片段(< 1秒)到前一句
if results and seg.end - seg.start < 1.0:
results[-1].text += " " + seg.text.strip()
results[-1].end = seg.end
else:
results.append(Segment(
start=seg.start,
end=seg.end,
text=seg.text.strip()
))
return results
3.4 翻译引擎 (src/translate.py)
python
import deepl
import yaml
from pathlib import Path
def translate_segments(segments, source_lang, target_lang,
engine="deepl", glossary=None):
"""逐句翻译,支持术语表绑定"""
# 加载术语表(确保专有名词一致)
glossary_map = {}
if glossary and Path(glossary).exists():
with open(glossary) as f:
glossary_map = yaml.safe_load(f).get(target_lang, {})
translator = deepl.Translator(os.environ["DEEPL_API_KEY"])
lang_map = {"zh": "ZH", "en": "EN-US", "ja": "JA", "es": "ES"}
results = []
for seg in segments:
text = seg.text
# 术语替换:先替换术语表内的词,再做翻译
for cn_term, target_term in glossary_map.items():
if cn_term in text:
text = text.replace(cn_term, f"<glossary>{cn_term}</glossary>")
result = translator.translate_text(
text,
source_lang=lang_map.get(source_lang, "ZH"),
target_lang=lang_map.get(target_lang, "EN-US"),
formality="prefer_less" # 短剧/短视频用口语化
)
translated = result.text
# 还原术语标记
for cn_term, target_term in glossary_map.items():
translated = translated.replace(
f"<glossary>{cn_term}</glossary>", target_term
)
seg.translated = translated
results.append(seg)
return results
3.5 FFmpeg合成 (src/compose.py)
python
import subprocess
from pathlib import Path
def compose_video(video_path, audio_path, translated_segments,
target_lang, output_dir):
"""合成多语言视频:替换音频 + 烧录字幕"""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
video_name = Path(video_path).stem
output_path = output_dir / f"{video_name}_{target_lang}.mp4"
# 生成SRT字幕
srt_path = output_dir / f"{video_name}_{target_lang}.srt"
with open(srt_path, "w") as f:
for i, seg in enumerate(translated_segments, 1):
f.write(f"{i}\n")
f.write(f"{_format_time(seg.start)} --> {_format_time(seg.end)}\n")
f.write(f"{seg.translated}\n\n")
# FFmpeg合成:替换音频 + 烧录字幕
cmd = [
"ffmpeg", "-y",
"-i", video_path,
"-i", audio_path,
"-vf", f"subtitles={srt_path}:force_style='FontSize=18,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BackColour=&H80000000'",
"-c:v", "libx264", "-preset", "medium", "-crf", "23",
"-c:a", "aac", "-b:a", "128k",
"-map", "0:v:0", "-map", "1:a:0",
str(output_path)
]
subprocess.run(cmd, check=True, capture_output=True)
return output_path
def _format_time(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds * 1000) % 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
四、Docker化部署
4.1 Dockerfile
dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
ffmpeg python3.11 python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
COPY config.yaml .
ENTRYPOINT ["python3", "src/main.py"]
4.2 docker-compose.yml
yaml
version: "3.8"
services:
pipeline:
build: .
volumes:
- ./data/videos:/data/videos
- ./data/output:/data/output
- ./config.yaml:/app/config.yaml
- ./data/glossary.yaml:/data/glossary.yaml
environment:
- DEEPL_API_KEY=${DEEPL_API_KEY}
- ELEVENLABS_API_KEY=${ELEVENLABS_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DINGTALK_WEBHOOK=${DINGTALK_WEBHOOK}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: "no" # 跑完即停,不常驻
五、CI/CD集成(GitHub Actions)
5.1 .github/workflows/pipeline.yml
yaml
name: Content Globalization Pipeline
on:
push:
paths:
- "data/videos/**.mp4" # 推送视频文件时触发
workflow_dispatch: # 手动触发
inputs:
video_file:
description: "视频文件名(留空处理全部)"
required: false
jobs:
translate-and-distribute:
runs-on: [self-hosted, gpu] # 需要GPU的Whisper转写
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build pipeline image
run: docker compose build
- name: Run pipeline
env:
DEEPL_API_KEY: ${{ secrets.DEEPL_API_KEY }}
ELEVENLABS_API_KEY: ${{ secrets.ELEVENLABS_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DINGTALK_WEBHOOK: ${{ secrets.DINGTALK_WEBHOOK }}
YOUTUBE_CLIENT_SECRET: ${{ secrets.YOUTUBE_CLIENT_SECRET }}
run: docker compose run --rm pipeline
- name: Commit output files
if: success()
run: |
git config user.name "pipeline-bot"
git config user.email "bot@cutrix.cc"
git add data/output/ data/output/*.srt
git commit -m "auto: 内容出海处理完成 $(date +%Y-%m-%d_%H:%M)" || true
git push
- name: Notify failure
if: failure()
uses: actions/github-script@v7
with:
script: |
const runId = context.runId;
const msg = `❌ Pipeline失败: ${context.repo.owner}/${context.repo.repo}\nRun: https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${runId}`;
// 钉钉通知已在main.py中内置,此处为兜底
5.2 术语表自动更新(Git pre-commit hook)
bash
#!/bin/bash
# .git/hooks/pre-commit --- 术语变更时自动提醒检查翻译一致性
if git diff --cached --name-only | grep -q "glossary.yaml"; then
echo "⚠️ 术语表已变更,请确认:"
echo " 1. 历史已翻译视频是否需要重新处理?"
echo " 2. 新术语是否已在各语种版本中统一?"
echo ""
echo "如需跳过检查: git commit --no-verify"
exit 1
fi
六、监控与告警
6.1 Pipeline健康指标
python
# src/monitor.py --- 记录每次pipeline运行的关键指标
import time
import json
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class PipelineMetrics:
video_name: str
duration_seconds: float
source_duration_minutes: float
target_langs: list[str]
asr_wer: float # 词错率
translation_time: float
dubbing_time: float
upload_success: bool
timestamp: str = ""
def __post_init__(self):
self.timestamp = datetime.now().isoformat()
def log(self, path="/data/metrics.jsonl"):
"""追加写入指标日志"""
with open(path, "a") as f:
f.write(json.dumps(asdict(self), ensure_ascii=False) + "\n")
@property
def realtime_factor(self) -> float:
"""实时率:处理耗时/视频时长,<1 表示比实时快"""
return self.duration_seconds / (self.source_duration_minutes * 60)
6.2 告警规则
| 指标 | 告警阈值 | 处理动作 |
|---|---|---|
| 单视频处理耗时 | > 视频时长 × 3 | 检查Whisper配置/GPU状态 |
| ASR词错率 | > 15% | 检查音频质量,可能需要降噪预处理 |
| API调用超时 | 连续3次 | 切换备用翻译/TTS引擎 |
| 上传失败率 | > 10% | 检查YouTube API配额 |
七、常见问题与解决
7.1 Whisper在短剧/直播场景的识别率低
短剧常有背景音乐、多人同时说话、方言口音。解决方案:在ASR之前加一道人声分离(UVR5/Demucs),分离出纯净人声再送入Whisper。
python
# 人声分离预处理
import subprocess
def separate_vocals(video_path: str) -> str:
"""用Demucs分离人声"""
audio_path = video_path.replace(".mp4", "_vocals.wav")
subprocess.run([
"demucs", "--two-stems=vocals",
"-o", "/tmp/demucs_output",
video_path
], check=True)
return audio_path
7.2 长视频(>30分钟)翻译质量下降
LLM翻译长文本时会"遗忘"前文,导致术语不一致。解决:分句翻译时维护一个上下文窗口(前3句+后2句),用简化的上下文帮助翻译引擎理解当前句的语境。
7.3 YouTube API配额不够用
YouTube Data API v3默认每天1万单位配额,上传一个视频约消耗1600单位(含metadata更新)。处理方案:
- 用Service Account申请配额提升(最多可提至100万单位)
- 非紧急内容用
privacy_status: private先上传,等配额恢复后批量改公开 - 高频场景对接YouTube Studio的Content ID批量上传
八、总结
本文搭建的产线实现了:推送视频文件到Git仓库 → 自动触发多语种翻译配音 → YouTube分发 → 钉钉通知的全自动化流程。
关键设计决策回顾:
| 决策点 | 选择 | 原因 |
|---|---|---|
| ASR引擎 | faster-whisper large-v3 | CTranslate2推理,比原版Whisper快4倍 |
| 翻译引擎(主) | DeepL API | 中文→日语/西语质量最优 |
| 翻译引擎(兜底) | GPT-4o | 处理DeepL不擅长的口语化/网络用语 |
| TTS引擎 | ElevenLabs | 情感还原度最高 |
| 分发方式 | YouTube API直接上传 | 减少中间步骤,95%+成功率 |
| 运行环境 | Docker + GPU Self-hosted Runner | 降低云端GPU成本 |
如果你的团队没有GPU服务器,或者不想维护这一整套流水线,可以用Cutrix API替代Whisper→DeepL→ElevenLabs→FFmpeg这个四段式链路,一个API调用完成翻译+配音+字幕输出,产线代码量减少70%以上。