视频转音频, 音频转文字

Ubuntu 24

环境准备

bash 复制代码

# 系统级依赖
sudo apt update && sudo apt install -y ffmpeg python3-venv git build-essential python3-dev

# Python虚拟环境
python3 -m venv ~/ai_summary
source ~/ai_summary/bin/activate

核心工具链

工具	用途	安装命令
Whisper	语音识别	`pip install openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple`
FFmpeg	音视频处理	`apt install -y ffmpeg`

1. 音频提取

bash 复制代码

ffmpeg -i 视频.mp4 -vn -ar 16000 -ac 1 -b:a 192k 音频.mp3

2. 语音转写（中文优化）

bash 复制代码

whisper --model tiny  --language zh --threads 4 音频.mp3 --output_format txt --output_dir transcripts

# 可用模型对比（内存需求从低到高）
# tiny(1GB) < base(1.2GB) < small(2GB) < medium(5GB) < large(10GB)

whisper 处理内存不足

解决方案（按优先级排序）

方案一：改用更小模型

bash 复制代码

# 选择内存占用最低的模型
whisper 教学音频.mp3 \
  --model tiny \
  --language zh \
  --device cpu \
  --threads 2

# 可用模型对比（内存需求从低到高）
# tiny(1GB) < base(1.2GB) < small(2GB) < medium(5GB) < large(10GB)

方案二：内存优化配置

bash 复制代码

# 1. 强制使用CPU模式（避免GPU显存占用）
whisper 教学音频.mp3 --model tiny --device cpu

# 2. 启用内存映射加载（仅限Linux）
HF_DATASETS_IN_MEMORY_MAX_SIZE=0 \
PYTORCH_NO_CUDA_MEMORY_CACHING=1 \
whisper 教学音频.mp3 --model tiny

# 3. 限制线程数
export OMP_NUM_THREADS=2  # 控制并行计算线程

方案三：系统级优化

bash 复制代码

# 创建交换文件（临时增加虚拟内存）
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# 验证交换空间
free -h

方案四：分片处理长音频

bash 复制代码

# 将音频切分为10分钟片段
ffmpeg -i 教学音频.mp3 -f segment -segment_time 600 -c copy part_%03d.mp3

# 分批处理
for file in part_*.mp3; do
  whisper "$file" --model tiny --output_dir transcripts
done

方案五：使用优化版工具

bash 复制代码

# 安装内存优化版的whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# 转换模型为ggml格式
./models/download-ggml-model.sh tiny

# 运行推理
./main -m models/ggml-tiny.bin -l zh -f 教学音频.mp3

验证方法

bash 复制代码

# 监控内存使用
watch -n 1 "free -h | grep Mem"

# 测试最小可行性案例
whisper --model tiny --language zh --output_format txt test.wav

备选方案

如果必须使用大模型：

升级服务器内存至至少8GB
使用云服务API（推荐OpenAI官方API）

python 复制代码

import openai
audio_file = open("教学音频.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)

技术说明

优化策略	内存节省效果	适用场景
使用tiny模型	减少80%	快速概要生成
CPU模式	减少30%	无GPU环境
分片处理	减少70%	超长音频(>1小时)
内存映射	减少50%	Linux系统

建议优先采用方案一+方案三组合，在保持可用性的同时最大程度降低内存需求。

繁体转简体

复制代码

# 安装轻量级转换库
pip install zhconv

# 在现有处理流程中加入转换步骤
from zhconv import convert

def traditional_to_simple(text):
    return convert(text, 'zh-cn')  # 大陆简体

with open("transcripts/教学音频.txt", "r") as f:
    content = traditional_to_simple(f.read())

Whisper支持的核心参数

参数	缩写	默认值	说明
`--temperature`	`-tmp`	`0`	采样温度（0为确定性输出，>0增加随机性）
`--best_of`	`-b`	`5`	生成候选结果的数量（选择最佳转录）
`--beam_size`	`-bs`	`5`	Beam搜索的宽度（影响转录质量）
`--patience`	`-pa`	`1.0`	Beam搜索的耐心值（影响转录速度与质量）
`--length_penalty`	`-lp`	`1.0`	长度惩罚系数（>1鼓励长输出，<1鼓励短输出）
`--suppress_tokens`	`-st`	`None`	禁止生成的token列表（用逗号分隔）
`--initial_prompt`	`-p`	`None`	初始提示文本（用于引导模型生成特定内容）
`--condition_on_previous_text`	`-cop`	`True`	是否基于前文生成后续内容
`--fp16`	`-fp`	`True`	是否使用FP16加速（仅限GPU）
`--temperature_increment_on_fallback`	`-tif`	`0.2`	回退时温度增量（用于处理低质量音频）
`--compression_ratio_threshold`	`-crt`	`2.4`	压缩比阈值（高于此值可能为低质量转录）
`--logprob_threshold`	`-lt`	`-1.0`	对数概率阈值（低于此值可能为低质量转录）
`--no_speech_threshold`	`-nst`	`0.6`	无语音阈值（高于此值可能为静音段）
`--word_timestamps`	`-wt`	`False`	是否生成逐字时间戳
`--prepend_punctuations`	`-pp`	`"'"¿([{-`"	前置标点符号列表
`--append_punctuations`	`-ap`	``	后置标点符号列表
`--highlight_words`	`-hw`	`False`	是否高亮显示单词（仅限VTT/SRT格式）
`--max_line_width`	`-w`	`None`	每行最大字符数（用于格式化输出）
`--max_line_count`	`-c`	`None`	每段最大行数（用于格式化输出）
`--max_words_per_line`	`-mwp`	`None`	每行最大单词数（用于格式化输出）
`--threads`	`-t`	`0`	CPU线程数（0为自动选择）
`--clip_timestamps`	`-ct`	`None`	裁剪时间戳（格式：`start,end`，单位：秒）
`--hallucination_silence_threshold`	`-hst`	`None`	幻觉静音阈值（用于检测无效转录）

1. 使用FFmpeg分片

bash 复制代码

# 将音频按300秒分片
ffmpeg -i 长讲座.mp3 -f segment -segment_time 300 -c copy 分片_%03d.mp3

2. 批量转录分片

bash 复制代码

# 使用Whisper转录所有分片
for file in 分片_*.mp3; do
    whisper "$file" --model large-v3 --language zh --output_dir transcripts
done

3. 合并转录结果

bash 复制代码

# 合并所有分片的转录文本
cat transcripts/*.txt > 完整转录.txt

示例命令

1. 高精度转录

bash 复制代码

whisper 教学音频.mp3 --model large-v3 --language zh --beam_size 5 --best_of 5

2. 逐字时间戳

bash 复制代码

whisper 会议录音.mp3 --model medium --word_timestamps True --output_format vtt

3. 低质量音频优化

bash 复制代码

whisper 低质量音频.mp3 --model small --temperature_increment_on_fallback 0.4 --compression_ratio_threshold 2.8

4. 自定义标点处理

bash 复制代码

whisper 音频.mp3 --model base --prepend_punctuations "'"¿([{-" --append_punctuations '"".。,，!！?？:：")]}、'

参数组合建议

场景	推荐参数组合
实时转录	`--model tiny --temperature 0 --threads 2`
高精度转录	`--model large-v3 --beam_size 5 --best_of 5`
低质量音频	`--model small --temperature_increment_on_fallback 0.4 --compression_ratio_threshold 2.8`
逐字时间戳	`--model medium --word_timestamps True --output_format vtt`
长音频处理	使用FFmpeg分片后批量转录

注意事项

硬件要求 ：large-v3需要至少10GB显存，建议使用NVIDIA 30系列以上显卡。
语言支持 ：tiny和base模型对非英语支持有限，建议中文场景至少使用small模型。
精度权衡 ：small模型在大多数场景下已能满足需求，无需盲目追求大模型。