使用ffmpeg8.0的whisper模块语音识别

2025年9月ffmpeg8.0发布,这个版本将whisper.cpp内置到了audio filter。最新版本的ffmpeg默认支持whisper模块。

以下是模块的可选参数,参数之间用:分隔,用=设置值。例如 :vad_threshold=0.3

model: The file path of the downloaded whisper.cpp model (mandatory).
language: The language to use for transcription ('auto' for auto-detect). Default value: "auto"
queue: The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
use_gpu: If the GPU support should be enabled. Default value: "true"
gpu_device: The GPU device index to use. Default value: "0"
destination: If set, the transcription output will be sent to the specified file or URL (use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages. The output will also be set in the "lavfi.whisper.text" frame metadata. If the destination is a file and it already exists, it will be overwritten.
format: The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json". Default value: "text"
vad_model: Path to the VAD model file. If set, the filter will load an additional voice activity detection module (https://github.com/snakers4/silero-vad) that will be used to fragment the audio queue; use this option setting a valid path obtained from the whisper.cpp repository (e.g. "../whisper.cpp/models/ggml-silero-v5.1.2.bin") and increase the queue parameter to a higher value (e.g. 20).
vad_threshold: The VAD threshold to use. Default value: "0.5"
vad_min_speech_duration: The minimum VAD speaking duration. Default value: "0.1"
**vad_min_silence_duration:**The minimum VAD silence duration. Default value: "0.5"

复制代码
举例说明使用方法:
ffmpeg -i input.mp4 -vn -af "whisper=model=../whisper.cpp/models/ggml-base.en.bin\
:language=en\
:queue=3\
:destination=output.srt\
:format=srt" -f null -

ffmpeg官方网站文档:https://ayosec.github.io/ffmpeg-filters-docs/8.0/Filters/Audio/whisper.html

再举一个例子:

ffmpeg -i H:\a.mp4 -vn -af "whisper=model=./models/ggml-medium.bin :language=auto :queue=3 :destination=./output.srt :format=srt :vad_model=./models/ggml-silero-v5.1.2.bin :vad_threshold=0.3" -f null -

但是经过测试,都使用ggml-medium.bin模型的情况下,识别效果不如先使用ffmpeg提取音频生成mp3文件,再使用whisper.cpp的whisper-cli.exe生成字幕文件。方法如下:

ffmpeg -i /path/to/video.mp4 -af aresample=async=1 -ar 16000 -ac 1 -c:a pcm_s16le -loglevel fatal /path/to/audio.mp3

./whisper-cli.exe -l auto -osrt --vad --vad-threshold 0.3 --vad-model .\models\ggml-silero-v5.1.2.bin -m .\models\ggml-medium.bin H:\a.mp3

推荐使用mp3格式 ,mp3格式的生成的文字有标点符号,wav格式的没有标点符号。

相关推荐
NAGNIP8 小时前
一文搞懂深度学习中的通用逼近定理!
人工智能·算法·面试
冬奇Lab9 小时前
一天一个开源项目(第36篇):EverMemOS - 跨 LLM 与平台的长时记忆 OS,让 Agent 会记忆更会推理
人工智能·开源·资讯
冬奇Lab9 小时前
OpenClaw 源码深度解析(一):Gateway——为什么需要一个"中枢"
人工智能·开源·源码阅读
AngelPP13 小时前
OpenClaw 架构深度解析:如何把 AI 助手搬到你的个人设备上
人工智能
宅小年13 小时前
Claude Code 换成了Kimi K2.5后,我再也回不去了
人工智能·ai编程·claude
九狼13 小时前
Flutter URL Scheme 跨平台跳转
人工智能·flutter·github
ZFSS13 小时前
Kimi Chat Completion API 申请及使用
前端·人工智能
天翼云开发者社区14 小时前
春节复工福利就位!天翼云息壤2500万Tokens免费送,全品类大模型一键畅玩!
人工智能·算力服务·息壤
知识浅谈15 小时前
教你如何用 Gemini 将课本图片一键转为精美 PPT
人工智能
Ray Liang15 小时前
被低估的量化版模型,小身材也能干大事
人工智能·ai·ai助手·mindx