幽冥大陆(六十九) Whisper-CLI —东方仙盟练气期

完整指令

复制代码

supported audio formats: flac, mp3, ogg, wav

options:
  -h,        --help                 [default] show this help message and exit
  -t N,      --threads N            [4      ] number of threads to use during computation
  -p N,      --processors N         [1      ] number of processors to use during computation
  -ot N,     --offset-t N           [0      ] time offset in milliseconds
  -on N,     --offset-n N           [0      ] segment index offset
  -d  N,     --duration N           [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N        [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N            [0      ] maximum segment length in characters
  -sow,      --split-on-word        [false  ] split on word rather than on token
  -bo N,     --best-of N            [5      ] number of best candidates to keep
  -bs N,     --beam-size N          [5      ] beam size for beam search
  -ac N,     --audio-ctx N          [0      ] audio context size (0 - all)
  -wt N,     --word-thold N         [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N      [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N      [-1.00  ] log probability threshold for decoder fail
  -nth N,    --no-speech-thold N    [0.60   ] no speech threshold
  -tp,       --temperature N        [0.00   ] The sampling temperature, between 0 and 1
  -tpi,      --temperature-inc N    [0.20   ] The increment of temperature, between 0 and 1
  -debug,    --debug-mode           [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate            [false  ] translate from source language to english
  -di,       --diarize              [false  ] stereo audio diarization
  -tdrz,     --tinydiarize          [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback          [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt           [false  ] output result in a text file
  -ovtt,     --output-vtt           [false  ] output result in a vtt file
  -osrt,     --output-srt           [false  ] output result in a srt file
  -olrc,     --output-lrc           [false  ] output result in a lrc file
  -owts,     --output-words         [false  ] output script for generating karaoke video
  -fp,       --font-path            [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv           [false  ] output result in a CSV file
  -oj,       --output-json          [false  ] output result in a JSON file
  -ojf,      --output-json-full     [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME    [       ] output file path (without file extension)
  -np,       --no-prints            [false  ] do not print anything other than the results
  -ps,       --print-special        [false  ] print special tokens
  -pc,       --print-colors         [false  ] print colors
             --print-confidence     [false  ] print confidence
  -pp,       --print-progress       [false  ] print progress
  -nt,       --no-timestamps        [false  ] do not print timestamps
  -l LANG,   --language LANG        [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language      [false  ] exit after automatically detecting language
             --prompt PROMPT        [       ] initial prompt (max n_text_ctx/2 tokens)
             --carry-initial-prompt [false  ] always prepend initial prompt
  -m FNAME,  --model FNAME          [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME           [       ] input audio file path
  -oved D,   --ov-e-device DNAME    [CPU    ] the OpenVINO device used for encode inference
  -dtw MODEL --dtw MODEL            [       ] compute token-level timestamps
  -ls,       --log-score            [false  ] log best decoder scores of tokens
  -ng,       --no-gpu               [false  ] disable GPU
  -fa,       --flash-attn           [true   ] enable flash attention
  -nfa,      --no-flash-attn        [false  ] disable flash attention
  -sns,      --suppress-nst         [false  ] suppress non-speech tokens
  --suppress-regex REGEX            [       ] regular expression matching tokens to suppress
  --grammar GRAMMAR                 [       ] GBNF grammar to guide decoding
  --grammar-rule RULE               [       ] top-level GBNF grammar rule name
  --grammar-penalty N               [100.0  ] scales down logits of nongrammar tokens

Voice Activity Detection (VAD) options:
             --vad                           [false  ] enable Voice Activity Detection (VAD)
  -vm FNAME, --vad-model FNAME               [       ] VAD model path
  -vt N,     --vad-threshold N               [0.50   ] VAD threshold for speech recognition
  -vspd N,   --vad-min-speech-duration-ms  N [250    ] VAD min speech duration (0.0-1.0)
  -vsd N,    --vad-min-silence-duration-ms N [100    ] VAD min silence duration (to split segments)
  -vmsd N,   --vad-max-speech-duration-s   N [FLT_MAX] VAD max speech duration (auto-split longer)
  -vp N,     --vad-speech-pad-ms           N [30     ] VAD speech padding (extend segments)
  -vo N,     --vad-samples-overlap         N [0.10   ] VAD samples overlap (seconds between segments)


D:\ai\asr\whisper64>

Whisper-CLI 是基于 OpenAI Whisper 模型的命令行工具，由 whisper.cpp 项目封装实现，能在本地快速完成语音转文字、精准识别特定指令的任务。对于初学者而言，无需复杂的代码开发，仅靠几条命令就能解锁语音识别能力。

我们可以用东方仙盟的体系来比喻 Whisper-CLI 的核心逻辑：

仙盟总坛 = whisper.cpp 项目本体：提供所有工具和能力的基础；
掌令长老 = whisper-cli.exe：负责执行指令、调度资源的核心工具；
仙术秘籍 = 预训练模型（如 ggml-tiny.bin）：模型大小决定 "仙术" 强弱，小模型轻便快速，大模型精准全面；
宗门戒律 = GBNF 语法文件：约束识别范围，只允许提取符合 "戒律" 的内容；
传音入密 = 音频文件：需要被识别的语音输入；
真言提取 = 语音转文字 / 热词提取：最终输出的识别结果。

一、核心概念与组件介绍

1. 核心组件（仙盟核心配置）

组件	仙盟比喻	作用
`whisper-cli.exe`	掌令长老	命令行执行入口，接收参数、调度模型、输出结果
预训练模型（.bin）	仙术秘籍	决定识别能力，从 `tiny`（入门秘籍）到 `large`（镇派宝典）分为多个等级
GBNF 语法文件	宗门戒律	约束识别范围，只提取符合规则的内容（如酒店指令、设备控制指令）
音频文件（wav/mp3）	传音玉简	待识别的语音输入，推荐 16kHz 单声道 WAV 格式（音质纯净的玉简）

2. 关键参数（长老传令令牌）

参数	功能（令牌作用）	入门必用示例
`-m`	指定仙术秘籍（模型路径）	`-m D:/ai/asr/models/ggml-tiny.bin`
`--language`	指定识别语言（如中文）	`--language zh`
`--grammar`	启用宗门戒律（GBNF 语法文件）	`--grammar D:/ai/asr/rule.wbnf`
`--output-txt`	将结果保存为文本文件（记录真言）	`--output-txt`
`--no-timestamps`	只输出纯文字，去除时间戳（精简真言）	`--no-timestamps`
音频文件路径	指定传音玉简（待识别音频）	`D:/ai/asr/audio/test.wav`

二、初学者入门步骤（从零加入仙盟）

步骤 1：筹备 "仙盟物资"（环境与文件准备）

下载仙盟总坛 ：从 whisper.cpp 官方仓库下载项目压缩包，解压到本地（如 D:/ai/asr/whisper.cpp），找到 whisper-cli.exe；
领取仙术秘籍 ：下载预训练模型，初学者优先选 ggml-tiny.bin（体积小、运行快，适合入门），放到 models 文件夹；
制作传音玉简 ：准备待识别的音频文件，用工具（如 Audacity）转换为 16kHz 单声道 WAV 格式（避免杂音干扰真言识别）；
制定宗门戒律 （可选）：新建 GBNF 文件，写入简单规则（如 root ::= "查询酒店" | "打开房间"），放到 rule 文件夹。

步骤 2：发出第一道指令（基础语音转文字）

打开传令大殿：启动 Windows 命令提示符（CMD）或 PowerShell；
进入长老居所 ：切换到 whisper-cli.exe 所在目录，输入命令：

bash

运行
复制代码
```
cd D:/ai/asr/whisper.cpp
```
发布传令令牌 ：输入基础识别命令，完成第一次语音转文字：

bash

运行
复制代码
```
whisper-cli.exe -m D:/ai/asr/models/ggml-tiny.bin --language zh D:/ai/asr/audio/test.wav
```
查看真言结果：命令执行完成后，控制台会输出音频对应的文字内容，这就是最基础的语音识别能力。

步骤 3：启用宗门戒律（约束识别范围）

如果想让工具只识别特定指令（如酒店开房、设备控制），就需要启用 GBNF 语法文件：

编写简易戒律 ：新建 hotel_rule.gbnf 文件，写入极简规则（确保能解析）：

gbnf
复制代码
```
root ::= "帮我打开酒店房间3105" | "客人李四的手机号1348883468" | "送一瓶可乐"
```

添加戒律参数 ：在命令中加入 --grammar 参数：

bash

运行

复制代码

whisper-cli.exe -m D:/ai/asr/models/ggml-tiny.bin --language zh --grammar D:/ai/asr/rule/hotel_rule.gbnf D:/ai/asr/audio/test.wav

验证戒律效果：此时工具只会输出符合 GBNF 规则的内容，无关的口语化表述会被自动过滤。

步骤 4：保存识别结果（记录真言典籍）

添加 --output-txt 参数，将识别结果保存为文本文件，方便后续查看：

bash

运行

复制代码

whisper-cli.exe -m D:/ai/asr/models/ggml-tiny.bin --language zh --grammar D:/ai/asr/rule/hotel_rule.gbnf --output-txt --no-timestamps D:/ai/asr/audio/test.wav

执行后，会在音频文件同目录生成 test.txt，里面是纯净的识别文字。

三、常见应用场景（仙盟的宗门业务）

1. 日常语音转写（凡人传话记录）

适用于会议录音转文字、采访音频整理等场景。无需 GBNF 约束，直接用基础命令就能将语音转为文字，省去手动打字的麻烦。

核心优势：本地运行，无需联网，保护隐私；
入门命令 ：whisper-cli.exe -m 模型路径 --language zh 音频路径 --output-txt。

2. 特定指令识别（宗门弟子传令）

适用于智能设备控制、业务指令提取等场景，通过 GBNF 语法约束，只识别预设指令。

场景示例：酒店前台语音指令识别（如 "打开房间 3105""送一瓶可乐"）、智能家居控制（如 "打开灯光""关闭空调"）；
核心优势：过滤无关内容，精准提取关键指令，避免识别错误；
关键配置：编写针对性的 GBNF 语法文件（宗门戒律）。

3. 热词精准提取（秘境寻宝真言）

适用于需要从语音中提取特定信息的场景，如从酒店通话中提取房间号、手机号、配送物品等。

实现思路：先用极简 GBNF 约束识别范围，再用正则表达式提取热词；
核心优势：比通用语音转写更聚焦，适合垂直业务场景。

四、初学者避坑指南（仙途避雷手册）

模型选择要 "量力而行" ：入门选 tiny/base 模型，大模型（如 large）虽然精准，但对电脑性能要求高，容易卡顿；
音频格式是 "关键前提"：必须转为 16kHz 单声道 WAV 格式，否则会出现识别乱码或失败；
GBNF 语法要 "极简优先" ：初学者不要写复杂嵌套规则，先从 root ::= "指令1" | "指令2" 开始，避免解析报错；

路径不要 "夹带私货"：模型、音频、语法文件的路径不要包含中文和空格，否则会提示 "文件找不到"。

五、进阶方向（仙途晋升之路）

尝试更大模型 ：从 tiny 升级到 base/small，提升识别准确率；
编写复杂 GBNF 规则：针对业务场景定制语法，实现更精准的指令约束；
结合脚本自动化：用 Python 脚本批量处理音频文件，自动提取热词并生成报告；
部署到嵌入式设备 ：whisper.cpp 支持树莓派等设备，可实现本地化的语音助手。

阿雪技术观

在科技发展浪潮中，我们不妨积极投身技术共享。不满足于做受益者，更要主动担当贡献者。无论是分享代码、撰写技术博客，还是参与开源项目维护改进，每一个微小举动都可能蕴含推动技术进步的巨大能量。东方仙盟是汇聚力量的天地，我们携手在此探索硅基生命，为科技进步添砖加瓦。

Hey folks, in this wild tech - driven world, why not dive headfirst into the whole tech - sharing scene? Don't just be the one reaping all the benefits; step up and be a contributor too. Whether you're tossing out your code snippets, hammering out some tech blogs, or getting your hands dirty with maintaining and sprucing up open - source projects, every little thing you do might just end up being a massive force that pushes tech forward. And guess what? The Eastern FairyAlliance is this awesome place where we all come together. We're gonna team up and explore the whole silicon - based life thing, and in the process, we'll be fueling the growth of technology