1.命令行全文
bash
(pp2) livingbody@192 workspace % whisper --help
usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]
[--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE]
[--task {transcribe,translate}]
[--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE]
[--patience PATIENCE] [--length_penalty LENGTH_PENALTY]
[--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]
[--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]
[--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHOLD]
[--word_timestamps WORD_TIMESTAMPS] [--prepend_punctuations PREPEND_PUNCTUATIONS]
[--append_punctuations APPEND_PUNCTUATIONS] [--highlight_words HIGHLIGHT_WORDS]
[--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]
[--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS]
[--clip_timestamps CLIP_TIMESTAMPS]
[--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
audio [audio ...]
positional arguments:
audio audio file(s) to transcribe
optional arguments:
-h, --help show this help message and exit
--model MODEL name of the Whisper model to use (default: turbo)
--model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by default (default: None)
--device DEVICE device to use for PyTorch inference (default: cpu)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
--output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all available formats will be
produced (default: all)
--verbose VERBOSE whether to print out the progress and debug messages (default: True)
--task {transcribe,translate}
whether to perform X->X speech recognition ('transcribe') or X->English
translation ('translate') (default: transcribe)
--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform language detection
(default: None)
--temperature TEMPERATURE
temperature to use for sampling (default: 0)
--best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)
--beam_size BEAM_SIZE
number of beams in beam search, only applicable when temperature is zero
(default: 5)
--patience PATIENCE optional patience value to use in beam decoding, as in
https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to
conventional beam search (default: None)
--length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as in
https://arxiv.org/abs/1609.08144, uses simple length normalization by default
(default: None)
--suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during sampling; '-1' will
suppress most special characters except common punctuations (default: -1)
--initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first window. (default: None)
--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a prompt for the next
window; disabling may make the text inconsistent across windows, but the model
becomes less prone to getting stuck in a failure loop (default: True)
--fp16 FP16 whether to perform inference in fp16; True by default (default: True)
--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the decoding fails to meet either
of the thresholds below (default: 0.2)
--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this value, treat the decoding as
failed (default: 2.4)
--logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this value, treat the decoding as
failed (default: -1.0)
--no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher than this value AND the
decoding has failed due to `logprob_threshold`, consider the segment as silence
(default: 0.6)
--word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on
them (default: False)
--prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word
(default: "'"¿([{-)
--append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous
word (default: "'.。,,!!??::")]}、)
--highlight_words HIGHLIGHT_WORDS
(requires --word_timestamps True) underline each word as it is spoken in srt and
vtt (default: False)
--max_line_width MAX_LINE_WIDTH
(requires --word_timestamps True) the maximum number of characters in a line
before breaking the line (default: None)
--max_line_count MAX_LINE_COUNT
(requires --word_timestamps True) the maximum number of lines in a segment
(default: None)
--max_words_per_line MAX_WORDS_PER_LINE
(requires --word_timestamps True, no effect with --max_line_width) the maximum
number of words in a segment (default: None)
--threads THREADS number of threads used by torch for CPU inference; supercedes
MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)
--clip_timestamps CLIP_TIMESTAMPS
comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
process, where the last end timestamp defaults to the end of the file (default:
0)
--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
(requires --word_timestamps True) skip silent periods longer than this threshold
(in seconds) when a possible hallucination is detected (default: None)
2.命令行解析
2.1 模型相关参数
通过查看帮助,其中模型相关的有三项,分别是模型名称、模型路径、运行模型的设备,具体如下:
-
--model MODEL name of the Whisper model to use (default: turbo)
-
--model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by default (default: None)
- --device DEVICE device to use for PyTorch inference (default: cpu)
2.2 输入输出相关
主要有输出路径、输出格式、打印进度、日志信息3种,具体如下:
- --output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
- --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all available formats will be
produced (default: all)
- --verbose VERBOSE whether to print out the progress and debug messages (default: True)
2.3 任务相关
内容比较多,主要有任务类型,例如是语音转文本,还是翻译等等,此外还有语言设定、采样参数等。
- --task {transcribe,translate}
whether to perform X->X speech recognition ('transcribe') or X->English
translation ('translate') (default: transcribe)
- --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform language detection
(default: None)
- --temperature TEMPERATURE
temperature to use for sampling (default: 0)
-
--best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)
-
--beam_size BEAM_SIZE
number of beams in beam search, only applicable when temperature is zero
(default: 5)
- --patience PATIENCE optional patience value to use in beam decoding, as in
arxiv.org/abs/2204.05..., the default (1.0) is equivalent to
conventional beam search (default: None)
- --length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as in
arxiv.org/abs/1609.08..., uses simple length normalization by default
(default: None)
- --suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during sampling; '-1' will
suppress most special characters except common punctuations (default: -1)
- --initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first window. (default: None)
- --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a prompt for the next
window; disabling may make the text inconsistent across windows, but the model
becomes less prone to getting stuck in a failure loop (default: True)
-
--fp16 FP16 whether to perform inference in fp16; True by default (default: True)
-
--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the decoding fails to meet either
of the thresholds below (default: 0.2)
- --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this value, treat the decoding as
failed (default: 2.4)
- --logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this value, treat the decoding as
failed (default: -1.0)
- --no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher than this value AND the
decoding has failed due to logprob_threshold
, consider the segment as silence
(default: 0.6)
- --word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on
them (default: False)
- --prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word
(default: "'"¿([{-)
- --append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous
word (default: "'.。,,!!??::")]}、)
- --highlight_words HIGHLIGHT_WORDS
(requires --word_timestamps True) underline each word as it is spoken in srt and
vtt (default: False)
- --max_line_width MAX_LINE_WIDTH
(requires --word_timestamps True) the maximum number of characters in a line
before breaking the line (default: None)
- --max_line_count MAX_LINE_COUNT
(requires --word_timestamps True) the maximum number of lines in a segment
(default: None)
- --max_words_per_line MAX_WORDS_PER_LINE
(requires --word_timestamps True, no effect with --max_line_width) the maximum
number of words in a segment (default: None)
- --threads THREADS number of threads used by torch for CPU inference; supercedes
MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)
- --clip_timestamps CLIP_TIMESTAMPS
comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
process, where the last end timestamp defaults to the end of the file (default:
- --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
(requires --word_timestamps True) skip silent periods longer than this threshold
(in seconds) when a possible hallucination is detected (default: None)
3.运行demo
3.1 中文语音识别
命令行基本格式如下,我们试试在mac下的效果。
bash
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese
输出:
bash
(pp2) livingbody@192 sound4 % whisper zh.wav --model tiny.pt
/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.000 --> 00:04.480] 我認為跑步最重要的就是給我帶來了身體健康
可见,tiny模型对简体中文支持不太好,如果用的话还需要进一步转换。
3.2 英文语音识别
bash
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese
输出:
bash
(pp2) livingbody@192 sound4 % whisper en.wav --model tiny.pt
/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.000] I knocked at the door on the ancient side of the building.
英文输出还是比比较精准,使用的tiny模型大概70余兆,比较小,很适合树莓派等设备部署。
