whisper 命令行解析【2】

1.命令行全文

bash 复制代码
(pp2) livingbody@192 workspace % whisper --help

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]

               [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE]

               [--task {transcribe,translate}]

               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]

               [--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE]

               [--patience PATIENCE] [--length_penalty LENGTH_PENALTY]

               [--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]

               [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]

               [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]

               [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]

               [--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHOLD]

               [--word_timestamps WORD_TIMESTAMPS] [--prepend_punctuations PREPEND_PUNCTUATIONS]

               [--append_punctuations APPEND_PUNCTUATIONS] [--highlight_words HIGHLIGHT_WORDS]

               [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]

               [--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS]

               [--clip_timestamps CLIP_TIMESTAMPS]

               [--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]

               audio [audio ...]

  


positional arguments:

  audio                 audio file(s) to transcribe

  


optional arguments:

  -h, --help            show this help message and exit

  --model MODEL         name of the Whisper model to use (default: turbo)

  --model_dir MODEL_DIR

                        the path to save model files; uses ~/.cache/whisper by default (default: None)

  --device DEVICE       device to use for PyTorch inference (default: cpu)

  --output_dir OUTPUT_DIR, -o OUTPUT_DIR

                        directory to save the outputs (default: .)

  --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

                        format of the output file; if not specified, all available formats will be

                        produced (default: all)

  --verbose VERBOSE     whether to print out the progress and debug messages (default: True)

  --task {transcribe,translate}

                        whether to perform X->X speech recognition ('transcribe') or X->English

                        translation ('translate') (default: transcribe)

  --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

                        language spoken in the audio, specify None to perform language detection

                        (default: None)

  --temperature TEMPERATURE

                        temperature to use for sampling (default: 0)

  --best_of BEST_OF     number of candidates when sampling with non-zero temperature (default: 5)

  --beam_size BEAM_SIZE

                        number of beams in beam search, only applicable when temperature is zero

                        (default: 5)

  --patience PATIENCE   optional patience value to use in beam decoding, as in

                        https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to

                        conventional beam search (default: None)

  --length_penalty LENGTH_PENALTY

                        optional token length penalty coefficient (alpha) as in

                        https://arxiv.org/abs/1609.08144, uses simple length normalization by default

                        (default: None)

  --suppress_tokens SUPPRESS_TOKENS

                        comma-separated list of token ids to suppress during sampling; '-1' will

                        suppress most special characters except common punctuations (default: -1)

  --initial_prompt INITIAL_PROMPT

                        optional text to provide as a prompt for the first window. (default: None)

  --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

                        if True, provide the previous output of the model as a prompt for the next

                        window; disabling may make the text inconsistent across windows, but the model

                        becomes less prone to getting stuck in a failure loop (default: True)

  --fp16 FP16           whether to perform inference in fp16; True by default (default: True)

  --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK

                        temperature to increase when falling back when the decoding fails to meet either

                        of the thresholds below (default: 0.2)

  --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD

                        if the gzip compression ratio is higher than this value, treat the decoding as

                        failed (default: 2.4)

  --logprob_threshold LOGPROB_THRESHOLD

                        if the average log probability is lower than this value, treat the decoding as

                        failed (default: -1.0)

  --no_speech_threshold NO_SPEECH_THRESHOLD

                        if the probability of the <|nospeech|> token is higher than this value AND the

                        decoding has failed due to `logprob_threshold`, consider the segment as silence

                        (default: 0.6)

  --word_timestamps WORD_TIMESTAMPS

                        (experimental) extract word-level timestamps and refine the results based on

                        them (default: False)

  --prepend_punctuations PREPEND_PUNCTUATIONS

                        if word_timestamps is True, merge these punctuation symbols with the next word

                        (default: "'"¿([{-)

  --append_punctuations APPEND_PUNCTUATIONS

                        if word_timestamps is True, merge these punctuation symbols with the previous

                        word (default: "'.。,,!!??::")]}、)

  --highlight_words HIGHLIGHT_WORDS

                        (requires --word_timestamps True) underline each word as it is spoken in srt and

                        vtt (default: False)

  --max_line_width MAX_LINE_WIDTH

                        (requires --word_timestamps True) the maximum number of characters in a line

                        before breaking the line (default: None)

  --max_line_count MAX_LINE_COUNT

                        (requires --word_timestamps True) the maximum number of lines in a segment

                        (default: None)

  --max_words_per_line MAX_WORDS_PER_LINE

                        (requires --word_timestamps True, no effect with --max_line_width) the maximum

                        number of words in a segment (default: None)

  --threads THREADS     number of threads used by torch for CPU inference; supercedes

                        MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)

  --clip_timestamps CLIP_TIMESTAMPS

                        comma-separated list start,end,start,end,... timestamps (in seconds) of clips to

                        process, where the last end timestamp defaults to the end of the file (default:

                        0)

  --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD

                        (requires --word_timestamps True) skip silent periods longer than this threshold

                        (in seconds) when a possible hallucination is detected (default: None)

2.命令行解析

2.1 模型相关参数

通过查看帮助,其中模型相关的有三项,分别是模型名称、模型路径、运行模型的设备,具体如下:

  • --model MODEL name of the Whisper model to use (default: turbo)

  • --model_dir MODEL_DIR

the path to save model files; uses ~/.cache/whisper by default (default: None)

  • --device DEVICE device to use for PyTorch inference (default: cpu)

2.2 输入输出相关

主要有输出路径、输出格式、打印进度、日志信息3种,具体如下:

  • --output_dir OUTPUT_DIR, -o OUTPUT_DIR

directory to save the outputs (default: .)

  • --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

format of the output file; if not specified, all available formats will be

produced (default: all)

  • --verbose VERBOSE whether to print out the progress and debug messages (default: True)

2.3 任务相关

内容比较多,主要有任务类型,例如是语音转文本,还是翻译等等,此外还有语言设定、采样参数等。

  • --task {transcribe,translate}

whether to perform X->X speech recognition ('transcribe') or X->English

translation ('translate') (default: transcribe)

  • --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

language spoken in the audio, specify None to perform language detection

(default: None)

  • --temperature TEMPERATURE

temperature to use for sampling (default: 0)

  • --best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)

  • --beam_size BEAM_SIZE

number of beams in beam search, only applicable when temperature is zero

(default: 5)

  • --patience PATIENCE optional patience value to use in beam decoding, as in

arxiv.org/abs/2204.05..., the default (1.0) is equivalent to

conventional beam search (default: None)

  • --length_penalty LENGTH_PENALTY

optional token length penalty coefficient (alpha) as in

arxiv.org/abs/1609.08..., uses simple length normalization by default

(default: None)

  • --suppress_tokens SUPPRESS_TOKENS

comma-separated list of token ids to suppress during sampling; '-1' will

suppress most special characters except common punctuations (default: -1)

  • --initial_prompt INITIAL_PROMPT

optional text to provide as a prompt for the first window. (default: None)

  • --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

if True, provide the previous output of the model as a prompt for the next

window; disabling may make the text inconsistent across windows, but the model

becomes less prone to getting stuck in a failure loop (default: True)

  • --fp16 FP16 whether to perform inference in fp16; True by default (default: True)

  • --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK

temperature to increase when falling back when the decoding fails to meet either

of the thresholds below (default: 0.2)

  • --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD

if the gzip compression ratio is higher than this value, treat the decoding as

failed (default: 2.4)

  • --logprob_threshold LOGPROB_THRESHOLD

if the average log probability is lower than this value, treat the decoding as

failed (default: -1.0)

  • --no_speech_threshold NO_SPEECH_THRESHOLD

if the probability of the <|nospeech|> token is higher than this value AND the

decoding has failed due to logprob_threshold, consider the segment as silence

(default: 0.6)

  • --word_timestamps WORD_TIMESTAMPS

(experimental) extract word-level timestamps and refine the results based on

them (default: False)

  • --prepend_punctuations PREPEND_PUNCTUATIONS

if word_timestamps is True, merge these punctuation symbols with the next word

(default: "'"¿([{-)

  • --append_punctuations APPEND_PUNCTUATIONS

if word_timestamps is True, merge these punctuation symbols with the previous

word (default: "'.。,,!!??::")]}、)

  • --highlight_words HIGHLIGHT_WORDS

(requires --word_timestamps True) underline each word as it is spoken in srt and

vtt (default: False)

  • --max_line_width MAX_LINE_WIDTH

(requires --word_timestamps True) the maximum number of characters in a line

before breaking the line (default: None)

  • --max_line_count MAX_LINE_COUNT

(requires --word_timestamps True) the maximum number of lines in a segment

(default: None)

  • --max_words_per_line MAX_WORDS_PER_LINE

(requires --word_timestamps True, no effect with --max_line_width) the maximum

number of words in a segment (default: None)

  • --threads THREADS number of threads used by torch for CPU inference; supercedes

MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)

  • --clip_timestamps CLIP_TIMESTAMPS

comma-separated list start,end,start,end,... timestamps (in seconds) of clips to

process, where the last end timestamp defaults to the end of the file (default:

  • --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD

(requires --word_timestamps True) skip silent periods longer than this threshold

(in seconds) when a possible hallucination is detected (default: None)

3.运行demo

3.1 中文语音识别

命令行基本格式如下,我们试试在mac下的效果。

bash 复制代码
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese

输出:

bash 复制代码
(pp2) livingbody@192 sound4 % whisper zh.wav --model tiny.pt              

/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead

  warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Detecting language using up to the first 30 seconds. Use `--language` to specify the language

Detected language: Chinese

[00:00.000 --> 00:04.480] 我認為跑步最重要的就是給我帶來了身體健康

可见,tiny模型对简体中文支持不太好,如果用的话还需要进一步转换。

3.2 英文语音识别

bash 复制代码
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese

输出:

bash 复制代码
(pp2) livingbody@192 sound4 % whisper en.wav --model tiny.pt 

/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead

  warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Detecting language using up to the first 30 seconds. Use `--language` to specify the language

Detected language: English

[00:00.000 --> 00:03.000]  I knocked at the door on the ancient side of the building.

英文输出还是比比较精准,使用的tiny模型大概70余兆,比较小,很适合树莓派等设备部署。

相关推荐
君爱学习1 小时前
RocketMQ延迟消息是如何实现的?
后端
Falling421 小时前
使用 CNB 构建并部署maven项目
后端
程序员小假1 小时前
我们来讲一讲 ConcurrentHashMap
后端
爱上语文1 小时前
Redis基础(5):Redis的Java客户端
java·开发语言·数据库·redis·后端
萧曵 丶2 小时前
Rust 中的返回类型
开发语言·后端·rust
高兴达3 小时前
Spring boot入门工程
java·spring boot·后端
到账一个亿5 小时前
后端树形结构
后端
武子康5 小时前
大数据-31 ZooKeeper 内部原理 Leader选举 ZAB协议
大数据·后端·zookeeper
我是哪吒5 小时前
分布式微服务系统架构第155集:JavaPlus技术文档平台日更-Java线程池实现原理
后端·面试·github