whisper 命令行解析【2】

1.命令行全文

bash 复制代码
(pp2) livingbody@192 workspace % whisper --help

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]

               [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE]

               [--task {transcribe,translate}]

               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]

               [--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE]

               [--patience PATIENCE] [--length_penalty LENGTH_PENALTY]

               [--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]

               [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]

               [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]

               [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]

               [--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHOLD]

               [--word_timestamps WORD_TIMESTAMPS] [--prepend_punctuations PREPEND_PUNCTUATIONS]

               [--append_punctuations APPEND_PUNCTUATIONS] [--highlight_words HIGHLIGHT_WORDS]

               [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]

               [--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS]

               [--clip_timestamps CLIP_TIMESTAMPS]

               [--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]

               audio [audio ...]

  


positional arguments:

  audio                 audio file(s) to transcribe

  


optional arguments:

  -h, --help            show this help message and exit

  --model MODEL         name of the Whisper model to use (default: turbo)

  --model_dir MODEL_DIR

                        the path to save model files; uses ~/.cache/whisper by default (default: None)

  --device DEVICE       device to use for PyTorch inference (default: cpu)

  --output_dir OUTPUT_DIR, -o OUTPUT_DIR

                        directory to save the outputs (default: .)

  --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

                        format of the output file; if not specified, all available formats will be

                        produced (default: all)

  --verbose VERBOSE     whether to print out the progress and debug messages (default: True)

  --task {transcribe,translate}

                        whether to perform X->X speech recognition ('transcribe') or X->English

                        translation ('translate') (default: transcribe)

  --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

                        language spoken in the audio, specify None to perform language detection

                        (default: None)

  --temperature TEMPERATURE

                        temperature to use for sampling (default: 0)

  --best_of BEST_OF     number of candidates when sampling with non-zero temperature (default: 5)

  --beam_size BEAM_SIZE

                        number of beams in beam search, only applicable when temperature is zero

                        (default: 5)

  --patience PATIENCE   optional patience value to use in beam decoding, as in

                        https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to

                        conventional beam search (default: None)

  --length_penalty LENGTH_PENALTY

                        optional token length penalty coefficient (alpha) as in

                        https://arxiv.org/abs/1609.08144, uses simple length normalization by default

                        (default: None)

  --suppress_tokens SUPPRESS_TOKENS

                        comma-separated list of token ids to suppress during sampling; '-1' will

                        suppress most special characters except common punctuations (default: -1)

  --initial_prompt INITIAL_PROMPT

                        optional text to provide as a prompt for the first window. (default: None)

  --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

                        if True, provide the previous output of the model as a prompt for the next

                        window; disabling may make the text inconsistent across windows, but the model

                        becomes less prone to getting stuck in a failure loop (default: True)

  --fp16 FP16           whether to perform inference in fp16; True by default (default: True)

  --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK

                        temperature to increase when falling back when the decoding fails to meet either

                        of the thresholds below (default: 0.2)

  --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD

                        if the gzip compression ratio is higher than this value, treat the decoding as

                        failed (default: 2.4)

  --logprob_threshold LOGPROB_THRESHOLD

                        if the average log probability is lower than this value, treat the decoding as

                        failed (default: -1.0)

  --no_speech_threshold NO_SPEECH_THRESHOLD

                        if the probability of the <|nospeech|> token is higher than this value AND the

                        decoding has failed due to `logprob_threshold`, consider the segment as silence

                        (default: 0.6)

  --word_timestamps WORD_TIMESTAMPS

                        (experimental) extract word-level timestamps and refine the results based on

                        them (default: False)

  --prepend_punctuations PREPEND_PUNCTUATIONS

                        if word_timestamps is True, merge these punctuation symbols with the next word

                        (default: "'"¿([{-)

  --append_punctuations APPEND_PUNCTUATIONS

                        if word_timestamps is True, merge these punctuation symbols with the previous

                        word (default: "'.。,,!!??::")]}、)

  --highlight_words HIGHLIGHT_WORDS

                        (requires --word_timestamps True) underline each word as it is spoken in srt and

                        vtt (default: False)

  --max_line_width MAX_LINE_WIDTH

                        (requires --word_timestamps True) the maximum number of characters in a line

                        before breaking the line (default: None)

  --max_line_count MAX_LINE_COUNT

                        (requires --word_timestamps True) the maximum number of lines in a segment

                        (default: None)

  --max_words_per_line MAX_WORDS_PER_LINE

                        (requires --word_timestamps True, no effect with --max_line_width) the maximum

                        number of words in a segment (default: None)

  --threads THREADS     number of threads used by torch for CPU inference; supercedes

                        MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)

  --clip_timestamps CLIP_TIMESTAMPS

                        comma-separated list start,end,start,end,... timestamps (in seconds) of clips to

                        process, where the last end timestamp defaults to the end of the file (default:

                        0)

  --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD

                        (requires --word_timestamps True) skip silent periods longer than this threshold

                        (in seconds) when a possible hallucination is detected (default: None)

2.命令行解析

2.1 模型相关参数

通过查看帮助,其中模型相关的有三项,分别是模型名称、模型路径、运行模型的设备,具体如下:

  • --model MODEL name of the Whisper model to use (default: turbo)

  • --model_dir MODEL_DIR

the path to save model files; uses ~/.cache/whisper by default (default: None)

  • --device DEVICE device to use for PyTorch inference (default: cpu)

2.2 输入输出相关

主要有输出路径、输出格式、打印进度、日志信息3种,具体如下:

  • --output_dir OUTPUT_DIR, -o OUTPUT_DIR

directory to save the outputs (default: .)

  • --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

format of the output file; if not specified, all available formats will be

produced (default: all)

  • --verbose VERBOSE whether to print out the progress and debug messages (default: True)

2.3 任务相关

内容比较多,主要有任务类型,例如是语音转文本,还是翻译等等,此外还有语言设定、采样参数等。

  • --task {transcribe,translate}

whether to perform X->X speech recognition ('transcribe') or X->English

translation ('translate') (default: transcribe)

  • --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

language spoken in the audio, specify None to perform language detection

(default: None)

  • --temperature TEMPERATURE

temperature to use for sampling (default: 0)

  • --best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)

  • --beam_size BEAM_SIZE

number of beams in beam search, only applicable when temperature is zero

(default: 5)

  • --patience PATIENCE optional patience value to use in beam decoding, as in

arxiv.org/abs/2204.05..., the default (1.0) is equivalent to

conventional beam search (default: None)

  • --length_penalty LENGTH_PENALTY

optional token length penalty coefficient (alpha) as in

arxiv.org/abs/1609.08..., uses simple length normalization by default

(default: None)

  • --suppress_tokens SUPPRESS_TOKENS

comma-separated list of token ids to suppress during sampling; '-1' will

suppress most special characters except common punctuations (default: -1)

  • --initial_prompt INITIAL_PROMPT

optional text to provide as a prompt for the first window. (default: None)

  • --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

if True, provide the previous output of the model as a prompt for the next

window; disabling may make the text inconsistent across windows, but the model

becomes less prone to getting stuck in a failure loop (default: True)

  • --fp16 FP16 whether to perform inference in fp16; True by default (default: True)

  • --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK

temperature to increase when falling back when the decoding fails to meet either

of the thresholds below (default: 0.2)

  • --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD

if the gzip compression ratio is higher than this value, treat the decoding as

failed (default: 2.4)

  • --logprob_threshold LOGPROB_THRESHOLD

if the average log probability is lower than this value, treat the decoding as

failed (default: -1.0)

  • --no_speech_threshold NO_SPEECH_THRESHOLD

if the probability of the <|nospeech|> token is higher than this value AND the

decoding has failed due to logprob_threshold, consider the segment as silence

(default: 0.6)

  • --word_timestamps WORD_TIMESTAMPS

(experimental) extract word-level timestamps and refine the results based on

them (default: False)

  • --prepend_punctuations PREPEND_PUNCTUATIONS

if word_timestamps is True, merge these punctuation symbols with the next word

(default: "'"¿([{-)

  • --append_punctuations APPEND_PUNCTUATIONS

if word_timestamps is True, merge these punctuation symbols with the previous

word (default: "'.。,,!!??::")]}、)

  • --highlight_words HIGHLIGHT_WORDS

(requires --word_timestamps True) underline each word as it is spoken in srt and

vtt (default: False)

  • --max_line_width MAX_LINE_WIDTH

(requires --word_timestamps True) the maximum number of characters in a line

before breaking the line (default: None)

  • --max_line_count MAX_LINE_COUNT

(requires --word_timestamps True) the maximum number of lines in a segment

(default: None)

  • --max_words_per_line MAX_WORDS_PER_LINE

(requires --word_timestamps True, no effect with --max_line_width) the maximum

number of words in a segment (default: None)

  • --threads THREADS number of threads used by torch for CPU inference; supercedes

MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)

  • --clip_timestamps CLIP_TIMESTAMPS

comma-separated list start,end,start,end,... timestamps (in seconds) of clips to

process, where the last end timestamp defaults to the end of the file (default:

  • --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD

(requires --word_timestamps True) skip silent periods longer than this threshold

(in seconds) when a possible hallucination is detected (default: None)

3.运行demo

3.1 中文语音识别

命令行基本格式如下,我们试试在mac下的效果。

bash 复制代码
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese

输出:

bash 复制代码
(pp2) livingbody@192 sound4 % whisper zh.wav --model tiny.pt              

/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead

  warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Detecting language using up to the first 30 seconds. Use `--language` to specify the language

Detected language: Chinese

[00:00.000 --> 00:04.480] 我認為跑步最重要的就是給我帶來了身體健康

可见,tiny模型对简体中文支持不太好,如果用的话还需要进一步转换。

3.2 英文语音识别

bash 复制代码
whisper /path/to/audio/file --model /path/to/custom/model --language Chinese
whisper zh.wav --model tiny.pt --language Chinese

输出:

bash 复制代码
(pp2) livingbody@192 sound4 % whisper en.wav --model tiny.pt 

/Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead

  warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Detecting language using up to the first 30 seconds. Use `--language` to specify the language

Detected language: English

[00:00.000 --> 00:03.000]  I knocked at the door on the ancient side of the building.

英文输出还是比比较精准,使用的tiny模型大概70余兆,比较小,很适合树莓派等设备部署。

相关推荐
uzong1 小时前
技术故障复盘模版
后端
GetcharZp1 小时前
基于 Dify + 通义千问的多模态大模型 搭建发票识别 Agent
后端·llm·agent
桦说编程2 小时前
Java 中如何创建不可变类型
java·后端·函数式编程
IT毕设实战小研2 小时前
基于Spring Boot 4s店车辆管理系统 租车管理系统 停车位管理系统 智慧车辆管理系统
java·开发语言·spring boot·后端·spring·毕业设计·课程设计
wyiyiyi2 小时前
【Web后端】Django、flask及其场景——以构建系统原型为例
前端·数据库·后端·python·django·flask
阿华的代码王国3 小时前
【Android】RecyclerView复用CheckBox的异常状态
android·xml·java·前端·后端
Jimmy3 小时前
AI 代理是什么,其有助于我们实现更智能编程
前端·后端·ai编程
AntBlack4 小时前
不当韭菜V1.1 :增强能力 ,辅助构建自己的交易规则
后端·python·pyqt
bobz9655 小时前
pip install 已经不再安全
后端
寻月隐君5 小时前
硬核实战:从零到一,用 Rust 和 Axum 构建高性能聊天服务后端
后端·rust·github