在 Windows 上部署 NVIDIA Parakeet-TDT 遇到的坑

Parakeet-TDT 是 NVIDIA NeMo 工具包中的一个语音识别模型，基于 Transformer 架构，适合转录英文语音到文字。它最大的优点是模型体积小，精度不错，速度也快，即使在纯 CPU 环境下也能正常运行，适合学习或小型项目。NeMo 提供预训练模型，配置简单，普通开发者也能快速上手。

parakeet-tdt huggingface 地址: huggingface.co/nvidia/para...

在 Windows 上部署 Parakeet-TDT 时，我遇到了一些问题，主要是与 Linux 环境的差异和依赖版本冲突。以下是我遇到的问题和解决办法，分享给有类似需求的同学。

部署环境与准备

我打算将整个parakeet-tdt项目放在D:/parakeet目录下

我使用的是 Python 3.10 的嵌入式版本（python-embed-3.10），解压后放在 D:/parakeet/runtime 目录下，推荐使用 Python 3.10 到 3.12。嵌入式 Python 很方便，因为它可以直接打包分享。安装依赖的步骤如下：

下载 get-pip.py 到 runtime 目录。get-pip.py下载地址
CMD终端中进入D:/parakeet执行 runtime\python get-pip.py 安装 pip。
准备一个 requirements.txt 文件，列出需要的包，比如：

css 复制代码

numpy<2.0
waitress
flask
typing_extensions
torch
nemo_toolkit[asr]

运行 runtime\python -m pip install -r requirements.txt，所有包会安装到 runtime\Lib\site-packages 目录。

这种方式的好处是，整个 D:/parakeet 文件夹可以直接压缩打包，分享给其他人，解压后就能直接用，非常方便。注意：NumPy 版本必须低于 2.0，否则会遇到兼容性问题。

问题一：SIGKILL 错误

在运行测试脚本（app.py）加载 Parakeet-TDT 模型时，出现了以下错误：

arduino 复制代码

AttributeError: module 'signal' has no attribute 'SIGKILL'. Did you mean: 'SIGILL'?

这是因为 NeMo 默认用了 Unix 系统的 SIGKILL 信号，而 Windows 不支持，导致程序崩溃。错误出现在 runtime\Lib\site-packages\nemo\utils\exp_manager.py 文件中。

解决办法

需要手动改一下代码。打开 runtime\Lib\site-packages\nemo\utils\exp_manager.py，找到以下行（大概在 176 行）：

python 复制代码

rank_termination_signal: signal.Signals = signal.SIGKILL

改成：

python 复制代码

from platform import system

rank_termination_signal: signal.Signals = signal.SIGTERM if system() == "Windows" else signal.SIGKILL

保存后重新运行脚本，问题就解决了。这段代码让 Windows 用 SIGTERM 替代 SIGKILL，避免了信号不兼容。

问题二：NumPy 版本问题

另一个问题是 NumPy 版本。如果装了 NumPy 2.0 或更高版本，运行时可能会报错，比如数组操作失败或模块加载问题。这是因为 NeMo 的一些依赖跟 NumPy 2.0 的新接口不太兼容。

arduino 复制代码

  File "E:\python\parakeet\runtime\./Lib/site-packages\numpy\__init__.py", line 411, in __getattr__
    raise AttributeError(
AttributeError: `np.sctypes` was removed in the NumPy 2.0 release. Access dtypes explicitly instead.. Did you mean: 'dtypes'?

遇到这种报错，可以重新安装 1.26版本 runtime\python -m pip install --force-reinstall numpy==1.26

解决办法

在 requirements.txt 中指定 NumPy 版本低于 2.0：

复制代码

numpy<2.0

或者手动安装某个稳定版本，比如：

bash 复制代码

runtime\python -m pip uninstall numpy
runtime\python -m pip install numpy==1.26.4

我用了 NumPy 1.26.4，测试下来没问题。装完后可以用 runtime\python -m pip list 确认版本。

其他注意事项

检查依赖 ：安装完后，检查 runtime\Lib\site-packages 是否包含所有需要的包，确保没遗漏。
打包分享 ：用嵌入式 Python 的好处是，整个 D:/parakeet/ 文件夹可以直接压缩分享，别人解压后就能运行，不需要额外配置环境。

实现一个简单的兼容 openai speech to text 的接口

核心代码如下，基于 flask 和 waitress 实现

parakeet-tdt 不仅仅支持英文语音识别，目前也支持日文，并且效果还不错，模型名字是parakeet-tdt_ctc-0.6b-ja

python 复制代码

import nemo.collections.asr as nemo_asr


@app.route('/v1/audio/transcriptions', methods=['POST'])
def transcribe_audio():
    """
    兼容 OpenAI 的语音识别接口，支持长音频分片处理。
    """
    if 'file' not in request.files:
        return jsonify({"error": "请求中未找到文件部分"}), 400
    file = request.files['file']
    if file.filename == '':
        return jsonify({"error": "未选择文件"}), 400
    if not shutil.which('ffmpeg'):
        return jsonify({"error": "FFmpeg 未安装或未在系统 PATH 中"}), 500
    if not shutil.which('ffprobe'):
        return jsonify({"error": "ffprobe 未安装或未在系统 PATH 中"}), 500
    # 用 model 参数传递特殊要求
    return_type = request.form.get('model', '')
    # prompt 用于获取语言
    language = request.form.get('prompt', 'en')
    model_list={
        "en":"parakeet-tdt-0.6b-v2",
        "ja":"parakeet-tdt_ctc-0.6b-ja"
    }
    if language not in model_list:
        return jsonify({"error": f"不支持该语言:{language}"}), 500


    original_filename = secure_filename(file.filename)
    unique_id = str(uuid.uuid4())
    temp_original_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{unique_id}_{original_filename}")
    target_wav_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{unique_id}.wav")
    
    temp_files_to_clean = []

    try:
        file.save(temp_original_path)
        temp_files_to_clean.append(temp_original_path)
        
        ffmpeg_command = [
            'ffmpeg', '-y', '-i', temp_original_path,
            '-ac', '1', '-ar', '16000', target_wav_path
        ]
        result = subprocess.run(ffmpeg_command, capture_output=True, text=True)
        if result.returncode != 0:
            return jsonify({"error": "文件转换失败", "details": result.stderr}), 500
        temp_files_to_clean.append(target_wav_path)

        CHUNK_DURATION_SECONDS = CHUNK_MINITE * 60  
        total_duration = get_audio_duration(target_wav_path)
        if total_duration == 0:
            return jsonify({"error": "无法处理时长为0的音频"}), 400
        asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=f"nvidia/{model_list[language]}")
        num_chunks = math.ceil(total_duration / CHUNK_DURATION_SECONDS)
        chunk_paths = []
        if num_chunks>1:
            for i in range(num_chunks):
                start_time = i * CHUNK_DURATION_SECONDS
                chunk_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{unique_id}_chunk_{i}.wav")
                chunk_paths.append(chunk_path)
                temp_files_to_clean.append(chunk_path)                
                chunk_command = [
                    'ffmpeg', '-y', '-i', target_wav_path,
                    '-ss', str(start_time),
                    '-t', str(CHUNK_DURATION_SECONDS),
                    '-c', 'copy',
                    chunk_path
                ]
                subprocess.run(chunk_command, capture_output=True, text=True)
        else:
            chunk_paths.append(target_wav_path)
        all_segments = []
        all_words = []
        cumulative_time_offset = 0.0

        for i, chunk_path in enumerate(chunk_paths):
            output = asr_model.transcribe([chunk_path], timestamps=True)            
            if output and output[0].timestamp:
                if 'segment' in output[0].timestamp:
                    for seg in output[0].timestamp['segment']:
                        seg['start'] += cumulative_time_offset
                        seg['end'] += cumulative_time_offset
                        all_segments.append(seg)
                
                if 'word' in output[0].timestamp:
                     for word in output[0].timestamp['word']:
                        word['start'] += cumulative_time_offset
                        word['end'] += cumulative_time_offset
                        all_words.append(word)

            # 更新下一个切片的时间偏移量,使用实际切片时长来更新，更精确
            chunk_actual_duration = get_audio_duration(chunk_path)
            cumulative_time_offset += chunk_actual_duration

        if not all_segments:
            return jsonify({"error": "转录失败，模型未返回任何有效内容"}), 500

        srt_result = segments_to_srt(all_segments)
        
        return Response(srt_result, mimetype='text/plain')

    except Exception as e:
        return jsonify({"error": "服务器内部错误", "details": str(e)}), 500

...
from waitress import serve
serve(app, host=host, port=port, threads=threads)