开源模型应用落地-语音转文本-whisper模型-AIGC应用探索（五）

一、前言

在上一节中，学习了如何使用vLLM来部署Whisper-large-v3-turbo模型。不过，在实际使用时，模型一次只能处理30秒的音频。今天，将结合实际业务，介绍如何处理一段完整的音频，并生成相应的字幕文件。

二、术语介绍

2.1.语音转文本

也称为语音识别或自动语音识别 (ASR)是一种将语音音频转换为文字的技术。它利用计算机程序和算法来监听语音输入,并将其转换为可读的文字输出。

2.2.Whisper-large-v3-turbo

是 OpenAI 于 2024年10月推出的一款优化型语音转录模型，基于 Whisper large-v3 改进而来，旨在平衡速度与准确性。以下是其核心特点：

1.技术改进

解码器层数缩减：从 32 层减少至 4 层，显著降低计算复杂度。
速度提升：转录速度较 large-v3 快 8 倍，超越 tiny 模型，支持实时应用。
推理优化：结合 torch.compile 和缩放点积注意力（F.scaled_dot_product_attention），进一步加速推理，减少延迟。
参数规模：8.09 亿参数，介于 medium（7.69 亿）与 large（155 亿）之间，模型体积约 1.6GB。

2.性能表现

质量保持：在高质量录音（如 FLEURS 数据集）上表现接近 large-v2，跨语言能力与 large-v2 相当。
多语言支持：覆盖 99 种语言，但对泰语、粤语等方言支持较弱。
VRAM 需求：仅需 6GB，显著低于 large 模型的 10GB，适合边缘设备部署。

3.应用场景

实时转录：适用于会议记录、直播字幕等低延迟场景。
长音频处理：支持分块或顺序算法处理超长音频，兼顾速度与准确性。
本地化部署：轻量化设计，便于在移动端或本地服务器集成。

4.集成与使用

开发友好：通过 Hugging Face Transformers 库或 OpenAI 官方工具调用，提供 Python 示例代码。
专注转录：训练数据不含翻译内容，不支持语音翻译任务，纯转录性能更优。

5.对比优势

速度与质量平衡：较 large-v3 速度提升明显，质量损失极小。
性价比：参数规模与 medium 接近，但性能更优，适合资源有限的场景。

三、构建环境

3.1.基础环境构建

bash 复制代码

conda create -n test python=3.10
conda activate test

pip install pydub -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install openai -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2.下载模型

huggingface：

https://huggingface.co/openai/whisper-large-v3-turbo/tree/main

ModelScope：

git clone https://www.modelscope.cn/iic/Whisper-large-v3-turbo.git

下载完成（建议使用HuggingFace）：

四、技术实现

4.1.启动vLLM服务

python 复制代码

vllm serve /data/model/whisper-large-v3-turbo  --swap-space 16 --disable-log-requests --max-num-seqs 256 --host 0.0.0.0 --port 9000  --dtype float16 --max-parallel-loading-workers 1  --max-model-len 448 --enforce-eager --gpu-memory-utilization 0.99 --task transcription

调用结果：

GPU占用：

4.2.定义STT工具类

请求私有化部署的语音转文本服务

python 复制代码

# -*-  coding:utf-8 -*-

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1"
model = "/data/model/whisper-large-v3-turbo"
language = "en"
response_format = "json"
temperature = 0.0

class STT:
    def __init__(self):

        self.client = OpenAI(
            api_key=openai_api_key,
            base_url=openai_api_base,
        )

    def request(self,audio_path):
        with open(str(audio_path), "rb") as f:
            transcription = self.client.audio.transcriptions.create(
                file=f,
                model="/data/model/whisper-large-v3-turbo",
                language=language,
                response_format=response_format,
                temperature=temperature)
            if transcription:
                return transcription.text
            else:
                return ''

if __name__ == '__main__':
    audio_path = r'E:\temp\0.mp3'
    stt = STT()
    text = stt.request(audio_path)
    print(f'text: {text}')

调用结果：

复制代码

4.3.切分音频生成字幕文件

需求：

字幕数据按每一分钟进行聚合
字幕文件包json格式保存，文件格式如下

python 复制代码

{
        "time_begin": 0.0,
        "time_end": 60000.0,
        "text": "Hello World,Hello World,Hello World,Hello World,Hello World!"
}

python 复制代码

import json
import os.path

from pydub import AudioSegment

from com.ai.uitl.stt_util import STT

stt = STT()

def create_directory_if_not_exists(directory_path):
    # 判断目录是否存在
    if not os.path.exists(directory_path):
        try:
            # 创建目录
            os.makedirs(directory_path)
            print(f"目录 '{directory_path}' 已创建。")
        except Exception as e:
            print(f"创建目录 '{directory_path}' 时发生错误: {e}")
    else:
        print(f"目录 '{directory_path}' 已存在。")

def split(file_name,input_dir,output_dir,duration,json_file_output):
    create_directory_if_not_exists(output_dir)

    input_path = os.path.join(input_dir,file_name)
    # 加载音频文件
    audio = AudioSegment.from_file(input_path, format="mp3")
    # 音频文件的时长
    duration_seconds = audio.duration_seconds
    duration_milliseconds = duration_seconds * 1000


    start_time,end_time = 0.00,0.00
    index = 0
    text = ''
    all_objs = []
    one_minute_obj = {}

    # 指定切割开始时间和结束时间(单位为毫秒)
    while end_time < duration_milliseconds:
        start_time = end_time
        end_time = start_time+duration

        if end_time > duration_milliseconds:
            end_time = duration_milliseconds

       
        # 切割音频
        cropped_audio = audio[start_time:end_time]
        output_file_name = f'{file_name}_{index}.mp3'
        output_path = os.path.join(output_dir,output_file_name)
        # 保存切割后的音频
        cropped_audio.export(output_path, format="mp3")

        result = index % 2

        if result == 0:
            text = stt.request(output_path)
            one_minute_obj['time_begin'] = start_time
        else:
            text = text + stt.request(output_path)
            one_minute_obj['time_end'] = end_time
            one_minute_obj['text'] = text

            all_objs.append(one_minute_obj)
            one_minute_obj = {}

        index += 1

    result = index % 2
    if result != 0:
        one_minute_obj['text'] = text
        one_minute_obj['time_end'] = end_time

        all_objs.append(one_minute_obj)

    # 打开文件并写入 JSON 数据
    with open(json_file_output, 'w', encoding='utf-8') as json_file:
        json.dump(all_objs, json_file, ensure_ascii=False, indent=4)


if __name__ == '__main__':
    file_arr = ['1277.mp3', '1279.mp3']

    input_dir = r"E:\temp"

    for file_name in file_arr:
        temp_json_file_name = file_name+'_字幕文件.json'
        output_dir = r"E:\temp\output"
        output_dir = os.path.join(output_dir,file_name)
        json_file_output = os.path.join(output_dir,temp_json_file_name)

        split(file_name,input_dir,output_dir,30000.00,json_file_output)

开源模型应用落地-语音转文本-whisper模型-AIGC应用探索（五）

一、前言

二、术语介绍

**2.1.**语音转文本

**2.2.**Whisper-large-v3-turbo

三、构建环境

3.1.基础环境构建

​

3.2.下载模型

四、技术实现

4.1.启动vLLM服务

4.2.定义STT工具类

4.3.切分音频生成字幕文件

2.1.语音转文本

2.2.Whisper-large-v3-turbo