深度学习系列83：使用outetts

1. 介绍

模型采用 LLaMa 架构基础，结合 WavTokenizer 音频标记化技术，将连续音频波形转换为离散令牌序列，每秒处理 150 个令牌。v2版本通过 CTC 强制对齐技术，实现文本与音频的精确映射，无需预处理即可生成时间戳对齐的语音流；v3版本使用whisper进行文本对齐。窗口化重复惩罚机制的引入，显著提升了语音输出的连贯性和自然度，尤其在长文本合成中表现稳定。

多语言支持是其核心亮点之一。模型直接支持英语、中文、阿拉伯语等 20 余种语言的文本输入，无需罗马化处理。训练数据覆盖高资源语言（如英语、中文）和中等资源语言（如葡萄牙语、波斯语），未训练语言也可生成语音但效果有限。

2. 中文使用方式

目前outetts不支持中文，所以要做些特别处理：

首先安装whisper-timestamped：pip3 install whisper-timestamped
然后安装outetts：pip3 install outetts==0.4.4
打开site-package里面的outetts/whisper/transcribe.py，将第一行改为：
import whisper_timestamped as whisper
将第17行改为text = whisper.transcribe(model, audio_path, language='zh',initial_prompt='普通话')
打开site-package里面的outetts/version/v3/audio_processor.py，将第226行改为：
words.extend([{'word': i['text'].strip(), 'start': float(i['start']), 'end': float(i['end'])} for i in s['words']])

3. 测试代码

首先出啊给你家爱你一个interface

复制代码

import outetts
interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.Q8_0,      
    )
)

如果没有speaker文件，则使用语音片段新建一个：

复制代码

speaker = interface.create_speaker("2.wav",whisper_model="base")
interface.save_speaker(speaker,path="zh.json")

然后生成语音：

复制代码

from outetts import GenerationConfig
output = interface.generate(
	config=GenerationConfig(
		text="要确保回答简洁明了，不使用复杂的术语。同时，保持语气友好，让用户感到舒适",
    		speaker= interface.load_speaker("zh.json")
    	)
)
output.save("output.wav")