AI 大模型在文本、图片等领域非常强悍，人工智能在音频生成等领域也大展拳脚。本篇文章主要介绍Spring AI 接入 OpenAI TTS大模型实现文字转语音和whisper大模型实现语音转换，例如语音转文字、语言翻译等功能。

由于本人对于TTS、Whisper大模型理解不够深入，这里就不对其实现原理进行讲解，只对Spring AI的使用进行演示。

OpenAI tts(text to speech) 模型

Spring AI 提供对 OpenAI 语音 API 的支持。并抽取公共 SpeechModel 接口和 StreamingSpeechModel 接口，可以实现后续其它tts大模型的快速扩展和集成。

它可以做哪些事？

可以帮助你完成一篇文章的阅读。
可以帮助你制作多种语言的视频。
可以帮助你使用流媒体提供实时音频输出。

tts模型属性

为了更好的使用tts大模型，我们先来了解下控制大模型的一些参数；

属性名	作用	默认值
spring.ai.openai.audio.speech.options.model	要使用的模型的 ID。目前只有 tts-1	tts-1
spring.ai.openai.audio.speech.options.voice	输出的语音，比如男音/女音。可用选项包括：alloy, echo, fable, onyx, nova, shimmer	alloy
spring.ai.openai.audio.speech.options.response-format	音频输出的格式。支持的格式包括 mp3、opus、aac、flac、wav 和 pcm	mp3
spring.ai.openai.audio.speech.options.speed	语音合成的速度。范围从 0.0（最慢）到 1.0（最快）	1.0

示例（文本转语音）

java 复制代码

package org.ivy.controller;

import org.springframework.ai.openai.OpenAiAudioSpeechModel;
import org.springframework.ai.openai.OpenAiAudioSpeechOptions;
import org.springframework.ai.openai.api.OpenAiAudioApi;
import org.springframework.ai.openai.audio.speech.SpeechPrompt;
import org.springframework.ai.openai.audio.speech.SpeechResponse;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import reactor.core.publisher.Flux;


@RestController
public class AudioController {

    private final OpenAiAudioSpeechModel openAiAudioSpeechModel;
    public AudioController(OpenAiAudioSpeechModel openAiAudioSpeechModel) {
        this.openAiAudioSpeechModel = openAiAudioSpeechModel;
    }
    // 同步方式文本生成语音
    @GetMapping(value = "tts", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public byte[] speech(@RequestParam(defaultValue = "Hello, this is a text-to-speech example.") String text) {
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
                .withModel("tts-1") // 指定模型, 目前Spring AI支持一种tts-1，可以不配置
                .withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY) // 指定生成的音色
                .withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3) // 指定生成音频的格式
                .withSpeed(1.0f) // 指定生成速度
                .build();
        SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);
        SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
        return response.getResult().getOutput(); // 返回语音byte数组
    }

    // 流式方式文本生成语音
    @GetMapping(value = "stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<byte[]> stream(@RequestParam(defaultValue = "Today is a wonderful day to build something people love!") String text) {
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
                .withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
                .build();
        SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);
        Flux<SpeechResponse> stream = openAiAudioSpeechModel.stream(speechPrompt);
        return stream.map(speechResponse -> speechResponse.getResult().getOutput()).flatMapSequential(Flux::just);
    }
}

实现了同步和流式实现文本转语音功能

Postman测试

将请求返回的二进制数组，通过Postman保存为本地的mp3文件即可听到效果。另外大家自己可以做一个页面，将返回的二进制音频通过播放器的库进行播放，就可以实现一个简单的文本播报功能的网站了，我将打算在我的示例代码中实现，提交到Github上，大家可以参考学习。

字节跳动推出的高质量文本生成语音的高质量模型-Seed-TTS

Whisper 模型

在上一节中我们通过使用tts模型实现了文本转语音，则本节我们将演示接入whisper模型实现语音转文字以及实现语言翻译功能， Spring AI 为 OpenAI 的转录 API 提供支持。当实现其他转录提供程序时，将提取一个通用 AudioTranscriptionModel 接口。

它可以做什么？

它可以帮助你完成会议记录，如，可以将会议将的话转换为会以纪要
它可以帮助你完成语言翻译，如，从中文翻译为英文

Whisper 模型属性

属性	作用	默认值
spring.ai.openai.audio.transcription.options.model	要使用的模型的 ID。目前只有 whisper-1（由我们的开源 Whisper V2 模型提供支持）可用。	whisper-1
spring.ai.openai.audio.transcription.options.response-format	输出的格式，位于以下选项之一中：json、text、srt、verbose_json 或 vtt。	json
spring.ai.openai.audio.transcription.options.prompt	An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. 用于指导模型样式或继续上一个音频片段的可选文本。提示应与音频语言匹配
spring.ai.openai.audio.transcription.options.language	输入音频的语言。以 ISO-639-1 格式提供输入语言将提高准确性和延迟
spring.ai.openai.audio.transcription.options.temperature	采样温度，介于 0 和 1 之间。较高的值（如 0.8）将使输出更具随机性，而较低的值（如 0.2）将使其更加集中和确定。如果设置为 0，模型将使用对数概率自动提高温度，直到达到某些阈值	0
spring.ai.openai.audio.transcription.options.timestamp_granularities	要为此听录填充的时间戳粒度。必须verbose_json设置response_format才能使用时间戳粒度。支持以下任一或两个选项：word 或 segment。注意：分段时间戳没有额外的延迟，但生成字时间戳会产生额外的延迟	segment

示例 (语音转文本)

java 复制代码

package org.ivy.controller;

import org.springframework.ai.openai.OpenAiAudioTranscriptionModel;
import org.springframework.ai.openai.OpenAiAudioTranscriptionOptions;
import org.springframework.ai.openai.api.OpenAiAudioApi;
import org.springframework.ai.openai.audio.transcription.AudioTranscriptionPrompt;
import org.springframework.ai.openai.audio.transcription.AudioTranscriptionResponse;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class TranscriptionController {

    @Value("classpath:audio.mp3")
    private org.springframework.core.io.Resource audioResource;

    private final OpenAiAudioTranscriptionModel openAiTranscriptionModel;

    public TranscriptionController(OpenAiAudioTranscriptionModel openAiTranscriptionModel) {
        this.openAiTranscriptionModel = openAiTranscriptionModel;
    }

    @GetMapping("audio2Text")
    public String audio2Text() {
        var transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
                .withResponseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT)
                .withTemperature(0f)
                .build();
        AudioTranscriptionPrompt transcriptionRequest = new AudioTranscriptionPrompt(audioResource, transcriptionOptions);
        AudioTranscriptionResponse response = openAiTranscriptionModel.call(transcriptionRequest);
        return response.getResult().getOutput();
    }

}

将上一小节生成的mp3文件进行反向转换为文本。由于token限额原因，暂时无法测试，大家可以使用postman测试一下结果。

源码示例

github.com/fangjieDevp...

总结

本篇文章根据官方文章对配置参数进行简单的说明，并提供了简单的实现示例。并未对tts、whisper模型的实现原理进行说明，待个人对这部分知识补齐之后在做补充。

08. Spring AI 接入OpenAI实现文字转语音、语音转文字、语言翻译