让 AI 开口说话：Spring AI Alibaba 语音合成（TTS）实战

我问过不少人：你觉得 AI 产品哪个功能最让用户觉得"哇这很厉害"？很多人的答案不是文字生成，而是语音------AI 开口说话那一刻，体验感直接上了一个台阶。

这一节我们来实战语音合成（TTS）。配合前面的对话功能，文字 + 语音就能做出一个完整的语音助手；单独使用的话，播报通知、有声内容、定时早报都是非常常见的场景。

一、Spring AI Alibaba 与 TTS

先说清楚一件事：Spring AI Alibaba 并没有封装 TTS 。
ChatModel、ImageModel 这些有 Spring AI 的标准接口，但 TTS 目前没有。阿里云的通义语音合成（CosyVoice）需要通过原生 DashScope Java SDK 来调用，类位于 com.alibaba.dashscope.audio.ttsv2 包。

因为 spring-ai-alibaba-starter-dashscope 没有传递这个原生 SDK，我们需要在 pom.xml 中单独添加依赖：

复制代码

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>dashscope-sdk-java</artifactId>
    <version>2.22.4</version>
</dependency>

核心的类只有两个：

类名	说明
`SpeechSynthesizer`	TTS 的入口，调用 `call()` 发起合成
`SpeechSynthesisParam`	封装模型、音色、语速、格式等参数

二、CosyVoice 配置与音色选择

目前最新版本是 cosyvoice‑v3‑flash，速度快、音质好，推荐直接使用。

注意：不同模型版本需要配套对应的音色。v3‑flash 使用 longanyang、 longxiaocheng 这一批；v2 版本使用 longxiaochun_v2 等，不能混用。

常用音色（cosyvoice‑v3‑flash）：

音色 ID	特点	适用场景
`longanyang`	女声，亲切自然	客服、智能助手
`longxiaocheng`	男声，成熟稳重	新闻播报
`longxiaoxia`	女声，活泼可爱	娱乐、儿童内容

完整音色列表请参考：https://help.aliyun.com/zh/model-studio/cosyvoice-voice-list

三、基础用法：文字转语音

下面的代码实现了两个接口：

GET /api/tts/synthesize：基础文字转语音，返回 MP3 并保存到本地

GET /api/tts/synthesize-custom：自定义音色和语速

package com.studying.controller.voice;

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.http.HttpHeaders;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;

import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;

@RestController
@RequestMapping("/api/tts")
public class TextToSpeechController {

复制代码

  @Value("${spring.ai.dashscope.api-key}")
  private String apiKey;

  /**
   * 基础合成：文字转语音，返回 mp3 字节流，同时保存到当前目录
   */
  @GetMapping(value = "/synthesize", produces = "audio/mpeg")
  public ResponseEntity<byte[]> synthesize(@RequestParam String text) throws Exception {
      Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";

      SpeechSynthesisParam param = SpeechSynthesisParam.builder()
              .apiKey(apiKey)
              .model("cosyvoice-v3-flash")
              .voice("longanyang")
              .build();

      SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
      ByteBuffer audio = synthesizer.call(text);
      synthesizer.getDuplexApi().close(1000, "bye");

      byte[] audioBytes = audio.array();
      Files.write(Path.of("speech.mp3"), audioBytes);

      return ResponseEntity.ok()
              .contentType(MediaType.parseMediaType("audio/mpeg"))
              .header(HttpHeaders.CONTENT_DISPOSITION, "inline; filename=\"speech.mp3\"")
              .body(audioBytes);
  }

  /**
   * 自定义参数：指定音色和语速
   */
  @GetMapping(value = "/synthesize-custom", produces = "audio/mpeg")
  public ResponseEntity<byte[]> synthesizeCustom(
          @RequestParam String text,
          @RequestParam(defaultValue = "longanyang") String voice,
          @RequestParam(defaultValue = "1.0") float speed) throws Exception {

      Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";

      SpeechSynthesisParam param = SpeechSynthesisParam.builder()
              .apiKey(apiKey)
              .model("cosyvoice-v3-flash")
              .voice(voice)
              .speechRate(speed)   // 语速：0.5（慢）~ 2.0（快），1.0 正常
              .build();

      SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
      ByteBuffer audio = synthesizer.call(text);
      synthesizer.getDuplexApi().close(1000, "bye");

      byte[] audioBytes = audio.array();
      Files.write(Path.of("speech-custom.mp3"), audioBytes);

      return ResponseEntity.ok()
              .contentType(MediaType.parseMediaType("audio/mpeg"))
              .body(audioBytes);
  }

}

测试命令：

复制代码

# 基础合成（文件保存为 speech.mp3）
curl "http://localhost:8080/api/tts/synthesize?text=你好，欢迎使用智能语音助手"

# 自定义音色和语速（保存为 speech-custom.mp3）
curl "http://localhost:8080/api/tts/synthesize-custom?text=今日播报&voice=longxiaocheng&speed=0.9"

四、实战：AI 对话 + 语音播报

把 ChatClient（文字对话）和 SpeechSynthesizer（语音合成）串联起来，实现"文字提问，AI 用语音回答"的完整闭环。

复制代码

package com.studying.controller.voice;

import com.alibaba.cloud.ai.dashscope.chat.DashScopeChatModel;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.SimpleLoggerAdvisor;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;

import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;

@RestController
@RequestMapping("/api/voice-chat")
public class VoiceChatController {

    private final ChatClient chatClient;

    @Value("${spring.ai.dashscope.api-key}")
    private String apiKey;

    public VoiceChatController(DashScopeChatModel dashScopeChatModel) {
        this.chatClient = ChatClient.builder(dashScopeChatModel)
                .defaultSystem("""
                        你是一个语音助手，回答会被转成语音播放。
                        因此：
                        - 回答要口语化，避免用 Markdown 格式
                        - 不要输出代码块、列表符号（*、-）等
                        - 句子要自然流畅，适合朗读
                        - 回答控制在 100 字以内
                        """)
                .defaultAdvisors(new SimpleLoggerAdvisor())
                .build();
    }

    /**
     * 文字提问，返回语音回答（MP3 音频流）
     */
    @GetMapping(value = "/ask", produces = "audio/mpeg")
    public ResponseEntity<byte[]> askWithVoice(
            @RequestParam String question,
            @RequestParam(defaultValue = "longanyang") String voice) throws Exception {

        // 第一步：获取文字回答
        String textAnswer = chatClient.prompt()
                .user(question)
                .call()
                .content();

        // 第二步：把文字转成语音
        Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                .apiKey(apiKey)
                .model("cosyvoice-v3-flash")
                .voice(voice)
                .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = synthesizer.call(textAnswer);
        synthesizer.getDuplexApi().close(1000, "bye");

        byte[] audioBytes = audio.array();
        Files.write(Path.of("voice-answer.mp3"), audioBytes); // 可选：本地存档

        return ResponseEntity.ok()
                .contentType(MediaType.parseMediaType("audio/mpeg"))
                .body(audioBytes);
    }
}

注意 System Prompt 的特别说明 ：语音播放时，**加粗**、- 列表 这些 Markdown 符号会被直接念出来，听感非常差。一定要显式要求模型输出口语化、无格式的文本 ------ 这是一个容易踩的坑。

测试：

复制代码

curl "http://localhost:8080/api/voice-chat/ask?question=今天天气怎么样"

浏览器或播放器会自动播放返回的 MP3 音频，同时本地会生成 voice-answer.mp3。

五、实战：定时语音播报（早报 / 系统通知）

适合每日早报、系统通知等周期性场景。需要先在启动类上加上 @EnableScheduling。

复制代码

package com.jichi.springaialibaba.service;

import com.alibaba.cloud.ai.dashscope.chat.DashScopeChatModel;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Service;

import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;

@Service
public class DailyBroadcastService {

    private final ChatClient chatClient;

    @Value("${spring.ai.dashscope.api-key}")
    private String apiKey;

    public DailyBroadcastService(DashScopeChatModel dashScopeChatModel) {
        this.chatClient = ChatClient.builder(dashScopeChatModel).build();
    }

    /**
     * 每天早上 8 点生成日报播报音频
     */
    @Scheduled(cron = "0 0 8 * * ?")
    public void generateDailyBroadcast() throws Exception {
        // 1. 用 AI 生成播报文案
        String script = chatClient.prompt()
                .system("你是一个播音员，生成简洁的早间播报文案，不要用 Markdown 格式")
                .user("今天是 " + LocalDate.now().format(DateTimeFormatter.ofPattern("yyyy年M月d日")) +
                      "，请生成一段 30 秒的早间播报，包括问候语和今日关键提示")
                .call()
                .content();

        // 2. 合成语音
        Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                .apiKey(apiKey)
                .model("cosyvoice-v3-flash")
                .voice("longxiaocheng")   // 男声播报更正式
                .speechRate(0.9f)         // 略慢，增加播报感
                .build();

        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = synthesizer.call(script);
        synthesizer.getDuplexApi().close(1000, "bye");

        // 3. 保存到文件（可进一步推送到前端或存储到 OSS）
        String filename = "broadcast-" + LocalDate.now() + ".mp3";
        Files.write(Path.of(filename), audio.array());
    }
}

你可以将生成的音频文件通过邮件、钉钉机器人或对象存储分发给用户，实现完全自动化的"早报语音推送"。

六、小结

通过 DashScope 原生 Java SDK，我们可以在 Spring Boot 中轻松集成 CosyVoice 语音合成能力。

基础用法：文字 → MP3 音频
进阶实战：文字对话 + 语音播报（一条 API 返回可播放的音频）
自动化场景：定时任务生成早报音频

三个小贴士：

模型与音色版本必须匹配（v3‑flash 用 longanyang 系列，不要混用 v2 音色）。
生成音频后记得关闭 WebSocket 连接（synthesizer.getDuplexApi().close(...)）。
语音助手的 Prompt 要强制要求口语化、无 Markdown，否则播放效果会很糟糕。