Spring AI Alibaba 多模态应用开发实践

Spring AI Alibaba 多模态应用开发实践：从模型接入到图像、语音与视频生成

在 Java AI 应用开发中，单纯接入文本对话模型已经不能覆盖越来越多的业务需求。内容生产、智能客服、教育陪练、营销素材生成、音视频处理等场景，往往同时需要图像生成、语音合成、语音识别和视频生成能力。

Spring AI Alibaba 基于 Spring AI 构建，将阿里云通义系列模型和 DashScope 服务封装进 Spring Boot 熟悉的开发体系中。开发者可以用统一的 Spring 配置、自动装配和模型抽象，快速完成多模态 AI 能力接入。

本文按照实际开发流程，依次整理环境准备、图像生成、语音合成、语音识别和视频生成的核心用法。

准备工作

使用阿里云百炼平台调用大模型服务前，需要先开通模型服务并获取 API Key。百炼是一站式大模型开发和应用构建平台，可以通过 API 或 SDK 调用通义系列模型，实现对话、内容创作、摘要生成以及多模态能力。

获取 API Key 后，建议将其配置到环境变量中：

bash 复制代码

DASHSCOPE_API_KEY=你的API_KEY

这样可以避免把密钥硬编码到项目中，也方便在不同环境中切换配置。

接下来创建 Spring Boot 项目，例如命名为 spring-alibaba-demo，并引入 Web、测试、WebFlux 以及 Spring AI Alibaba DashScope Starter 依赖：

xml 复制代码

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-webflux</artifactId>
    </dependency>
    <dependency>
        <groupId>com.alibaba.cloud.ai</groupId>
        <artifactId>spring-ai-alibaba-starter-dashscope</artifactId>
    </dependency>
</dependencies>

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>com.alibaba.cloud.ai</groupId>
            <artifactId>spring-ai-alibaba-bom</artifactId>
            <version>1.0.0.2</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

在 application.yml 中配置 DashScope API Key：

yaml 复制代码

spring:
  ai:
    dashscope:
      api-key: ${DASHSCOPE_API_KEY}

启动类保持标准 Spring Boot 写法即可：

java 复制代码

@SpringBootApplication
public class AlibabaApplicationDemo {

    public static void main(String[] args) {
        SpringApplication.run(AlibabaApplicationDemo.class, args);
    }
}

完成基础配置后，可以先通过聊天模型做一次连通性测试：

java 复制代码

@SpringBootTest
public class ChatModelTest {

    @Autowired
    public DashScopeChatModel dashScopeChatModel;

    @Test
    void testChat() {
        String content = dashScopeChatModel.call("你是谁");
        System.out.println(content);
    }
}

如果能够正常返回结果，说明项目已经成功接入 DashScope 服务，后续即可继续扩展图像、语音和视频能力。

图像生成

图像生成使用 DashScopeImageModel。它实现了 Spring AI 的 ImageModel 接口，因此调用方式和 Spring AI 原生图像模型保持一致。

最简单的文生图调用如下：

java 复制代码

@Autowired
DashScopeImageModel imageModel;

@Test
void text2Img() {
    ImageResponse imageResponse = imageModel.call(
            new ImagePrompt("孩子在海边玩耍")
    );
    String imageUrl = imageResponse.getResult().getOutput().getUrl();
    System.out.println(imageUrl);
}

模型返回的结果中包含图片 URL，业务系统可以将它展示给用户，也可以下载后存储到自己的对象存储或素材库中。

图像模型相关属性由 DashScopeImageProperties 管理，配置前缀为：

text 复制代码

spring.ai.dashscope.image

默认图像模型为 wanx-v1，默认生成数量为 1。常见配置如下：

yaml 复制代码

spring:
  ai:
    dashscope:
      api-key: ${DASHSCOPE_API_KEY}
      image:
        options:
          model: wan2.2-t2i-flash
          n: 4

如果希望通过代码临时指定参数，可以使用 DashScopeImageOptions：

java 复制代码

@Test
void text2ImgWithOptions() {
    DashScopeImageOptions imageOptions = DashScopeImageOptions.builder()
            .withModel("wanx-v1")
            .withStyle("<flat illustration>")
            .build();

    ImageResponse imageResponse = imageModel.call(
            new ImagePrompt("孩子在海边玩耍", imageOptions)
    );

    String imageUrl = imageResponse.getResult().getOutput().getUrl();
    System.out.println(imageUrl);
}

常用图像参数

DashScopeImageOptions 用于描述图像生成时传递给模型的参数，常见字段包括：

参数	作用
`model`	指定图像生成模型
`n`	生成图片数量
`width`	生成图片宽度
`height`	生成图片高度
`size`	生成图片尺寸，部分模型中已不推荐继续使用
`style`	指定生成风格
`seed`	控制随机性，便于生成相对稳定的结果
`refImg`	指定参考图像 URL
`refMode`	指定参考图模式，如内容参考或风格参考
`refStrength`	控制生成结果与参考图的相似度
`watermark`	是否添加"AI生成"水印
`negativePrompt`	负向提示词，用于描述不希望出现的内容

模型选择时可以按目标场景判断：

复杂文字渲染，例如海报、对联、带文字的视觉设计，优先考虑 qwen-image。
写实场景、商业摄影、通用文生图，推荐通义万相较新的模型，例如 wan2.2-t2i-plus 或 wan2.2-t2i-flash。
追求最高图像质量时，可以选择 plus 类型模型。
对响应速度要求更高时，可以选择 flash 或 turbo 类型模型。
需要自定义分辨率时，通义万相模型更灵活，例如 wan2.2-t2i-flash 支持一定范围内的宽高组合。

生成多张图片可以这样写：

java 复制代码

@Test
public void text2ImgBatch() {
    String prompt = "孩子在海边玩耍";

    DashScopeImageOptions imageOptions = DashScopeImageOptions.builder()
            .withModel("wan2.2-t2i-flash")
            .withN(3)
            .build();

    ImageResponse imageResponse = imageModel.call(
            new ImagePrompt(prompt, imageOptions)
    );

    List<String> urls = new ArrayList<>();
    for (ImageGeneration imageGeneration : imageResponse.getResults()) {
        urls.add(imageGeneration.getOutput().getUrl());
    }

    System.out.println(urls);
}

添加水印只需要开启 watermark：

java 复制代码

DashScopeImageOptions imageOptions = DashScopeImageOptions.builder()
        .withModel("wan2.2-t2i-flash")
        .withN(1)
        .withWatermark(true)
        .build();

自定义分辨率时，可以通过宽高单独指定：

java 复制代码

DashScopeImageOptions imageOptions = DashScopeImageOptions.builder()
        .withModel("wan2.2-t2i-flash")
        .withN(1)
        .withWidth(512)
        .withHeight(1440)
        .build();

需要注意的是，宽高必须符合模型支持范围，否则生成时会报错。

对于图生图或风格迁移场景，wanx-v1 更适合根据参考图片生成特定艺术风格。可以传入参考图 URL，并通过 refMode 和 refStrength 控制生成效果：

java 复制代码

String prompt = "一只黑色的小猫";

DashScopeImageOptions imageOptions = DashScopeImageOptions.builder()
        .withRefImg("https://example.com/house.png")
        .withRefMode("refonly")
        .withRefStrength(1.0f)
        .build();

其中 refonly 表示主要参考图像风格，refStrength 越高，生成结果与参考图越相似。

语音合成

语音合成使用 DashScopeSpeechSynthesisModel，它实现了 SpeechSynthesisModel，用于将文本转换为音频。

一个基础的文本转语音示例如下：

java 复制代码

@Autowired
private DashScopeSpeechSynthesisModel speechSynthesisModel;

@Test
public void text2Audio() throws IOException {
    DashScopeSpeechSynthesisOptions options =
            DashScopeSpeechSynthesisOptions.builder().build();

    SpeechSynthesisResponse response = speechSynthesisModel.call(
            new SpeechSynthesisPrompt(
                    "小池，泉眼无声惜细流，树阴照水爱晴柔。小荷才露尖尖角，早有蜻蜓立上头。",
                    options
            )
    );

    File file = new File(System.getProperty("user.dir") + "/output.mp3");
    try (FileOutputStream fos = new FileOutputStream(file)) {
        ByteBuffer byteBuffer = response.getResult().getOutput().getAudio();
        fos.write(byteBuffer.array());
    }
}

语音合成相关配置前缀为：

text 复制代码

spring.ai.dashscope.audio.synthesis

默认模型为 sambert-zhichu-v1，默认输出格式为 MP3。也可以切换为 CosyVoice 模型并指定音色：

yaml 复制代码

spring:
  ai:
    dashscope:
      api-key: ${DASHSCOPE_API_KEY}
      audio:
        synthesis:
          options:
            model: cosyvoice-v1
            voice: longwan

常用语音合成参数

参数	作用
`model`	指定语音合成模型
`voice`	指定音色
`text`	待合成文本
`sampleRate`	音频采样率
`volume`	音量，常见范围为 0 到 100
`speed`	语速，常见范围为 0.5 到 2
`pitch`	音调，常见范围为 0.5 到 2
`responseFormat`	输出格式，支持 MP3、WAV、PCM
`enableWordTimestamp`	是否启用字级时间戳
`enablePhonemeTimestamp`	是否启用音素级时间戳

模型和音色必须匹配。不同版本模型支持的音色不同，有些音色还需要额外开通权限。

合成男声示例：

java 复制代码

DashScopeSpeechSynthesisOptions options =
        DashScopeSpeechSynthesisOptions.builder()
                .model("cosyvoice-v2")
                .voice("longsanshu")
                .build();

SpeechSynthesisResponse response = speechSynthesisModel.call(
        new SpeechSynthesisPrompt(
                "山村咏怀，一去二三里，烟村四五家。亭台六七座，八九十枝花。",
                options
        )
);

合成女声示例：

java 复制代码

DashScopeSpeechSynthesisOptions options =
        DashScopeSpeechSynthesisOptions.builder()
                .model("cosyvoice-v2")
                .voice("longmiao_v2")
                .build();

调整语速：

java 复制代码

DashScopeSpeechSynthesisOptions options =
        DashScopeSpeechSynthesisOptions.builder()
                .model("cosyvoice-v2")
                .voice("longmiao_v2")
                .speed(0.5f)
                .build();

调整音调：

java 复制代码

DashScopeSpeechSynthesisOptions options =
        DashScopeSpeechSynthesisOptions.builder()
                .model("cosyvoice-v2")
                .voice("longmiao_v2")
                .speed(0.5f)
                .pitch(1.2d)
                .build();

调整音量：

java 复制代码

DashScopeSpeechSynthesisOptions options =
        DashScopeSpeechSynthesisOptions.builder()
                .model("cosyvoice-v2")
                .voice("longmiao_v2")
                .volume(10)
                .build();

除了 Spring Boot Starter，也可以直接使用阿里云官方 dashscope-sdk-java 调用模型。Starter 更适合 Spring Boot 项目，能够享受自动装配、配置管理和 Spring AI 抽象；SDK 则更底层，可以在普通 Java 应用中独立使用。

SDK 依赖示例：

xml 复制代码

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>dashscope-sdk-java</artifactId>
    <version>2.21.2</version>
</dependency>

SDK 调用语音合成示例：

java 复制代码

@Test
public void text2AudioWithSdk() throws IOException {
    SpeechSynthesisParam param = SpeechSynthesisParam.builder()
            .model("cosyvoice-v2")
            .voice("longhuhu")
            .build();

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    ByteBuffer audio = synthesizer.call("今天天气怎么样？");

    File file = new File("output.mp3");
    try (FileOutputStream fos = new FileOutputStream(file)) {
        fos.write(audio.array());
    }
}

语音识别

语音识别用于将音频或视频中的语音转换为文本。录音文件识别可以使用 DashScopeAudioTranscriptionModel。

基础示例：

java 复制代码

@Autowired
private DashScopeAudioTranscriptionModel transcriptionModel;

private final String DEFAULT_MODEL = "paraformer-v2";

@Test
void stt() {
    Resource resource = new DefaultResourceLoader()
            .getResource("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav");

    AudioTranscriptionResponse response = transcriptionModel.call(
            new AudioTranscriptionPrompt(
                    resource,
                    DashScopeAudioTranscriptionOptions.builder()
                            .withModel(DEFAULT_MODEL)
                            .build()
            )
    );

    System.out.println(response.getResult().getOutput());
}

如果使用 DashScope SDK，可以通过异步任务方式提交多个音频文件：

java 复制代码

@Test
void sttWithDashscopeSdk() {
    TranscriptionParam param = TranscriptionParam.builder()
            .model("paraformer-v2")
            .parameter("language_hints", new String[]{"zh", "en"})
            .fileUrls(Arrays.asList(
                    "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                    "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"
            ))
            .build();

    try {
        Transcription transcription = new Transcription();
        TranscriptionResult result = transcription.asyncCall(param);
        System.out.println("RequestId: " + result.getRequestId());

        result = transcription.wait(
                TranscriptionQueryParam.FromTranscriptionParam(
                        param,
                        result.getTaskId()
                )
        );

        System.out.println(
                new GsonBuilder().setPrettyPrinting().create().toJson(result.getOutput())
        );
    } catch (Exception e) {
        System.out.println("error: " + e);
    }
}

这里的核心对象有两个：

TranscriptionParam：封装模型、语言提示、文件 URL 等请求参数。
TranscriptionResult：封装任务执行结果，包括请求 ID、任务 ID、任务状态、子任务结果等信息。

每个音频文件会对应一个子任务，子任务结果由 TranscriptionTaskResult 表示，其中最重要的是 transcriptionUrl。它指向识别结果文件，通常为 JSON 格式。该链接有有效期限制，超过时间后无法继续下载，所以生产环境中应及时拉取并保存识别结果。

对于实时语音识别，可以使用 Paraformer 实时模型：

java 复制代码

@Test
void testParaformerRealtime() {
    Recognition recognizer = new Recognition();

    RecognitionParam param = RecognitionParam.builder()
            .model("paraformer-realtime-v2")
            .format("wav")
            .sampleRate(16000)
            .parameter("language_hints", new String[]{"zh", "en"})
            .build();

    try {
        ClassPathResource resource =
                new ClassPathResource("hello_world_male_16k_16bit_mono.wav");

        String result = recognizer.call(param, resource.getFile());
        System.out.println(result);

        Gson gson = new GsonBuilder().setPrettyPrinting().create();
        JsonObject jsonObject = gson.fromJson(result, JsonObject.class);

        if (jsonObject.has("sentences")) {
            for (JsonElement sent : jsonObject.get("sentences").getAsJsonArray()) {
                JsonObject sentObj = sent.getAsJsonObject();
                System.out.println(sentObj.get("text").getAsString());
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        recognizer.getDuplexApi().close(1000, "bye");
    }
}

实时识别结束后要主动关闭 WebSocket 连接，避免资源长期占用。

视频生成

Spring AI Alibaba DashScope 还可以接入视频生成能力，例如文生视频、图生视频、首帧生成视频、首尾帧生成视频以及视频特效等。

文生视频可以使用 DashScope SDK 中的 VideoSynthesis：

java 复制代码

@Test
public void textToVideo() throws NoApiKeyException, InputRequiredException {
    VideoSynthesis vs = new VideoSynthesis();

    VideoSynthesisParam param = VideoSynthesisParam.builder()
            .model("wan2.2-t2v-plus")
            .prompt("一只小猫在月光下奔跑")
            .size("1920*1080")
            .build();

    System.out.println("please wait...");
    VideoSynthesisResult result = vs.call(param);
    System.out.println(JsonUtils.toJson(result));
}

请求参数由 VideoSynthesisParam 通过链式调用配置，常见字段包括：

model：指定视频生成模型。
prompt：描述要生成的视频内容。
size：指定视频尺寸。
imgUrl：图生视频时指定输入图片。
resolution：指定视频分辨率。
parameters：传入额外参数，例如提示词扩展。

响应结果由 VideoSynthesisResult 封装，其中 output 中包含任务 ID、任务状态、错误信息以及 videoUrl。videoUrl 是生成视频的下载地址，通常有效期为 24 小时，输出格式为 MP4。业务系统拿到链接后，应及时下载并做持久化保存。

图生视频可以基于首帧图片生成动态视频：

java 复制代码

@Test
public void image2Video() throws NoApiKeyException, InputRequiredException {
    String imgUrl = "https://example.com/input.jpg";

    Map<String, Object> parameters = new HashMap<>();
    parameters.put("prompt_extend", true);

    VideoSynthesis vs = new VideoSynthesis();

    VideoSynthesisParam param = VideoSynthesisParam.builder()
            .model("wan2.2-i2v-plus")
            .prompt("喝水")
            .imgUrl(imgUrl)
            .parameters(parameters)
            .resolution("1080P")
            .build();

    System.out.println("please wait...");
    VideoSynthesisResult result = vs.call(param);
    System.out.println(JsonUtils.toJson(result));
}

视频生成通常耗时更长，实际项目中建议将它设计成异步任务：用户提交生成请求后，系统记录任务状态，后台轮询或回调获取结果，再将视频地址返回给前端。

实践建议

Spring AI Alibaba 的优势在于把 DashScope 多模态能力纳入 Spring Boot 的开发范式中，让 Java 开发者能够用熟悉的依赖管理、配置文件、自动装配和模型抽象完成 AI 能力集成。

落地时可以遵循以下思路：

先完成 API Key、依赖和基础连通性测试，再扩展多模态能力。
图像生成重点关注模型选择、提示词质量、分辨率、数量、风格和参考图参数。
语音合成重点关注模型与音色匹配，以及语速、音调、音量、输出格式等参数。
语音识别要区分录音文件识别和实时识别，并及时保存有有效期限制的识别结果。
视频生成适合做成异步流程，避免接口长时间阻塞。
生产环境中不要硬编码 API Key，也不要直接依赖临时 URL 作为长期资源地址。

从文本到图片，从文本到语音，从语音到文本，再到视频生成，Spring AI Alibaba 提供了一条较完整的 Java 多模态应用开发路径。对于已经使用 Spring Boot 的团队来说，它可以显著降低接入通义系列模型的成本，让 AI 能力更自然地进入现有业务系统。