轻量级调用
通过HTTP API调用第三方大模型(如OpenAI API、阿里通义千问),或在本地部署开源模型(如LLaMA系列),封装为RESTful服务供Java系统调用
方案一:调用第三方大模型API(如OpenAI/通义千问)的代码开发步骤
开发模型网关服务(Spring Boot应用),接收Java业务系统请求,转发至第三方API(如OpenAI),处理认证、错误、缓存、监控,返回标准化响应
text
用户 → Java业务系统前端 → Java业务系统后端 → 模型网关服务 → 第三方大模型API
↓
(智能客服场景示例)
1. 用户在前端输入:"如何办理退款?"
↓
2. 前端调用业务系统接口:POST /api/customer-service/ask-ai
↓
3. CustomerController接收请求,调用CustomerService.getAIResponseForCustomer()
↓
4. CustomerService构造ModelRequest,调用模型网关:POST /api/v1/ai/chat
↓
5. ModelGatewayController接收请求,调用ModelGatewayService处理
↓
6. ModelGatewayService调用OpenAIService.callOpenAI()
↓
7. OpenAIService发送HTTP请求到 https://api.openai.com/v1/chat/completions
↓
8. OpenAI返回结果,逐层返回到CustomerController
↓
9. 前端显示AI回复:"您可以在订单页面点击退款按钮..."
①、SpringBoot项目
xml
<dependencies>
<!-- Web RESTful API-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Redis 缓存-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<!-- Resilience4j 熔断/重试-->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot2</artifactId>
<version>2.1.0</version>
</dependency>
<!-- OkHttp HTTP客户端,替代RestTemplate,性能更好-->
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>4.12.0</version>
</dependency>
<!-- Micrometer指标收集 + Prometheus -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!-- Lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<!-- 测试 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
②、配置管理(密钥、第三方API参数)
API密钥(如OpenAI的sk-xxx、通义千问的API_KEY)不硬编码,使用环境变量或配置中心(如Spring Cloud Config、HashiCorp Vault)
yml
server:
port: 8080
# 第三方模型配置(可扩展多提供商)
model:
providers:
openai:
base-url: https://api.openai.com/v1
api-key: ${OPENAI_API_KEY:default_key_for_dev} # 优先读环境变量OPENAI_API_KEY
models:
chat: gpt-3.5-turbo
completion: text-davinci-003
tongyi:
base-url: https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation
api-key: ${TONGYI_API_KEY:default_tongyi_key}
models:
chat: qwen-turbo
# Redis缓存配置
spring:
redis:
host: localhost
port: 6379
password: ${REDIS_PASSWORD:}
lettuce:
pool:
max-active: 8
max-idle: 8
# Resilience4j熔断/重试配置
resilience4j:
retry:
instances:
openai-api:
max-attempts: 3
wait-duration: 1s
exponential-backoff-multiplier: 2
circuitbreaker:
instances:
openai-api:
failure-rate-threshold: 50
minimum-number-of-calls: 10
sliding-window-size: 20
# 监控配置
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
export:
prometheus:
enabled: true
③、请求体和响应模型
java
/**
同一接受Java业务系统的参数
*/
@Data
@NoArgsConstructor
@AllArgConstructor
public class ModelRequest{
private String provider;//模型提供商:openai/tongyi
private String modelType;//模型类型:chat/completion
private List<Message> messages;// Chat格式:[{"role":"user","content":"..."}]
private String prompt;//Completion格式:直接传prompt
private Integer maxTokens = 500;
private Double temperature = 0.7;
private Boolean stream = false;//是否流式
//其他参数:topP,presencePenalty等
}
@Data
@NoArgsConstructor
@AllArgsConstructor
public class Message{
private String role;
private String content;
}
java
/**
同一返回给Java业务系统
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class ModelResponse{
private String id; //请求ID(UUID生成)
private String provider;//实际调用的提供商
private String model;//实际调用的模型名
private String content;//生成文本内容
private Long created;//时间戳
private Usage usage;//Token用量
private String error; //错误信息(成功时为null)
}
//Token用量
@Data
@NoArgsConstructor
@AllArgsConstructor
public class Usage{
private Integer promptTokens;
private Integer completionTokens;
private Integer totalTokens;
}
java
/**
*/
@Data
private static class OpenAiResponse {
private String id;
private String model;
private Long created;
private List<Choice> choices;
private Usage usage; // 注意:字段名与自定义Usage一致,可直接复用
@Data
private static class Choice {
private Message message; // 含role和content
private String finishReason;
}
}
④、Java业务系统:端口8090,提供 /api/customer-service/ask-ai接口,被前端调用
yml
server:
port: 8090 # 业务系统端口
# 模型网关地址配置
model:
gateway:
url: http://localhost:8080/api/v1/ai/chat
timeout: 60s
java
@RestController
@RequestMapping("/api/customer-service")
@RequiredArgsConstructor
public class CustomerController{
private final CustomerService customerService;
/**
前端调用的客服接口-前后端调用交互
*/
@PostMapping("/ask-ai")
public ResponseEntity<ApiResponse> askAI(@RequestBody CustomerRequest customerRequest){
String aiResponse = customerService.getAIResponseForCustomer(
customerRequest.getQuestion(),
customerRequest.getSessionId()
);
ApiResponse response = new ApiResponse();
response.setCode(200);
response.setMessage("success");
response.setData(aiResponse);
return ResponseEntity.ok(response);
}
}
// 数据模型
@Data
class CustomerRequest {
private String question; // 用户问题
private String sessionId; // 会话ID
private String userId; // 用户ID
}
@Data
class ApiResponse {
private int code;
private String message;
private Object data;
}
java
@Service
@RequiredArgsConstructor
public class CustomerService {
//调用模型网关的Http客户端
private final RestTemplate restTemplate;
private final String gatewayUrl = "http://localhost:8080/api/v1/ai/chat"; // 网关地址
/**
* 智能客服获取AI回复的业务方法
*/
public String getAIResponseForCustomer(String customerQuestion, String sessionId){
try{
//1.构建请求对象
ModelRequest request = new ModelRequest();
request.setProvider("openai"); // 使用OpenAI
request.setModel("gpt-3.5-turbo"); // 具体模型
request.setUserMessage(customerQuestion); // 用户问题
request.setSessionId(sessionId); // 会话ID(用于上下文)
request.setMaxTokens(500); // 最大token数
//2.调用模型网关
ModelResponse response = restTemplate.postForObject(gatewayUrl, request, ModelResponse.class);
// 3. 返回AI回复内容
return response != null ? response.getContent() : "抱歉,AI服务暂时不可用";
}catch(Exception e){
log.error("调用AI网关失败: {}", e.getMessage());
return "AI服务异常,请稍后重试";
}
}
}
模型网关服务:端口8080,提供 /api/v1/ai/chat接口,只被Java业务系统调用
yml
server:
port: 8080 # 网关端口
model:
providers:
openai:
base-url: https://api.openai.com/v1
api-key: ${OPENAI_API_KEY}
timeout: 30s
tongyi:
base-url: https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation
api-key: ${TONGYI_API_KEY}
timeout: 30s
java
@RestController
@RequestMapping("/api/v1/ai")
@RequiredArgsConstructor
public class ModelGatewayController {
private final ModelGatewayService modelGatewayService;
/**
* 模型网关的核心接口 - Java业务系统调用这里
*/
@PostMapping("/chat")
public ResponseEntity<ModelResponse> chatCompletion(@RequestBody ModelRequest request) {
ModelResponse response = modelGatewayService.processChatRequest(request);
return ResponseEntity.ok(response);
}
/**
* 健康检查接口
*/
@GetMapping("/health")
public ResponseEntity<String> health() {
return ResponseEntity.ok("Model Gateway is running");
}
}
java
@Service
@RequiredArgsConstructor //自动生成包含 final 字段的构造函数
public class ModelGatewayService {
private final OpenAIService openAIService;
private final TongyiService tongyiService;
private final CacheManager cacheManager;
public ModelResponse processChatRequest(ModelRequest request) {
// 1. 参数校验
validateRequest(request);
// 2. 缓存检查
String cacheKey = generateCacheKey(request);
ModelResponse cached = cacheManager.get(cacheKey);
if (cached != null) {
return cached;
}
// 3. 根据provider路由到对应服务
ModelResponse response;
switch (request.getProvider().toLowerCase()) {
case "openai":
response = openAIService.callOpenAI(request);
break;
case "tongyi":
response = tongyiService.callTongyi(request);
break;
default:
throw new IllegalArgumentException("Unsupported provider: " + request.getProvider());
}
// 4. 缓存结果
cacheManager.put(cacheKey, response, Duration.ofHours(1));
return response;
}
private void validateRequest(ModelRequest request) {
if (StringUtils.isEmpty(request.getProvider())) {
throw new IllegalArgumentException("Provider is required");
}
if (StringUtils.isEmpty(request.getUserMessage())) {
throw new IllegalArgumentException("User message is required");
}
}
private String generateCacheKey(ModelRequest request) {
return String.format("ai_chat:%s:%s:%s",
request.getProvider(),
request.getModel(),
DigestUtils.md5DigestAsHex(request.getUserMessage().getBytes()));
}
}
⑤、缓存工具类
java
@Component
@RequiredArgsConstructor
public class CacheManager{
private final StringRedisTemplate redisTemplate;
public void set(String key,Object value,Duration ttl){
try{
String json = new ObjectMapper().writeValueAsString(value);
redisTemplate.opsForValue().set(key, json, ttl);
}catch(JsonProcessingException e){
throw new RuntimeException("Cache serialization failed", e);
}
}
public <T> T get(String key, Class<T> clazz){
String json = redisTemplate.opsForValue().get(key);
if (json == null)
return null;
try {
return new ObjectMapper().readValue(json, clazz);
} catch (JsonProcessingException e) {
throw new RuntimeException("Cache deserialization failed", e);
}
}
}
⑥、业务系统(如智能客服后台)通过HTTP客户端调用网关
java
@Service
public class CustomerService{
private final WebClient webClient;//使用Spring WebClient(响应式)
public String getAiSuggestion(String userQuestion){
//构建请求体
ModelRequest request = new ModelRequest();
request.setProvider("openai");
request.setModelType("chat");
request.setMessages(List.of(new Message("user", userQuestion)));
request.setMaxTokens(300);
// 调用网关
ModelResponse response = webClient.post()
.uri("http://localhost:8080/api/v1/models/generate")
.bodyValue(request)
.retrieve()
.bodyToMono(ModelResponse.class)
.block(); // 同步调用(实际可用异步)
return response != null ? response.getContent() : "AI辅助暂不可用";
}
}
方案二:本地部署开源模型封装服务的代码开发步骤
与方案一相比,方案二的模型网关服务需对接本地部署的推理服务器(如vLLM、TGI),而非第三方API。
①、选择推理服务器:优先用vLLM(高性能,支持OpenAI兼容API),部署命令示例:
bash
# 启动vLLM服务(Mistral-7B模型,开放8000端口,OpenAI兼容API)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.1 \
--quantization awq \ # 量化加速(可选)
--port 8000 \
--gpu-memory-utilization 0.9
验证服务:访问http://localhost:8000/v1/models,返回支持的模型列表
②、网关服务对接本地模型服务器
yml
model:
providers:
local-vllm: # 本地模型提供商
base-url: http://localhost:8000/v1 # vLLM的OpenAI兼容API地址
api-key: "no-key-needed" # vLLM默认无需API Key
models:
chat: mistral-7b-instruct
java
private ModelResponse callLocalVllmApi(String url, Headers headers, String body, ProviderConfig config) throws IOException {
// 本地模型调用(与OpenAI API格式兼容,直接复用callOpenAiApi逻辑)
return callOpenAiApi(url, headers, body, config);
}
③、多模型路由与动态配置
java
// 根据请求中的model参数路由到不同本地模型
if ("llama2".equals(request.getModel())) {
config = providerConfigs.get("local-llama2"); // 指向LLaMA-2的vLLM服务
}
深度集成
若需高性能(如实时对话),可将大模型推理引擎(如vLLM、TGI)封装为Java可调用的SDK(通过JNI或gRPC);或通过消息队列(Kafka)实现异步交互(如客服系统接收用户问题→推送至大模型服务→返回结果)
方案一:封装推理引擎为Java SDK(gRPC)------ 实时高性能场景
text
+-------------------+ +-----------------------+ +-----------------------+
| Java业务系统 | | Model SDK (Java) | | 模型推理服务集群 |
| (实时对话服务) | gRPC | (gRPC Stub + 封装) | gRPC | (vLLM/TGI + GPU节点) |
| (Spring Boot) |◄─────►| (连接池/熔断/监控) |◄─────►| (K8s StatefulSet部署) |
+-------------------+ +-----------------------+ +-----------------------+
▲
│ 服务发现(Consul/Nacos)
+-------+-------+
| 负载均衡(K8s Service)|
+-------------------+
①、集群中部署推理服务(vLLM/TGI)
选择推理引擎:优先用vLLM(PagedAttention优化显存,支持连续批处理和流式输出),或TGI(Hugging Face Text Generation Inference,原生gRPC接口)
- vLLM部署(暴露grpc接口)
bash
# 启动vLLM服务(Mistral-7B模型,gRPC端口50051,量化加速)
python -m vllm.entrypoints.grpc.server \
--model mistralai/Mistral-7B-Instruct-v0.1 \
--quantization awq \ # 4bit量化,显存占用~8GB
--port 50051 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 # 2卡并行(若多GPU)
- TGI部署(原生gRPC)
bash
docker run -p 50051:80 -v /path/to/models:/data ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.1 \
--grpc-port 80 # 容器内gRPC端口映射到主机50051
验证服务:通过gRPC客户端工具(如grpcurl)测试接口
bash
grpcurl -plaintext -d '{"inputs":"你好"}' localhost:50051 text_generation.InferenceService/Generate
②、定义gRPC服务接口(以TGI为例)
protobuf
syntax = "proto3";
package text_generation;
service InferenceService {
rpc Generate(GenerateRequest) returns (GenerateResponse) {} // 同步调用
rpc GenerateStream(GenerateRequest) returns (stream GenerateStreamResponse) {} // 流式调用
}
message GenerateRequest {
string inputs = 1; // Prompt文本
Parameters parameters = 2; // 推理参数(温度、max_tokens等)
bool stream = 3; // 是否流式输出
}
message Parameters {
float temperature = 1; // 随机性(0~1)
float top_p = 2; // 核采样(0~1)
int32 max_new_tokens = 3; // 最大生成token数
repeated string stop_sequences = 4; // 停止序列
}
message GenerateResponse {
string generated_text = 1; // 完整生成文本
Usage usage = 2; // Token用量
}
message GenerateStreamResponse {
string text = 1; // 流式片段(逐字/逐句)
bool stop = 2; // 是否结束
}
message Usage {
int32 input_tokens = 1;
int32 output_tokens = 2;
}
③、开发Java SDK(封装gRPC客户端)
Java侧通过gRPC Stub调用推理服务,封装为易用的SDK,屏蔽底层细节
xml
<dependencies>
<!-- gRPC -->
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-netty-shaded</artifactId>
<version>1.58.0</version>
</dependency>
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-protobuf</artifactId>
<version>1.58.0</version>
</dependency>
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-stub</artifactId>
<version>1.58.0</version>
</dependency>
<!-- 熔断/重试 -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-grpc</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
java
import io.grpc.ManagedChannel;
import io.grpc.ManagedChannelBuilder;
import io.grpc.stub.StreamObserver;
import text_generation.*;
import java.util.concurrent.TimeUnit;
import java.util.function.Consumer;
public class LLMSDK {
private final ManagedChannel channel;
private final InferenceServiceGrpc.InferenceServiceBlockingStub blockingStub; // 同步调用
private final InferenceServiceGrpc.InferenceServiceStub asyncStub; // 流式调用
// 初始化连接(支持服务发现动态获取地址)
public LLMSDK(String target) {
this.channel = ManagedChannelBuilder.forTarget(target)
.usePlaintext() // 生产环境用TLS
.maxInboundMessageSize(10 * 1024 * 1024) // 10MB最大消息
.keepAliveTime(30, TimeUnit.SECONDS) // 保活机制
.build();
this.blockingStub = InferenceServiceGrpc.newBlockingStub(channel);
this.asyncStub = InferenceServiceGrpc.newStub(channel);
}
// 同步调用(非流式)
public String generateSync(String prompt, Parameters params) {
GenerateRequest request = GenerateRequest.newBuilder()
.setInputs(prompt)
.setParameters(params)
.setStream(false)
.build();
GenerateResponse response = blockingStub.generate(request);
return response.getGeneratedText();
}
// 异步流式调用(实时对话逐字显示)
public void generateStream(String prompt, Parameters params, Consumer<String> chunkHandler, Runnable onComplete) {
GenerateRequest request = GenerateRequest.newBuilder()
.setInputs(prompt)
.setParameters(params)
.setStream(true)
.build();
asyncStub.generateStream(request, new StreamObserver<GenerateStreamResponse>() {
@Override
public void onNext(GenerateStreamResponse value) {
chunkHandler.accept(value.getText()); // 逐块回调(如逐字)
}
@Override
public void onError(Throwable t) { /* 错误处理 */ }
@Override
public void onCompleted() { onComplete.run(); } // 完成回调
});
}
// 关闭连接
public void shutdown() throws InterruptedException {
channel.shutdown().awaitTermination(5, TimeUnit.SECONDS);
}
}
④、业务集成SDK
流程:用户提问→构建Prompt(含对话历史)→SDK流式调用→WebSocket推送结果给前端
java
@Service
@RequiredArgsConstructor
public class RealtimeDialogueService {
private final LLMSDK llmSdk; // 注入SDK
private final WebSocketHandler webSocketHandler; // WebSocket推送
// 处理用户实时对话请求
public void handleUserQuery(String sessionId, String userMessage) {
// 1. 构建Prompt(含对话历史,简化示例)
String prompt = "用户:" + userMessage + "\n助手:";
// 2. 配置推理参数
Parameters params = Parameters.newBuilder()
.setTemperature(0.7f)
.setMaxNewTokens(200)
.setStream(true)
.build();
// 3. 调用SDK流式接口
llmSdk.generateStream(
prompt,
params,
chunk -> webSocketHandler.push(sessionId, chunk), // 逐字推送给前端
() -> webSocketHandler.push(sessionId, "[DONE]") // 完成标记
);
}
}
方案二:消息队列异步交互(Kafka)
某电商客服系统每日接收10万+工单,需自动分类(物流/售后/咨询)并生成初步回复。要求不阻塞用户提交,允许分钟级延迟,支持后续人工复核
text
+-------------------+ +----------------+ +-----------------------+
| 用户/前端系统 | | Kafka Topic | | 大模型推理服务(消费者)|
| (提交工单) |------>| ticket-req |------>| (vLLM/TGI + 业务逻辑) |
+-------------------+ +----------------+ +-----------------------+
▼
+-----------------------+
| Kafka Topic |
| ticket-resp |
+-----------------------+
▼
+-----------------------+
| 结果处理器(消费者) |
| (更新DB/通知人工) |
+-----------------------+
①、设计Kafka消息流
- ticket-req:工单请求Topic(分区数=推理服务实例数×2,保证并行度)
- ticket-resp:工单结果Topic(按ticketId哈希分区,确保有序)
- ticket-dlq:死信Topic(处理重试失败的工单)
消息格式
json
// 请求消息(ticket-req)
{
"type": "record",
"name": "TicketRequest",
"fields": [
{"name": "ticketId", "type": "string"}, // 工单ID(UUID)
{"name": "content", "type": "string"}, // 工单内容
{"name": "timestamp", "type": "long"}, // 提交时间
{"name": "priority", "type": ["null", "string"], "default": "medium"} // 优先级
]
}
// 响应消息(ticket-resp)
{
"type": "record",
"name": "TicketResponse",
"fields": [
{"name": "ticketId", "type": "string"},
{"name": "category", "type": "string"}, // 分类结果(物流/售后)
{"name": "reply", "type": "string"}, // 生成回复
{"name": "status", "type": "string"}, // SUCCESS/FAILED
{"name": "errorMsg", "type": ["null", "string"], "default": null}
]
}
②、Java业务系统(生产者):推送工单请求
java
@Service
@RequiredArgsConstructor
public class TicketProducer{
private final KafkaTemplate<String,TicketRequest> kafkaTemplate;
private final ObjectMapper objectMapper;
//推送工单请求到kafka
public void sendTicketRequest(TicketRequest request){
//分区键:ticketId哈希(确保同一工单有序)
String key = String.valueOf(request.getTicketId().hashCode());
//发送消息
kafkaTemplate.send("ticket-req",key,request)
.addCallback(
result -> log.info("工单{}发送成功", request.getTicketId()),
ex -> log.error("工单{}发送失败", request.getTicketId(), ex)
);
}
}
③、大模型推理服务(消费者)处理工单并生成结果
部署:Kubernetes Deployment,每个Pod包含一个Kafka Consumer Group成员,消费ticket-reqTopic
java
@Service
@RequiredArgsConstructor
public class TicketConsumer{
private final KafkaTemplate<String,TicketResponse> respTemplate;
private final LLMClient llmClient;
@KafkaListener(topics = "ticket-req",groupId = "model-worker-group")
public void consumer(ConsumerRecord<String, TicketRequest> record, Acknowledgment ack){
try{
TicketRequest req = record.value();
// 1. 构建Prompt(分类+生成回复)
String prompt = String.format("分类工单内容:%s\n可选类别:物流、售后、咨询\n生成回复:", req.getContent());
// 2. 调用模型(同步调用,非流式)
String result = llmClient.generateSync(prompt, Parameters.getDefaultInstance());
// 3. 解析结果(简化示例:假设模型返回"分类:物流;回复:您的快递预计...")
TicketResponse resp = parseResult(req.getTicketId(), result);
// 4. 发送到结果Topic
respTemplate.send("ticket-resp", req.getTicketId(), resp);
// 5. 手动提交偏移量(确保消息不丢失)
ack.acknowledge();
}catch(Exception e){
log.error("工单{}处理失败", record.key(), e);
// 重试3次后进入死信队列(通过Spring Retry实现)
}
}
}
④、结果处理(消费者):更新状态与通知
java
@Service
@RequiredArgsConstructor
public class ResultProcessor {
private final TicketRepository ticketRepo; // 数据库操作
private final NotificationService notificationService; // 通知人工客服
@KafkaListener(topics = "ticket-resp", groupId = "result-processor-group")
public void processResult(ConsumerRecord<String, TicketResponse> record) {
TicketResponse resp = record.value();
// 1. 更新工单状态
ticketRepo.updateStatus(resp.getTicketId(), resp.getStatus(), resp.getCategory(), resp.getReply());
// 2. 通知人工客服(若需复核)
if ("SUCCESS".equals(resp.getStatus())) {
notificationService.notifyAgent(resp.getTicketId());
}
}
}