FastAPI + Ollama 实战：搭一个能查天气的AI助手

📝 摘要： 春节时看到公众号有个留言："能不能搞个AI助手，要能聊天气、查限行，还得私有部署。" 别慌！这篇手记用FastAPI + Ollama + 开源模型，加上一点点天气数据微调，跑通一个能对话查询天气的demo。全程口语化，附代码、踩坑记录，直接可以搬到你的项目里。

🚨 一、公众号留言

我盯着手机，脑子里闪过一万个念头：从头训练？没卡；调API？数据出局；用开源模型？怎么让它懂实时天气？

这场景是不是特眼熟？别问我怎么知道的，上周我刚经历过。好在折腾几天总算跑通了，今天就把这套FastAPI + Ollama + 微调的组合拳拆给你看，代码直接复制，改改就能用。

🗺️ 二、先画张地图：我们要干三件事

🔹 第一 ------ 用Ollama跑一个开源大模型（比如qwen2.5:3b），让它能对话。

🔹 第二 ------ 给模型装个"工具包"：通过FastAPI调用天气API，实现实时查询。

🔹 第三 ------ 用过去一年的天气数据做个LoRA微调，让模型更懂天气问答的口气。

🔹 附赠 ------ 用Docker Compose一键启动，扔给运维同事不吵架。

⚙️ 三、地基：Ollama + FastAPI 极简搭

Ollama 这东西，我愿称之为"大模型界的Docker"，一行命令拉模型，一行命令起服务。咱们先用它跑个轻量模型：

复制代码

# 安装ollama（mac/linux都有脚本，windows有exe）
curl -fsSL https://ollama.com/install.sh | sh

# 拉取一个3b模型，够用还不吃显卡
ollama pull qwen2.5:3b

# 启动服务（默认11434端口）
ollama serve

接下来是FastAPI，它就像个智能接线员，把用户的提问转给大模型，再把模型回话包装成API。先写个最简版本：

复制代码

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import httpx

app = FastAPI()
OLLAMA_URL = "http://localhost:11434/api/generate"

class ChatRequest(BaseModel):
    message: str

@app.post("/chat")
async def chat(req: ChatRequest):
    async with httpx.AsyncClient() as client:
        payload = {
            "model": "qwen2.5:3b",
            "prompt": req.message,
            "stream": False
        }
        resp = await client.post(OLLAMA_URL, json=payload)
        return {"reply": resp.json()["response"]}

⚠️ 踩坑预警： 我第一次写忘了加stream=False，结果返回了一堆流式chunk，前端直接崩了。记住，简单demo就别开流式了。

🌤️ 四、让模型学会"查天气"：工具调用 + 微调双保险

现在模型只会瞎聊，得让它知道：当用户问天气时，要去调外部API。有两种路子：
🔸 方法A：函数调用（function calling） ------ 让模型输出特定格式，我们解析后调接口。简单直接，适合快速验证。

🔸 方法B：微调（fine-tuning） ------ 用一批天气问答数据训练模型，让它内化"查天气"这个动作。效果更自然，但需要数据。

我两个都试了，最后用了"微调+工具调用"混搭------微调让模型更主动问地点，工具调用保证数据实时。先看工具调用咋写：

复制代码

# 给ollama的prompt里加入工具描述
SYSTEM_PROMPT = """
你是一个天气助手。当用户询问天气时，你必须输出JSON格式的查询参数，例如：{"city": "北京"}。我会根据这个参数去调用天气API，然后把结果给你，你再生成自然语言回复。
"""
# 然后在chat接口里加入工具调用逻辑
@app.post("/chat")
async def chat(req: ChatRequest):
    # 先调ollama，让它决定是否要查天气
    # ... 省略重复代码
    # 解析返回内容，如果包含JSON格式，就调天气API
    # 调完后再把天气数据拼进prompt，让模型生成最终回答

完整代码有点长，重点是理解其中的逻辑，具体数据解析和系统提示词可再根据返回数据再作调整。这里说个翻车经验：一开始让模型自己决定要不要查天气，结果它老是不按格式输出，后来加了few-shot样例才好。

复制代码

pip install fastapi uvicorn httpx pydantic

import json
import re
from typing import List, Optional

import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

# Ollama 服务地址（容器化后可能是 http://ollama:11434）
OLLAMA_URL = "http://localhost:11434/api/generate"
# 使用的模型名称（微调后可用 weather-assistant）
MODEL_NAME = "qwen2.5:3b"

# 天气 API 配置（以和风天气为例，你需要注册获取 key）
WEATHER_API_KEY = "your_weather_api_key"
WEATHER_API_URL = "https://api.qweather.com/v7/weather/now"

# 对话历史存储（简单起见用内存，生产环境建议用 Redis）
conversation_history: List[dict] = []

class ChatRequest(BaseModel):
    message: str
    session_id: Optional[str] = None  # 用于区分不同对话

class ChatResponse(BaseModel):
    reply: str
    used_tool: bool = False

def extract_json(text: str) -> Optional[dict]:
    """从模型回复中提取第一个 JSON 对象"""
    # 匹配 {...} 或 [...] 格式
    json_pattern = r'(\{.*\}|\[.*\])'
    match = re.search(json_pattern, text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            return None
    return None

async def call_weather_api(city: str) -> Optional[str]:
    """调用天气 API 获取实时天气"""
    params = {
        "location": city,
        "key": WEATHER_API_KEY
    }
    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get(WEATHER_API_URL, params=params, timeout=10)
            data = resp.json()
            if data.get("code") == "200":
                now = data["now"]
                return f"{city} 当前天气：{now['text']}，温度 {now['temp']}℃，体感温度 {now['feelsLike']}℃。"
            else:
                return None
        except Exception as e:
            print(f"天气 API 调用失败: {e}")
            return None

async def call_ollama(prompt: str, system: str = None) -> str:
    """调用 Ollama 生成回复"""
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "system": system,
        "stream": False,
        "options": {
            "temperature": 0.7,
            "max_tokens": 500
        }
    }
    async with httpx.AsyncClient() as client:
        resp = await client.post(OLLAMA_URL, json=payload, timeout=60)
        if resp.status_code == 200:
            return resp.json()["response"]
        else:
            raise HTTPException(status_code=500, detail="Ollama 服务出错")

@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(req: ChatRequest):
    # 构建带上下文的 prompt
    history = conversation_history[-5:]  # 只保留最近 5 轮
    context = ""
    for turn in history:
        context += f"用户：{turn['user']}\n助手：{turn['assistant']}\n"
    
    # System prompt 明确要求输出 JSON 以触发工具
    system_prompt = """你是一个天气助手，能够通过调用工具获取实时天气。
当用户询问天气时，你必须先输出一个 JSON 对象，格式为 {"city": "城市名"}，然后再输出自然语言回复。
如果用户没有明确城市，你可以反问用户。
对于其他问题，正常对话即可。"""
    
    full_prompt = f"{context}用户：{req.message}\n助手："
    
    # 第一轮调用 Ollama
    raw_reply = await call_ollama(full_prompt, system=system_prompt)
    
    # 尝试提取 JSON
    tool_call = extract_json(raw_reply)
    used_tool = False
    
    if tool_call and "city" in tool_call:
        # 调天气 API
        city = tool_call["city"]
        weather_info = await call_weather_api(city)
        if weather_info:
            # 将天气信息作为上下文重新请求模型生成最终回答
            second_prompt = f"{full_prompt}（工具返回：{weather_info}）\n请根据这个信息生成自然语言回复。"
            final_reply = await call_ollama(second_prompt, system=system_prompt)
            used_tool = True
            # 清理可能残留的 JSON 片段
            final_reply = re.sub(r'\{.*?\}', '', final_reply).strip()
        else:
            final_reply = f"抱歉，查询 {city} 的天气失败了，请稍后再试。"
    else:
        final_reply = raw_reply
    
    # 保存对话历史
    conversation_history.append({
        "user": req.message,
        "assistant": final_reply
    })
    
    return ChatResponse(reply=final_reply, used_tool=used_tool)

@app.get("/health")
async def health():
    return {"status": "ok"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

📊 五、加点数据微调：让模型更"懂行"

微调听起来高大上，其实现在有LoRA这种低成本技术，几张图的数据就能见效。比如用过去一年的天气记录（城市、温度、天气状况）创建500条问答对，大概格式长这样：

复制代码

{"instruction": "北京今天天气怎么样？", "output": "北京今天晴，气温-2℃到8℃，北风3级，空气质量优。建议穿羽绒服。"}
{"instruction": "上海下雨了吗？", "output": "上海目前小雨，气温10℃到12℃，东北风2级。出门记得带伞。"}
{"instruction": "广州明天适合穿什么？", "output": "广州明天多云，22℃到28℃，建议穿短袖+薄外套。"}
{"instruction": "帮我查查成都的限行", "output": "成都今天限行尾号3和8，限行时间7:30-20:00。"}

然后用 unsloth 库对qwen2.5做LoRA微调，关键代码：

（一直不太想玩大模型的原因就是太吃配置，太耗时，以下训练时参数还需要根据实际需求作调整，此处只作参考）

复制代码

pip install unsloth transformers datasets torch accelerate

import json
from datasets import Dataset
from unsloth import FastLanguageModel
import torch
from transformers import TrainingArguments
from trl import SFTTrainer

# 1. 加载基础模型
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="qwen/Qwen2.5-3B",
    max_seq_length=512,
    dtype=None,  # 自动选择
    load_in_4bit=True,  # 节省显存
)

# 2. 添加 LoRA 适配器
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=42,
    max_seq_length=512,
)

# 3. 准备数据
def format_instruction(example):
    return {
        "text": f"用户：{example['instruction']}\n助手：{example['output']}"
    }

with open("weather_train.jsonl", "r") as f:
    raw_data = [json.loads(line) for line in f]

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)

# 4. 设置训练参数
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="outputs",
        optim="adamw_8bit",
        save_strategy="epoch",
    ),
)

# 5. 开始训练
trainer.train()

# 6. 保存 LoRA 权重
model.save_pretrained("lora-weather")
tokenizer.save_pretrained("lora-weather")
print("微调完成，权重保存在 lora-weather 目录")

注意： 微调后要把LoRA权重合并或通过Ollama的Modelfile导入。Ollama支持直接加载safetensors，可以写个Modelfile：

复制代码

FROM qwen2.5:3b
ADAPTER ./lora-weather  # 挂载LoRA权重

然后用 ollama 创建新模型。这样ollama跑的就是微调后的模型了。

复制代码

ollama create weather-assistant -f Modelfile

验证，Ok后就可以在 FastAPI 中将 MODEL_NAME 改为 weather-assistant 来使用微调后的模型了

复制代码

ollama run weather-assistant "北京今天天气怎么样？"

🐳 六、容器化：一键部署，告别"在我电脑上好好的"

本地跑通了，要交给运维？必须容器化！整个docker-compose把fastapi和ollama装一起：

复制代码

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ./ollama:/root/.ollama
    ports:
      - "11434:11434"
    command: serve
  
  fastapi:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_URL=http://ollama:11434
    depends_on:
      - ollama

FastAPI的Dockerfile里记得把微调后的模型拷进去，或者在启动时用ollama create创建。这里有个大坑：容器内ollama默认会下载模型，如果网络慢，可以提前ollama pull好并挂载目录。

🧠 七、几点进阶思考（别再踩我踩过的坑）

🔸 微调不是必须的：如果只是简单查天气，写好prompt+工具调用完全够用。微调更适合让模型学会复杂的对话风格或领域术语。

🔸 数据质量 > 数量：可能100条高质量数据微调，效果比500条噪声数据好得多。一定要清洗数据，把"今天天气咋样"这种口语都覆盖到。

🔸 对话上下文管理：别把历史消息一股脑全塞给模型，用个滑动窗口保留最近几轮就够了，不然容易超限。

🔸 限行和穿搭推荐：一样的套路，准备好对应的API（高德地图/和风天气都有限行接口），模型识别意图后调用即可。

👭 最后唠叨两句

最好的学习就是自己把代码跑一遍，逐个解决出现的问题，要是卡在哪了，评论区留言，我看到了就回------毕竟程序员不帮程序员，谁帮？

💡 对了，别忘了关注+收藏，下次聊些更好玩的～