前端转agent-【python】-13 Ollama Python流式输出教程：stream=True 与 async 实践

从"打字机"到"实时对话"：Python流式输出与Ollama实战

用JS/TS的视角，理解Python中的流式输出

流式输出是什么？------用JS/TS来理解

如果你写过前端，一定见过类似这样的场景：

typescript 复制代码

// 前端使用SSE接收流式数据
const eventSource = new EventSource('/api/chat');
eventSource.onmessage = (event) => {
    const chunk = JSON.parse(event.data);
    // 逐字追加到页面
    messageElement.textContent += chunk.content;
};

这就是流式输出 的核心思想：数据不是一次性全部返回，而是一块一块（chunk）地推送给客户端。

在Node.js后端，你可能这样处理：

typescript 复制代码

// Node.js 流式响应
const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    body: JSON.stringify({ model: 'qwen3:4b', messages: [...] })
});

for await (const chunk of response.body) {
    // 逐块处理数据
    processChunk(chunk);
}

Python里的流式输出，本质上做的是同一件事------只是语法不同，底层逻辑完全一致。

基础流式输出：最简单的"打字机"效果

python 复制代码

# basic_stream.py
from ollama import chat

# stream=True 开启流式输出
stream = chat(
    model='qwen3:4b',
    messages=[{'role': 'user', 'content': '用一句话解释什么是递归'}],
    stream=True,
)

# 逐块打印，end='' 不换行，flush=True 立即输出
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

运行这段代码，你会看到文字像打字机一样一个字一个字地蹦出来------而不是等待几秒钟后一次性全部显示。

🆚 JS/TS 横向对比

如果用Node.js + Ollama的JS SDK，写法是这样的：

javascript 复制代码

# basic_stream.js
import { chat } from 'ollama';

const stream = await chat({
    model: 'qwen3:4b',
    messages: [{ role: 'user', content: '用一句话解释什么是递归' }],
    stream: true,
});

for await (const chunk of stream) {
    process.stdout.write(chunk.message.content);
}

看到了吗？ 两段代码的结构几乎是一模一样的：

都通过 stream: true / stream=True 开启流式
都通过迭代器逐块获取数据
区别只是语法糖不同（Python的for...in vs JS的for await...of）

多轮对话实战：保持上下文

流式输出真正的价值在于实时对话体验。下面我们实现一个带上下文的多轮对话：

python 复制代码

# multi_turn_chat.py
from ollama import chat

def streaming_chat(messages):
    """流式对话函数，实时打印模型回复"""
    stream = chat(
        model='qwen3:4b',
        messages=messages,
        stream=True,
    )
    
    print("🤖: ", end='', flush=True)
    full_response = ""
    for chunk in stream:
        content = chunk['message']['content']
        print(content, end='', flush=True)
        full_response += content
    print("\n")
    return full_response

# 对话历史
conversation = []

while True:
    user_input = input("👤 你: ")
    if user_input.lower() in ['exit', 'quit']:
        break
    
    # 将用户消息加入历史
    conversation.append({'role': 'user', 'content': user_input})
    
    # 获取流式回复
    response = streaming_chat(conversation)
    
    # 将助手回复加入历史
    conversation.append({'role': 'assistant', 'content': response})

🆚 JS/TS 横向对比

在TypeScript中，同样的逻辑是这样的：

typescript 复制代码

# multi_turn_chat.ts
import { chat } from 'ollama';

async function streamingChat(messages: any[]) {
    const stream = await chat({
        model: 'qwen3:4b',
        messages,
        stream: true,
    });
    
    process.stdout.write('🤖: ');
    let fullResponse = '';
    for await (const chunk of stream) {
        const content = chunk.message.content;
        process.stdout.write(content);
        fullResponse += content;
    }
    console.log('\n');
    return fullResponse;
}

const conversation: any[] = [];
// 省略readline循环，实际用法类似

核心差异：

Python用 for chunk in stream 同步迭代，因为 chat() 返回的是一个生成器（Generator）
JS/TS需要用 for await...of 异步迭代，因为 chat() 返回的是一个AsyncGenerator
这背后的原因是：Python的Ollama客户端默认是同步的，而JS客户端默认是异步的

进阶：处理"思考过程"（Thinking）

Qwen3系列模型支持思考模式（Thinking Mode） ------模型在给出最终答案之前，会先输出一段推理过程。

如果你想让用户看到模型的"思考过程"，可以这样处理：

python 复制代码

# thinking_stream.py
from ollama import chat

stream = chat(
    model='qwen3:4b',
    messages=[{'role': 'user', 'content': '鸡兔同笼，头共35个，脚共94只，鸡和兔各多少？'}],
    stream=True,
)

in_thinking = False
thinking = ''
content = ''

for chunk in stream:
    # 检测是否有思考内容
    if hasattr(chunk['message'], 'thinking') and chunk['message']['thinking']:
        if not in_thinking:
            in_thinking = True
            print('🧠 思考中:\n', end='', flush=True)
        print(chunk['message']['thinking'], end='', flush=True)
        thinking += chunk['message']['thinking']
    elif chunk['message']['content']:
        if in_thinking:
            in_thinking = False
            print('\n\n💬 回答:\n', end='', flush=True)
        print(chunk['message']['content'], end='', flush=True)
        content += chunk['message']['content']

🆚 JS/TS 横向对比

typescript 复制代码

// thinking_stream.ts
import { chat } from 'ollama';

const stream = await chat({
    model: 'qwen3:4b',
    messages: [{ role: 'user', content: '鸡兔同笼，头共35个，脚共94只，鸡和兔各多少？' }],
    stream: true,
});

let inThinking = false;
let thinking = '';
let content = '';

for await (const chunk of stream) {
    if (chunk.message.thinking) {
        if (!inThinking) {
            inThinking = true;
            process.stdout.write('🧠 思考中:\n');
        }
        process.stdout.write(chunk.message.thinking);
        thinking += chunk.message.thinking;
    } else if (chunk.message.content) {
        if (inThinking) {
            inThinking = false;
            process.stdout.write('\n\n💬 回答:\n');
        }
        process.stdout.write(chunk.message.content);
        content += chunk.message.content;
    }
}

两段代码的逻辑完全一致，只是迭代方式不同（同步 vs 异步）。

异步版本：适合高并发场景

如果你的应用需要处理多个并发请求（比如一个Web服务），建议使用异步客户端：

python 复制代码

# async_stream.py
import asyncio
from ollama import AsyncClient

async def stream_chat_async():
    client = AsyncClient(host='http://localhost:11434')
    
    async for chunk in await client.chat(
        model='qwen3:4b',
        messages=[{'role': 'user', 'content': '给我讲个笑话'}],
        stream=True,
    ):
        print(chunk['message']['content'], end='', flush=True)

asyncio.run(stream_chat_async())

🆚 JS/TS 横向对比

JavaScript本身就是异步优先的，所以写法上其实更自然：

javascript 复制代码

# async_stream.js
import { chat } from 'ollama';

async function streamChatAsync() {
    const stream = await chat({
        model: 'qwen3:4b',
        messages: [{ role: 'user', content: '给我讲个笑话' }],
        stream: true,
    });
    
    for await (const chunk of stream) {
        process.stdout.write(chunk.message.content);
    }
}

streamChatAsync();

有趣的是：Python的异步版本写起来反而更像JS的默认版本------因为两者现在都是异步迭代了。

总结：一张表看懂Python vs JS/TS的流式输出

特性	Python	JavaScript / TypeScript
开启流式	`stream=True`	`stream: true`
同步迭代	`for chunk in stream:`	不适用（默认异步）
异步迭代	`async for chunk in await client.chat()`	`for await (const chunk of stream)`
逐块输出	`print(content, end='', flush=True)`	`process.stdout.write(content)`
处理Thinking	`chunk['message']['thinking']`	`chunk.message.thinking`
适用场景	脚本、CLI工具、FastAPI后端	Web前端、Node.js后端

核心 takeaways

流式输出的本质：无论Python还是JS，都是通过迭代器逐块获取数据，只是语法表达不同
Qwen3:4b的优势：40亿参数，性能接近Qwen2.5-72B，适合本地部署
同步vs异步：Python提供了两种选择，而JS天生异步------根据你的应用场景选择即可
思考过程：Qwen3的thinking字段让模型推理透明化，是提升用户体验的利器