最近,Claude迎来了3.5版本的重大更新,除了模型性能的全面提升,最引人注目的莫过于Computer Use功能的公测。
这个特性让Claude首次拥有了看 和用 计算机的能力 ------ 它不仅能理解屏幕上的内容,还能像人类一样操作浏览器、处理文件、使用各类软件。
在现在AI助手还停留在对话层面时,Claude已经开始尝试探索真正智能助理的雏形。
本文将带大家深入探索一下这个新feature能力。
使用Computer Use
目前使用Computer Use的方式是通过API,官方给了一个基础的实现Demo:
我们可以把这个Demo跑起来看看:
bash
export ANTHROPIC_API_KEY=%your_api_key%
docker run \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $HOME/.anthropic:/home/computeruse/.anthropic \
-p 5900:5900 \
-p 8501:8501 \
-p 6080:6080 \
-p 8080:8080 \
-it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
如果你的Docker环境和Claude的API Key没有问题,稍等片刻就可以看到容器已经跑起来了:
我们打开http://localhost:8080,可以看到如下界面:
通过聊天框我们可以给Claude下达指令,在右边我们可以预览到Computer的实时操作行为。
简单指令------打开网页
我们给他下达一个比较简单的指令,打开网页试试:
Cool!这是它最终打开的效果,我们来看看其中的过程:
我的指令:打开google网页
css
我来帮您打开Google网页。我会使用Firefox浏览器来访问Google。让我们一步步来操作:
首先让我截个图看看当前桌面情况:
Tool Use: computer
Input: {'action': 'screenshot'}
css
我看到了Firefox图标,让我点击它:
Tool Use: computer
Input: {'action': 'mouse_move', 'coordinate': [804, 738]}
css
Tool Use: computer
Input: {'action': 'left_click'}
css
让我在地址栏中输入Google的网址:
Tool Use: computer
Input: {'action': 'mouse_move', 'coordinate': [511, 407]}
css
Tool Use: computer
Input: {'action': 'left_click'}
css
Tool Use: computer
Input: {'action': 'type', 'text': 'www.google.com'}
css
Tool Use: computer
Input: {'action': 'key', 'text': 'Return'}
Google网页已经成功打开了。您现在可以在Google搜索框中输入任何想要搜索的内容。需要我帮您进行其他操作吗?
以上就是这个简单指令的详细过程,可以看到它的本质是: 将指令step by step拆解(符合CoT思想),通过不同的actions(截屏、鼠标位移、单击、键盘)来达到控制电脑的目标。
根据官方文档,目前支持的action可以是:
key
:模拟键盘按键或组合键 (支持xdotool语法:例如:"a"、"Return"、"alt+Tab"、"ctrl+s"、"Up"等)type
:输入一串文本cursor_position
:获取当前鼠标光标的(x, y)坐标位置mouse_move
:将鼠标光标移动到指定的(x, y)坐标位置left_click
:模拟鼠标左键单击left_click_drag
:模拟鼠标左键点击并拖拽到指定位置right_click
:模拟鼠标右键点击middle_click
:模拟鼠标中键点击double_click
:模拟鼠标左键双击screenshot
:获取屏幕截图
来点复杂的------写个程序
我们尝试让它来实现一个TODOList程序:
可以看到它尝试直接创建项目失败了,并且自己发现了问题,要提前安装nodejs:
nodejs安装好了,但是执行项目创建超时了(猜测是脚手架让用户选择,然后一直卡在交互提示那里了):
然后尝试手动创建文件,由于权限问题(猜测要输入密码)也失败了,尝试各种方式均失败:
最后,它反思了一下,放弃了,给你代码,你自己拷贝过去用吧...
这个case其实应该是linux的sudo权限问题,需要让用户输入密码,但是目前Claude对于这种情况处理的不好,所以导致一直超时了。
接下来我们手动创建下这个目录,再让它试试看:
可以看到它首先Check了下目录,是存在的,于是开始尝试把文件写进去:
这次它发现了权限问题,并且自动修正了:
最后,所有文件都创建好了,它开始运行:
然后,又遇到了同样的问题,其实这个时候是服务启动了,但是它认为是bash超时了,就没有后续了。
手动预览一下,其实服务已经起来了:
但是文件格式有点问题,还需要再修复下,不过这个问题不大,我们让它修一下:
可以看到,项目终于跑起来了!!!
深入分析一下Computer Use Demo原理
Docker镜像
我们首先看到demo项目提供了一个DockerFile:
bash
FROM docker.io/ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV DEBIAN_PRIORITY=high
RUN apt-get update && \
apt-get -y upgrade && \
apt-get -y install \
build-essential \
# UI Requirements
xvfb \
xterm \
xdotool \
scrot \
imagemagick \
sudo \
mutter \
x11vnc \
# Python/pyenv reqs
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
curl \
git \
libncursesw5-dev \
xz-utils \
tk-dev \
libxml2-dev \
libxmlsec1-dev \
libffi-dev \
liblzma-dev \
# Network tools
net-tools \
netcat \
# PPA req
software-properties-common && \
# Userland apps
sudo add -apt-repository ppa:mozillateam/ppa && \
sudo apt-get install -y --no-install-recommends \
libreoffice \
firefox-esr \
x11-apps \
xpdf \
gedit \
xpaint \
tint2 \
galculator \
pcmanfm \
unzip && \
apt-get clean
# Install noVNC
RUN git clone --branch v1.5.0 https://github.com/novnc/noVNC.git /opt/noVNC && \
git clone --branch v0.12.0 https://github.com/novnc/websockify /opt/noVNC/utils/websockify && \
ln -s /opt/noVNC/vnc.html /opt/noVNC/index.html
# setup user
ENV USERNAME=computeruse
ENV HOME= /home/ $USERNAME
RUN useradd -m -s /bin/bash -d $HOME $USERNAME
RUN echo "${USERNAME} ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER computeruse
WORKDIR $HOME
# setup python
RUN git clone https://github.com/pyenv/pyenv.git ~/.pyenv && \
cd ~/.pyenv && src/configure && make -C src && cd .. && \
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc && \
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc && \
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
ENV PYENV_ROOT="$HOME/.pyenv"
ENV PATH="$PYENV_ROOT/bin:$PATH"
ENV PYENV_VERSION_MAJOR=3
ENV PYENV_VERSION_MINOR=11
ENV PYENV_VERSION_PATCH=6
ENV PYENV_VERSION=$PYENV_VERSION_MAJOR . $PYENV_VERSION_MINOR . $PYENV_VERSION_PATCH
RUN eval "$(pyenv init -)" && \
pyenv install $PYENV_VERSION && \
pyenv global $PYENV_VERSION && \
pyenv rehash
ENV PATH="$HOME/.pyenv/shims:$HOME/.pyenv/bin:$PATH"
RUN python -m pip install --upgrade pip==23.1.2 setuptools==58.0.4 wheel==0.40.0 && \
python -m pip config set global.disable-pip-version-check true
# only reinstall if requirements.txt changes
COPY -- chown=$USERNAME:$USERNAME computer_use_demo/requirements.txt $HOME /computer_use_demo/requirements.txt
RUN python -m pip install -r $HOME /computer_use_demo/requirements.txt
# setup desktop env & app
COPY -- chown=$USERNAME:$USERNAME image/ $HOME
COPY -- chown=$USERNAME:$USERNAME computer_use_demo/ $HOME /computer_use_demo/
ARG DISPLAY_NUM=1
ARG HEIGHT=768
ARG WIDTH=1024
ENV DISPLAY_NUM=$DISPLAY_NUM
ENV HEIGHT=$HEIGHT
ENV WIDTH=$WIDTH
ENTRYPOINT [ "./entrypoint.sh" ]
可以看到首先,以Ubuntu 22.04为基础镜像,提供了一个桌面版的Linux环境。
然后,安装了一系列依赖和应用程序。
并安装了noVNC(用来提供远程访问能力)。
然后就是设置用户、安装python,并且设置环境变量,再之后运行启动脚本。
启动脚本实际上就是启动了一个python服务,也就是我们看到的demo界面:
bash
#!/bin/bash
set -e
./start_all.sh
./novnc_startup.sh
python http_server.py > /tmp/server_logs.txt 2>&1 &
STREAMLIT_SERVER_PORT=8501 python -m streamlit run computer_use_demo/streamlit.py > /tmp/streamlit_stdout.log &
echo "✨ Computer Use Demo is ready!"
echo "➡️ Open http://localhost:8080 in your browser to begin"
# Keep the container running
tail -f /dev/null
核心loop逻辑
服务启动之后,chat指令发出后,主要进入了一个loop逻辑,位于tools/loop.py
:
python
async def sampling_loop(
*,
model: str,
provider: APIProvider,
system_prompt_suffix: str,
messages: list[BetaMessageParam],
output_callback: Callable[[BetaContentBlock], None],
tool_output_callback: Callable[[ToolResult, str], None],
api_response_callback: Callable[[APIResponse[BetaMessage]], None],
api_key: str,
only_n_most_recent_images: int | None = None,
max_tokens: int = 4096,
):
"""
Agentic sampling loop for the assistant/tool interaction of computer use.
"""
首先创建了工具的集合:
scss
tool_collection = ToolCollection(
ComputerTool(),
BashTool(),
EditTool(),
)
包含三个核心工具:
- ComputerTool:控制鼠标、键盘等
- BashTool:执行命令行操作
- EditTool:文本编辑工具
然后拼装System Prompt:
python
system = (
f"{SYSTEM_PROMPT}{' ' + system_prompt_suffix if system_prompt_suffix else ''}"
)
接着进入主循环逻辑:
ini
while True:
if only_n_most_recent_images:
_maybe_filter_to_n_most_recent_images(messages, only_n_most_recent_images)
if provider == APIProvider.ANTHROPIC:
client = Anthropic(api_key=api_key)
elif provider == APIProvider.VERTEX:
client = AnthropicVertex()
elif provider == APIProvider.BEDROCK:
client = AnthropicBedrock()
# Call the API
# we use raw_response to provide debug information to streamlit. Your
# implementation may be able call the SDK directly with:
# `response = client.messages.create(...)` instead.
raw_response = client.beta.messages.with_raw_response.create(
max_tokens=max_tokens,
messages=messages,
model=model,
system=system,
tools=tool_collection.to_params(),
betas=["computer-use-2024-10-22"],
)
在循环中处理工具调用:
ini
tool_result_content: list[BetaToolResultBlockParam] = []
for content_block in cast(list[BetaContentBlock], response.content):
output_callback(content_block)
if content_block.type == "tool_use":
result = await tool_collection.run(
name=content_block.name,
tool_input=cast(dict[str, Any], content_block.input),
)
tool_result_content.append(
_make_api_tool_result(result, content_block.id)
)
tool_output_callback(result, content_block.id)
直到没有工具执行结果返回为止:
css
if not tool_result_content:
return messages
messages.append({"content": tool_result_content, "role": "user"})
System Prompt
我们稍微看一下这个Prompt怎么写的:
ini
SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY>
* You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access.
* You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget.
* To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system.
* Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did.
* When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output.
* When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available.
* When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request.
* The current date is {datetime.today().strftime('%A, %B %-d, %Y')}.
</SYSTEM_CAPABILITY>
<IMPORTANT>
* When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there.
* If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool.
</IMPORTANT>"""
首先进行了环境说明:
- 说明运行环境是Ubuntu虚拟机
- 动态获取系统架构
- 确认有网络访问权限
然后进行工具说明:
- 允许使用bash工具安装应用
- 指定使用curl而非wget
- 说明Firefox的特殊版本(firefox-esr)
然后给到了一些最佳实践:
- 建议使用文件重定向
- 提供文本搜索方法
- 推荐使用str_replace_editor工具
- 建议使用缩放功能
- 提醒完整浏览页面内容
- 提醒操作延迟
- 建议批量处理操作
最后申明了重点注意事项:
-
Firefox特殊处理
-
忽略启动向导
-
直接使用地址栏
-
PDF处理流程
-
评估是否需要完整读取
-
提供替代处理方案:
- 获取URL
-
下载PDF
- 转换为文本
-
使用StrReplaceEditTool读取
Computer工具实现
我们先看一下Computer的核心逻辑,位于tools/computer.py
:
首先定义了支持的操作类型:
ini
Action = Literal[
"key", # 键盘按键
"type", # 键盘输入文本
"mouse_move", # 鼠标移动
"left_click", # 左键点击
"left_click_drag", # 左键拖拽
"right_click", # 右键点击
"middle_click", # 中键点击
"double_click", # 双击
"screenshot", # 截图
"cursor_position", # 获取光标位置
]
然后定义了一个坐标转换函数,主要处理分辨率问题:
python
MAX_SCALING_TARGETS: dict[str, Resolution] = {
"XGA": Resolution(width=1024, height=768), # 4:3
"WXGA": Resolution(width=1280, height=800), # 16:10
"FWXGA": Resolution(width=1366, height=768), # ~16:9
}
def scale_coordinates(self, source: ScalingSource, x: int, y: int):
"""Scale coordinates to a target maximum resolution."""
if not self._scaling_enabled:
return x, y
ratio = self.width / self.height
target_dimension = None
for dimension in MAX_SCALING_TARGETS.values():
if abs(dimension["width"] / dimension["height"] - ratio) < 0.02:
if dimension["width"] < self.width:
target_dimension = dimension
break
if target_dimension is None:
return x, y
x_scaling_factor = target_dimension["width"] / self.width
y_scaling_factor = target_dimension["height"] / self.height
if source == ScalingSource.API:
if x > self.width or y > self.height:
raise ToolError(f"Coordinates {x}, y} are out of bounds")
return round(x / x_scaling_factor), round(y / y_scaling_factor)
return round(x * x_scaling_factor), round(y * y_scaling_factor)
然后核心的实现是__call__函数,首先处理的是鼠标移动和拖拽(是通过执行xdotool shell 来实现的):
python
if action in ("mouse_move", "left_click_drag"):
# 参数验证
if coordinate is None:
raise ToolError(f"coordinate is required for {action}")
if text is not None:
raise ToolError(f"text is not accepted for {action}")
if not isinstance(coordinate, list) or len(coordinate) != 2:
raise ToolError(f"{coordinate} must be a tuple of length 2")
if not all(isinstance(i, int) and i >= 0 for i in coordinate):
raise ToolError(f"{coordinate} must be a tuple of non-negative ints")
# 坐标缩放
x, y = self.scale_coordinates(ScalingSource.API, coordinate[0], coordinate[1])
# 执行操作
if action == "mouse_move":
return await self.shell(f"{self.xdotool} mousemove --sync {x} {y}")
elif action == "left_click_drag":
return await self.shell(
f"{self.xdotool} mousedown 1 mousemove --sync {x} {y} mouseup 1"
)
其次是键盘操作(同样基于xdotool来执行输入):
python
if action in ("key", "type"):
# 参数验证
if text is None:
raise ToolError(f"text is required for {action}")
if coordinate is not None:
raise ToolError(f"coordinate is not accepted for {action}")
# 按键操作
if action == "key":
return await self.shell(f"{self.xdotool} key -- {text}")
# 文本输入操作
elif action == "type":
results: list[ToolResult] = []
# 分块处理长文本
for chunk in chunks(text, TYPING_GROUP_SIZE):
cmd = f"{self.xdotool} type --delay {TYPING_DELAY_MS} -- {shlex.quote(chunk)}"
results.append(await self.shell(cmd, take_screenshot=False))
# 获取最终截图
screenshot_base64 = (await self.screenshot()).base64_image
# 合并结果
return ToolResult(
output="".join(result.output or "" for result in results),
error="".join(result.error or "" for result in results),
base64_image=screenshot_base64,
)
点击操作:
python
if action in ("left_click", "right_click", "middle_click", "double_click"):
# 参数验证
if coordinate is not None:
x, y = self.scale_coordinates(ScalingSource.API, coordinate[0], coordinate[1])
await self.shell(f"{self.xdotool} mousemove --sync {x} {y}")
# 映射点击类型到xdotool命令
click_map = {
"left_click": "click 1",
"right_click": "click 3",
"middle_click": "click 2",
"double_click": "click 1 --repeat 2",
}
return await self.shell(f"{self.xdotool} {click_map[action]}")
bash工具的实现
比较基础的实现了一个bash session的管理,没有太特殊的,不展开介绍:
python
async def __call__(
self, command: str | None = None, restart: bool = False, **kwargs
):
if restart:
if self._session:
self._session.stop()
self._session = _BashSession()
await self._session.start()
return ToolResult(system="tool has been restarted.")
if self._session is None:
self._session = _BashSession()
await self._session.start()
if command is not None:
return await self._session.run(command)
raise ToolError("no command provided.")
python
class _BashSession:
"""A session of a bash shell."""
_started: bool
_process: asyncio.subprocess.Process
command: str = "/bin/bash"
_output_delay: float = 0.2 # seconds
_timeout: float = 120.0 # seconds
_sentinel: str = "<<exit>>"
async def start(self):
if self._started:
return
self._process = await asyncio.create_subprocess_shell(
self.command,
preexec_fn=os.setsid,
shell=True,
bufsize=0,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
self._started = True
edit工具的实现
edit工具主要实现了文本编辑相关的操作:
ini
Command = Literal[
"view", # 查看文件内容
"create", # 创建新文件
"str_replace", # 替换文本
"insert", # 插入文本
"undo_edit", # 撤销编辑
]
查看文件:
python
async def view(self, path: Path, view_range: list[int] | None = None):
"""Implement the view command"""
if path.is_dir():
# 处理目录查看
_, stdout, stderr = await run(
rf"find {path} -maxdepth 2 -not -path '*/.*'"
)
return CLIResult(output=stdout, error=stderr)
# 处理文件查看
file_content = self.read_file(path)
if view_range:
# 验证view_range参数
if len(view_range) != 2 or not all(isinstance(i, int) for i in view_range):
raise ToolError("Invalid `view_range`. It should be a list of two integers.")
# 处理行范围
file_lines = file_content.split("\n")
init_line, final_line = view_range
if final_line == -1:
file_content = "\n".join(file_lines[init_line - 1 :])
else:
file_content = "\n".join(file_lines[init_line - 1 : final_line])
替换文本:
python
def str_replace(self, path: Path, old_str: str, new_str: str | None):
"""Implement the str_replace command"""
# 读取文件内容
file_content = self.read_file(path).expandtabs()
old_str = old_str.expandtabs()
new_str = new_str.expandtabs() if new_str is not None else ""
# 检查替换字符串的唯一性
occurrences = file_content.count(old_str)
if occurrences == 0:
raise ToolError(f"No replacement was performed, old_str `{old_str}` not found")
elif occurrences > 1:
# 找出所有出现的行号
lines = [
idx + 1
for idx, line in enumerate(file_content.split("\n"))
if old_str in line
]
raise ToolError(f"Multiple occurrences found in lines {lines}")
# 执行替换
new_file_content = file_content.replace(old_str, new_str)
self.write_file(path, new_file_content)
self._file_history[path].append(file_content) # 保存历史记录
文本插入:
python
def insert(self, path: Path, insert_line: int, new_str: str):
"""Implement the insert command"""
file_text = self.read_file(path).expandtabs()
new_str = new_str.expandtabs()
file_text_lines = file_text.split("\n")
# 验证插入位置
if insert_line < 0 or insert_line > len(file_text_lines):
raise ToolError(f"Invalid insert_line: {insert_line}")
# 执行插入
new_str_lines = new_str.split("\n")
new_file_text_lines = (
file_text_lines[:insert_line]
+ new_str_lines
+ file_text_lines[insert_line:]
)
# 生成预览片段
snippet_lines = (
file_text_lines[max(0, insert_line - SNIPPET_LINES) : insert_line]
+ new_str_lines
+ file_text_lines[insert_line : insert_line + SNIPPET_LINES]
)
撤销:
python
def undo_edit(self, path: Path):
"""Implement the undo_edit command"""
if not self._file_history[path]:
raise ToolError(f"No edit history found for {path}")
old_text = self._file_history[path].pop()
self.write_file(path, old_text)
return CLIResult(output=f"Last edit undone successfully")
原理小结
其实Claude模型更新,仅仅是加强训练返回了三类指令,也就是我们通过和模型聊天,可以得到三类actions:
- Computer:进行键鼠操作、截图等基础电脑操作
- Bash:执行命令行指令
- Edit:编辑文本内容
工程实现时,只需要做两件事:
- 搭建一个执行循环(loop),持续接收模型的指令
- 将这三类指令翻译并对接到操作系统的实际API
这种设计的巧妙之处在于:它用最小的接口集合实现了最大的操作灵活性。就像给了AI三个基础工具,却能让它完成各种复杂任务。这也符合Unix的设计哲学 ------ 小而美的工具组合出强大的功能。
总结一下
通过这次Computer Use的更新,Claude展现了大模型发展的一个重要方向 ------ 端到端的智能控制。虽然当前版本还有待完善,但它已经用最朴素的方式证明了一个事实:AI 确实可以像人类一样'使用'电脑了。
更深层看,这次创新的意义在于重新定义了大模型的角色。它不再只是一个对话工具,而是在向智能操作系统 的方向迈进 ------ 通过自然语言理解用户意图,继而直接控制和协调系统行为。Claude团队的这一步虽小,却为AI的未来指明了方向。
可以预见,随着技术的成熟和生态的发展,这种能力必将延伸到各类操作系统,无论是Mac、Windows、Linux,还有手机操作系统。届时,大模型很可能成为新时代的统一入口,用最自然的交互方式,帮助人类完成各种任务。Computer Use的出现,让这个未来不再遥远!
我是孟健,腾讯/字节资深工程师,专注全栈与AI技术研究。
欢迎关注我的公众号「孟健的全栈认知」,每周分享技术实践和经验复盘。