Bytebot源码学习

本文档详细介绍 Bytebot 项目中各个包的功能、原理和实现细节。

仓库地址：github.com/bytebot-ai/...

概述

Bytebot 是一个开源的自托管 AI 桌面代理系统，通过 LLM 的强大理解能力和桌面自动化的精确执行，Bytebot 实现了"给 AI 一台自己的电脑"的愿景，让 AI 能够像人类一样操作计算机完成复杂任务。

简洁来说就是，可以通过对话的形式（当然也可以创建任务）让AI自动操作虚拟桌面

架构图

核心概念理解

虚拟桌面

虚拟桌面是一种在计算机上创建和运行图形化桌面环境的技术，但这个环境并不是直接显示在物理屏幕上，而是运行在内存中或通过网络远程访问。

物理桌面 vs 虚拟桌面

特性	物理桌面	虚拟桌面
显示输出	直接连接显示器	无物理显示输出
硬件要求	需要显卡、显示器	只需 CPU 和内存
访问方式	直接观看	远程连接、网络访问
使用场景	个人日常使用	服务器、自动化、测试

如何搭建一个虚拟桌面

使用docker

创建 Dockerfile

Dockerfile

bash 复制代码

FROM ubuntu:22.04

# 设置环境变量
ENV DEBIAN_FRONTEND=noninteractive
ENV DISPLAY=:0

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    # 虚拟显示和远程访问
    xvfb x11vnc xauth \
    # 桌面环境
    xfce4 xfce4-goodies \
    # 工具和应用程序
    firefox-esr gedit mousepad \
    # 网络和进程管理
    net-tools supervisor \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# 安装 noVNC
RUN apt-get update && apt-get install -y \
    python3 python3-pip git \
    && git clone https://github.com/novnc/noVNC.git /opt/noVNC \
    && git clone https://github.com/novnc/websockify.git /opt/noVNC/utils/websockify

# 创建用户
RUN useradd -m -s /bin/bash user && \
    echo 'user:password' | chpasswd

# 复制配置文件
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# 暴露端口
EXPOSE 6080

# 启动服务
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf", "-n"]

通过 noVNC客户端链接在浏览器中查看虚拟桌面

创建 Supervisor 配置文件

Supervisor 进程管理配置文件，用于在 Docker 容器中管理和协调多个服务的启动和运行

ini 复制代码

// supervisord.conf
[supervisord]
nodaemon=true
logfile=/var/log/supervisord.log
pidfile=/var/run/supervisord.pid

[program:xvfb]
command=/usr/bin/Xvfb :0 -screen 0 1024x768x24 -ac +extension GLX
autorestart=true
user=user

[program:xfce4]
command=/bin/bash -c "export DISPLAY=:0 && /usr/bin/startxfce4"
environment=DISPLAY=":0"
autorestart=true
user=user
depends_on=xvfb

[program:x11vnc]
command=/usr/bin/x11vnc -display :0 -forever -shared -noxdamage -passwd password123
environment=DISPLAY=":0"
autorestart=true
user=user
depends_on=xfce4

[program:websockify]
command=python3 /opt/noVNC/utils/websockify/websockify.py --web /opt/noVNC 6080 localhost:5900
autorestart=true
user=user
depends_on=x11vnc

运行和构建

csharp 复制代码

# 构建镜像
docker build -t virtual-desktop .

# 运行容器
docker run -d -p 6080:6080 --name vdesktop virtual-desktop

访问虚拟桌面

打开浏览器访问: http://localhost:6080/vnc.html

密码: password123

在 Bytebot 中的具体使用

前端中使用react-vnc提供的VncComponent组件，可展示虚拟桌面并与其交互

typescript 复制代码

export function VncViewer({ viewOnly = true }: VncViewerProps) {
  // 容器 DOM 引用
  const containerRef = useRef<HTMLDivElement>(null);
  // VNC 组件（动态导入）
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  const [VncComponent, setVncComponent] = useState<any>(null);
  // WebSocket URL
  const [wsUrl, setWsUrl] = useState<string | null>(null);

  // 动态导入 VncScreen 组件（仅在客户端，避免 SSR 问题）
  useEffect(() => {
    import("react-vnc").then(({ VncScreen }) => {
      setVncComponent(() => VncScreen);
    });
  }, []);

  // 构建 WebSocket URL
  useEffect(() => {
    // SSR 安全检查
    if (typeof window === "undefined") return;
    
    // 根据当前协议选择 WebSocket 协议（https -> wss, http -> ws）
    const proto = window.location.protocol === "https:" ? "wss" : "ws";
    // 构建 WebSocket URL，通过代理路径连接
    // wsUrl 后面会被代理到真实的ws路径
    setWsUrl('url', `${proto}://${window.location.host}/api/proxy/websockify`);
  }, []);
  
<VncComponent
    // RFB (Remote Frame Buffer) 选项
    rfbOptions={{
    secure: false, // 不使用 TLS
    shared: true, // 共享模式（多个客户端可以同时连接）
    wsProtocols: ["binary"], // WebSocket 协议类型
    }}
    // autoConnect={true} // 自动连接（已注释）
    // 使用 key 来区分只读和交互模式，确保组件正确重新渲染
    key={viewOnly ? "view-only" : "interactive"}
    url={wsUrl} // WebSocket URL
    scaleViewport // 缩放视口以适应容器
    viewOnly={viewOnly} // 是否只读
    style={{ width: "100%", height: "100%" }} // 全尺寸样式
/>

VncComponent中的url是什么呢？

bytebotd包启动后，会创建桌面端服务，并提供noVNC，因而可通过网页url访问虚拟桌面可视化界面

代码中的url就是指向noVNC链接，因此可以在前端看到虚拟桌面

浏览器远程访问虚拟桌面内部实现

客户端鼠标键盘事件 =》网络请求 =〉 VNC服务远程执行 =》 VNC传输屏幕画面 =〉客户端本地显示

输入传输（客户端 → 服务器）

scss 复制代码

  用户鼠标/键盘操作
    → react-vnc (编码为 RFB 输入消息)
    → WebSocket 二进制帧
    → 反向代理链
    → websockify (WebSocket → RFB)
    → x11vnc (RFB → X11 输入事件)
    → X11 服务器

画面传输（服务器 → 客户端）

scss 复制代码

  X11 屏幕内容 
    → x11vnc (RFB 协议编码)
    → websockify (RFB → WebSocket 二进制帧)
    → bytebotd (透明转发)
    → Next.js (透明转发)
    → react-vnc (解码 RFB，渲染到 Canvas)

项目详解

bytebot-ui - 前端用户界面

描述： 基于 Next.js 的现代化 Web 界面，提供任务管理和桌面查看功能。

技术栈

Next.js 15
React 19
TypeScript
Tailwind CSS
Socket.IO Client
react-vnc: VNC(Virtual Network Computing)客户端组件(让你能在网页上直接显示并操作远程虚拟桌面。)

页面设计

首页
- 创建新任务
- 查看任务列表
任务列表、详情页

可以在右侧与AI的对话界面输入需求，左侧将展示远程桌面的UI

远程桌面
- 可以在线操控远程桌面

接口设计

server.ts 处理 WebSocket 连接（/api/proxy/tasks 和 /api/proxy/websockify）

server.ts 第74行把所有请求交给 Next.js，包括 /api/*

route.ts 处理所有其他 HTTP API 请求（如 /api/tasks/models、/api/tasks 等）

server.ts

typescript 复制代码

/**
 * Bytebot UI 服务器入口文件
 * 
 * 这个文件负责启动 Next.js 应用服务器，并配置代理中间件来转发请求到后端服务。
 * 主要功能：
 * 1. 启动 Next.js 应用
 * 2. 代理 tasks相关请求 连接到 bytebot-agent
 * 3. 代理 VNC相关请求 连接到 bytebotd
 * 4. 处理所有其他请求到 Next.js
 */

import express from "express";
import { createServer } from "http";
import { createProxyMiddleware } from "http-proxy-middleware";
import { createProxyServer } from "http-proxy";
import next from "next";
import dotenv from "dotenv";

// 加载环境变量
dotenv.config();

// 获取运行环境配置
const dev = process.env.NODE_ENV !== "production";
const hostname = process.env.HOSTNAME || "localhost";
const port = parseInt(process.env.PORT || "9992", 10);

// 后端服务 URL 配置
const BYTEBOT_AGENT_BASE_URL = process.env.BYTEBOT_AGENT_BASE_URL; // AI 代理服务地址
const BYTEBOT_DESKTOP_VNC_URL = process.env.BYTEBOT_DESKTOP_VNC_URL; // 桌面 VNC 服务地址

// 初始化 Next.js 应用
const app = next({ dev, hostname, port });

app
  .prepare()
  .then(() => {
    // HTTP 请求交由 Next.js 的 handle 处理器处理
    const handle = app.getRequestHandler();
    // WebSocket 请求交由 nextUpgradeHandler 处理器处理
    const nextUpgradeHandler = app.getUpgradeHandler();

    const expressApp = express();

    // 配置 Socket.IO WebSocket 代理（用于连接到 bytebot-agent）
    // 将 /api/proxy/tasks 路径重写为 /socket.io，转发到后端 Socket.IO 服务器
    const tasksProxy = createProxyMiddleware({
      target: BYTEBOT_AGENT_BASE_URL,
      ws: true, // 同时支持 HTTP 和 WebSocket
      pathRewrite: { "^/api/proxy/tasks": "/socket.io" },
    });

    // 创建 VNC WebSocket 代理服务器
    // 使用http-proxy是因为要精细化控制请求和响应的细节
    // changeOrigin: true 表示将请求的 Origin头（类似发件人姓名）设置为目标URL
    const vncProxy = createProxyServer({ changeOrigin: true, ws: true });

    // 应用 HTTP 代理中间件
    expressApp.use("/api/proxy/tasks", tasksProxy);
    
    // 代理 VNC WebSocket 连接（用于桌面查看）
    expressApp.use("/api/proxy/websockify", (req, res) => {
      // 重写路径，将 /api/proxy/websockify 转换为目标 VNC 服务的路径
      const targetUrl = new URL(BYTEBOT_DESKTOP_VNC_URL!);
      req.url =
        targetUrl.pathname +
        (req.url?.replace(/^/api/proxy/websockify/, "") || "");
      // 使用 http-proxy 转发 HTTP 请求
      vncProxy.web(req, res, {
        target: `${targetUrl.protocol}//${targetUrl.host}`,
      });
    });

    // 处理所有其他请求，交给 Next.js 处理（页面路由、API 路由等）
    expressApp.all("*", (req, res) => handle(req, res));

    // 为了同时支持 Express 的 HTTP 处理和原生的 WebSocket 升级处理
    const server = createServer(expressApp);

    // 处理 WebSocket 连接升级请求
    server.on("upgrade", (request, socket, head) => {
      const { pathname } = new URL(
        request.url!,
        `http://${request.headers.host}`,
      );

      // 如果是任务相关的 WebSocket 连接，升级并转发到 bytebot-agent
      if (pathname.startsWith("/api/proxy/tasks")) {
        return tasksProxy.upgrade(request, socket as any, head);
      }

      // 如果是 VNC WebSocket 连接，升级并转发到 bytebotd
      if (pathname.startsWith("/api/proxy/websockify")) {
        const targetUrl = new URL(BYTEBOT_DESKTOP_VNC_URL!);
        request.url =
          targetUrl.pathname +
          (request.url?.replace(/^/api/proxy/websockify/, "") || "");
        return vncProxy.ws(request, socket as any, head, {
          target: `${targetUrl.protocol}//${targetUrl.host}`,
        });
      }

      // 其他 WebSocket 连接交给 Next.js 处理器，比如HMR WebSocket
      nextUpgradeHandler(request, socket, head);
    });

    // 启动服务器
    server.listen(port, hostname, () => {
      console.log(`> Ready on http://${hostname}:${port}`);
    });
  })
  .catch((err) => {
    console.error("Server failed to start:", err);
    process.exit(1);
  });

src/app/api/[[...path]]/route.ts

typescript 复制代码

/**
 * API 路由代理处理器
 * 这个文件实现了 Next.js API 路由的通用代理功能。
 * 所有发送到 /api/* 的请求都会被转发到 bytebot-agent 后端服务。
 * 使用动态路由 [[...path]] 来捕获所有路径段，实现通配符代理。
 */

import { NextRequest } from "next/server";

/* -------------------------------------------------------------------- */
/* 通用代理辅助函数                                                     */
/* -------------------------------------------------------------------- */
/**
 * 将请求代理到后端服务
 * @param req - Next.js 请求对象
 * @param path - 路径数组（从动态路由捕获）
 * @returns 代理后的响应
 */
async function proxy(req: NextRequest, path: string[]): Promise<Response> {
  // 获取后端服务的基础 URL
  const BASE_URL = process.env.BYTEBOT_AGENT_BASE_URL!;
  // 将路径数组拼接成子路径
  const subPath = path.length ? path.join("/") : "";
  // 构建完整的目标 URL（包含查询参数）
  const url = `${BASE_URL}/${subPath}${req.nextUrl.search}`;

  // 从请求中提取 Cookie（用于身份验证等）
  const cookies = req.headers.get("cookie");

  // 构建转发请求的配置
  const init: RequestInit = {
    method: req.method,
    headers: {
      "Content-Type": "application/json",
      // 如果有 Cookie，则转发到后端
      ...(cookies && { Cookie: cookies }),
    },
    // GET 和 HEAD 请求不需要请求体
    body:
      req.method === "GET" || req.method === "HEAD"
        ? undefined
        : await req.text(),
  };

  // 发送请求到后端服务
  const res = await fetch(url, init);
  const body = await res.text();

  // 从后端响应中提取 Set-Cookie 头（用于设置认证 Cookie）
  const setCookieHeaders = res.headers.getSetCookie?.() || [];

  // 创建响应头
  const responseHeaders = new Headers({
    "Content-Type": "application/json",
  });

  // 如果有 Set-Cookie 头，则添加到响应中（保持认证状态）
  setCookieHeaders.forEach((cookie) => {
    responseHeaders.append("Set-Cookie", cookie);
  });

  // 返回代理后的响应
  return new Response(body, {
    status: res.status,
    headers: responseHeaders,
  });
}

/* -------------------------------------------------------------------- */
/* 路由处理器                                                           */
/* -------------------------------------------------------------------- */
/**
 * 路径参数类型
 * 注意：在 Next.js 15+ 中，params 是 Promise 类型
 */
type PathParams = Promise<{ path?: string[] }>;

/**
 * 通用请求处理器
 * 从动态路由中提取路径，然后调用代理函数
 */
async function handler(req: NextRequest, { params }: { params: PathParams }) {
  const { path } = await params;
  return proxy(req, path ?? []);
}

// 导出所有 HTTP 方法的处理器
export const GET = handler;
export const POST = handler;
export const PUT = handler;
export const PATCH = handler;
export const DELETE = handler;
export const OPTIONS = handler;
export const HEAD = handler;

执行流程设计

任务执行流程

scss 复制代码

1. 用户输入任务描述
   ↓
2. bytebot-ui → HTTP POST → bytebot-agent (创建任务)
   ↓
3. bytebot-agent → 数据库 (保存任务)
   ↓
4. bytebot-agent → LLM API (生成操作计划)
   ↓
5. LLM 返回工具调用 (tool_use)
   ↓
6. bytebot-agent → bytebotd/computer-use (执行桌面操作)
   ↓
7. bytebotd → nut-js → 桌面系统 (实际执行)
   ↓
8. bytebotd → 返回操作结果 → bytebot-agent
   ↓
9. bytebot-agent → 构建消息 → LLM (继续对话)
   ↓
10. 循环 5-9 直到任务完成
   ↓
11. bytebot-agent → WebSocket → bytebot-ui (状态更新)
   ↓
12. 用户看到任务完成

实时桌面查看流程

markdown 复制代码

1. 用户打开桌面标签
   ↓
2. bytebot-ui → WebSocket → bytebotd/websockify
   ↓
3. bytebotd → 代理 → noVNC (6080)
   ↓
4. noVNC → VNC 服务器 → 桌面画面
   ↓
5. 画面流 → WebSocket → bytebot-ui
   ↓
6. react-vnc 组件渲染画面

接管模式流程

markdown 复制代码

1. 用户点击"接管"按钮
   ↓
2. bytebot-ui → WebSocket → bytebot-agent (任务暂停)
   ↓
3. bytebot-agent → 事件 → bytebotd (启用输入捕获)
   ↓
4. 用户操作桌面
   ↓
5. bytebotd（uiohook-napi） → 捕获输入事件 =》 发送消息到bytebot-agent
   ↓
6. bytebot-agent → WebSocket → bytebot-ui (实时同步)
   ↓
7. bytebot-ui → VNC → 桌面 (执行操作)
   ↓
8. 用户完成操作，点击"恢复"
   ↓
9. bytebot-agent → 继续任务执行

bytebot-agent - AI 代理服务

描述： 系统的"大脑"，负责理解用户任务、制定执行计划、协调桌面操作，并与 LLM 进行交互。

技术栈：

NestJS 框架
Prisma ORM: 数据库操作
Socket.IO: 实时通信
支持多种 LLM: Anthropic Claude, OpenAI GPT, Google Gemini

目录结构

ini 复制代码

bytebot-agent/
├── src/
│   ├── agent/              ###  AI 代理核心模块
│   │   ├── agent.processor.ts      ###  任务处理器（核心）
│   │   ├── agent.scheduler.ts       ###  任务调度器
│   │   ├── agent.tools.ts           ###  工具定义
│   │   ├── agent.computer-use.ts    ###  计算机操作处理
│   │   ├── agent.analytics.ts       ###  分析服务
│   │   ├── input-capture.service.ts ###  输入捕获服务
│   │   ├── agent.module.ts         ###  代理模块
│   │   ├── agent.types.ts          ###  类型定义
│   │   └── agent.constants.ts      ###  常量定义
│   ├── anthropic/         ###  Anthropic Claude API 服务
│   ├── openai/            ###  OpenAI API 服务
│   ├── google/            ###  Google Gemini API 服务
│   ├── proxy/             ###  LLM 代理服务
│   ├── tasks/             ###  任务管理模块
│   ├── messages/          ###  消息管理模块
│   ├── summaries/         ###  消息摘要模块
│   ├── prisma/            ###  Prisma ORM 模块
│   ├── app.module.ts      ###  应用根模块
│   └── main.ts            ###  应用入口
└── prisma/                ###  Prisma 数据库模式
    └── schema.prisma

模块化设计

包采用 NestJS 的模块化架构，主要模块包括：

AppModule: 应用根模块，整合所有功能模块
AgentModule: AI 代理核心模块，包含处理器、调度器等
TasksModule: 任务管理模块，提供任务 CRUD 和状态管理
MessagesModule: 消息管理模块，处理对话消息
SummariesModule: 摘要模块，管理长对话的摘要
LLM 服务模块: AnthropicModule、OpenAIModule、GoogleModule、ProxyModule

javascript 复制代码

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { webcrypto } from 'crypto';
import { json, urlencoded } from 'express';

/**
 * 应用程序启动函数
 * 初始化 NestJS 应用，配置中间件和 CORS，并启动服务器
 */
async function bootstrap() {
  console.log('Starting bytebot-agent application...');

  try {
    const app = await NestFactory.create(AppModule);

    // 配置请求体解析器，增加负载大小限制（50MB），用于处理大文件上传
    app.use(json({ limit: '50mb' }));
    app.use(urlencoded({ limit: '50mb', extended: true }));

    // 启用跨域资源共享（CORS），允许所有来源访问
    app.enableCors({
      origin: '*',
      methods: ['GET', 'POST', 'PUT', 'DELETE', 'OPTIONS', 'PATCH'],
    });

    // 启动服务器，监听指定端口（默认 9991）
    await app.listen(process.env.PORT ?? 9991);
  } catch (error) {
    console.error('Error starting application:', error);
  }
}
bootstrap();

app.module.ts

typescript 复制代码

import { Module } from '@nestjs/common';
import { AppController } from './app.controller';
import { AppService } from './app.service';
import { AgentModule } from './agent/agent.module';
import { TasksModule } from './tasks/tasks.module';
import { MessagesModule } from './messages/messages.module';
import { AnthropicModule } from './anthropic/anthropic.module';
import { OpenAIModule } from './openai/openai.module';
import { GoogleModule } from './google/google.module';
import { PrismaModule } from './prisma/prisma.module';
import { ConfigModule } from '@nestjs/config';
import { ScheduleModule } from '@nestjs/schedule';
import { EventEmitterModule } from '@nestjs/event-emitter';
import { SummariesModule } from './summaries/summaries.modue';
import { ProxyModule } from './proxy/proxy.module';

/**
 * 应用程序根模块
 * 负责导入和配置所有功能模块
 */
@Module({
  imports: [
    // 定时任务模块，用于调度任务执行
    ScheduleModule.forRoot(),
    // 事件发射器模块，用于模块间事件通信
    EventEmitterModule.forRoot(),
    // 配置模块，全局可用，用于读取环境变量
    ConfigModule.forRoot({
      isGlobal: true,
    }),
    // 核心业务模块
    AgentModule,        // AI 代理处理模块
    TasksModule,         // 任务管理模块
    MessagesModule,      // 消息管理模块
    SummariesModule,     // 消息摘要模块
    // LLM 服务模块
    AnthropicModule,     // Anthropic Claude API 服务
    OpenAIModule,        // OpenAI API 服务
    GoogleModule,        // Google Gemini API 服务
    ProxyModule,         // LLM 代理服务
    // 数据访问模块
    PrismaModule,        // Prisma ORM 数据库访问模块
  ],
  controllers: [AppController],
  providers: [AppService],
})
export class AppModule {}

数据库设计

bash 复制代码

bytebot-project/
├── .env                    # 环境变量文件
├── prisma/
│   ├── schema.prisma       # 数据模型定义
│   └── migrations/         # 数据库迁移文件
├── src/
│   ├── prisma.service.ts   # Prisma 服务
│   └── prisma.module.ts    # Prisma 
└── package.json

创建 Prisma 配置

schema.prisma

Prisma 的核心配置文件，它定义了数据模型、数据库连接和生成器配置

kotlin 复制代码

generator client {
  provider = "prisma-client-js"
}

datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
}

enum TaskStatus {
  PENDING
  RUNNING
  NEEDS_HELP
  NEEDS_REVIEW
  COMPLETED
  CANCELLED
  FAILED
}

enum TaskPriority {
  LOW
  MEDIUM
  HIGH
  URGENT
}

enum Role {
  USER
  ASSISTANT
}

enum TaskType {
  IMMEDIATE
  SCHEDULED
}

model Task {
  id            String        @id @default(uuid())
  description   String
  type          TaskType      @default(IMMEDIATE)
  status        TaskStatus    @default(PENDING)
  priority      TaskPriority  @default(MEDIUM)
  control       Role          @default(ASSISTANT)
  createdAt     DateTime      @default(now())
  createdBy     Role          @default(USER)
  scheduledFor  DateTime?
  updatedAt     DateTime      @updatedAt
  executedAt    DateTime?
  completedAt   DateTime?
  queuedAt      DateTime?
  error         String?
  result        Json?
  // Example: 
  // { "provider": "anthropic", "name": "claude-opus-4-20250514", "title": "Claude Opus 4" }
  model         Json
  messages      Message[]
  summaries     Summary[]
  files         File[]
}

model Summary {
  id             String     @id @default(uuid())
  content        String
  createdAt      DateTime   @default(now())
  updatedAt      DateTime   @updatedAt
  messages       Message[]  // One-to-many relationship: one Summary has many Messages

  task      Task        @relation(fields: [taskId], references: [id], onDelete: Cascade)
  taskId    String
  
  // Self-referential relationship
  parentSummary  Summary?   @relation("SummaryHierarchy", fields: [parentId], references: [id])
  parentId       String?
  childSummaries Summary[]  @relation("SummaryHierarchy")
}

model Message {
  id        String      @id @default(uuid())
  // Content field follows Anthropic's content blocks structure
  // Example: 
  // [
  //   {"type": "text", "text": "Hello world"},
  //   {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "..."}}
  // ]
  content   Json
  role      Role @default(ASSISTANT)
  createdAt DateTime    @default(now())
  updatedAt DateTime    @updatedAt
  task      Task        @relation(fields: [taskId], references: [id], onDelete: Cascade)
  taskId    String
  summary   Summary?    @relation(fields: [summaryId], references: [id])
  summaryId String?     // Optional foreign key to Summary
}

model File {
  id            String      @id @default(uuid())
  name          String
  type          String      // MIME type
  size          Int         // Size in bytes
  data          String      // Base64 encoded file data
  createdAt     DateTime    @default(now())
  updatedAt     DateTime    @updatedAt
  
  // Relations
  task          Task        @relation(fields: [taskId], references: [id], onDelete: Cascade)
  taskId        String
}

使用 schema.prisma

生成 Prisma Client
- 这会生成 @prisma/client，包含类型安全的数据库操作方法
bash 复制代码
```
  # 根据 schema 生成客户端
  npx prisma generate
```

创建数据库迁移

bash 复制代码

  # 根据 schema 变化创建迁移文件
  npx prisma migrate dev --name "add_user_profile"

  # 应用迁移到数据库
  npx prisma migrate deploy

创建 Prisma 服务

scala 复制代码

 // prisma.service.ts
 import { Injectable, OnModuleInit } from '@nestjs/common';
 import { PrismaClient } from '@prisma/client';

 @Injectable()
 export class PrismaService extends PrismaClient implements OnModuleInit {
   constructor() {
     super();
   }

   async onModuleInit() {
     await this.$connect();
   }
 }

创建 Prisma 模块

typescript 复制代码

 import { Global, Module } from '@nestjs/common';

 import { PrismaService } from './prisma.service';

 @Global()
 @Module({
   providers: [PrismaService], // 注册服务
   exports: [PrismaService], // 导出供其他模块使用
 })
 export class PrismaModule {}

在业务模块中使用

typescript 复制代码

 // user.module.ts
 import { Module } from '@nestjs/common';
 import { UserService } from './user.service';
 import { UserController } from './user.controller';
 import { PrismaModule } from '../prisma/prisma.module';

 @Module({
   imports: [PrismaModule], // 导入 Prisma 模块
   controllers: [UserController],
   providers: [UserService],
 })
 export class UserModule {}

在服务中使用 Prisma

typescript 复制代码

 // user.service.ts
 import { Injectable } from '@nestjs/common';
 import { PrismaService } from '../prisma/prisma.service';
 import { CreateUserDto } from './dto/create-user.dto';
 import { UpdateUserDto } from './dto/update-user.dto';

 @Injectable()
 export class UserService {
   constructor(private prisma: PrismaService) {}

   // 创建用户
   async create(createUserDto: CreateUserDto) {
     return this.prisma.user.create({
       data: createUserDto,
     });
   }

   // 查找所有用户
   async findAll() {
     return this.prisma.user.findMany({
       include: {
         posts: true, // 包含关联的 posts
       },
     });
   }

   // 根据 ID 查找用户
   async findOne(id: number) {
     return this.prisma.user.findUnique({
       where: { id },
       include: { posts: true },
     });
   }

   // 根据邮箱查找用户
   async findByEmail(email: string) {
     return this.prisma.user.findUnique({
       where: { email },
     });
   }

   // 更新用户
   async update(id: number, updateUserDto: UpdateUserDto) {
     return this.prisma.user.update({
       where: { id },
       data: updateUserDto,
     });
   }

   // 删除用户
   async remove(id: number) {
     return this.prisma.user.delete({
       where: { id },
     });
   }
 }

任务处理流程设计

任务调度

typescript 复制代码

import { Injectable, Logger, OnModuleInit } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { TasksService } from '../tasks/tasks.service';
import { AgentProcessor } from './agent.processor';
import { TaskStatus } from '@prisma/client';
import { writeFile } from './agent.computer-use';

/**
 * 任务调度器
 * 负责定时检查待执行的任务，并将任务提交给 AgentProcessor 处理
 * 主要功能：
 * 1. 检查并激活已到期的定时任务
 * 2. 查找下一个待执行的任务（按优先级排序）
 * 3. 将任务文件写入桌面
 * 4. 启动任务处理
 */
@Injectable()
export class AgentScheduler implements OnModuleInit {
  private readonly logger = new Logger(AgentScheduler.name);

  constructor(
    private readonly tasksService: TasksService,
    private readonly agentProcessor: AgentProcessor,
  ) {}

  /**
   * 模块初始化时执行
   * 立即执行一次任务检查，然后由定时任务接管
   */
  async onModuleInit() {
    this.logger.log('AgentScheduler initialized');
    await this.handleCron();
  }

  /**
   * 定时任务处理函数
   * 每 5 秒执行一次，检查并处理待执行的任务
   */
  @Cron(CronExpression.EVERY_5_SECONDS)
  async handleCron() {
    const now = new Date();
    
    // 检查所有定时任务，如果到期则加入队列
    const scheduledTasks = await this.tasksService.findScheduledTasks();
    for (const scheduledTask of scheduledTasks) {
      if (scheduledTask.scheduledFor && scheduledTask.scheduledFor < now) {
        this.logger.debug(
          `Task ID: ${scheduledTask.id} is scheduled for ${scheduledTask.scheduledFor}, queuing it`,
        );
        await this.tasksService.update(scheduledTask.id, {
          queuedAt: now,
        });
      }
    }

    // 如果处理器正在运行，跳过本次检查
    if (this.agentProcessor.isRunning()) {
      return;
    }
    
    // 查找下一个待执行的任务（按优先级排序）
    const task = await this.tasksService.findNextTask();
    if (task) {
      // 如果任务包含文件，先将文件写入桌面
      if (task.files.length > 0) {
        this.logger.debug(
          `Task ID: ${task.id} has files, writing them to the desktop`,
        );
        for (const file of task.files) {
          await writeFile({
            path: `/home/user/Desktop/${file.name}`,
            content: file.data, // file.data 在数据库中已经是 base64 编码
          });
        }
      }

      // 更新数据库中任务状态为运行中，并记录执行时间
      await this.tasksService.update(task.id, {
        status: TaskStatus.RUNNING,
        executedAt: new Date(),
      });
      this.logger.debug(`Processing task ID: ${task.id}`);
      
      // 启动任务处理
      this.agentProcessor.processTask(task.id);
    }
  }
}

任务处理

步骤

步骤 1: LLM 分析任务和上下文

csharp 复制代码

// 1. 获取任务和消息历史
const task = await this.tasksService.findById(taskId);
const latestSummary = await this.summariesService.findLatest(taskId);
const unsummarizedMessages = await this.messagesService.findUnsummarized(taskId);

// 2. 构建消息列表（包含摘要和未摘要的消息）
const messages = [
  ...(latestSummary ? [摘要消息] : []),
  ...unsummarizedMessages,  // 包含：用户消息、助手消息、工具结果、用户操作等
];

// 3. 调用 LLM 生成响应
agentResponse = await service.generateMessage(
  AGENT_SYSTEM_PROMPT,  // 系统提示词，定义 AI 的行为
  messages,              // 完整的对话历史
  model.name,
  true,                 // 启用工具调用
  abortController.signal
);

步骤 2: 执行工具调用

scss 复制代码

// LLM 返回的内容块可能包含工具调用
for (const block of messageContentBlocks) {
  if (isComputerToolUseContentBlock(block)) {
    // 执行计算机操作工具
    const result = await handleComputerToolUse(block, this.logger);
    generatedToolResults.push(result);
  }
  // ... 其他工具类型
}

步骤 3: 虚拟桌面执行操作

php 复制代码

// 例如：点击鼠标操作
async function clickMouse(input) {
  // 通过 REST API 发送到虚拟桌面
  await fetch(`${BYTEBOT_DESKTOP_BASE_URL}/computer-use`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      action: 'click_mouse',
      coordinates: { x: 100, y: 200 },
      button: 'left',
      clickCount: 1
    })
  });
}

执行流程

scss 复制代码

Agent (HTTP POST)
    ↓
bytebotd/computer-use (REST API)
    ↓
ComputerUseService.action()
    ↓
NutService (底层 uiohook-napi)
    ↓
实际桌面操作（鼠标移动、点击、键盘输入）

步骤 4: 用户操作（可选）

如果用户在此时手动操作桌面：

csharp 复制代码

// bytebotd 端：InputTrackingService 监听用户输入
uIOhook.on('mousedown', (event) => {
  // 捕获鼠标点击
  const action = convertToComputerAction(event);
  gateway.emitScreenshotAndAction(screenshot, action);
});

// bytebot-agent 端：InputCaptureService 接收
socket.on('screenshotAndAction', async (shot, action) => {
  // 转换为用户操作消息
  const userActionBlock = {
    type: MessageContentType.UserAction,
    content: [
      { type: MessageContentType.Image, source: { data: shot.image } },
      convertClickMouseActionToToolUseBlock(action, toolUseId)
    ]
  };
  
  // 保存为用户消息
  await this.messagesService.create({
    content: [userActionBlock],
    role: Role.USER,
    taskId
  });
});

用户操作通过 WebSocket 实时发送
自动转换为 UserActionContentBlock 并保存

步骤 5: 获取操作结果

typescript 复制代码

// 执行工具后，自动获取结果
let image: string | null = null;
try {
  await new Promise(resolve => setTimeout(resolve, 750)); // 等待 UI 稳定
  image = await screenshot(); // 获取操作后的截图
} catch (error) {
  logger.error('Failed to take screenshot', error);
}

// 构建工具结果
const toolResult: ToolResultContentBlock = {
  type: MessageContentType.ToolResult,
  tool_use_id: block.id,
  content: [
    { type: MessageContentType.Text, text: 'Tool executed successfully' },
    { type: MessageContentType.Image, source: { data: image } } // 截图
  ]
};

步骤 6: 继续下一轮迭代

kotlin 复制代码

// 保存工具结果
if (generatedToolResults.length > 0) {
  await this.messagesService.create({
    content: generatedToolResults,
    role: Role.USER,  // 工具结果作为用户消息
    taskId
  });
}

// 检查任务状态
if (setTaskStatusToolUseBlock) {
  // LLM 调用了 set_task_status 工具
  if (status === 'completed') {
    await this.tasksService.update(taskId, { status: TaskStatus.COMPLETED });
    this.isProcessing = false; // 停止循环
  } else if (status === 'needs_help') {
    await this.tasksService.update(taskId, { status: TaskStatus.NEEDS_HELP });
    // 触发 takeover 模式，等待用户帮助
  }
}

// 如果任务仍在运行，继续下一轮迭代
if (this.isProcessing) {
  setImmediate(() => this.runIteration(taskId)); // 异步调度，不阻塞
}

示例

css 复制代码

示例
迭代 1:
  LLM: "我需要先截图看看当前状态"
  → 调用 computer_screenshot
  → 返回截图（显示桌面）

迭代 2:
  LLM: "我看到桌面，需要打开浏览器"
  → 调用 computer_application({ application: 'firefox' })
  → 等待 750ms
  → 截图（显示浏览器打开）
  → 保存工具结果

迭代 3:
  LLM: "浏览器已打开，我需要点击地址栏"
  → 调用 computer_click_mouse({ coordinates: {x: 500, y: 50} })
  → 等待 750ms
  → 截图（显示地址栏被选中）
  → 保存工具结果

迭代 4:
  LLM: "现在输入搜索内容"
  → 调用 computer_type_text({ text: 'TypeScript' })
  → 等待 750ms
  → 截图（显示文本已输入）
  → 保存工具结果

迭代 5:
  LLM: "按回车搜索"
  → 调用 computer_type_keys({ keys: ['Enter'] })
  → 等待 750ms
  → 截图（显示搜索结果）
  → 保存工具结果

迭代 6:
  LLM: "任务完成"
  → 调用 set_task_status({ status: 'completed', description: '...' })
  → 更新任务状态为 COMPLETED
  → 停止循环

LLM API 服务

Anthropic Claude API 服务（src/anthropic）
Google Gemini API 服务（src/google）
OpenAI API 服务（src/openai）
LLM 代理 API服务（src/proxy）

上述LLM服务，核心功能如下，就是根据传入的模型型号、message、mcp工具等调用模型返回结果并转为task所需的message格式

typescript 复制代码

  async generateMessage(
    systemPrompt: string,
    messages: Message[],
    model: string = DEFAULT_MODEL.name,
    useTools: boolean = true,
    signal?: AbortSignal,
  ): Promise<BytebotAgentResponse> {
    const isReasoning = model.startsWith('o');
    try {
      const openaiMessages = this.formatMessagesForOpenAI(messages);

      const maxTokens = 8192;
      const response = await this.openai.responses.create(
        {
          model,
          max_output_tokens: maxTokens,
          input: openaiMessages, // 将任务的messages转成所需格式的message
          instructions: systemPrompt, // 系统提示
          tools: useTools ? openaiTools : [], // 提供的mcp工具
          reasoning: isReasoning ? { effort: 'medium' } : null,
          store: false,
          include: isReasoning ? ['reasoning.encrypted_content'] : [],
        },
        { signal },
      );

      return {
        contentBlocks: this.formatOpenAIResponse(response.output),
        tokenUsage: {
          inputTokens: response.usage?.input_tokens || 0,
          outputTokens: response.usage?.output_tokens || 0,
          totalTokens: response.usage?.total_tokens || 0,
        },
      };
    } catch (error: any) {
      console.log('error', error);
      console.log('error name', error.name);

      if (error instanceof APIUserAbortError) {
        this.logger.log('OpenAI API call aborted');
        throw new BytebotAgentInterrupt();
      }
      this.logger.error(
        `Error sending message to OpenAI: ${error.message}`,
        error.stack,
      );
      throw error;
    }
  }

mcp工具

所有可供 AI 代理使用的工具，包括：

鼠标操作工具（移动、点击、拖拽、滚动等）
键盘操作工具（输入文本、按键、快捷键等）
应用程序管理工具
文件操作工具
任务管理工具

每个LLM服务需要自定义agentToolToOpenAITool，以转化为所需工具数据格式

arduino 复制代码

function agentToolToOpenAITool(agentTool: any): OpenAI.Responses.FunctionTool {
  return {
    type: 'function',
    name: agentTool.name,
    description: agentTool.description,
    parameters: agentTool.input_schema,
  } as OpenAI.Responses.FunctionTool;
}

公共提供的工具（agentTool）

鼠标操作工具、键盘操作工具、应用程序管理工具、文件操作工具这几个工具都是调用bytebotd虚拟桌面服务提供的api实现的

css 复制代码

/**
 * 通用模式定义，供多个工具复用
 */
const coordinateSchema = {
  type: 'object' as const,
  properties: {
    x: {
      type: 'number' as const,
      description: 'The x-coordinate',
    },
    y: {
      type: 'number' as const,
      description: 'The y-coordinate',
    },
  },
  required: ['x', 'y'],
};

/**
 * 鼠标操作工具定义
 */
export const _moveMouseTool = {
  name: 'computer_move_mouse',
  description: 'Moves the mouse cursor to the specified coordinates',
  input_schema: {
    type: 'object' as const,
    properties: {
      coordinates: {
        ...coordinateSchema,
        description: 'Target coordinates for mouse movement',
      },
    },
    required: ['coordinates'],
  },
};

/**
 * 键盘操作工具定义
 */
export const _typeKeysTool = {
  name: 'computer_type_keys',
  description: 'Types a sequence of keys (useful for keyboard shortcuts)',
  input_schema: {
    type: 'object' as const,
    properties: {
      keys: {
        type: 'array' as const,
        items: { type: 'string' as const },
        description: 'Array of key names to type in sequence',
      },
      delay: {
        type: 'number' as const,
        description: 'Optional delay in milliseconds between key presses',
        nullable: true,
      },
    },
    required: ['keys'],
  },
};

/**
 * 实用工具定义
 */
export const _applicationTool = {
  name: 'computer_application',
  description: 'Opens or focuses an application and ensures it is fullscreen',
  input_schema: {
    type: 'object' as const,
    properties: {
      application: {
        type: 'string' as const,
        enum: [
          'firefox',
          '1password',
          'thunderbird',
          'vscode',
          'terminal',
          'desktop',
          'directory',
        ],
        description: 'The application to open or focus',
      },
    },
    required: ['application'],
  },
};

/**
 * 任务管理工具定义
 */
export const _setTaskStatusTool = {
  name: 'set_task_status',
  description: 'Sets the status of the current task',
  input_schema: {
    type: 'object' as const,
    properties: {
      status: {
        type: 'string' as const,
        enum: ['completed', 'needs_help'],
        description: 'The status of the task',
      },
      description: {
        type: 'string' as const,
        description:
          'If the task is completed, a summary of the task. If the task needs help, a description of the issue or clarification needed.',
      },
    },
    required: ['status', 'description'],
  },
};

export const _createTaskTool = {
  name: 'create_task',
  description: 'Creates a new task',
  input_schema: {
    type: 'object' as const,
    properties: {
      description: {
        type: 'string' as const,
        description: 'The description of the task',
      },
      type: {
        type: 'string' as const,
        enum: ['IMMEDIATE', 'SCHEDULED'],
        description: 'The type of the task (defaults to IMMEDIATE)',
      },
      scheduledFor: {
        type: 'string' as const,
        format: 'date-time',
        description: 'RFC 3339 / ISO 8601 datetime for scheduled tasks',
      },
      priority: {
        type: 'string' as const,
        enum: ['LOW', 'MEDIUM', 'HIGH', 'URGENT'],
        description: 'The priority of the task (defaults to MEDIUM)',
      },
    },
    required: ['description'],
  },
};

/**
 * 文件读取工具定义
 */
export const _readFileTool = {
  name: 'computer_read_file',
  description:
    'Reads a file from the specified path and returns it as a document content block with base64 encoded data',
  input_schema: {
    type: 'object' as const,
    properties: {
      path: {
        type: 'string' as const,
        description: 'The file path to read from',
      },
    },
    required: ['path'],
  },
};

bytebot-agent-cc - Claude Code 代理版本

bytebot-agent-cc 是专门为 Claude Code 优化的简化版本，通过 MCP 协议和流式处理提供更直接的集成。bytebot-agent 是功能更完整的通用版本，支持多提供商、消息摘要和更灵活的任务管理。

与 bytebot-agent 的主要区别

LLM 服务实现方式

bytebot-agent-cc:

php 复制代码

// 使用 Claude Code 的 query 函数，通过 MCP 连接
import { query } from '@anthropic-ai/claude-code';
for await (const message of query({
  prompt: task.description,
  options: {
    mcpServers: {
      desktop: {
        type: 'sse',
        url: ${BYTEBOT_DESKTOP_BASE_URL}/mcp,
      },
    },
  },
})) {
  // 处理流式响应
}

bytebot-agent:

kotlin 复制代码

// 支持多个 LLM 提供商，直接调用 REST API
const service = this.services[model.provider]; // anthropic/openai/google/proxy
const agentResponse = await service.generateMessage(
  AGENT_SYSTEM_PROMPT,
  messages,
  model.name,
  true,
  this.abortController.signal
);

模块结构差异

bytebot-agent-cc 缺少的模块：

❌ anthropic/ - 标准 Anthropic 服务
❌ openai/ - OpenAI 服务
❌ google/ - Google Gemini 服务
❌ proxy/ - LLM 代理服务
❌ summaries/ - 消息摘要服务

bytebot-agent 包含的模块：

✅ 完整的 LLM 服务模块体系
✅ 多提供商支持（Anthropic、OpenAI、Google、Proxy）
✅ 消息摘要服务（用于管理长上下文）

上下文管理

bytebot-agent-cc:

❌ 没有摘要服务
❌ 不管理消息摘要
依赖 Claude Code 自身的上下文管理

bytebot-agent:

✅ 有完整的摘要服务
✅ 当 token 使用量达到上下文窗口的 75% 时自动生成摘要
✅ 主动管理长对话上下文

使用场景对比

使用 bytebot-agent-cc 的场景：
- ✅ 代码生成和编程任务
- ✅ 需要 Claude Code 特定功能
- ✅ 与 Claude Code 深度集成
- ✅ 需要更强的代码理解能力
使用 bytebot-agent 的场景：
- ✅ 需要多 LLM 提供商支持
- ✅ 需要灵活切换不同模型
- ✅ 需要消息摘要管理长上下文
- ✅ 需要更细粒度的任务控制

bytebot-llm-proxy

描述：使用 LiteLLM 实现的统一 LLM 代理服务，提供统一的 API 接口访问多个 LLM 提供商。

技术栈：

LiteLLM 框架：统一的 LLM 代理框架

架构设计

bytebot-llm-proxy 基于 LiteLLM 框架构建，提供统一的 OpenAI 兼容 API 接口。所有 LLM 请求都通过这个代理服务路由到对应的提供商。

scss 复制代码

bytebot-agent → bytebot-llm-proxy → LLM 提供商
                (统一 API)         (Anthropic/OpenAI/Gemini/...)

配置文件

litellm-config.yaml 定义了所有可用的 LLM 模型：

yaml 复制代码

model_list:
  # Anthropic Models
  - model_name: claude-opus-4
    litellm_params:
      model: anthropic/claude-opus-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-sonnet-4
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  # OpenAI Models
  - model_name: gpt-4.1
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Gemini Models
  - model_name: gemini-2.5-pro
    litellm_params:
      model: gemini/gemini-2.5-pro
      api_key: os.environ/GEMINI_API_KEY
  - model_name: gemini-2.5-flash
    litellm_params:
      model: gemini/gemini-2.5-flash
      api_key: os.environ/GEMINI_API_KEY

主要功能

统一 LLM 接口

支持的提供商：

OpenAI: GPT-4, GPT-3.5 系列
Anthropic: Claude Opus, Claude Sonnet 系列
Google Gemini: Gemini Pro, Gemini Flash 系列
Azure OpenAI: Azure 托管的 OpenAI 模型
AWS Bedrock: AWS 托管的多种模型
Ollama: 本地部署的开源模型
100+ 其他提供商: 通过 LiteLLM 统一接口访问

Docker 部署

bash 复制代码

FROM ghcr.io/berriai/litellm:main-stable

# 复制配置文件
COPY ./bytebot-llm-proxy/litellm-config.yaml /app/config.yaml

# 启动 LiteLLM 服务
CMD ["--config", "/app/config.yaml", "--port", "4000"]

在 bytebot-agent 中的使用

bytebot-agent 通过 ProxyService 连接到 bytebot-llm-proxy：

proxy.service.ts

csharp 复制代码

// 初始化 OpenAI 客户端，baseURL 指向 LiteLLM 代理
this.openai = new OpenAI({
  apiKey: 'dummy-key-for-proxy',
  baseURL: proxyUrl, // BYTEBOT_LLM_PROXY_URL
});

// 调用时，模型名称会被 LiteLLM 路由到对应的提供商
const completion = await this.openai.chat.completions.create({
  model: 'claude-opus-4', // LiteLLM 会路由到 Anthropic
  messages: [...],
  tools: [...]
});

监听端口： 4000

bytebotd

描述： Bytebot 的核心桌面控制服务，负责直接与虚拟桌面环境交互，执行所有的桌面操作。

用途：远程桌面 + 自动化执行平台

技术栈：

NestJS 框架（node服务）
@nut-tree-fork/nut-js: 跨平台的桌面自动化库
- 跨平台桌面自动化 (截图、控件识别)
uiohook-napi: 输入跟踪库
- 用于提供整个操作系统范围内的鼠标和键盘事件监听
Socket.IO: WebSocket 通信
@rekog/mcp-nest:
- NestJS 的 MCP 协议集成 (AI 工具调用)
sharp
- 图像处理和分析
http-proxy-middleware:
- HTTP 请求代理中间件

部署时做了什么？

Dockerfile

在容器中搭建一个带图形界面的 Linux 桌面，可通过浏览器访问，并运行自动化程序。

bash 复制代码

# 基础设置
FROM ubuntu:22.04
ARG DEBIAN_FRONTEND=noninteractive
ENV DISPLAY=:0

# 1. 安装桌面环境
RUN apt-get update && apt-get install -y \
    xvfb x11vnc xfce4 firefox-esr

# 2. 安装开发工具  
RUN apt-get install -y \
    nodejs python3-pip git code

# 3. 安装 noVNC
RUN git clone https://github.com/novnc/noVNC.git /opt/noVNC

# 4. 复制应用代码
COPY ./bytebotd/ /bytebot/bytebotd/
WORKDIR /bytebot/bytebotd
RUN npm install && npm run build

# 5. 创建用户
RUN useradd -ms /bin/bash user

# 6. 暴露端口
EXPOSE 9990

# 7. 启动服务
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]

VNC 桌面：创建虚拟桌面、安装软件
Web 界面：noVNC 网页客户端，可以通过url访问虚拟桌面并与其交互
自动化 API：启动bytebotd 服务，提供相关接口

核心功能

虚拟桌面相关配置
虚拟桌面操作模块
- 处理鼠标、键盘、截图等自动化操作
虚拟桌面输入跟踪模块
- 监听并记录用户在虚拟桌面的鼠标和键盘输入
MCP
- 提供 AI 模型上下文协议支持

虚拟桌面相关配置

提供了虚拟桌面及其软件相关配置

policies.json

json 复制代码

{
  "policies": {
    // 用户消息屏蔽
    "UserMessaging": {
      "ExtensionRecommendations": false,      // 禁用扩展推荐
      "FeatureRecommendations": false,        // 禁用功能推荐
      "UrlbarInterventions": false,           // 禁用地址栏干预提示
      "SkipOnboarding": true,                 // 跳过新手引导
      "MoreFromMozilla": false               // 隐藏 Mozilla 其他产品推广
    },

    // 页面重定向（清空默认页面）
    "OverrideFirstRunPage": "",        // 清空首次运行页面（打开空白页）
    "OverridePostUpdatePage": "",       // 清空更新后显示页面

    // 应用内通知禁用
    "InAppNotification": {
      "DonationEnabled": false,     // 禁用捐款请求
      "SurveyEnabled": false,       // 禁用用户调查
      "MessageEnabled": false      // 禁用所有其他消息
    }
  }

}

50-autologin.conf

自动登录配置文件，用于在系统启动时自动登录到指定的用户桌面环境

ini 复制代码

[Seat:*]                   
autologin-user=user         # 自动登录的用户名为 "user"
autologin-user-timeout=0    # 超时时间为0秒（立即登录，不显示登录界面）
autologin-session=xfce      # 自动登录后启动 Xfce 桌面环境

虚拟桌面操作模块 (Computer Use Service)

负责执行各种桌面操作，将来自 bytebot-agent 的高级指令转换为底层系统调用。

支持的操作类型

diff 复制代码

// 鼠标操作
- move_mouse: 移动鼠标到指定坐标
- trace_mouse: 鼠标沿路径移动
- click_mouse: 鼠标点击（支持单/双击）
- press_mouse: 鼠标按下/释放
- drag_mouse: 鼠标拖拽
- scroll: 滚轮滚动（上下左右）

// 键盘操作
- type_keys: 按键序列输入（支持快捷键）
- press_keys: 按键按下/释放（用于组合键）
- type_text: 文本逐字符输入
- paste_text: 粘贴文本（通过剪贴板）

// 系统操作
- screenshot: 截屏（返回 base64 图片）
- cursor_position: 获取鼠标当前位置
- application: 应用程序管理（打开/激活/最大化）
- write_file: 写入文件
- read_file: 读取文件（支持多种格式）

操作执行流程

csharp 复制代码

// ComputerUseController 接收请求
@Post()
async action(@Body() params: ComputerActionDto) {
  return await this.computerUseService.action(params);
}

// ComputerUseService 路由到具体操作
async action(params: ComputerAction): Promise<any> {
  switch (params.action) {
    case 'click_mouse':
      await this.clickMouse(params);
      break;
    case 'type_text':
      await this.typeText(params);
      break;
    // ... 其他操作
  }
}

// NutService 执行底层操作
async mouseClickEvent(button: Button): Promise<void> {
  await mouse.click(button);
}

应用程序管理

csharp 复制代码

private async application(action: ApplicationAction): Promise<void> {
  // 检查应用是否已打开
  const appOpen = await this.checkApplicationOpen(action.application);
  
  if (appOpen) {
    // 激活并最大化窗口
    await this.activateWindow(action.application);
    await this.maximizeWindow(action.application);
  } else {
    // 启动应用
    await this.launchApplication(action.application);
  }
}

// 使用 wmctrl 管理窗口
spawnAndForget('sudo', [
  '-u', 'user',
  'wmctrl', '-x', '-a',  // 激活窗口
  processMap[application]
]);

文件操作

typescript 复制代码

// 写入文件
private async writeFile(action: WriteFileAction): Promise<{ success: boolean; message: string }> {
  // 1. 解码 base64 数据
  const buffer = Buffer.from(action.data, 'base64');
  
  // 2. 解析路径（相对路径转为绝对路径）
  let targetPath = action.path;
  if (!path.isAbsolute(targetPath)) {
    targetPath = path.join('/home/user/Desktop', targetPath);
  }
  
  // 3. 创建目录（如果不存在）
  await execAsync(`sudo mkdir -p "${path.dirname(targetPath)}"`);
  
  // 4. 写入临时文件，然后移动到目标位置
  const tempFile = `/tmp/bytebot_temp_${Date.now()}`;
  await fs.writeFile(tempFile, buffer);
  await execAsync(`sudo cp "${tempFile}" "${targetPath}"`);
  await execAsync(`sudo chown user:user "${targetPath}"`);
  
  return { success: true, message: `File written to: ${targetPath}` };
}

// 读取文件
private async readFile(action: ReadFileAction): Promise<{
  success: boolean;
  data?: string;
  name?: string;
  size?: number;
  mediaType?: string;
}> {
  // 1. 解析路径
  let targetPath = action.path;
  if (!path.isAbsolute(targetPath)) {
    targetPath = path.join('/home/user/Desktop', targetPath);
  }
  
  // 2. 复制到临时位置（使用 sudo 权限）
  const tempFile = `/tmp/bytebot_read_${Date.now()}`;
  await execAsync(`sudo cp "${targetPath}" "${tempFile}"`);
  await execAsync(`sudo chmod 644 "${tempFile}"`);
  
  // 3. 读取文件内容
  const buffer = await fs.readFile(tempFile);
  const base64Data = buffer.toString('base64');
  
  // 4. 确定媒体类型
  const ext = path.extname(targetPath).toLowerCase().slice(1);
  const mimeTypes: Record<string, string> = {
    pdf: 'application/pdf',
    docx: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    txt: 'text/plain',
    png: 'image/png',
    jpg: 'image/jpeg',
    // ... 更多类型
  };
  
  return {
    success: true,
    data: base64Data,
    name: path.basename(targetPath),
    size: buffer.length,
    mediaType: mimeTypes[ext] || 'application/octet-stream'
  };
}

NutService 执行底层操作

负责调用底层系统操作方法，执行实际的鼠标、键盘和屏幕操作

引用依赖

typescript 复制代码

// nut.service.ts
import { Injectable, Logger } from '@nestjs/common';
import {
  keyboard,
  mouse,
  Point,
  screen,
  Key,
  Button,
  FileType,
} from '@nut-tree-fork/nut-js';
import { spawn } from 'child_process';
import * as path from 'path';
import { Injectable } from '@nestjs/common';

@Injectable()  // 标记为可注入的服务
export class NutService {
  async sendKeys(x: number, y: number) {
    // 发送按键功能
  }
  
  async pasteText(text: string): Promise<void> {
    // 粘贴文本功能
  }
}

键盘操作

typescript 复制代码

// 发送按键序列
async sendKeys(keys: string[], delay: number = 100): Promise<any> {
  const nutKeys = keys.map((key) => this.validateKey(key));
  await keyboard.pressKey(...nutKeys);
  await this.delay(delay);
  await keyboard.releaseKey(...nutKeys);
}

// 按住/释放按键（用于组合键）
async holdKeys(keys: string[], down: boolean): Promise<any> {
  for (const key of keys) {
    const nutKey = this.validateKey(key);
    if (down) {
      await keyboard.pressKey(nutKey);
    } else {
      await keyboard.releaseKey(nutKey);
    }
  }
}

// 文本输入（逐字符）
async typeText(text: string, delayMs: number = 0): Promise<void> {
  for (let i = 0; i < text.length; i++) {
    const char = text[i];
    const keyInfo = this.charToKeyInfo(char);
    if (keyInfo.withShift) {
      await keyboard.pressKey(Key.LeftShift, keyInfo.keyCode);
      await keyboard.releaseKey(Key.LeftShift, keyInfo.keyCode);
    } else {
      await keyboard.pressKey(keyInfo.keyCode);
      await keyboard.releaseKey(keyInfo.keyCode);
    }
    if (delayMs > 0 && i < text.length - 1) {
      await new Promise((resolve) => setTimeout(resolve, delayMs));
    }
  }
}

// 粘贴文本（通过剪贴板）
async pasteText(text: string): Promise<void> {
  // 1. 使用 xclip 复制到剪贴板
  await new Promise<void>((resolve, reject) => {
    const child = spawn('xclip', ['-selection', 'clipboard'], {
      env: { ...process.env, DISPLAY: ':0.0' },
      stdio: ['pipe', 'ignore', 'inherit'],
    });
    child.stdin.write(text);
    child.stdin.end();
    child.once('close', (code) => {
      code === 0 ? resolve() : reject(new Error(`xclip exited with code ${code}`));
    });
  });
  
  // 2. 等待剪贴板设置完成
  await new Promise((resolve) => setTimeout(resolve, 100));
  
  // 3. 发送 Ctrl+V
  await keyboard.pressKey(Key.LeftControl, Key.V);
  await keyboard.releaseKey(Key.LeftControl, Key.V);
}

鼠标操作

typescript 复制代码

// 移动鼠标
async mouseMoveEvent(coordinates: Coordinates): Promise<void> {
  await mouse.setPosition(new Point(coordinates.x, coordinates.y));
}

// 点击鼠标
async mouseClickEvent(button: Button): Promise<void> {
  await mouse.click(button);
}

// 按下/释放鼠标按钮
async mouseButtonEvent(button: Button, down: boolean): Promise<void> {
  if (down) {
    await mouse.pressButton(button);
  } else {
    await mouse.releaseButton(button);
  }
}

// 滚轮滚动
async mouseWheelEvent(direction: 'up' | 'down' | 'left' | 'right', amount: number): Promise<void> {
  await mouse.scroll(direction, amount);
}

截屏

arduino 复制代码

async screendump(): Promise<Buffer> {
  const image = await screen.capture();
  return image.data;
}

虚拟桌面输入跟踪模块

监控桌面上的用户输入，并发送给bytebot-agent

input-tracking.service.ts

kotlin 复制代码

// 启动输入跟踪
startTracking() {
  if (this.isTracking) return;
  this.registerListeners();
  uIOhook.start();
  this.isTracking = true;
}

// 注册事件监听器
private registerListeners() {
  // 鼠标移动
  uIOhook.on('mousemove', (e: UiohookMouseEvent) => {
    if (this.isDragging && this.dragMouseAction) {
      this.dragMouseAction.path.push({ x: e.x, y: e.y });
    } else {
      // 防抖截屏（延迟 250ms）
      if (this.screenshotTimeout) clearTimeout(this.screenshotTimeout);
      this.screenshotTimeout = setTimeout(async () => {
        this.screenshot = await this.computerUseService.screenshot();
      }, 250);
    }
  });
  
  // 鼠标点击（防抖处理，250ms）
  uIOhook.on('click', (e: UiohookMouseEvent) => {
    const action: ClickMouseAction = {
      action: 'click_mouse',
      button: this.mapButton(e.button),
      coordinates: { x: e.x, y: e.y },
      clickCount: e.clicks,
      holdKeys: [
        e.altKey ? 'alt' : undefined,
        e.ctrlKey ? 'ctrl' : undefined,
        e.shiftKey ? 'shift' : undefined,
        e.metaKey ? 'meta' : undefined,
      ].filter((key) => key !== undefined),
    };
    this.clickMouseActionBuffer.push(action);
    
    // 防抖：250ms 内收集所有点击事件，取最大 clickCount
    if (this.clickMouseActionTimeout) clearTimeout(this.clickMouseActionTimeout);
    this.clickMouseActionTimeout = setTimeout(async () => {
      const final = this.clickMouseActionBuffer.reduce((a, b) =>
        b.clickCount > a.clickCount ? b : a
      );
      await this.logAction(final);
      this.clickMouseActionBuffer = [];
    }, 250);
  });
  
  // 键盘输入
  uIOhook.on('keydown', async (e: UiohookKeyboardEvent) => {
    // 可打印字符 → 缓冲为 TypeTextAction
    if (!this.isModifierKey(e) && keyInfoMap[e.keycode].isPrintable) {
      this.bufferChar(
        e.shiftKey
          ? keyInfoMap[e.keycode].shiftString!
          : keyInfoMap[e.keycode].string!
      );
      return;
    }
    
    // 修饰键或非打印键 → 刷新缓冲区，发送 TypeKeysAction
    await this.flushTypingBuffer();
    if (!this.pressedKeys.has(e.keycode)) {
      this.pressedKeys.add(e.keycode);
    }
  });
  
  uIOhook.on('keyup', async (e: UiohookKeyboardEvent) => {
    await this.flushTypingBuffer();
    if (this.pressedKeys.size > 0) {
      const action: TypeKeysAction = {
        action: 'type_keys',
        keys: Array.from(this.pressedKeys.values()).map(
          (key) => keyInfoMap[key].name
        ),
      };
      this.pressedKeys.clear();
      await this.logAction(action);
    }
  });
}

MCP

注意：bytebot-agent-cc 才会使用这里的MCP服务，bytebot-agent内部有自定义MCP服务

工具定义位置

packages/bytebotd/src/mcp/computer-use.tools.ts定义了所有桌面操作的 MCP 工具，包括：

less 复制代码

import { Injectable } from '@nestjs/common';
import { Tool } from '@rekog/mcp-nest';
import { z } from 'zod';
import { ComputerUseService } from '../computer-use/computer-use.service';

@Injectable()
export class ComputerUseTools {
  constructor(private readonly computerUse: ComputerUseService) {}

  // 鼠标操作
  @Tool({ name: 'computer_move_mouse', description: '移动鼠标到指定坐标' })
  async moveMouse({ coordinates }: { coordinates: { x: number; y: number } }) {
    await this.computerUse.action({ action: 'move_mouse', coordinates });
    return { content: [{ type: 'text', text: 'mouse moved' }] };
  }

  @Tool({ name: 'computer_click_mouse', description: '鼠标点击' })
  async clickMouse({ coordinates, button, clickCount }: {
    coordinates?: { x: number; y: number };
    button: 'left' | 'right' | 'middle';
    clickCount: number;
  }) {
    await this.computerUse.action({ action: 'click_mouse', coordinates, button, clickCount });
    return { content: [{ type: 'text', text: 'mouse clicked' }] };
  }

  // 键盘操作
  @Tool({ name: 'computer_type_keys', description: '模拟按键序列' })
  async typeKeys({ keys, delay }: { keys: string[]; delay?: number }) {
    await this.computerUse.action({ action: 'type_keys', keys, delay });
    return { content: [{ type: 'text', text: 'keys typed' }] };
  }

  @Tool({ name: 'computer_type_text', description: '输入文本' })
  async typeText({ text, delay }: { text: string; delay?: number }) {
    await this.computerUse.action({ action: 'type_text', text, delay });
    return { content: [{ type: 'text', text: 'text typed' }] };
  }

  // 系统操作
  @Tool({ name: 'computer_screenshot', description: '截取屏幕截图' })
  async screenshot() {
    const shot = await this.computerUse.action({ action: 'screenshot' }) as { image: string };
    return { content: [{ type: 'image', data: shot.image, mimeType: 'image/png' }] };
  }

  @Tool({ name: 'computer_application', description: '打开应用程序' })
  async application({ application }: {
    application: 'firefox' | '1password' | 'vscode' | 'terminal';
  }) {
    await this.computerUse.action({ action: 'application', application });
    return { content: [{ type: 'text', text: 'application opened' }] };
  }

  // 文件操作
  @Tool({ name: 'computer_write_file', description: '写入文件' })
  async writeFile({ path, data }: { path: string; data: string }) {
    await this.computerUse.action({ action: 'write_file', path, data });
    return { content: [{ type: 'text', text: 'file written' }] };
  }

  @Tool({ name: 'computer_read_file', description: '读取文件' })
  async readFile({ path }: { path: string }) {
    const result = await this.computerUse.action({ action: 'read_file', path });
    return { content: [{ type: 'document', source: { type: 'base64', data: result.data } }] };
  }
}

模块注册位置

packages/bytebotd/src/mcp/bytebot-mcp.module.ts

php 复制代码

import { McpModule } from '@rekog/mcp-nest';

@Module({
  imports: [
    ComputerUseModule,
    McpModule.forRoot({
      name: 'bytebotd',
      version: '0.0.1',
      sseEndpoint: '/mcp',  // 暴露 MCP SSE 端点
    }),
  ],
  providers: [ComputerUseTools],  // 注册工具提供者
})
export class BytebotMcpModule {}

工具使用位置（主要）

packages/bytebot-agent-cc/src/agent/agent.processor.ts这是 MCP 工具的主要使用位置：

typescript 复制代码

// 1. 连接到 MCP 服务器
for await (const message of query({
  prompt: task.description,
  options: {
    mcpServers: {
      desktop: {
        type: 'sse',
        url: `${BYTEBOT_DESKTOP_BASE_URL}/mcp`,  // 连接到 bytebotd 的 MCP 端点
      },
    },
  },
})) {
  // 2. 处理工具调用
  // Claude Code 会调用工具，工具名称格式为：mcp__desktop__computer_xxx
}

// 3. 工具名称转换
private formatAnthropicResponse(content: Anthropic.ContentBlock[]) {
  // 过滤出 MCP 工具调用
  content = content.filter(
    (block) =>
      block.type !== 'tool_use' || block.name.startsWith('mcp__desktop__'),
  );
  
  // 移除前缀，转换为标准工具名称
  return content.map((block) => {
    if (block.type === 'tool_use') {
      return {
        name: block.name.replace('mcp__desktop__', ''),  // 移除前缀
        // 例如：mcp__desktop__computer_click_mouse → computer_click_mouse
      };
    }
  });
}

工具调用流程

shared - 共享类型和工具

描述：所有包共享的类型定义和工具函数，确保类型安全和代码复用。

类型定义

arduino 复制代码

// ComputerAction Types 桌面操作类型
MoveMouseAction
TraceMouseAction
ClickMouseAction
PressMouseAction
DragMouseAction
ScrollAction
TypeKeysAction
PressKeysAction
TypeTextAction
PasteTextAction
ApplicationAction
WriteFileAction
ReadFileAction

// MessageContentBlock 消息内容类型
TextContentBlock
ImageContentBlock
ToolUseContentBlock
ToolResultContentBlock

工具函数

javascript 复制代码

// ComputerAction Utils:
isMoveMouseAction
isClickMouseAction
// ... 类型守卫函数

// MessageContent Utils:
isTextContentBlock
isImageContentBlock
isToolUseContentBlock
// ... 类型守卫函数


/**
 * 类型守卫：检查对象是否为工具使用内容块（通用）
 * @param obj - 要验证的对象
 * @returns 类型谓词，表示 obj 是 ToolUseContentBlock
 */
export function isToolUseContentBlock(
  obj: unknown
): obj is ToolUseContentBlock {
  if (!obj || typeof obj !== "object") {
    return false;
  }

  const block = obj as Partial<ToolUseContentBlock>;
  return (
    block.type === MessageContentType.ToolUse &&
    typeof block.name === "string" &&
    typeof block.id === "string" &&
    block.input !== undefined &&
    typeof block.input === "object"
  );
}

总结

bytebotd: 桌面控制的执行层
bytebot-agent: 任务协调和 AI 交互的核心层
bytebot-ui: 用户交互的展示层
bytebot-llm-proxy: LLM 访问的统一入口层
shared: 类型定义和工具函数的共享层

Bytebot源码学习

概述

架构图

核心概念理解

虚拟桌面

如何搭建一个虚拟桌面

使用docker

创建 Dockerfile

创建 Supervisor 配置文件

运行和构建

访问虚拟桌面

在 Bytebot 中的具体使用

项目详解

bytebot-ui - 前端用户界面

技术栈

页面设计

接口设计

执行流程设计

任务执行流程

实时桌面查看流程

接管模式流程

bytebot-agent - AI 代理服务

模块化设计

数据库设计

任务处理流程设计

任务调度

任务处理

步骤

示例

LLM API 服务

bytebot-agent-cc - Claude Code 代理版本

与 bytebot-agent 的主要区别

LLM 服务实现方式

模块结构差异

上下文管理

使用场景对比

bytebot-llm-proxy

架构设计

配置文件

主要功能

统一 LLM 接口

Docker 部署

在 bytebot-agent 中的使用

bytebotd

核心功能

虚拟桌面相关配置

虚拟桌面操作模块 (Computer Use Service)

支持的操作类型

虚拟桌面输入跟踪模块

MCP

工具定义位置

模块注册位置

工具使用位置（主要）

工具调用流程

shared - 共享类型和工具

类型定义

工具函数

总结