开源AI工具Midscene.js

「Midscene.js」是什么？

它是由字节开源的一款AI驱动的UI自动化工具。官网的Solgen：Midscene.js - AI 驱动，带来愉悦的 UI 自动化体验。从这个Solgen可以看出，它主要是通过大语言模型解析用户的自然语言指令，然后通过底层的自动化框架进行交互，将这些解析后的指令转为具体的页面操作代码。大大提高了操作浏览器的效率，让我们的有更愉悦的UI自动化体验。

二、为什么我们要用Midscene.js？

大家在日常工作做肯定会遇到，在浏览器中经常反复操作同样的事件，比如每天搜索相关商品数据，填写某些页面的表单。对于这些重复的动作，需要耗费我们的大量精力。在这个AI时代，我们可否寻找一些AI工具，来代替我们完整这样的任务呢？

相信大家肯定会用过一些操作浏览器的AI工具，比如Nanobrowser、BrowserOS，今天推荐的这个款Midscene.js有什么不同呢，其实它们都采用了大语言模型来解析自然语言指令，但是适用的模型略有不同。因为提到模型就避免不了需要收费。Midscene.js这个工具支持qwen-vl模型，当你你一次登录注册千问模型平台时，会赠送100万免费token，想想是不是属于免费了。

这个工具不仅支持Web端，还支持移动端Android自动化，可以在手机中通过自然语言描述，来执行你想要的动作。

同时这款工具的报告也非常的直观，不仅有视频回放，还有各个步骤的截图。

01移动端

三、核心功能包括哪些？

1、支持Web浏览器自动化：

1.1 Chrome插件方式，最简单快捷的使用方式就是从Chrome插件商店中，搜索Midscene直接安装，然后配置你想用的模型，最好是使用支持视觉的大语言模型，例如qwen-vl。然后就可以在命令行窗口输入指令。

1.2MCP服务方式，Midscene支持MCP服务，允许AI助手通过自然语言来控制浏览器，自动化执行UI任务。同样是需要安装Chrome的扩展插件，然后再切换到桥接模式，点击允许连接。

1.3支持集成到Playwright或者Puppeteer中。

2、支持Android自动化：

2.1同样也支持移动端MCP方式，前提需要配置AI模型，安装Android adb工具，Android设备启用USB调试模式。

3、支持YAML格式的自动化：

支持基于.yaml文件的自动化方式，使用这种方式，就可以让我们更专注于编写流程，而不是编写脚本。例如

复制代码

web:
  url: https://www.bing.com
tasks:
  - name: 搜索天气
    flow:
      - ai: 搜索 "今日天气"
      - sleep: 3000
  - name: 检查结果
    flow:
      - aiAssert: 结果中展示了天气信息

4、支持多种AI模型：

支持多种AI模型，包括常见的GPT-4、千问Qwen-2.5-VL、Doubao-1.5-thinking-vision-pro、UI-TARS、Gemini 2.5 Pro。强烈推荐使用Qwen-2.5-VL，因为有免费额度。

5、提示词指令技巧：

因为它是使用自然语言来分析动作和执行任务的，所以对于提示词要写的尽量详细，这样才可以让大模型更能理解你想要的结果。例如

复制代码

错误示例：
搜耳机
正确示例：
找到搜索框，输入'耳机'，点击搜索按钮

四、主要应用场景？

1、浏览器重复动作：对于日常反复操作浏览器的相同动作，都可以使用Midscene来完成，节省我们大量的时间。

2、可以做自动化测试：对于测试人员，可以使用Midscene工具来进行自动化测试，省去定位元素经常变动的烦恼。

3、移动端自动化：同时支持移动端Android系统，进行自动化重复的动作，节省我们的时间。

五、如何部署？

部署方式：

方式一（无代码基础，推荐这种使用方式）

1、直接在Chrome插件商店中搜索Midscene.js，直接安装。

2、在设置中配置API_KEY和模型名称，我选择的是qwen

复制代码

OPENAI_API_KEY="xxxxxxx"
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1

方式二

1、安装Node.js

2、安装midscene、playwright

复制代码

npm install @midscene/web playwright @playwright/test tsx --save-dev

3、编写脚本

复制代码

import { chromium } from 'playwright';
import { PlaywrightAgent } from '@midscene/web/playwright';
import 'dotenv/config'; // read environment variables from .env file
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
Promise.resolve(
  (async () => {
    const browser = await chromium.launch({
      headless: true, // 'true' means we can't see the browser window
      args: ['--no-sandbox', '--disable-setuid-sandbox'],
    });
    const page = await browser.newPage();
    await page.setViewportSize({
      width: 1280,
      height: 768,
    });
    await page.goto('https://www.ebay.com');
    await sleep(5000); // init Midscene agent
    const agent = new PlaywrightAgent(page);
    // type keywords, perform a search
    await agent.aiAction('type "Headphones" in search box, hit Enter');
    // wait for the loading
    await agent.aiWaitFor('there is at least one headphone item on page');
    // or you may use a plain sleep:
    // await sleep(5000);
    // understand the page content, find the items
    const items = await agent.aiQuery(
      '{itemTitle: string, price: Number}[], find item in list and corresponding price',
    );
    console.log('headphones in stock', items);
    const isMoreThan1000 = await agent.aiBoolean(
      'Is the price of the headphones more than 1000?',
    );
    console.log('isMoreThan1000', isMoreThan1000);
    const price = await agent.aiNumber(
      'What is the price of the first headphone?',
    );
    console.log('price', price);
    const name = await agent.aiString(
      'What is the name of the first headphone?',
    );
    console.log('name', name);
    const location = await agent.aiLocate(
      'What is the location of the first headphone?',
    );
    console.log('location', location);
    // assert by AI
    await agent.aiAssert('There is a category filter on the left');
    // click on the first item
    await agent.aiTap('the first item in the list');
    await browser.close();
  })(),
);

4、运行脚本

复制代码

npx tsx demo.ts

六、项目地址

https://github.com/web-infra-dev/midscene