浅玩一下 Mobile Use

上一篇文章简单调研了下利用 AI 进行 UI 自动化的技术,顺着这个思路,是不是可以给 AI 一个手机,让它帮我们做更多事情呢?搜索一下,发现已经有人开发了 mobile-use 这个项目,今天就来玩一下。

首先需要安装好 adb,并通过 USB 数据线连接安卓手机,打开开发者模式,并开启 USB 调试模式,在终端输入 adb devices,如果显示有设备,则说明安装成功。

然后,写个简单的 Demo 脚本:

js 复制代码
const {mobileUse} = require('mobile-use')
const {createOpenAI} = require('@ai-sdk/openai')

const {readFile, access} = require('node:fs/promises')
const yargs = require('yargs')
const {hideBin} = require('yargs/helpers')

const openai = createOpenAI({
  baseURL: process.env.OPENAI_API_BASE_URL,
  apiKey: process.env.OPENAI_API_KEY,
})

const argv = yargs(hideBin(process.argv))
  .usage('Usage: $0 <command> [options]')
  .demandCommand(1, 'Please provide either a command string or a file path')
  .help().argv

async function run() {
  const input = argv._[0].toString()
  let task
  try {
    await access(input)
    task = (await readFile(input, 'utf-8')).trim()
  } catch {
    task = input
  }
  const response = await mobileUse({
    task,
    llm: openai('gpt-4o'),
  })
  return response.text
}

run()
  .catch((error) => console.error(error))
  .then((result) => console.log(result))

编写一个 task.txt 文件,即将手机笔记本中的内容复制到腾讯元宝中并提交:

markdown 复制代码
1. open com.hihonor.notepad
2. Tap Taylor Swift
3. Get content of Taylor Swift as $content
4. open com.tencent.hunyuan.app.chat
5. Type $content
6. Click send

执行 node index.js task.txt

效果如下所示:

我们还是通过请求日志来分析一下是怎么实现的。

第一次发送的请求如下所示,可以看到定义了 openApplistAppscomputer 三个 tool,这些 tool 最终都是通过 adb 来操作设备。

json 复制代码
{
  "model": "gpt-4o",
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are an experienced mobile automation engineer. \nYour job is to navigate an android device and perform actions to fullfil request of the user.\n\n<steps>\nIf the user asks to use a specific app in the request, open it before performing any other action.\nDo not take ui dump more than once per action. If you think you don't need to take ui dump, skip it. Use it sparingly.\n</steps>\n"
    },
    {
      "role": "user",
      "content": "1. open com.hihonor.notepad\n2. Tap Taylor Swift\n3. Get content of Taylor Swift as $content\n4. open com.tencent.hunyuan.app.chat\n5. Type $content\n6. Click send"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "openApp",
        "description": "Open an on on android device.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": {
              "type": "string",
              "description": "package name of the app to open such as com.google.android.dialer"
            }
          },
          "required": ["name"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "listApps",
        "description": "Use this to list packages.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": {
              "type": "string",
              "description": "Name of the package to filter."
            }
          },
          "required": ["name"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "computer",
        "description": "Mobile tool to perform actions on a mobile device.",
        "parameters": {
          "type": "object",
          "properties": {
            "action": {
              "type": "string",
              "enum": [
                "ui_dump",
                "tap",
                "swipe",
                "type",
                "press",
                "wait",
                "screenshot"
              ],
              "description": "ui_dump: Get UI elements you can interact with for the current screen.\ntap: Tap on the provided coordinate.\nswipe: Swipe from start_coordinate to end_coordinate.\ntype: Type in the box.\npress: Press mobile key or button.\nscreenshot: Take a screenshot of the current screen if UI dump is not helpful or where you need to see visuals. \n      "
            },
            "coordinate": {"type": "array", "items": {"type": "number"}},
            "start_coordinate": {"type": "array", "items": {"type": "number"}},
            "end_coordinate": {"type": "array", "items": {"type": "number"}},
            "text": {"type": "string"},
            "duration": {"type": "number"}
          },
          "required": ["action"],
          "additionalProperties": false
        }
      }
    }
  ],
  "tool_choice": "auto"
}

大模型返回 openApp 的 tool call,打开笔记本:

json 复制代码
{
  "id": "chatcmpl-BN9A2Vtp0QnGxf9frNG8fMtk4tBHg",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": [],
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "call_ydC7cOmRppTsVEu0AneG3g6R",
            "function": {
              "arguments": "{\"name\":\"com.hihonor.notepad\"}",
              "name": "openApp"
            },
            "type": "function"
          }
        ]
      }
    }
  ],
  ...
}

第二次请求,将 openApp 以及执行结果追加到 messges 中给大模型:

json 复制代码
{
  "model": "gpt-4o",
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are an experienced mobile automation engineer. \nYour job is to navigate an android device and perform actions to fullfil request of the user.\n\n<steps>\nIf the user asks to use a specific app in the request, open it before performing any other action.\nDo not take ui dump more than once per action. If you think you don't need to take ui dump, skip it. Use it sparingly.\n</steps>\n"
    },
    {
      "role": "user",
      "content": "1. open com.hihonor.notepad\n2. Tap Taylor Swift\n3. Get content of Taylor Swift as $content\n4. open com.tencent.hunyuan.app.chat\n5. Type $content\n6. Click send"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "call_ydC7cOmRppTsVEu0AneG3g6R",
          "type": "function",
          "function": {
            "name": "openApp",
            "arguments": "{\"name\":\"com.hihonor.notepad\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_ydC7cOmRppTsVEu0AneG3g6R",
      "content": "\"Successfull opened com.hihonor.notepad\""
    }
  ],
  "tools": [
    ...
  ],
  "tool_choice": "auto"
}

大模型返回 computer 类型的 tool call,参数为 {"action": "ui_dump"},即获取页面的 UI 信息:

json 复制代码
{
  "id": "chatcmpl-BN9A4Mrj0JZ8QeSZYsdlz7gOMiwhW",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": [],
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "call_2T37XmWjfQhItVmw4vQzkoom",
            "function": {
              "arguments": "{\"action\":\"ui_dump\"}",
              "name": "computer"
            },
            "type": "function"
          }
        ]
      }
    }
  ],
  ...
}

看一下 ui_dump 返回的结果,是一个 JSON 格式的数据结构,包含所有页面的元素的坐标、类型等信息。它是通过解析 uiautomator dump 的结果得到的。

json 复制代码
{
  "model": "gpt-4o",
  "temperature": 0,
  "messages": [
    ...
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "call_2T37XmWjfQhItVmw4vQzkoom",
          "type": "function",
          "function": {
            "name": "computer",
            "arguments": "{\"action\":\"ui_dump\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_2T37XmWjfQhItVmw4vQzkoom",
      "content": "[{\"type\":\"text\",\"text\":\"{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,0][1080,2316]\\\",\\\"children\\\":[{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,84][1080,468]\\\",\\\"desc\\\":\\\"1 条笔记\\\",\\\"id\\\":\\\"collapsing_toolbar\\\",\\\"children\\\":[{\\\"type\\\":\\\"button\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[936,96][1044,240]\\\",\\\"desc\\\":\\\"更多选项\\\",\\\"id\\\":\\\"action_menu_more_button\\\"},{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[72,245][564,366]\\\",\\\"children\\\":[{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[72,245][432,366]\\\",\\\"text\\\":\\\"全部笔记\\\",\\\"id\\\":\\\"extend_appbar_title\\\"}]}]},{\\\"type\\\":\\\"list\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,468][1080,2316]\\\",\\\"id\\\":\\\"notes_listview\\\",\\\"children\\\":[{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,636][1080,1012]\\\",\\\"id\\\":\\\"swipelayout\\\",\\\"children\\\":[{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[72,666][996,731]\\\",\\\"text\\\":\\\"Taylor Swift\\\",\\\"id\\\":\\\"item_title\\\"},{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[72,749][996,861]\\\",\\\"text\\\":\\\"本服务将笔记中输入的语音内容转换成文本,需联网,需读取设备唯一可识别信息,需收集、读取并传送笔记中输入的语音到服务器,其中语音内容会委托给第三方服务提供商,用于将语音转换成文本。点击"同意",即可启用该服务。\\\",\\\"id\\\":\\\"item_content_hide\\\"},{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[72,749][996,855]\\\",\\\"text\\\":\\\"We were both young when I first saw you.\\\",\\\"id\\\":\\\"item_content\\\"}]}]},{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[984,468][1080,2316]\\\",\\\"id\\\":\\\"scroll_bar\\\"},{\\\"type\\\":\\\"view\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,468][1080,636]\\\",\\\"id\\\":\\\"real_search_view\\\",\\\"children\\\":[{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[36,498][1044,606]\\\",\\\"text\\\":\\\"搜索笔记\\\",\\\"desc\\\":\\\"搜索查询\\\",\\\"id\\\":\\\"search_src_text\\\"}]},{\\\"type\\\":\\\"button\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[864,2016][1008,2160]\\\",\\\"desc\\\":\\\"新建\\\",\\\"id\\\":\\\"fab_add\\\"},{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,0][0,0]\\\",\\\"text\\\":\\\"笔记\\\",\\\"id\\\":\\\"content\\\"},{\\\"type\\\":\\\"text\\\",\\\"clickable\\\":false,\\\"bounds\\\":\\\"[0,0][0,0]\\\",\\\"text\\\":\\\"待办\\\",\\\"id\\\":\\\"content\\\"}]}\"}]"
    }
  ],
  "tools": [
    ...
  ],
  "tool_choice": "auto"
}

然后 LLM 通过分析页面信息,得到 Taylor Swift 的点击坐标返回 computer 类型的 tool call,参数为 {"action": "tap", "coordinate":[72,666]}

json 复制代码
{
  "id": "chatcmpl-BN9A8OmVB5R8o7cQ15spyO278XpJM",
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": null,
        "refusal": null,
        "role": "assistant",
        "annotations": [],
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "call_KELEnVExZ9eBD7nmYxoVFrUk",
            "function": {
              "arguments": "{\"action\":\"tap\",\"coordinate\":[72,666]}",
              "name": "computer"
            },
            "type": "function"
          }
        ]
      }
    }
  ],
  ...
}

之后就是不停地重复类似的步骤了。

给大模型装配上手和脚以后,可以催生出很多应用场景,比如让大模型同时陪我的几个女朋友一起聊天,我就可以安心学习 AI 了,想想就开心。

相关推荐
qq_433502183 分钟前
Codex cli 飞书文档创建进阶实用命令 + Skill 创建&使用 小白完整教程
java·前端·飞书
IT_陈寒9 分钟前
为什么我的Vite热更新老是重新加载整个页面?
前端·人工智能·后端
一袋米扛几楼9828 分钟前
【网络安全】SIEM -Security Information and Event Management 工具是什么?
前端·安全·web安全
小陈工39 分钟前
2026年4月7日技术资讯洞察:下一代数据库融合、AI基础设施竞赛与异步编程实战
开发语言·前端·数据库·人工智能·python
Cobyte1 小时前
3.响应式系统基础:从发布订阅模式的角度理解 Vue2 的数据响应式原理
前端·javascript·vue.js
竹林8181 小时前
从零到一:在React前端中集成The Graph查询Uniswap V3池数据实战
前端·javascript
Mintopia1 小时前
别再迷信"优化":大多数性能问题根本不在代码里
前端
倾颜1 小时前
接入 MCP,不一定要先平台化:一次 AI Runtime 的实战取舍
前端·后端·mcp
军军君011 小时前
Three.js基础功能学习十八:智能黑板实现实例五
前端·javascript·vue.js·3d·typescript·前端框架·threejs