OmniParser视觉鼠标自动化实战

一、原理：

一、OmniParser 模拟鼠标的核心原理

OmniParser 是视觉解析 + 键鼠执行：

截图 → 用 YOLOv8 识别 UI 元素（按钮 / 输入框），输出边界框（bbox）
计算元素中心坐标 → 调用pyautogui执行移动、点击、拖拽等操作

二、Python 实现（最接近 OmniParser，推荐）

1. 安装依赖

bash 复制代码

pip install pyautogui pillow opencv-python

2. 基础鼠标操作（和 OmniTool 一致）

python 复制代码

import pyautogui
import time

# 安全设置（鼠标移到左上角立即停止）
pyautogui.FAILSAFE = True
pyautogui.PAUSE = 0.5  # 操作间隔

# 1. 获取屏幕尺寸
w, h = pyautogui.size()
print(f"屏幕：{w}x{h}")

# 2. 移动鼠标（绝对/相对）
pyautogui.moveTo(500, 300, duration=0.3)  # 0.3秒移到(500,300)
pyautogui.moveRel(100, 50, duration=0.2)  # 右下移动100,50

# 3. 点击（左键/右键/双击）
pyautogui.click()  # 当前位置左键单击
pyautogui.click(600, 400, button='right')  # 右键点击指定位置
pyautogui.doubleClick(700, 500)

# 4. 拖拽
pyautogui.dragTo(800, 600, duration=0.5)  # 拖拽到目标
pyautogui.dragRel(100, 0, duration=0.3)   # 相对拖拽

# 5. 滚轮
pyautogui.scroll(10)  # 上滚
pyautogui.scroll(-5)  # 下滚

3. 结合视觉识别（完整 OmniParser 风格）

python 复制代码

import cv2
import pyautogui

def bbox_to_center(bbox, screen_w, screen_h):
    """bbox转屏幕中心坐标（OmniParser常用）"""
    x1, y1, x2, y2 = bbox
    cx = int((x1 + x2) / 2 * screen_w)
    cy = int((y1 + y2) / 2 * screen_h)
    return cx, cy

# 模拟OmniParser输出的bbox（按钮/输入框）
button_bbox = [0.2, 0.3, 0.4, 0.4]  # 归一化坐标
screen_w, screen_h = pyautogui.size()
cx, cy = bbox_to_center(button_bbox, screen_w, screen_h)

# 执行点击
pyautogui.moveTo(cx, cy, duration=0.2)
pyautogui.click()
print(f"点击：({cx}, {cy})")

三、其他语言 / 方案

C#（Windows 底层，InputSimulator）

css 复制代码

using WindowsInput;
var sim = new InputSimulator();
sim.Mouse.MoveMouseTo(500, 300);
sim.Mouse.LeftButtonClick();

安装：NuGet → InputSimulator

Node.js（RobotJS）

javascript 复制代码

const robot = require("robotjs");
robot.moveMouse(500, 300);
robot.mouseClick();

安装：npm install robotjs

C++（Windows API SendInput，底层）

cpp 复制代码

#include <Windows.h>
void MouseMove(int x, int y) {
    INPUT inp = {0};
    inp.type = INPUT_MOUSE;
    inp.mi.dx = (LONG)(x * 65535.0 / GetSystemMetrics(SM_CXSCREEN));
    inp.mi.dy = (LONG)(y * 65535.0 / GetSystemMetrics(SM_CYSCREEN));
    inp.mi.dwFlags = MOUSEEVENTF_MOVE | MOUSEEVENTF_ABSOLUTE;
    SendInput(1, &inp, sizeof(INPUT));
}

四、关键技巧（更像人类操作）

自然移动 ：用duration控制速度，避免瞬移
坐标归一化：适配不同分辨率（OmniParser 标准做法）
防检测：加随机偏移、轨迹平滑（OxyMouse 贝塞尔 / 高斯轨迹）
安全：保留 FAILSAFE，避免失控

五、完整流程（复刻 OmniParser）

截图 → pyautogui.screenshot()
视觉识别（YOLO/OpenCV）→ 得到 bbox
坐标转换 → 屏幕中心
键鼠执行 → pyautogui

二、模型（Ollama给我的推荐）

一、Ollama 可用 VLM 推荐（适合鼠标自动化）

优先选能输出坐标 /bbox的模型，最适合做 OmniParser 风格视觉点击。

1. Qwen3-VL / Qwen2.5-VL（最强 UI 定位）

特点：原生支持归一化坐标 /bbox，直接输出可点击位置，完美适配鼠标自动化
版本：qwen3-vl:8b（16GB 显存）、qwen2.5-vl:7b（8GB 显存）
拉取：

bash 复制代码

ollama pull qwen3-vl:8b
# 或轻量版
ollama pull qwen2.5-vl:7b

2. Llama 3.2 Vision（Meta 官方）

特点：通用视觉强，适合截图理解，坐标输出需提示词引导
版本：llama3.2-vision:11b（16GB+）、llama3.2-vision:90b（32GB+）
拉取：

bash 复制代码

ollama pull llama3.2-vision:11b

3. LLaVA（经典轻量）

特点：开源轻量，适合入门，坐标输出较弱
版本：llava:7b、llava-llama3:8b
拉取：

bash 复制代码

ollama pull llava:7b

4. Moondream2（超轻量，低配机）

特点：1.8B，8GB 内存可跑，适合快速测试
拉取：

bash 复制代码

ollama pull moondream2

二、Python 调用 Ollama VLM + 鼠标点击（完整示例）

1. 安装依赖

bash 复制代码

pip install ollama pyautogui mss pillow

2. 核心代码（截图→识别→点击）

python 复制代码

import ollama
import pyautogui
import mss
import json

# 1. 截图（用mss更快）
def capture_screen():
    with mss.mss() as sct:
        monitor = sct.monitors[1]
        img = sct.grab(monitor)
        img_path = "screen.png"
        mss.tools.to_png(img.rgb, img.size, output=img_path)
    return img_path

# 2. 调用Ollama VLM获取坐标（关键：提示词要明确要JSON+bbox）
def get_click_coords(img_path, prompt="找到提交按钮，返回归一化bbox [x1,y1,x2,y2]，JSON格式"):
    response = ollama.chat(
        model="qwen3-vl:8b",
        messages=[{
            "role": "user",
            "content": prompt,
            "images": [img_path]
        }]
    )
    # 解析JSON（适配Qwen3-VL输出）
    try:
        res = json.loads(response['message']['content'])
        bbox = res['bbox']  # 归一化 [0-1]
        return bbox
    except:
        return None

# 3. 坐标转屏幕像素+点击
def click_by_bbox(bbox):
    w, h = pyautogui.size()
    x1, y1, x2, y2 = bbox
    cx = int((x1 + x2) / 2 * w)
    cy = int((y1 + y2) / 2 * h)
    pyautogui.moveTo(cx, cy, duration=0.2)
    pyautogui.click()
    print(f"点击：({cx}, {cy})")

# 4. 主流程
if __name__ == "__main__":
    pyautogui.FAILSAFE = True
    img = capture_screen()
    bbox = get_click_coords(img)
    if bbox:
        click_by_bbox(bbox)

三、关键提示词（让模型输出坐标）

必须在 prompt 里明确要求归一化 bbox/JSON，示例：

bash 复制代码

请分析这张截图，找到【登录按钮】，返回归一化bbox [x1,y1,x2,y2]，仅输出JSON，不要其他文字。

四、硬件要求（显存 / 内存）

Qwen3-VL:8b / Llama3.2-Vision:11b → 16GB 显存
Qwen2.5-VL:7b / LLaVA:7b → 8GB 显存
Moondream2 → 8GB 内存（CPU 可跑）

三、内存占用

主流 VLM 模型真实内存 / 显存占用，按你跑自动化最实用的顺序排，直接看就能选。

VLM 显存 / 内存占用（实测）

单位：VRAM 显存 / RAM 内存

1. 最轻量（8GB 显存可稳跑）

moondream2
- 显存：~2.5GB
- 内存：~1GB
- 适合：低配机、快速测试
llava:7b / llava-llama3:8b
- 显存：~5--6GB
- 内存：~2--3GB
qwen2.5-vl:7b
- 显存：~6--7GB
- 内存：~2--3GB
- 亮点：UI 识别、输出坐标最强，自动化首选

2. 中等（建议 ≥10GB 显存）

llama3.2-vision:11b
- 显存：~9--10GB
- 内存：~3--4GB
- 通用理解很强，坐标要靠提示词
qwen3-vl:8b
- 显存：~8--9GB
- 内存：~3GB
- 比 7B 更强，UI 定位更准

3. 大模型（≥16GB 显存才考虑）

llama3.2-vision:90b
- 显存：~35GB+
- 不适合自动化，太重

给你直接结论（最适合做鼠标自动化）

你显卡 ≤8GB 显存 → 用 qwen2.5-vl:7b 或 moondream2
你显卡 ≥10GB 显存 → 用 qwen3-vl:8b（最强 UI 定位）
想最省资源→ moondream2