基于文本检测的 Python 爬虫弹窗图片定位与拖动实现

一、核心技术原理

（一）文本检测技术选型

弹窗图片中的文字是定位交互区域的关键标识，需通过光学字符识别（OCR）技术提取文本并确定位置。Tesseract-OCR 作为开源高效的 OCR 引擎，支持多语言识别且可通过训练优化精度，配合 Python 的 pytesseract 库可快速实现图片文本检测。同时，OpenCV 用于图片预处理（降噪、二值化），提升 OCR 识别准确率。

（二）弹窗定位逻辑

网页弹窗本质是 DOM 元素的动态渲染，通过 Selenium 模拟浏览器环境，可捕获弹窗对应的 HTML 节点。结合文本检测结果，将 OCR 识别到的目标文本（如 "拖动滑块"）与弹窗图片中的坐标关联，定位滑块起始位置与目标区域（通常为文本提示对应的缺口位置）。

（三）滑块拖动实现

滑块拖动需模拟人类操作轨迹，避免被反爬机制识别为机器行为。核心是生成非线性移动轨迹（加速 - 匀速 - 减速），通过 Selenium 的 ActionChains 类实现鼠标按下、移动、释放的连贯操作，同时控制移动时间与步长，模拟真实用户交互。

二、开发环境准备

（一）依赖库安装

需安装以下 Python 库，分别用于浏览器模拟、图片处理、OCR 识别及轨迹生成：

bash 复制代码

# 浏览器自动化库
pip install selenium
# 图片处理库
pip install opencv-python
# OCR识别库
pip install pytesseract
# 图像处理辅助库
pip install pillow
# 数值计算库（用于轨迹生成）
pip install numpy

（二）环境配置

Tesseract-OCR 安装：从官网下载安装包，配置系统环境变量（如 Windows 添加C:\Program Files\Tesseract-OCR），确保命令行输入tesseract --version可正常返回版本。
浏览器驱动配置：下载与本地 Chrome 版本匹配的 ChromeDriver，放置在 Python 脚本目录或系统环境变量路径中。

三、完整实现流程

（一）步骤拆解

网页访问与弹窗触发：通过 Selenium 打开目标网页，触发弹窗验证（如点击按钮、滚动页面）。
弹窗图片捕获：定位弹窗图片元素，截图并保存为本地文件。
图片预处理与文本检测：使用 OpenCV 优化图片质量，通过 Tesseract-OCR 提取文本及坐标。
目标区域定位：根据识别到的关键文本（如 "缺口""验证"），确定滑块起始位置与目标位置。
模拟滑块拖动：生成人类行为轨迹，通过 ActionChains 执行拖动操作。
验证结果校验：判断验证是否成功，失败则重试。

（二）核心代码实现

1. 初始化浏览器与基础配置

python

python 复制代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import cv2
import pytesseract
from PIL import Image
import numpy as np
import time

# 配置Tesseract路径（Windows系统需指定）
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

class SlideVerificationCrawler:
    def __init__(self, url):
        # 初始化Chrome浏览器（无头模式可选）
        self.options = webdriver.ChromeOptions()
        self.options.add_experimental_option("excludeSwitches", ["enable-automation"])
        self.options.add_experimental_option('useAutomationExtension', False)
        self.driver = webdriver.Chrome(options=self.options)
        self.driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
            Object.defineProperty(navigator, 'webdriver', {
              get: () => undefined
            })
          """
        })
        self.url = url
        self.wait = WebDriverWait(self.driver, 10)

    def open_page(self):
        """打开目标网页"""
        self.driver.get(self.url)
        self.driver.maximize_window()
        # 等待页面加载完成（根据实际网页调整等待条件）
        self.wait.until(EC.presence_of_element_located((By.TAG_NAME, 'body')))
        print("网页加载完成")

2. 弹窗图片捕获与预处理

python

python 复制代码

def capture_popup_image(self, popup_img_xpath):
        """捕获弹窗图片"""
        # 等待弹窗图片加载
        popup_img = self.wait.until(EC.presence_of_element_located((By.XPATH, popup_img_xpath)))
        # 获取图片位置与尺寸
        location = popup_img.location
        size = popup_img.size
        # 浏览器截图
        self.driver.save_screenshot('full_screen.png')
        # 裁剪弹窗图片
        full_img = Image.open('full_screen.png')
        left = location['x']
        top = location['y']
        right = left + size['width']
        bottom = top + size['height']
        popup_img_crop = full_img.crop((left, top, right, bottom))
        popup_img_path = 'popup_image.png'
        popup_img_crop.save(popup_img_path)
        print(f"弹窗图片已保存至{popup_img_path}")
        return popup_img_path

    def preprocess_image(self, img_path):
        """图片预处理：降噪、二值化提升OCR准确率"""
        img = cv2.imread(img_path)
        # 转为灰度图
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        # 高斯模糊降噪
        blur = cv2.GaussianBlur(gray, (3, 3), 0)
        # 自适应二值化
        thresh = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                      cv2.THRESH_BINARY_INV, 11, 2)
        # 保存预处理后的图片
        preprocessed_img_path = 'preprocessed_image.png'
        cv2.imwrite(preprocessed_img_path, thresh)
        return preprocessed_img_path

3. 文本检测与目标位置定位

python

python 复制代码

def detect_text(self, img_path):
        """OCR文本检测：返回识别文本及对应坐标"""
        img = Image.open(img_path)
        # 配置Tesseract识别参数（中文识别+返回坐标）
        custom_config = r'--oem 3 --psm 6 -l chi_sim+eng makebox'
        # 执行OCR识别
        result = pytesseract.image_to_data(img, config=custom_config, output_type=pytesseract.Output.DICT)
        # 提取有效文本（过滤空字符串）
        text_info = []
        for i in range(len(result['text'])):
            text = result['text'][i].strip()
            if text:
                # 获取文本区域坐标（left, top, width, height）
                left = result['left'][i]
                top = result['top'][i]
                width = result['width'][i]
                height = result['height'][i]
                text_info.append({
                    'text': text,
                    'left': left,
                    'top': top,
                    'width': width,
                    'height': height,
                    'center_x': left + width // 2,
                    'center_y': top + height // 2
                })
        print(f"识别到的文本：{[info['text'] for info in text_info]}")
        return text_info

    def locate_target_area(self, text_info):
        """根据文本定位滑块起始位置与目标位置"""
        start_pos = None
        target_pos = None
        for info in text_info:
            # 假设"拖动滑块"对应滑块起始位置（可根据实际网页调整关键词）
            if '拖动滑块' in info['text']:
                start_pos = (info['center_x'], info['center_y'])
            # 假设"缺口"对应目标位置（可根据实际网页调整关键词）
            elif '缺口' in info['text']:
                target_pos = (info['center_x'], info['center_y'])
        if not start_pos or not target_pos:
            raise Exception("未识别到关键文本，无法定位目标区域")
        print(f"滑块起始位置：{start_pos}，目标位置：{target_pos}")
        return start_pos, target_pos

4. 滑块拖动轨迹生成与执行

python

python 复制代码

def generate_track(self, distance):
        """生成模拟人类的滑块拖动轨迹（非线性）"""
        track = []
        current = 0
        # 加速阶段（前30%距离）
        accelerate_distance = distance * 0.3
        while current < accelerate_distance:
            # 加速度逐渐减小
            a = np.random.uniform(1.5, 2.5)
            v = current + a
            track.append(round(v))
            current += v
        # 匀速阶段（中间50%距离）
        uniform_distance = distance * 0.5
        while current < accelerate_distance + uniform_distance:
            v = np.random.uniform(2.0, 3.0)
            track.append(round(v))
            current += v
        # 减速阶段（最后20%距离）
        while current < distance:
            a = np.random.uniform(0.5, 1.0)
            v = max(0.5, current - (distance - current) * a)
            track.append(round(v))
            current += v
        # 确保总距离准确
        track = track[:-1] if sum(track) > distance else track
        track.append(distance - sum(track))
        return track

    def drag_slider(self, slider_xpath, start_pos, target_pos):
        """执行滑块拖动操作"""
        # 定位滑块元素
        slider = self.wait.until(EC.presence_of_element_located((By.XPATH, slider_xpath)))
        # 计算水平拖动距离（假设垂直方向无需移动）
        distance = target_pos[0] - start_pos[0]
        if distance <= 0:
            raise Exception("目标位置在起始位置左侧，距离计算错误")
        # 生成拖动轨迹
        track = self.generate_track(distance)
        print(f"拖动轨迹：{track}，总距离：{sum(track)}")
        # 模拟鼠标操作
        action = ActionChains(self.driver)
        # 鼠标移动到滑块起始位置并按下
        action.move_to_element_with_offset(slider, start_pos[0], start_pos[1])
        action.click_and_hold().perform()
        time.sleep(0.2)  # 按下后停留0.2秒，模拟人类操作
        # 按轨迹移动鼠标
        for step in track:
            action.move_by_offset(step, np.random.randint(-2, 3))  # 垂直方向随机小幅度抖动
            action.perform()
            time.sleep(np.random.uniform(0.01, 0.03))  # 每步停留随机时间
        # 释放鼠标
        action.release().perform()
        time.sleep(1)  # 等待验证结果

5. 主函数与验证流程

python

运行

python 复制代码

def run(self, popup_img_xpath, slider_xpath):
        """执行完整爬虫与验证流程"""
        try:
            # 1. 打开网页
            self.open_page()
            # 2. 触发弹窗（根据实际网页调整，如点击按钮）
            # 示例：self.driver.find_element(By.XPATH, '//button[@id="verify-btn"]').click()
            time.sleep(2)  # 假设弹窗自动弹出，等待2秒
            # 3. 捕获弹窗图片
            popup_img_path = self.capture_popup_image(popup_img_xpath)
            # 4. 图片预处理
            preprocessed_img_path = self.preprocess_image(popup_img_path)
            # 5. 文本检测
            text_info = self.detect_text(preprocessed_img_path)
            # 6. 定位目标区域
            start_pos, target_pos = self.locate_target_area(text_info)
            # 7. 拖动滑块
            self.drag_slider(slider_xpath, start_pos, target_pos)
            # 8. 验证结果校验（根据实际网页调整判断条件）
            if "验证成功" in self.driver.page_source:
                print("滑块验证成功！")
            else:
                print("验证失败，重试中...")
                self.run(popup_img_xpath, slider_xpath)  # 重试
        except Exception as e:
            print(f"执行失败：{str(e)}")
        finally:
            # 保持浏览器打开，便于调试（生产环境可注释）
            time.sleep(10)
            self.driver.quit()

# 示例：调用爬虫（需根据实际网页调整XPATH）
if __name__ == "__main__":
    target_url = "https://example.com/verify-page"  # 目标网页URL
    # 弹窗图片XPATH（需根据实际网页F12查看）
    popup_img_xpath = '//div[@class="popup-img"]/img'
    # 滑块元素XPATH（需根据实际网页F12查看）
    slider_xpath = '//div[@class="slider-block"]'
    crawler = SlideVerificationCrawler(target_url)
    crawler.run(popup_img_xpath, slider_xpath)

四、关键优化与注意事项

（一）OCR 识别精度优化

针对弹窗图片特点，调整 OpenCV 预处理参数（如模糊核大小、二值化阈值），减少背景干扰。
若识别准确率低，可使用 Tesseract 训练工具（如 jTessBoxEditor）训练目标网页弹窗文字的专属字库。
结合关键词模糊匹配（如使用in关键字或正则表达式），提高关键文本识别容错率。

（二）反爬机制规避

禁用 Selenium 的webdriver标识，通过 CDP 命令修改navigator.webdriver属性，避免被网页检测为爬虫。
拖动轨迹加入随机抖动（垂直方向 ±2 像素）和随机时间间隔（0.01-0.03 秒），模拟人类操作的不确定性。
避免操作速度过快，在鼠标按下、移动、释放等步骤中加入合理延时（如 0.2 秒按下停留）。

（三）适应性调整

不同网页的弹窗文本、元素 XPATH 差异较大，需通过 F12 开发者工具查看实际 DOM 结构，调整关键词（如 "拖动滑块""缺口"）和 XPATH 路径。
若弹窗图片为动态加载（如每次刷新不同），需在捕获图片前确保图片完全加载，可通过WebDriverWait等待图片naturalWidth属性大于 0。