Selenium 自动化测试中跳过机器人验证的完整指南：能用

Selenium 自动化测试中跳过机器人验证的完整指南：从原理到实战

在网络自动化操作中，我们经常会遇到网站的机器人验证机制。这些机制旨在区分人类用户和自动化程序，但也给我们的 Selenium 自动化任务带来了挑战。本文将深入探讨如何使用 Selenium 跳过这些机器人验证，让你的自动化脚本更加高效和隐蔽。

机器人验证的工作原理

在解决问题之前，我们需要了解网站是如何检测机器人的。现代网站主要通过以下几个方面来识别自动化程序：

浏览器指纹识别：每个浏览器都有独特的指纹，包括 User-Agent、WebGL 渲染结果、字体列表、时区等信息。自动化程序通常使用固定的指纹，容易被识别。
WebDriver 特征检测 ：Selenium 等自动化工具会暴露特定的 WebDriver 特征，如 window.webdriver 属性，这是机器人检测的重要标志。
行为模式分析：人类用户的浏览行为具有随机性，如滚动速度、点击位置和停留时间等。自动化程序的行为往往过于规律，容易被检测。
环境异常检测：自动化环境可能缺少某些真实浏览器具有的功能或属性，如媒体设备访问权限、特定的浏览器扩展等。

了解了这些检测机制，我们就可以有针对性地制定解决方案。

Selenium 跳过机器人验证的完整解决方案

下面是一个完善的 Selenium 脚本，它采用了多种技术来绕过机器人验证：

python 复制代码

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import os
import json


def open_website_with_anti_detection():
    try:
        # 1. 基础配置 - 浏览器选项
        chrome_options = Options()

        # 指定用户数据目录，保留浏览器指纹和登录状态
        user_data_dir = r"D:\python_project\anti_bot\UserData"
        if not os.path.exists(user_data_dir):
            os.makedirs(user_data_dir)
        chrome_options.add_argument(f"--user-data-dir={user_data_dir}")

        # 2. 反检测核心配置 - 隐藏WebDriver特征
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")  # 隐藏自动化标识
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])  # 排除自动化开关
        chrome_options.add_experimental_option('useAutomationExtension', False)  # 禁用自动化扩展

        # 3. 模拟真实浏览器环境
        # 设置高版本User-Agent，接近真实用户
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
        chrome_options.add_argument(f"user-agent={user_agent}")

        # 指定Chrome浏览器二进制文件路径
        chrome_binary_path = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
        if os.path.exists(chrome_binary_path):
            chrome_options.binary_location = chrome_binary_path

        # 4. 浏览器环境优化
        chrome_options.add_argument("--disable-gpu")  # 禁用GPU加速，避免被部分反爬系统检测
        chrome_options.add_argument("--disable-features=IsolateOrigins,site-per-process")  # 禁用站点隔离
        # 随机窗口尺寸，模拟真实用户的不同设备
        chrome_options.add_argument(f"--window-size={random.randint(1366, 1920)},{random.randint(768, 1080)}")

        # 5. 驱动配置
        chrome_driver_path = r"D:\chromedriver\chromedriver.exe"
        service = Service(chrome_driver_path)

        # 6. 创建浏览器驱动
        driver = webdriver.Chrome(service=service, options=chrome_options)

        # 7. 注入JavaScript隐藏WebDriver特征，这是反检测的关键步骤
        driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
                // 隐藏WebDriver标识
                Object.defineProperty(navigator, 'webdriver', {
                    get: () => undefined
                })
                // 模拟真实的Chrome浏览器属性
                window.navigator.chrome = {
                    runtime: {},
                    browser: {
                        getVersion: () => '115.0.5790.170'
                    }
                }
                // 模拟媒体设备，避免因缺少摄像头/麦克风权限被检测
                navigator.mediaDevices = {
                    getDevices: () => Promise.resolve([])
                }
                // 模拟浏览器加载完成事件
                window.dispatchEvent(new Event('load'))
            """
        })

        # 8. 打开目标网站，这里以指纹检测页面为例
        driver.get("https://fingerprintjs.github.io/BotD/main/")
        print("已打开指纹检测页面，请查看检测结果")

        # 9. 模拟人机行为 - 滚动和延时
        wait = WebDriverWait(driver, 10)
        for _ in range(3):
            scroll_height = driver.execute_script("return document.body.scrollHeight")
            # 随机滚动到页面不同位置
            driver.execute_script(f"window.scrollTo(0, {random.randint(0, scroll_height)})")
            # 随机延时，模拟人类操作节奏
            time.sleep(random.uniform(1, 3))

        # 10. 打印页面检测结果
        try:
            result_element = wait.until(EC.presence_of_element_located((By.ID, 'result')))
            print("页面检测结果:", result_element.text)
        except:
            print("未获取到检测结果元素")

        # 保持窗口打开，手动查看检测结果
        input("按Enter键关闭浏览器...")

    except Exception as e:
        print(f"出现错误: {e}")
    finally:
        if 'driver' in locals():
            driver.quit()
            print("浏览器已关闭")


if __name__ == "__main__":
    open_website_with_anti_detection()

核心反检测技术详解

1. 隐藏 WebDriver 特征

Selenium 最容易被检测到的特征就是 WebDriver 标识。我们通过以下方法来隐藏这些特征：

--disable-blink-features=AutomationControlled：这是 Selenium 4.8+ 后的关键反检测参数，用于禁用 Chrome 的自动化控制特征。
排除自动化开关：chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) 可以防止浏览器启动时加载自动化相关的开关。
注入 JavaScript 脚本：通过重写 navigator.webdriver 和 window.navigator.chrome 属性，模拟真实浏览器环境。这一步非常重要，因为很多网站会直接检查这些属性来判断是否为自动化程序。

2. 模拟真实浏览器指纹

浏览器指纹是机器人检测的重要依据，我们可以通过以下方式模拟真实指纹：

User-Agent 设置：使用最新版本的 Chrome User-Agent，避免使用旧版本（如 Chrome 91），因为旧版本 UA 很容易被识别为机器人。
随机窗口尺寸：每次运行时生成不同的窗口大小，避免固定值。真实用户使用不同设备访问，窗口尺寸各不相同。
用户数据目录 ：使用 --user-data-dir 选项指定浏览器用户数据目录，这样可以保留浏览器指纹和登录状态，使后续访问更加真实。

3. 模拟人机行为

行为模式是区分人类和机器人的重要因素，我们可以通过以下方式模拟真实用户行为：

随机滚动：在页面加载完成后，随机滚动到不同位置，模拟人类浏览页面的行为。
随机延时：在操作之间添加随机延时，避免固定频率的操作，这是机器人的典型特征。
显式等待 ：使用 WebDriverWait 等待元素加载，模拟人类等待页面响应的行为。

4. 环境优化

禁用 GPU 加速：部分反爬系统会检测 GPU 渲染特征，禁用后更接近普通浏览器。
禁用站点隔离 ：--disable-features=IsolateOrigins,site-per-process 可以禁用站点隔离，避免因环境异常被检测。

进阶反检测技术

1. 使用反指纹浏览器扩展

可以安装一些反指纹浏览器扩展来进一步随机化浏览器指纹，例如：

Chameleon：随机化浏览器指纹，包括 User-Agent、时区、语言等。
Random User-Agent：每次浏览时随机更换 User-Agent。

在 Selenium 中安装扩展的方法：

python 复制代码

chrome_options.add_extension("chameleon.crx")

2. 配置代理 IP

使用代理 IP 可以避免同一 IP 频繁访问触发反爬机制：

python 复制代码

chrome_options.add_argument("--proxy-server=http://127.0.0.1:8080")  # 替换为实际代理

3. 完善语言环境

检测结果中单一的语言环境（如 en-US）容易被怀疑，添加中文支持：

python 复制代码

chrome_options.add_argument("--lang=zh-CN")

4. 模拟媒体设备

python 复制代码

# 注入JavaScript模拟媒体设备
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
        navigator.mediaDevices = {
            getDevices: () => Promise.resolve([
                {kind: 'videoinput', label: 'Webcam'},
                {kind: 'audioinput', label: 'Microphone'}
            ])
        }
    """
})

验证反检测效果

有几个很好的网站可以用来验证你的反检测设置效果：

FingerprintJS 检测页面：https://fingerprintjs.github.io/BotD/main/
- 理想状态应显示：Bot: false，Detected bot kind: undefined
Sannysoft 机器人检测：https://bot.sannysoft.com/
- 该网站会从多个维度检测机器人特征，提供详细的检测报告。
AmIACrawler：https://www.amia-crawler.com/
- 专门用于检测自动化程序的网站，提供全面的机器人检测评估。

注意事项和最佳实践

驱动版本匹配：确保 chromedriver 版本与你的 Chrome 浏览器版本一致，否则可能导致运行错误。
定期更新 User-Agent：浏览器版本更新频繁，定期更新 User-Agent 以保持与最新浏览器一致。
避免过度请求：即使使用了反检测技术，也应避免对目标网站进行过度请求，以免触发其他反爬机制。
合规性优先：在进行网络自动化操作时，确保你的行为符合目标网站的使用条款和相关法律法规。
动态调整策略：反爬技术不断更新，定期测试你的自动化脚本，并根据检测结果调整反检测策略。

总结

跳过机器人验证是一个需要不断优化的过程，随着反爬技术的进步，我们的反检测方法也需要不断更新。本文提供的解决方案结合了多种反检测技术，能够有效降低被识别为机器人的概率。记住，最关键的是模拟真实用户的浏览器环境和行为模式，让自动化程序尽可能接近人类用户的操作。

通过不断学习和实践，你可以让你的 Selenium 自动化脚本更加隐蔽和高效，轻松应对各种机器人验证机制。