都2024年了?你还不会使用Puppeteer?

前言:

众所周知在开发的过程中,数据一直是推动整个业务链条的重要一环,通过爬虫进行数据的爬取和更新也是日常的操作,目前支持爬虫的语言很多:Python、Java、Ruby 还有Nodejs ,也就是今天主角Puppeteer, 它是由 Google Chrome 官方团队维护以Node.js 为基础的开源工具,主要用于控制和自动化谷歌浏览器(Google Chrome)或其他兼容的浏览器操作。

废话不多说,下面让我们从浅到深一步一步带领大家走进爬虫的世界~

puppeteer的简介

Puppeteer是一个由Google开发的Node.js库,它提供了一套用于控制headless Chrome或Chromium浏览器的API。它可以模拟用户在浏览器中的操作行为,如点击、填写表单、截图等,同时还可以让开发者获取到浏览器渲染后的HTML内容。它提供了一套高级的 API,使得浏览器操作变得简单和可靠。主要包括:自动化控制、页面操控、网络请求拦截、页面截图和 PDF 生成、自动化测试等一系列操作

总而言之,Puppeteer 是一个功能强大、易用且灵活的浏览器自动化工具,能够帮助开发者完成各种浏览器操作和自动化任务。

环境搭建

puppeteer从v1.7.0 开始支持两个包:puppeteerpuppeteer-core

  • puppeteer: 一个完整的包会下载一个可执行的Chromium浏览器。整个体积很大(适合本地调试)
  • puppeteer-core: 不会下载一个可执行的Chromium浏览器、体积很小、配置的浏览器需要自己手动更新(适合部署在生产环境)

支持的版本Node版本 >= v16.20.0

js 复制代码
npm i puppeteer or npm i puppeteer -g   // 最新版本:V21.7.0

Puppeteer的基础API

使用Headless模式

Puppeteer默认启动的是无头模式进行开发, 可以通过headless进行配置关闭,本地调试建议开启,

js 复制代码
const browser = await puppeteer.launch();
// Equivalent to
const browser = await puppeteer.launch({headless: false}); // 本地调试

需要注意的是Chrome 112 推出了新的 Headless 模式,可以通过新的参数调整

js 复制代码
const browser = await puppeteer.launch({headless: 'new'});

使用Puppeteer-core

在生产环境部署的时候使用puppeteer-core要注意版本,目测在v16.2.0 这个版本是没问题 ,最新v21.7.0在部署线上的时候有点问题

js 复制代码
// const puppeteer = require("puppeteer");
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
// executablePath: "/usr/bin/google-chrome", // 生产环境
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
});
return browser;

关于executablePath 如何可以访问:chrome://version/可执行文件路径进行查看」

设置浏览器实例的其他命令行参数

可以设置args来执行当前运行的浏览器实例一些命令行加以限制,具体可以参考Chromium命令行开关列表

js 复制代码
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
  args: [
    "--no-sandbox", // 使用沙盒模式
    "--disable-setuid-sandbox", // 禁用setuid沙盒(仅限Linux)
    "--disable-extensions", // 禁用扩展
    "--incognito", // 禁用GPU硬件加速
    "--disable-gpu", // 以隐身模式运行
    "--no-zygote", // 禁用 Zygote 进程模型,启动时不创建一个共享的子进程来提高性能。
  ],
});
return browser;

设置浏览器视口分辨率

可以通过defaultViewport进行PC端的设置默认的视口分辨率

js 复制代码
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
  defaultViewport: {
    height: 1080,
    width: 1920,
  },
});
return browser;

指定移动端设备访问

js 复制代码
const puppeteer = require("puppeteer");
const iPhone = puppeteer.devices["iPhone 6"];

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  await page.emulate(iPhone);
 
})();

其他

如果使用Docker部署可以参考相关资源

简单case上手

介绍完了以上的基础的API下面通过三个小例子来看一下它是如何工作的。

模拟设备截图

通过puppeteer模拟iPhone6进行访问百度的域名,进行当前网页的截图

js 复制代码
const puppeteer = require("puppeteer");
const iPhone = puppeteer.devices["iPhone 6"];

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto("https://baidu.com/");
  await page.screenshot({
    path: "full.png",
    fullPage: true,
  });
  console.log(await page.title());
  await browser.close();
})();

使用用户搜索截图

通过puppeteer进行百度搜索Puppeteer,进行截图保存到本地

js 复制代码
// baidu search
const puppeteer = require("puppeteer");
const screenshot = "baidu.png";
try {
  (async () => {
    const browser = await puppeteer.launch({
      headless: false,
    });
    const page = await browser.newPage();
    await page.goto("https://baidu.com");
    await page.type("#kw", "puppeteer");
    await page.click("#su");
    await page.waitForTimeout(2000);
    await page.screenshot({ path: screenshot });
    await browser.close();
  })();
} catch (err) {
  console.error(err);
}

设置cookie

通过puppeteer进行打开paypal进行cookie的种植,达到用户名的渲染

js 复制代码
// set cookie
const cookie = {
  name: "login_email",
  value: "set_by_cookie@domain.com",
  domain: ".paypal.com",
  url: "https://www.paypal.com/",
  path: "/",
  httpOnly: true,
  secure: true,
};

const puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.setCookie(cookie);
  await page.goto("https://www.paypal.com/signin");
  await page.screenshot({ path: "paypal_login.png" });
  await browser.close();

})();

实战案例

相信通过以上的简单的示例分析,大家对整个流程有了一个初步的认识。下面让围绕目前主流爬取的方式来逐一攻破它的工作原理

解析HTML

下面以codashop 为例,通过解析HTML的方式把相关DOM节点元素进行筛选和过滤,抽离出SKU(商品)的「价格、商品名称」等数据

实例代码如下:

js 复制代码
// codashop
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://www.codashop.com/en-my/pubg-mobile-uc-redeem-code",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

// 延时
const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};

try {
  const run = async () => {
    const roots = ".form-section__denom-group";
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    await page.goto(config.url, { waitUntil: "domcontentloaded" });

    const section__denom = await page.waitForSelector(roots);

    if (!section__denom) return [];

    const params = { ...config, platform: "codashop" };
    const _waitFor = waitFor.toString();

    // 进行DOM操作
    const jsons = await page.evaluate(
      async (args, _waitFor, _roots) => {
        const _wait = eval("(" + _waitFor + ")");
        await _wait(1000);

        let price, sku_name;

        let _lis =
          Array.from(
            document.querySelectorAll(".form-section__denom-group li")
          ) || [];

        if (_lis && _lis.length === 0) return [];

        const games = _lis.map((item) => {
          const sku_name_dom =
            item.querySelector(".form-section__denom-data-section") || null;
          const sku_price_dom =
            item.querySelector(".starting-price-value") || null;

          if (sku_name_dom) {
            sku_name = sku_name_dom.innerText || "SKU_NAME";
          }

          if (sku_price_dom) {
            price = sku_price_dom.innerText;
          }

          return {
            price,
            sku_name,
            currency: args.currency,
            platform: args.platform,
            game: args.gameName,
            country: args.country,
          };
        });

        return !!(games && games.length) ? games : [];
      },
      params,
      _waitFor,
      roots
    );
    console.log(jsons);
    /**
     * [
          {
            price: 'RM4.50',
            sku_name: '60 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM22.50',
            sku_name: '325 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM45.00',
            sku_name: '660 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM112.50',
            sku_name: '1800 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM225.00',
            sku_name: '3850 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM450.00',
            sku_name: '8100 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          }
        ]
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

解析SSR渲染数据

jollymax为例,通过查看当前的源码可以得到两个信息:使用的框架和是否为SSR渲染从而定位到数据的位置,下面是使用nuxtjs的SSR渲染,如图所示:

实例代码如下:

js 复制代码
// jollymax
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://www.jollymax.com/ru/PUBG",
  gameName: "pubgm",
  currency: "RUB",
  country: "ru",
  thous_separator: "", // 千位分隔符
  decimal_point_separator: ".", // 小数分隔符
};

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    // Create a page
    const page = await browser.newPage();

    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    // 拦截请求
    await page.setRequestInterception(true);

    page.on("request", async (request) => {
      // 对一些不必要的资源、进行终止增加加载速度
      if (
        request.resourceType() == "image" ||
        request.resourceType() == "font" ||
        request.resourceType() == "stylesheet"
      ) {
        await request.abort();
      } else {
        await request.continue();
      }
    });

    await page.goto(config.url, { waitUntil: "domcontentloaded" });

    // 等待整个DOM加载完成
    await page.waitForSelector(".content-right-part");

    const params = { ...config, platform: "jollymax" };

    const result = await page.evaluate(async (args) => {
      let filterResults = [];

      if (
        window &&
        window.__NUXT__ &&
        window.__NUXT__.data &&
        window.__NUXT__.data.length
      ) {
        const _serverData = window.__NUXT__.data[0]?.serverData;

        if ("pageData" in _serverData) {
          const glist = _serverData.pageData.pageInfo.goodsList;
          if (!(glist && glist.length)) return filterResults;

          const getPrice = (item) => {
            let result = "0";
            if (item.payTypeList.length) {
              result = item.payTypeList[0].amount;
            }
            return result.toString();
          };

          //   默认拿取第一个支付通道的价格
          return glist.map((item) => {
            const price = getPrice(item);
            return {
              currency: item.currency || args.currency,
              platform: args.platform,
              game: args.gameName,
              country: args.country,
              price,
              sku_name: item?.goodsName || "SKU_NAME",
            };
          });
        }
      }

      return [];
    }, params);

    console.log(result);
    /**
         * [
         {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '91',
            sku_name: '60 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '440',
            sku_name: '325 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '910',
            sku_name: '660 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '2248',
            sku_name: '1800 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '4500',
            sku_name: '3850 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '9100',
            sku_name: '8100 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '1442',
            sku_name: 'RP Upgrade Pack-A3'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '3608',
            sku_name: 'Elite RP Upgrade Pack-A3'
        }
        ]
     */

    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

HTTP劫持\请求

razer为例,在请求中找到渲染当前页面的关系,通过拦截当前游戏名称的请求进行数据分析,获取当前的商品名称和价格等信息。

实例代码如下:

js 复制代码
// razer
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    // 拦截请求
    await page.setRequestInterception(true);

    page.on("request", async (request) => {
      // 对一些不必要的资源、进行终止增加加载速度
      if (
        request.resourceType() == "image" ||
        request.resourceType() == "font"
      ) {
        await request.abort();
      } else {
        await request.continue();
      }
    });

    function getResValue() {
      return new Promise((resolve) => {
        let result = [];

        page.on("response", async (response) => {
          const url = response.url();
          const headers = response.headers();
          const contentType = headers["content-type"];
          const _url =
            url && url.indexOf("/") !== -1 ? url.split("/").pop() : "";

          if (_url && contentType.includes("application/json")) {
            const jsons = await response.json();

            if (jsons && jsons.gameSkus && jsons.gameSkus.length) {
              const _gameSkus = jsons.gameSkus || [];
              result = _gameSkus.map((item) => {
                const price = item.unitGold || item.unitBaseGold || 0;
                const sku_name =
                  item.productName || item.vanityName || "SKU_NAME";

                return {
                  currency: config.currency,
                  country: config.country,
                  platform: "razer",
                  game: _url,
                  price: price.toString(),
                  sku_name,
                };
              });
              resolve(result);
            }
          }
        });
      });
    }
    await page.goto(config.url);
    const result = await getResValue();
    console.log(result);
    /**
     * [
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '5',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '10',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '20',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '30',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '40',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '50',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '100',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '200',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '300',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)'
        }
    ]
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

模拟用户点击

razer为例,找到网页的商品的锚点的DOM元素进行模拟点击操作,根据不同商品请求对应的价格的通道数据

实例代码如下:

js 复制代码
// razer 模拟用户点击
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};
const gameSkuList = [];

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    await page.goto(config.url);

    await waitFor(3000);

    const webshopStepSku = await page.waitForSelector("#webshop_step_sku");
    if (!webshopStepSku) {
      throw new Error("当前的IP被封禁了!!!");
    }

    const skuItem = await page.$$("#webshop_step_sku .sku-list__item");
    const darkFilter = await page.$(".onetrust-pc-dark-filter");

    // 自定义弹窗默认关闭
    await page.evaluateHandle((element) => {
      element && (element.style.display = "none");
    }, darkFilter);

    const params = { ...config };
    const getCards = async (dList, args) => {
      for (let d of dList) {
        const sku_name = await page.evaluate((element) => {
          const res = element.querySelector(".selection-tile__text") || null;
          if (!res) return {};

          return res?.innerText || "";
        }, d);
        await d.click();
        await waitFor(1500);

        const price_text = await page.evaluate(() => {
          const channels =
            document.querySelector("#webshop_step_payment_channels") || null;
          if (!channels) return {};

          // 优先获取其他支付通道
          let _details =
            channels.querySelectorAll(".selection-tile-promos__details")[1] ||
            null;

          // 兜底钱包
          if (!_details) {
            _details =
              channels.querySelectorAll(".selection-tile-promos__details")[0] ||
              null;

            if (!_details) return {};
          }

          const _card =
            _details.querySelector(".align-self-center.text-right") || null;

          if (!_card) return {};

          return _card?.innerText || "0";
        });

        const jons = {
          sku_name,
          price: price_text,
          currency: args.currency,
          platform: "pubgm",
          game: args.gameName,
          country: args.country,
        };

        gameSkuList.push(jons);
      }

      return gameSkuList;
    };
    const result = await getCards(skuItem, params);

    console.log(result);
    /**
     * [
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)',
            price: 'RM 5.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)',
            price: 'RM 10.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)',
            price: 'RM 20.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)',
            price: 'RM 30.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)',
            price: 'RM 40.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)',
            price: 'RM 50.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)',
            price: 'RM 100.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)',
            price: 'RM 200.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)',
            price: 'RM 300.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        }
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

高级应用

在实际情况中不同的网站都有一些不同或者说特殊的场景,比如:如何爬取多个页面?绕过验证码校验?破解机器人检测等,下面就让我们解锁Puppeteer更强大的功能!

绕过机器检测

我们可以通过检测机器人的网址进行测试,左真实的用户右侧是puppeteer访问,可以明显的看出在右侧的WebDriver标记为红色;Tips: 不同的浏览器可能表现不一致

我们可以使用到插件puppeteer-extra-plugin-stealth,它属于puppeteer-extra 全家桶的一个,访问右图片就明显看到没有报错了。

js 复制代码
// 绕过爬虫检测
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.goto("https://bot.sannysoft.com/");
  await browser.close();
})();

绕过验证码检测

对一些网站的验证码的校验,例如下图的google的人机验证,其实可以借助puppeteer-extra-plugin-recaptcha 进行破解处理来完成后续数据的操作,实例代码如下:

Tips: 知识需要付费哦

实例代码如下:

js 复制代码
const puppeteer = require("puppeteer-extra");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
puppeteer.use(
  RecaptchaPlugin({
    provider: {
      id: "2captcha",
      token: "xxxxx", // 知识需要付费
    },
    visualFeedback: true,
  })
);
const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};
puppeteer.launch({ headless: false }).then(async (browser) => {
  const page = await browser.newPage();
  await page.goto("https://www.google.com/recaptcha/api2/demo");

  await page.solveRecaptchas();

  await Promise.all([
    page.waitForNavigation(),
    page.click(`#recaptcha-demo-submit`),
  ]);
    await page.screenshot({ path: 'response.png', fullPage: true })
    await browser.close()
});

开始多进程

很多场景我们会同时爬取多个网址,为了在性能上得到保证可以采用puppeteer-cluster来管理多个线程进行不同网站的处理,降低性能的损耗

实例代码如下:

js 复制代码
const { Cluster } = require("puppeteer-cluster");

(async () => {
  // Create a cluster with 2 workers
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 3,
    puppeteerOptions: {
      headless: false,
    },
  });

  // Define a task (in this case: screenshot of page)
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);

    const path = url.replace(/[^a-zA-Z]/g, "_") + ".png";
    await page.screenshot({ path });
    console.log(`Screenshot of ${url} saved: ${path}`);
  });

  // Add some pages to queue
  cluster.queue("https://www.baidu.com");
  cluster.queue("https://www.bing.com/?mkt=zh-CN");
  cluster.queue("https://github.com/");

  // Shutdown after everything is done
  await cluster.idle();
  await cluster.close();
})();

总结

puppeteer可以帮助我们完成一些自动化操作的同时也要注意他的优缺点,在进行一些内存消耗较大的任务的时候会导致占用的内存特别高,同时要启动一个真实的Chrome实例 会对一些需要快速执行的应用造成影响。

总体来说,Puppeteer是一个功能强大且易于使用的浏览器自动化工具,适用于各种场景。然而,在选择是否使用Puppeteer时,需要考虑到其对系统资源的消耗和启动时间较慢这两个缺点。

相关推荐
续亮~43 分钟前
6、Redis系统-数据结构-05-整数
java·前端·数据结构·redis·算法
顶顶年华正版软件官方2 小时前
剪辑抽帧技巧有哪些 剪辑抽帧怎么做视频 剪辑抽帧补帧怎么操作 剪辑抽帧有什么用 视频剪辑哪个软件好用在哪里学
前端·音视频·视频·会声会影·视频剪辑软件·视频剪辑教程·剪辑抽帧技巧
MarkHD3 小时前
javascript 常见设计模式
开发语言·javascript·设计模式
托尼沙滩裤3 小时前
【js面试题】js的数据结构
前端·javascript·数据结构
不熬夜的臭宝3 小时前
每天10个vue面试题(一)
前端·vue.js·面试
朝阳394 小时前
vue3【实战】来回拖拽放置图片
javascript·vue.js
不如喫茶去4 小时前
VUE自定义新增、复制、删除dom元素
前端·javascript·vue.js
长而不宰4 小时前
vue3+electron项目搭建,遇到的坑
前端·vue.js·electron
阿垚啊4 小时前
vue事件参数
前端·javascript·vue.js
加仑小铁4 小时前
【区分vue2和vue3下的element UI Dialog 对话框组件,分别详细介绍属性,事件,方法如何使用,并举例】
javascript·vue.js·ui