都2024年了?你还不会使用Puppeteer?

前言:

众所周知在开发的过程中,数据一直是推动整个业务链条的重要一环,通过爬虫进行数据的爬取和更新也是日常的操作,目前支持爬虫的语言很多:Python、Java、Ruby 还有Nodejs ,也就是今天主角Puppeteer, 它是由 Google Chrome 官方团队维护以Node.js 为基础的开源工具,主要用于控制和自动化谷歌浏览器(Google Chrome)或其他兼容的浏览器操作。

废话不多说,下面让我们从浅到深一步一步带领大家走进爬虫的世界~

puppeteer的简介

Puppeteer是一个由Google开发的Node.js库,它提供了一套用于控制headless Chrome或Chromium浏览器的API。它可以模拟用户在浏览器中的操作行为,如点击、填写表单、截图等,同时还可以让开发者获取到浏览器渲染后的HTML内容。它提供了一套高级的 API,使得浏览器操作变得简单和可靠。主要包括:自动化控制、页面操控、网络请求拦截、页面截图和 PDF 生成、自动化测试等一系列操作

总而言之,Puppeteer 是一个功能强大、易用且灵活的浏览器自动化工具,能够帮助开发者完成各种浏览器操作和自动化任务。

环境搭建

puppeteer从v1.7.0 开始支持两个包:puppeteerpuppeteer-core

  • puppeteer: 一个完整的包会下载一个可执行的Chromium浏览器。整个体积很大(适合本地调试)
  • puppeteer-core: 不会下载一个可执行的Chromium浏览器、体积很小、配置的浏览器需要自己手动更新(适合部署在生产环境)

支持的版本Node版本 >= v16.20.0

js 复制代码
npm i puppeteer or npm i puppeteer -g   // 最新版本:V21.7.0

Puppeteer的基础API

使用Headless模式

Puppeteer默认启动的是无头模式进行开发, 可以通过headless进行配置关闭,本地调试建议开启,

js 复制代码
const browser = await puppeteer.launch();
// Equivalent to
const browser = await puppeteer.launch({headless: false}); // 本地调试

需要注意的是Chrome 112 推出了新的 Headless 模式,可以通过新的参数调整

js 复制代码
const browser = await puppeteer.launch({headless: 'new'});

使用Puppeteer-core

在生产环境部署的时候使用puppeteer-core要注意版本,目测在v16.2.0 这个版本是没问题 ,最新v21.7.0在部署线上的时候有点问题

js 复制代码
// const puppeteer = require("puppeteer");
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
// executablePath: "/usr/bin/google-chrome", // 生产环境
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
});
return browser;

关于executablePath 如何可以访问:chrome://version/可执行文件路径进行查看」

设置浏览器实例的其他命令行参数

可以设置args来执行当前运行的浏览器实例一些命令行加以限制,具体可以参考Chromium命令行开关列表

js 复制代码
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
  args: [
    "--no-sandbox", // 使用沙盒模式
    "--disable-setuid-sandbox", // 禁用setuid沙盒(仅限Linux)
    "--disable-extensions", // 禁用扩展
    "--incognito", // 禁用GPU硬件加速
    "--disable-gpu", // 以隐身模式运行
    "--no-zygote", // 禁用 Zygote 进程模型,启动时不创建一个共享的子进程来提高性能。
  ],
});
return browser;

设置浏览器视口分辨率

可以通过defaultViewport进行PC端的设置默认的视口分辨率

js 复制代码
const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
  executablePath:
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", // [本地路径]
  defaultViewport: {
    height: 1080,
    width: 1920,
  },
});
return browser;

指定移动端设备访问

js 复制代码
const puppeteer = require("puppeteer");
const iPhone = puppeteer.devices["iPhone 6"];

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  await page.emulate(iPhone);
 
})();

其他

如果使用Docker部署可以参考相关资源

简单case上手

介绍完了以上的基础的API下面通过三个小例子来看一下它是如何工作的。

模拟设备截图

通过puppeteer模拟iPhone6进行访问百度的域名,进行当前网页的截图

js 复制代码
const puppeteer = require("puppeteer");
const iPhone = puppeteer.devices["iPhone 6"];

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto("https://baidu.com/");
  await page.screenshot({
    path: "full.png",
    fullPage: true,
  });
  console.log(await page.title());
  await browser.close();
})();

使用用户搜索截图

通过puppeteer进行百度搜索Puppeteer,进行截图保存到本地

js 复制代码
// baidu search
const puppeteer = require("puppeteer");
const screenshot = "baidu.png";
try {
  (async () => {
    const browser = await puppeteer.launch({
      headless: false,
    });
    const page = await browser.newPage();
    await page.goto("https://baidu.com");
    await page.type("#kw", "puppeteer");
    await page.click("#su");
    await page.waitForTimeout(2000);
    await page.screenshot({ path: screenshot });
    await browser.close();
  })();
} catch (err) {
  console.error(err);
}

设置cookie

通过puppeteer进行打开paypal进行cookie的种植,达到用户名的渲染

js 复制代码
// set cookie
const cookie = {
  name: "login_email",
  value: "set_by_cookie@domain.com",
  domain: ".paypal.com",
  url: "https://www.paypal.com/",
  path: "/",
  httpOnly: true,
  secure: true,
};

const puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.setCookie(cookie);
  await page.goto("https://www.paypal.com/signin");
  await page.screenshot({ path: "paypal_login.png" });
  await browser.close();

})();

实战案例

相信通过以上的简单的示例分析,大家对整个流程有了一个初步的认识。下面让围绕目前主流爬取的方式来逐一攻破它的工作原理

解析HTML

下面以codashop 为例,通过解析HTML的方式把相关DOM节点元素进行筛选和过滤,抽离出SKU(商品)的「价格、商品名称」等数据

实例代码如下:

js 复制代码
// codashop
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://www.codashop.com/en-my/pubg-mobile-uc-redeem-code",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

// 延时
const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};

try {
  const run = async () => {
    const roots = ".form-section__denom-group";
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    await page.goto(config.url, { waitUntil: "domcontentloaded" });

    const section__denom = await page.waitForSelector(roots);

    if (!section__denom) return [];

    const params = { ...config, platform: "codashop" };
    const _waitFor = waitFor.toString();

    // 进行DOM操作
    const jsons = await page.evaluate(
      async (args, _waitFor, _roots) => {
        const _wait = eval("(" + _waitFor + ")");
        await _wait(1000);

        let price, sku_name;

        let _lis =
          Array.from(
            document.querySelectorAll(".form-section__denom-group li")
          ) || [];

        if (_lis && _lis.length === 0) return [];

        const games = _lis.map((item) => {
          const sku_name_dom =
            item.querySelector(".form-section__denom-data-section") || null;
          const sku_price_dom =
            item.querySelector(".starting-price-value") || null;

          if (sku_name_dom) {
            sku_name = sku_name_dom.innerText || "SKU_NAME";
          }

          if (sku_price_dom) {
            price = sku_price_dom.innerText;
          }

          return {
            price,
            sku_name,
            currency: args.currency,
            platform: args.platform,
            game: args.gameName,
            country: args.country,
          };
        });

        return !!(games && games.length) ? games : [];
      },
      params,
      _waitFor,
      roots
    );
    console.log(jsons);
    /**
     * [
          {
            price: 'RM4.50',
            sku_name: '60 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM22.50',
            sku_name: '325 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM45.00',
            sku_name: '660 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM112.50',
            sku_name: '1800 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM225.00',
            sku_name: '3850 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          },
          {
            price: 'RM450.00',
            sku_name: '8100 UC',
            currency: 'RM',
            platform: 'codashop',
            game: 'pubgm',
            country: 'my'
          }
        ]
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

解析SSR渲染数据

jollymax为例,通过查看当前的源码可以得到两个信息:使用的框架和是否为SSR渲染从而定位到数据的位置,下面是使用nuxtjs的SSR渲染,如图所示:

实例代码如下:

js 复制代码
// jollymax
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://www.jollymax.com/ru/PUBG",
  gameName: "pubgm",
  currency: "RUB",
  country: "ru",
  thous_separator: "", // 千位分隔符
  decimal_point_separator: ".", // 小数分隔符
};

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    // Create a page
    const page = await browser.newPage();

    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    // 拦截请求
    await page.setRequestInterception(true);

    page.on("request", async (request) => {
      // 对一些不必要的资源、进行终止增加加载速度
      if (
        request.resourceType() == "image" ||
        request.resourceType() == "font" ||
        request.resourceType() == "stylesheet"
      ) {
        await request.abort();
      } else {
        await request.continue();
      }
    });

    await page.goto(config.url, { waitUntil: "domcontentloaded" });

    // 等待整个DOM加载完成
    await page.waitForSelector(".content-right-part");

    const params = { ...config, platform: "jollymax" };

    const result = await page.evaluate(async (args) => {
      let filterResults = [];

      if (
        window &&
        window.__NUXT__ &&
        window.__NUXT__.data &&
        window.__NUXT__.data.length
      ) {
        const _serverData = window.__NUXT__.data[0]?.serverData;

        if ("pageData" in _serverData) {
          const glist = _serverData.pageData.pageInfo.goodsList;
          if (!(glist && glist.length)) return filterResults;

          const getPrice = (item) => {
            let result = "0";
            if (item.payTypeList.length) {
              result = item.payTypeList[0].amount;
            }
            return result.toString();
          };

          //   默认拿取第一个支付通道的价格
          return glist.map((item) => {
            const price = getPrice(item);
            return {
              currency: item.currency || args.currency,
              platform: args.platform,
              game: args.gameName,
              country: args.country,
              price,
              sku_name: item?.goodsName || "SKU_NAME",
            };
          });
        }
      }

      return [];
    }, params);

    console.log(result);
    /**
         * [
         {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '91',
            sku_name: '60 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '440',
            sku_name: '325 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '910',
            sku_name: '660 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '2248',
            sku_name: '1800 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '4500',
            sku_name: '3850 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '9100',
            sku_name: '8100 UC'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '1442',
            sku_name: 'RP Upgrade Pack-A3'
        },
        {
            currency: 'RUB',
            platform: 'jollymax',
            game: 'pubgm',
            country: 'ru',
            price: '3608',
            sku_name: 'Elite RP Upgrade Pack-A3'
        }
        ]
     */

    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

HTTP劫持\请求

razer为例,在请求中找到渲染当前页面的关系,通过拦截当前游戏名称的请求进行数据分析,获取当前的商品名称和价格等信息。

实例代码如下:

js 复制代码
// razer
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    // 拦截请求
    await page.setRequestInterception(true);

    page.on("request", async (request) => {
      // 对一些不必要的资源、进行终止增加加载速度
      if (
        request.resourceType() == "image" ||
        request.resourceType() == "font"
      ) {
        await request.abort();
      } else {
        await request.continue();
      }
    });

    function getResValue() {
      return new Promise((resolve) => {
        let result = [];

        page.on("response", async (response) => {
          const url = response.url();
          const headers = response.headers();
          const contentType = headers["content-type"];
          const _url =
            url && url.indexOf("/") !== -1 ? url.split("/").pop() : "";

          if (_url && contentType.includes("application/json")) {
            const jsons = await response.json();

            if (jsons && jsons.gameSkus && jsons.gameSkus.length) {
              const _gameSkus = jsons.gameSkus || [];
              result = _gameSkus.map((item) => {
                const price = item.unitGold || item.unitBaseGold || 0;
                const sku_name =
                  item.productName || item.vanityName || "SKU_NAME";

                return {
                  currency: config.currency,
                  country: config.country,
                  platform: "razer",
                  game: _url,
                  price: price.toString(),
                  sku_name,
                };
              });
              resolve(result);
            }
          }
        });
      });
    }
    await page.goto(config.url);
    const result = await getResValue();
    console.log(result);
    /**
     * [
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '5',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '10',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '20',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '30',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '40',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '50',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '100',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '200',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)'
        },
        {
            currency: 'RM',
            country: 'my',
            platform: 'razer',
            game: 'pubgm',
            price: '300',
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)'
        }
    ]
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

模拟用户点击

razer为例,找到网页的商品的锚点的DOM元素进行模拟点击操作,根据不同商品请求对应的价格的通道数据

实例代码如下:

js 复制代码
// razer 模拟用户点击
const puppeteer = require("puppeteer");
const ua =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5410.0 Safari/537.36";

const config = {
  url: "https://gold.razer.com/my/en/gold/catalog/pubgm",
  gameName: "pubgm",
  currency: "RM",
  country: "my",
  thous_separator: ",",
  decimal_point_separator: ".",
};

const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};
const gameSkuList = [];

try {
  const run = async () => {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: {
        height: 1080,
        width: 1920,
      },
      args: ["--no-sandbox"],
    });
    const page = await browser.newPage();
    // 设置页面默认超时时间
    page.setDefaultTimeout(100000);

    // 设置页面的默认导航超时时间
    page.setDefaultNavigationTimeout(50000);

    // 设置user-agent
    ua && (await page.setUserAgent(ua));

    await page.goto(config.url);

    await waitFor(3000);

    const webshopStepSku = await page.waitForSelector("#webshop_step_sku");
    if (!webshopStepSku) {
      throw new Error("当前的IP被封禁了!!!");
    }

    const skuItem = await page.$$("#webshop_step_sku .sku-list__item");
    const darkFilter = await page.$(".onetrust-pc-dark-filter");

    // 自定义弹窗默认关闭
    await page.evaluateHandle((element) => {
      element && (element.style.display = "none");
    }, darkFilter);

    const params = { ...config };
    const getCards = async (dList, args) => {
      for (let d of dList) {
        const sku_name = await page.evaluate((element) => {
          const res = element.querySelector(".selection-tile__text") || null;
          if (!res) return {};

          return res?.innerText || "";
        }, d);
        await d.click();
        await waitFor(1500);

        const price_text = await page.evaluate(() => {
          const channels =
            document.querySelector("#webshop_step_payment_channels") || null;
          if (!channels) return {};

          // 优先获取其他支付通道
          let _details =
            channels.querySelectorAll(".selection-tile-promos__details")[1] ||
            null;

          // 兜底钱包
          if (!_details) {
            _details =
              channels.querySelectorAll(".selection-tile-promos__details")[0] ||
              null;

            if (!_details) return {};
          }

          const _card =
            _details.querySelector(".align-self-center.text-right") || null;

          if (!_card) return {};

          return _card?.innerText || "0";
        });

        const jons = {
          sku_name,
          price: price_text,
          currency: args.currency,
          platform: "pubgm",
          game: args.gameName,
          country: args.country,
        };

        gameSkuList.push(jons);
      }

      return gameSkuList;
    };
    const result = await getCards(skuItem, params);

    console.log(result);
    /**
     * [
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM5)',
            price: 'RM 5.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM10)',
            price: 'RM 10.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM20)',
            price: 'RM 20.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM30)',
            price: 'RM 30.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM40)',
            price: 'RM 40.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM50)',
            price: 'RM 50.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM100)',
            price: 'RM 100.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM200)',
            price: 'RM 200.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        },
        {
            sku_name: 'Razer Gold Direct Top-Up PIN (MY) - (RM300)',
            price: 'RM 300.00',
            currency: 'RM',
            platform: 'pubgm',
            game: 'pubgm',
            country: 'my'
        }
     */
    await browser.close();
  };
  run();
} catch (err) {
  console.error(err);
}

高级应用

在实际情况中不同的网站都有一些不同或者说特殊的场景,比如:如何爬取多个页面?绕过验证码校验?破解机器人检测等,下面就让我们解锁Puppeteer更强大的功能!

绕过机器检测

我们可以通过检测机器人的网址进行测试,左真实的用户右侧是puppeteer访问,可以明显的看出在右侧的WebDriver标记为红色;Tips: 不同的浏览器可能表现不一致

我们可以使用到插件puppeteer-extra-plugin-stealth,它属于puppeteer-extra 全家桶的一个,访问右图片就明显看到没有报错了。

js 复制代码
// 绕过爬虫检测
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
(async () => {
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.goto("https://bot.sannysoft.com/");
  await browser.close();
})();

绕过验证码检测

对一些网站的验证码的校验,例如下图的google的人机验证,其实可以借助puppeteer-extra-plugin-recaptcha 进行破解处理来完成后续数据的操作,实例代码如下:

Tips: 知识需要付费哦

实例代码如下:

js 复制代码
const puppeteer = require("puppeteer-extra");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
puppeteer.use(
  RecaptchaPlugin({
    provider: {
      id: "2captcha",
      token: "xxxxx", // 知识需要付费
    },
    visualFeedback: true,
  })
);
const waitFor = async (t) => {
  return new Promise((r) => setTimeout(r, t));
};
puppeteer.launch({ headless: false }).then(async (browser) => {
  const page = await browser.newPage();
  await page.goto("https://www.google.com/recaptcha/api2/demo");

  await page.solveRecaptchas();

  await Promise.all([
    page.waitForNavigation(),
    page.click(`#recaptcha-demo-submit`),
  ]);
    await page.screenshot({ path: 'response.png', fullPage: true })
    await browser.close()
});

开始多进程

很多场景我们会同时爬取多个网址,为了在性能上得到保证可以采用puppeteer-cluster来管理多个线程进行不同网站的处理,降低性能的损耗

实例代码如下:

js 复制代码
const { Cluster } = require("puppeteer-cluster");

(async () => {
  // Create a cluster with 2 workers
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 3,
    puppeteerOptions: {
      headless: false,
    },
  });

  // Define a task (in this case: screenshot of page)
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);

    const path = url.replace(/[^a-zA-Z]/g, "_") + ".png";
    await page.screenshot({ path });
    console.log(`Screenshot of ${url} saved: ${path}`);
  });

  // Add some pages to queue
  cluster.queue("https://www.baidu.com");
  cluster.queue("https://www.bing.com/?mkt=zh-CN");
  cluster.queue("https://github.com/");

  // Shutdown after everything is done
  await cluster.idle();
  await cluster.close();
})();

总结

puppeteer可以帮助我们完成一些自动化操作的同时也要注意他的优缺点,在进行一些内存消耗较大的任务的时候会导致占用的内存特别高,同时要启动一个真实的Chrome实例 会对一些需要快速执行的应用造成影响。

总体来说,Puppeteer是一个功能强大且易于使用的浏览器自动化工具,适用于各种场景。然而,在选择是否使用Puppeteer时,需要考虑到其对系统资源的消耗和启动时间较慢这两个缺点。

相关推荐
转角羊儿几秒前
uni-app文章列表制作⑧
前端·javascript·uni-app
大G哥7 分钟前
python 数据类型----可变数据类型
linux·服务器·开发语言·前端·python
hong_zc30 分钟前
初始 html
前端·html
小小吱36 分钟前
HTML动画
前端·html
Bio Coder1 小时前
学习用 Javascript、HTML、CSS 以及 Node.js 开发一个 uTools 插件,学习计划及其周期
javascript·学习·html·开发·utools
糊涂涂是个小盆友1 小时前
前端 - 使用uniapp+vue搭建前端项目(app端)
前端·vue.js·uni-app
浮华似水1 小时前
Javascirpt时区——脱坑指南
前端
王二端茶倒水1 小时前
大龄程序员兼职跑外卖第五周之亲身感悟
前端·后端·程序员
_oP_i1 小时前
Web 与 Unity 之间的交互
前端·unity·交互
钢铁小狗侠2 小时前
前端(1)——快速入门HTML
前端·html