爬虫只能用PY？咱大node也可以！

当我们遇到私单的时候：

甲方：我要做这个XXX网站对标。

我：噼里啪啦，做好了，给。

甲方：商品数据呢？咋滴没数据？扣钱了啊！

我：需要爬取？稍等我两分半！

这种场景在我接私单的时候经常遇到，很多老板只会拿一个对标网站让我们进行模仿，并且对此网站的数据有一定的硬性要求，这时候掌握Node的简单爬虫就显得很重要了。

注意：如果对象网站做了反扒、请求限制的话就应该使用更专业的工具，据我遇到的网站好像基本没做这些策略。

废话不多说，正文开始：

咱这波用的是Node内置的http模块、fs模块和第三方依赖包cheerio。

Cheerio

Cheerio 是一个强大的工具，它允许你在 Node.js 环境中使用类似于 jQuery 的语法来解析和操作 HTML，这在处理网页数据、爬取信息等方面非常有用。

根据官网的例子

javascript 复制代码

const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

console.log($.html())

由此可看出咱可以使用 Cheerio 加载 HTML 字符串，然后使用类似于 jQuery 的选择器来选择和操作元素，最后获取或修改元素的文本、属性等。

准备一个空架子

创建文件夹并cd进入执行以下命令：

csharp 复制代码

npm init // 期间自行配置

npm i cheerio

我们先用http模块试一下响应回来的内容究竟是什么

javascript 复制代码

http.get(
  'http://www.baidu.com/',
  (res) => {
    let html = ''

    res.on('data', (chunk) => {
      html += chunk
    })

    res.on('end', () => {
      console.log(html)
    })
  }
)

可以看到完全是纯html字串，这就好办了，利用如上Cheerio库转为"DOM"对象，就可以跟我们在浏览器环境下操作dom简单啦。

我们这边拿某视频网站为例！！，当然也可以拿稀土，嘘嘘嘘！

拓展：https需要引入https模块，并且默认端口一般为443，http则为80

定义originRequest函数

接受三个参数：请求模式（http或https）、请求选项option和一个回调函数cb。
使用给定的选项发起请求，并在接收到响应数据时，将其收集到data中。
当所有数据都接收完毕时，调用回调函数，并将收集到的数据data作为参数传递。

javascript 复制代码

const originRequest = (
  mode, // 为http 还是https
  option, // 请求配置
  cb // 响应成功回调
) => {
  const req = mode.request(
    option, 
    res => {  
      let data = ''  
      
      // 监听'data'事件来接收数据块  
      res.on('data', (chunk) => {  
        data += chunk  
      })  
      
      // 监听'end'事件来知道所有数据都接收完毕  
      res.on('end', () => {  
        // 在这里处理数据，比如使用Cheerio来解析HTML  
        cb(data)
      })  
  })  
  
  req.on('error', (error) => {  
    console.error(`请求遇到问题: ${error.message}`)  
  })  
    
  // 发送请求  
  req.end()
}

定义urlInfo函数

获取对象地址的基本信息

dart 复制代码

const urlInfo = url => new URL(url)

执行请求函数，并开始读取html内容

使用originRequest函数发起请求。
在回调函数中，使用Cheerio加载HTML内容，并使用选择器选择特定的元素（这里选择的是.qy-mod-link-wrap）。
对于每个选定的元素，提取其内部的图片URL、标题和链接，并将这些数据放入一个数组中。
最后，调用writeToJsFile函数，将提取的数据写入一个js文件，当然也可以写入json文件中。

javascript 复制代码

const {
  hostname,
  pathname
} = urlInfo('https://www.iqiyi.com/')

originRequest(
  https,
  {  
    hostname,  
    port: 443,  
    path: pathname,  
    method: 'GET',  
    agent: new https.Agent({ rejectUnauthorized: false }),
    headers: {  
      'User-Agent': 'node.js' // 设置请求头，避免被服务器识别为爬虫并拒绝服务  
    }  
  },
  html => {
    const $ = cheerio.load(html)   
    
    // 使用Cheerio选择器选择元素并提取数据  
    const moves = []  
    $('.qy-mod-link-wrap').each((index, element) => {  
      const info = $(element).find('.video-item-preview-img > img')
      const url = $(element).find('a').attr('href')
      const img = info.attr('src')
      const title = info.attr('alt')
      moves.push({
        title,
        img,
        url
      })  
    })

    // 处理爬取到的数据，比如写入到JS文件中  
    writeToJsFile(moves) 
  }
)

定义writeToJsFile函数

将传入的数据转换为JSON字符串。
使用fs.writeFile方法将JSON字符串写入一个名为output.js的文件中。

javascript 复制代码

function writeToJsFile(data) {  
  // 将数据转换为JSON字符串  
  const jsonData = JSON.stringify(data, null, 2)  
    
  // 将JSON字符串写入到JS文件中  
  fs.writeFile(
    'output.js', 
    `module.exports = ${jsonData};`, 
    err => {  
      if (err) {
        throw err 
      } 
      console.log('数据已成功写入到output.js文件中！')  
  })  
}

执行结果

写入结果

完整代码

javascript 复制代码

const http = require('http')
const https = require('https')
const cheerio = require('cheerio')
const fs = require('fs')  


const originRequest = (
  mode,
  option,
  cb
) => {
  const req = mode.request(
    option, 
    res => {  
      let data = ''  
      
      // 监听'data'事件来接收数据块  
      res.on('data', (chunk) => {  
        data += chunk  
      })  
      
      // 监听'end'事件来知道所有数据都接收完毕  
      res.on('end', () => {  
        // 在这里处理数据，比如使用Cheerio来解析HTML  
        cb(data)
      })  
  })  
  
  req.on('error', (error) => {  
    console.error(`请求遇到问题: ${error.message}`)  
  })  
    
  // 发送请求  
  req.end()
}
  
const urlInfo = url => new URL(url)

const {
  hostname,
  pathname
} = urlInfo('https://www.iqiyi.com/')

originRequest(
  https,
  {  
    hostname,  
    port: 443,  
    path: pathname,  
    method: 'GET',  
    agent: new https.Agent({ rejectUnauthorized: false }), // 允许接受无效的SSL证书
    headers: {  
      'User-Agent': 'node.js' // 设置请求头，避免被服务器识别为爬虫并拒绝服务  
    }  
  },
  html => {
    const $ = cheerio.load(html)   
    
    // 使用Cheerio选择器选择元素并提取数据  
    const moves = []  
    $('.qy-mod-link-wrap').each((index, element) => {  
      const info = $(element).find('.video-item-preview-img > img')
      const url = $(element).find('a').attr('href')
      const img = info.attr('src')
      const title = info.attr('alt')
      moves.push({
        title,
        img,
        url
      })  
    })

    // 处理爬取到的数据，比如写入到JS文件中  
    writeToJsFile(moves) 
  }
)


function writeToJsFile(data) {  
  // 将数据转换为JSON字符串  
  const jsonData = JSON.stringify(data, null, 2)  
    
  // 将JSON字符串写入到JS文件中  
  fs.writeFile(
    'output.js', 
    `module.exports = ${jsonData};`, 
    err => {  
      if (err) {
        throw err 
      } 
      console.log('数据已成功写入到output.js文件中！')  
  })  
}

通过结合Node.js的内置http/https和fs模块，以及Cheerio库，我们可以轻松地实现一个功能强大的爬虫，并将爬取的结果写入到JS/json文件中。