使用scrapy框架爬取微博热搜榜

注：在使用爬虫抓取网站数据之前，非常重要的一点是确保遵守相关的法律、法规以及目标网站的使用条款。

（最底下附下载链接）

准备工作：

安装依赖：

确保已经安装了Python环境。

使用pip安装scrapy：pip install scrapy。

创建Scrapy项目：

打开命令行工具，在期望的位置创建一个新的Scrapy项目：scrapy startproject weiboHotSearch。

进入项目目录：cd weiboHotSearch。

设置User-Agent和其他headers：

修改settings.py文件中的USER_AGENT以及其他可能需要自定义的headers，模拟真实浏览器访问。

编写爬虫

1.创建Spider

2. 定义Item

在items.py文件中定义你想要抓取的数据字段。对于微博热搜榜单，我们可能需要如下字段：

python 复制代码

import scrapy

class WeiboHotsearchItem(scrapy.Item):
    rank = scrapy.Field()        # 排名
    keyword = scrapy.Field()     # 热搜关键词
    url = scrapy.Field()         # 关键词链接
    hot_index = scrapy.Field()   # 热度指数
    category = scrapy.Field()    # 类别（如置顶、实时上升等）

3. 编写Spider

使用genspider命令生成一个爬虫模板并编辑它：

导入必要的库

python 复制代码

import scrapy
from ..items import WeiboHotsearchItem
from urllib.parse import urljoin
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

爬虫类定义

python 复制代码

class HotSearchSpider(scrapy.Spider):
    name = 'hot_search'
    allowed_domains = ['s.weibo.com']
    start_urls = ['https://s.weibo.com/top/summary']

初始化方法

python 复制代码

def __init__(self, *args, **kwargs):
    super(HotSearchSpider, self).__init__(*args, **kwargs)
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # 无头模式运行
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)

目的：初始化爬虫实例时，配置并启动一个无界面（headless）模式的Chrome浏览器实例，以避免在执行过程中弹出浏览器窗口。

解析方法

python 复制代码

def parse(self, response):
    self.driver.get(response.url)

    # 显式等待，直到所有的tr元素都出现
    wait = WebDriverWait(self.driver, 20)
    wait.until(EC.presence_of_all_elements_located((By.XPATH, '//table/tbody/tr')))

    # 滚动页面到底部以触发更多内容加载
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    while True:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)  # 等待新内容加载

        new_height = self.driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    for sel in self.driver.find_elements(By.XPATH, '//table/tbody/tr'):
        item = WeiboHotsearchItem()
        item['rank'] = sel.find_element(By.XPATH, './/td[@class="td-01"]').text if sel.find_elements(By.XPATH,
                                                                                                     './/td[@class="td-01"]') else None
        item['keyword'] = sel.find_element(By.XPATH, './/td[@class="td-02"]/a').text if sel.find_elements(By.XPATH,
                                                                                                          './/td[@class="td-02"]/a') else None
        href = sel.find_element(By.XPATH, './/td[@class="td-02"]/a').get_attribute('href') if sel.find_elements(
            By.XPATH, './/td[@class="td-02"]/a') else None
        item['url'] = urljoin('https://s.weibo.com', href) if href else None
        item['hot_index'] = sel.find_element(By.XPATH, './/td[@class="td-02"]/span').text if sel.find_elements(
            By.XPATH, './/td[@class="td-02"]/span') else None
        item['category'] = sel.find_element(By.XPATH, './/td[@class="td-03"]/i').text if sel.find_elements(By.XPATH,
                                                                                                           './/td[@class="td-03"]/i') else None
        yield item

目的：

使用Selenium加载网页并等待所有目标元素加载完成。

实现页面滚动以加载动态内容，确保获取完整数据。

遍历每个搜索结果项，提取排名、关键词、链接、热度指数和类别等信息，封装到WeiboHotsearchItem对象中，并将其生成为输出。

5. 关闭方法

python 复制代码

def closed(self, reason):
    self.driver.quit()

目的：当爬虫关闭时，确保释放由Selenium创建的浏览器资源，即关闭浏览器实例。

4.配置Pipeline以保存至MongoDB

python 复制代码

import pymongo

class MongoDBPipeline:

    collection_name = 'weibo_hotsearch'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

5. 更新Settings

python 复制代码

# 启用pipelines
ITEM_PIPELINES = {
   'weibo_hotsearch.pipelines.MongoDBPipeline': 300,
}

# MongoDB连接设置
MONGO_URI = 'mongodb://localhost:27017/'
MONGO_DATABASE = 'weibo'

# 其他可选设置
ROBOTSTXT_OBEY = False  # 如果网站有robots.txt且不允许爬取，请谨慎设置为True
DOWNLOAD_DELAY = 1      # 设置下载延迟避免触发反爬虫机制

# 禁用默认的下载器中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

6．查看Mongodb保存结果

mongodb-windows-x86下载

源码下载