爬取B站评论：Python技术实现详解

引言

在当今信息爆炸的互联网时代，用户生成的内容不断涌现，其中包括了各种各样的评论。而B站作为一个充满活力的视频分享平台，其评论区更是一个充满了各种各样精彩评论的宝藏地。那么，有没有一种简单的方法可以将这些评论收集起来呢？答案是肯定的！本文将介绍如何使用Python编写一个爬虫程序，轻松实现爬取B站视频的评论，为我们探索互联网数据的奥秘带来便利。

什么是爬虫？

在开始之前，我们先来了解一下什么是爬虫。爬虫，又称网络爬虫、网络蜘蛛，是一种按照一定的规则，自动地获取万维网信息的程序或脚本。简单来说，就是通过编写代码，让计算机自动地从网页上抓取需要的信息。而Python作为一种简洁、易学的编程语言，非常适合用来编写爬虫程序。

准备工作

在开始爬取B站评论之前，我们需要做一些准备工作：

Python环境：确保你的电脑上已经安装了Python，并且能够正常运行。
编辑器：推荐使用VS Code、PyCharm等编辑器来编写Python代码，方便调试和管理。
第三方库：我们将使用requests库发送HTTP请求，以及beautifulsoup4库解析HTML页面。你可以使用以下命令来安装这两个库：

编写爬虫程序

第一步：获取评论页面URL

首先，我们需要找到要爬取评论的视频页面，并获取其评论页面的URL。通常，B站视频的评论页面URL格式为https://www.bilibili.com/video/avXXXXXX/#reply，其中avXXXXXX是视频的av号。我们可以通过拼接URL的方式来构造评论页面的URL。

第二步：发送HTTP请求获取页面内容

有了评论页面的URL之后，我们就可以使用requests库发送HTTP请求，获取页面的HTML内容。

第三步：完整代码实现

复制代码

import requests
import json
import os
import pickle
from bs4 import BeautifulSoup
import time

# 设置请求头部信息，伪装成浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 设置代理信息
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"
proxyMeta = f"http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"

# 登录B站后获取的cookies，用于自动登录
cookies_file = 'cookies.pkl'

# 保存进度的文件名
progress_file = 'progress.txt'

# 保存评论的文件夹名
comment_dir = 'comments'

# 创建保存评论的文件夹
if not os.path.exists(comment_dir):
    os.makedirs(comment_dir)

# 加载cookies
if os.path.exists(cookies_file):
    with open(cookies_file, 'rb') as f:
        cookies = pickle.load(f)
else:
    cookies = None


def login():
    """手动登录B站，获取cookies"""
    print("请手动登录B站，并复制cookies到cookies.pkl文件中。")


def get_video_id(url):
    """从视频网址中提取视频ID"""
    return url.split('/')[-1]


def get_comments(video_url):
    """爬取视频的评论"""
    video_id = get_video_id(video_url)
    comment_file = os.path.join(comment_dir, f'{video_id}.csv')
    if os.path.exists(comment_file):
        print(f"评论文件 {comment_file} 已存在，跳过该视频。")
        return

    # 请求视频页面，获取评论接口
    response = requests.get(video_url, headers=headers, cookies=cookies, proxies={"http": proxyMeta, "https": proxyMeta})
    soup = BeautifulSoup(response.text, 'html.parser')
    script = soup.find('script', attrs={'type': 'application/ld+json'})
    video_data = json.loads(script.text)
    api_url = video_data['comment']['embedUrl']

    # 循环获取评论，直到获取完所有评论
    page = 1
    comments = []
    while True:
        api = f'{api_url}&pn={page}&type=1'
        response = requests.get(api, headers=headers, cookies=cookies, proxies={"http": proxyMeta, "https": proxyMeta})
        data = response.json()
        if 'data' in data and data['data']['replies']:
            comments.extend(data['data']['replies'])
            page += 1
            time.sleep(1)  # 避免请求过于频繁被封IP
        else:
            break

    # 保存评论到CSV文件
    with open(comment_file, 'w', encoding='utf-8') as f:
        f.write('一级评论计数,隶属关系,被评论者昵称,被评论者ID,评论者昵称,评论者用户ID,评论内容,发布时间,点赞数\n')
        for comment in comments:
            content = comment['content']['message']
            content = content.replace('\n', ' ')
            like = comment['like']
            publish_time = comment['ctime']
            f.write(f'1, , , , , ,"{content}",{publish_time},{like}\n')
            if 'replies' in comment:
                for reply in comment['replies']:
                    content = reply['content']['message']
                    content = content.replace('\n', ' ')
                    like = reply['like']
                    publish_time = reply['ctime']
                    f.write(f'2,{comment["mid"]},{reply["member"]["uname"]},{reply["member"]["mid"]},'
                            f'{reply["member"]["uname"]},{reply["member"]["mid"]},"{content}",{publish_time},{like}\n')
    print(f"成功爬取视频 {video_id} 的评论，保存在 {comment_file} 中。")


def main():
    # 读取视频列表
    with open('video_list.txt', 'r') as f:
        video_urls = f.readlines()

    # 批量爬取视频评论
    for url in video_urls:
        url = url.strip()
        get_comments(url)


if __name__ == '__main__':
    if cookies is None:
        login()
    main()

总结

批量爬取多个视频的评论：只需将要爬取的视频网址写入video_list.txt文件中，程序会自动遍历网址列表，爬取每个视频的评论，并保存到以视频ID命名的CSV文件中。
只需一次登录：手动登录B站一次后，程序会自动保存cookies，下次运行程序时无需再次登录，确保持续爬取评论数据。
断点续爬：程序支持断点续爬功能，如果中断了爬虫，下次运行时会根据progress.txt文件中的进度继续爬取评论，并且已经写入一半的CSV文件也会继续写入，避免数据丢失。