【python】六个常见爬虫案例【附源码】

大家好，我是博主英杰，整理了几个常见的爬虫案例，分享给大家，适合小白学习

一、爬取豆瓣电影排行榜Top250存储到Excel文件

近年来，Python在数据爬取和处理方面的应用越来越广泛。本文将介绍一个基于Python的爬虫程序，用于抓取豆瓣电影Top250的相关信息，并将其保存为Excel文件。

获取网页数据的函数，包括以下步骤：

循环10次，依次爬取不同页面的信息；

使用`urllib`获取html页面；

使用`BeautifulSoup`解析页面；

遍历每个div标签，即每一部电影；

对每个电影信息进行匹配，使用正则表达式提取需要的信息并保存到一个列表中；

将每个电影信息的列表保存到总列表中。

效果展示：

源代码：

python 复制代码

from bs4 import BeautifulSoup
import  re  #正则表达式，进行文字匹配
import urllib.request,urllib.error #指定URL，获取网页数据
import xlwt  #进行excel操作
 
 
def main():
    baseurl = "https://movie.douban.com/top250?start="
    datalist= getdata(baseurl)
    savepath = ".\\豆瓣电影top250.xls"
    savedata(datalist,savepath)
 
#compile返回的是匹配到的模式对象
findLink = re.compile(r'<a href="(.*?)">')  # 正则表达式模式的匹配，影片详情
findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)  # re.S让换行符包含在字符中,图片信息
findTitle = re.compile(r'<span class="title">(.*)</span>')  # 影片片名
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')  # 找到评分
findJudge = re.compile(r'<span>(\d*)人评价</span>')  # 找到评价人数 #\d表示数字
findInq = re.compile(r'<span class="inq">(.*)</span>')  # 找到概况
findBd = re.compile(r'<p class="">(.*?)</p>', re.S)  # 找到影片的相关内容，如导演，演员等
 
 
 
##获取网页数据
def  getdata(baseurl):
    datalist=[]
    for i in range(0,10):
        url = baseurl+str(i*25)     ##豆瓣页面上一共有十页信息，一页爬取完成后继续下一页
        html = geturl(url)
        soup = BeautifulSoup(html,"html.parser") #构建了一个BeautifulSoup类型的对象soup，是解析html的
        for item in soup.find_all("div",class_='item'): ##find_all返回的是一个列表
            data=[]  #保存HTML中一部电影的所有信息
            item = str(item) ##需要先转换为字符串findall才能进行搜索
            link = re.findall(findLink,item)[0]  ##findall返回的是列表，索引只将值赋值
            data.append(link)
 
            imgSrc = re.findall(findImgSrc, item)[0]
            data.append(imgSrc)
 
            titles=re.findall(findTitle,item)  ##有的影片只有一个中文名，有的有中文和英文
            if(len(titles)==2):
                onetitle = titles[0]
                data.append(onetitle)
                twotitle = titles[1].replace("/","")#去掉无关的符号
                data.append(twotitle)
            else:
                data.append(titles)
                data.append(" ")  ##将下一个值空出来
 
            rating = re.findall(findRating, item)[0]  # 添加评分
            data.append(rating)
 
            judgeNum = re.findall(findJudge, item)[0]  # 添加评价人数
            data.append(judgeNum)
 
            inq = re.findall(findInq, item)  # 添加概述
            if len(inq) != 0:
                inq = inq[0].replace("。", "")
                data.append(inq)
            else:
                data.append(" ")
 
            bd = re.findall(findBd, item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?', " ", bd)
            bd = re.sub('/', " ", bd)
            data.append(bd.strip())  # 去掉前后的空格
            datalist.append(data)
    return  datalist
 
##保存数据
def  savedata(datalist,savepath):
    workbook = xlwt.Workbook(encoding="utf-8",style_compression=0) ##style_compression=0不压缩
    worksheet = workbook.add_sheet("豆瓣电影top250",cell_overwrite_ok=True) #cell_overwrite_ok=True再次写入数据覆盖
    column = ("电影详情链接", "图片链接", "影片中文名", "影片外国名", "评分", "评价数", "概况", "相关信息")  ##execl项目栏
    for i in range(0,8):
        worksheet.write(0,i,column[i]) #将column[i]的内容保存在第0行，第i列
    for i in range(0,250):
        data = datalist[i]
        for j in range(0,8):
            worksheet.write(i+1,j,data[j])
    workbook.save(savepath)
 
 
##爬取网页
def geturl(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
    }
    req = urllib.request.Request(url,headers=head)
    try:   ##异常检测
     response = urllib.request.urlopen(req)
     html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):    ##如果错误中有这个属性的话
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html
 
if __name__ == '__main__':
    main()
    print("爬取成功！！！")

二、爬取百度热搜排行榜Top50+可视化

2.1 代码思路：

导入所需的库：

python 复制代码

import requests
from bs4 import BeautifulSoup
import openpyxl

requests 库用于发送HTTP请求获取网页内容。

BeautifulSoup 库用于解析HTML页面的内容。

openpyxl 库用于创建和操作Excel文件。

2.发起HTTP请求获取百度热搜页面内容：

python 复制代码

url = 'https://top.baidu.com/board?tab=realtime'
response = requests.get(url)
html = response.content

这里使用了 requests.get() 方法发送GET请求，并将响应的内容赋值给变量 html。

3.使用BeautifulSoup解析页面内容：

复制代码

soup = BeautifulSoup(html, 'html.parser')

创建一个 BeautifulSoup 对象，并传入要解析的HTML内容和解析器类型。

4.提取热搜数据：

python 复制代码

hot_searches = []
for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):
    hot_searches.append(item.text)

这段代码通过调用 soup.find_all() 方法找到所有 <div> 标签，并且指定 class 属性为 'c-single-text-ellipsis' 的元素。

然后，将每个元素的文本内容添加到 hot_searches 列表中。

5.保存热搜数据到Excel：

python 复制代码

workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = 'Baidu Hot Searches'

使用 openpyxl.Workbook() 创建一个新的工作簿对象。

调用 active 属性获取当前活动的工作表对象，并将其赋值给变量 sheet。

使用 title 属性给工作表命名为 'Baidu Hot Searches'。

6.设置标题：

python 复制代码

sheet.cell(row=1, column=1, value='百度热搜排行榜---博主:Yan-英杰')

使用 cell() 方法选择要操作的单元格，其中 row 和 column 参数分别表示行和列的索引。

将标题字符串 '百度热搜排行榜---博主:Yan-英杰' 写入选定的单元格。

7.写入热搜数据：

python 复制代码

for i in range(len(hot_searches)):
    sheet.cell(row=i+2, column=1, value=hot_searches[i])

使用 range() 函数生成一个包含索引的范围，循环遍历 hot_searches 列表。

对于每个索引 i，使用 cell() 方法将对应的热搜词写入Excel文件中。

8.保存Excel文件：

python 复制代码

workbook.save('百度热搜.xlsx')

使用 save() 方法将工作簿保存到指定的文件名 '百度热搜.xlsx'。

9.输出提示信息：

python 复制代码

print('热搜数据已保存到 百度热搜.xlsx')

在控制台输出保存成功的提示信息。

效果展示：

源代码：

python 复制代码

import requests
from bs4 import BeautifulSoup
import openpyxl

# 发起HTTP请求获取百度热搜页面内容
url = 'https://top.baidu.com/board?tab=realtime'
response = requests.get(url)
html = response.content

# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(html, 'html.parser')

# 提取热搜数据
hot_searches = []
for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):
    hot_searches.append(item.text)

# 保存热搜数据到Excel
workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = 'Baidu Hot Searches'

# 设置标题
sheet.cell(row=1, column=1, value='百度热搜排行榜---博主:Yan-英杰')

# 写入热搜数据
for i in range(len(hot_searches)):
    sheet.cell(row=i+2, column=1, value=hot_searches[i])

workbook.save('百度热搜.xlsx')
print('热搜数据已保存到 百度热搜.xlsx')

可视化代码：

python 复制代码

import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
 
# 发起HTTP请求获取百度热搜页面内容
url = 'https://top.baidu.com/board?tab=realtime'
response = requests.get(url)
html = response.content
 
# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(html, 'html.parser')
 
# 提取热搜数据
hot_searches = []
for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):
    hot_searches.append(item.text)
 
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
 
# 绘制条形图
plt.figure(figsize=(15, 10))
x = range(len(hot_searches))
y = list(reversed(range(1, len(hot_searches)+1)))
plt.barh(x, y, tick_label=hot_searches, height=0.8)  # 调整条形图的高度
 
# 添加标题和标签
plt.title('百度热搜排行榜')
plt.xlabel('排名')
plt.ylabel('关键词')
 
# 调整坐标轴刻度
plt.xticks(range(1, len(hot_searches)+1))
 
# 调整条形图之间的间隔
plt.subplots_adjust(hspace=0.8, wspace=0.5)
 
# 显示图形
plt.tight_layout()
plt.show()

三、爬取斗鱼直播照片保存到本地目录

效果展示：

源代码：

python 复制代码

 
#导入了必要的模块requests和os
import requests
import os
 
 
# 定义了一个函数get_html(url)，
# 用于发送GET请求获取指定URL的响应数据。函数中设置了请求头部信息，
# 以模拟浏览器的请求。函数返回响应数据的JSON格式内容
def get_html(url):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    response = requests.get(url=url, headers=header)
    # print(response.json())
    html = response.json()
    return html
 
 
# 定义了一个函数parse_html(html)，
# 用于解析响应数据中的图片信息。通过分析响应数据的结构，
# 提取出每个图片的URL和标题，并将其存储在一个字典中，然后将所有字典组成的列表返回
def parse_html(html):
    rl_list = html['data']['rl']
    # print(rl_list)
    img_info_list = []
    for rl in rl_list:
        img_info = {}
        img_info['img_url'] = rl['rs1']
        img_info['title'] = rl['nn']
        # print(img_url)
        # exit()
        img_info_list.append(img_info)
    # print(img_info_list)
    return img_info_list
 
 
# 定义了一个函数save_to_images(img_info_list)，用于保存图片到本地。
# 首先创建一个目录"directory"，如果目录不存在的话。然后遍历图片信息列表，
# 依次下载每个图片并保存到目录中，图片的文件名为标题加上".jpg"后缀。
def save_to_images(img_info_list):
    dir_path = 'directory'
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
    for img_info in img_info_list:
        img_path = os.path.join(dir_path, img_info['title'] + '.jpg')
        res = requests.get(img_info['img_url'])
        res_img = res.content
        with open(img_path, 'wb') as f:
            f.write(res_img)
        # exit()
 
#在主程序中，设置了要爬取的URL，并调用前面定义的函数来执行爬取、解析和保存操作。
if __name__ == '__main__':
    url = 'https://www.douyu.com/gapi/rknc/directory/yzRec/1'
    html = get_html(url)
    img_info_list = parse_html(html)
    save_to_images(img_info_list)

四、爬取酷狗音乐Top500排行榜

从酷狗音乐排行榜中提取歌曲的排名、歌名、歌手和时长等信息

代码思路：

效果展示：

源码：

python 复制代码

import requests  # 发送网络请求，获取 HTML 等信息
from bs4 import BeautifulSoup  # 解析 HTML 信息，提取需要的信息
import time  # 控制爬虫速度，防止过快被封IP
 
 
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
    # 添加浏览器头部信息，模拟请求
}
 
def get_info(url):
    # 参数 url ：要爬取的网页地址
    web_data = requests.get(url, headers=headers)  # 发送网络请求，获取 HTML 等信息
    soup = BeautifulSoup(web_data.text, 'lxml')  # 解析 HTML 信息，提取需要的信息
 
    # 通过 CSS 选择器定位到需要的信息
    ranks = soup.select('span.pc_temp_num')
    titles = soup.select('div.pc_temp_songlist > ul > li > a')
    times = soup.select('span.pc_temp_tips_r > span')
    
    # for 循环遍历每个信息，并将其存储到字典中
    for rank, title, time in zip(ranks, titles, times):
        data = {
            "rank": rank.get_text().strip(),  # 歌曲排名
            "singer": title.get_text().replace("\n", "").replace("\t", "").split('-')[1],  # 歌手名
            "song": title.get_text().replace("\n", "").replace("\t", "").split('-')[0],  # 歌曲名
            "time": time.get_text().strip()  # 歌曲时长
        }
        print(data)  # 打印获取到的信息
 
if __name__ == '__main__':
    urls = ["https://www.kugou.com/yy/rank/home/{}-8888.html".format(str(i)) for i in range(1, 24)]
    # 构造要爬取的页面地址列表
    for url in urls:
        get_info(url)  # 调用函数，获取页面信息
        time.sleep(1)  # 控制爬虫速度，防止过快被封IP

五、爬取链家二手房数据做数据分析

在数据分析和挖掘领域中，网络爬虫是一种常见的工具，用于从网页上收集数据。介绍如何使用 Python 编写简单的网络爬虫程序，从链家网上海二手房页面获取房屋信息，并将数据保存到 Excel 文件中。

效果图：

代码思路：

首先，我们定义了一个函数 fetch_data(page_number)，用于获取指定页面的房屋信息数据。这个函数会构建对应页数的 URL，并发送 GET 请求获取页面内容。然后，使用 BeautifulSoup 解析页面内容，并提取每个房屋信息的相关数据，如区域、房型、关注人数、单价和总价。最终将提取的数据以字典形式存储在列表中，并返回该列表。

接下来，我们定义了主函数 main()，该函数控制整个爬取和保存数据的流程。在主函数中，我们循环爬取前 10 页的数据，调用 fetch_data(page_number) 函数获取每一页的数据，并将数据追加到列表中。然后，将所有爬取的数据存储在 DataFrame 中，并使用 df.to_excel('lianjia_data.xlsx', index=False) 将数据保存到 Excel 文件中。

最后，在程序的入口处，通过 if name == "main": 来执行主函数 main()。

源代码：

python 复制代码

import requests
 
from bs4 import BeautifulSoup
 
import pandas as pd
 
 
# 收集单页数据 xpanx.com
 
def fetch_data(page_number):
    url = f"https://sh.lianjia.com/ershoufang/pg{page_number}/"
 
    response = requests.get(url)
 
    if response.status_code != 200:
        print("请求失败")
 
        return []
 
    soup = BeautifulSoup(response.text, 'html.parser')
 
    rows = []
 
    for house_info in soup.find_all("li", {"class": "clear LOGVIEWDATA LOGCLICKDATA"}):
        row = {}
 
        # 使用您提供的类名来获取数据 xpanx.com
 
        row['区域'] = house_info.find("div", {"class": "positionInfo"}).get_text() if house_info.find("div", {
            "class": "positionInfo"}) else None
 
        row['房型'] = house_info.find("div", {"class": "houseInfo"}).get_text() if house_info.find("div", {
            "class": "houseInfo"}) else None
 
        row['关注'] = house_info.find("div", {"class": "followInfo"}).get_text() if house_info.find("div", {
            "class": "followInfo"}) else None
 
        row['单价'] = house_info.find("div", {"class": "unitPrice"}).get_text() if house_info.find("div", {
            "class": "unitPrice"}) else None
 
        row['总价'] = house_info.find("div", {"class": "priceInfo"}).get_text() if house_info.find("div", {
            "class": "priceInfo"}) else None
 
        rows.append(row)
 
    return rows
 
 
# 主函数
 
def main():
    all_data = []
 
    for i in range(1, 11):  # 爬取前10页数据作为示例
 
        print(f"正在爬取第{i}页...")
 
        all_data += fetch_data(i)
 
    # 保存数据到Excel xpanx.com
 
    df = pd.DataFrame(all_data)
 
    df.to_excel('lianjia_data.xlsx', index=False)
 
    print("数据已保存到 'lianjia_data.xlsx'")
 
 
if __name__ == "__main__":
    main()

六、爬取豆瓣电影排行榜TOP250存储到CSV文件中

代码思路：

首先，我们导入了需要用到的三个Python模块：requests、lxml和csv。

然后，我们定义了豆瓣电影TOP250页面的URL地址，并使用getSource(url)函数获取网页源码。

接着，我们定义了一个getEveryItem(source)函数，它使用XPath表达式从HTML源码中提取出每部电影的标题、URL、评分和引言，并将这些信息存储到一个字典中，最后将所有电影的字典存储到一个列表中并返回。

然后，我们定义了一个writeData(movieList)函数，它使用csv库的DictWriter类创建一个CSV写入对象，然后将电影信息列表逐行写入CSV文件。

最后，在if name == 'main'语句块中，我们定义了一个空的电影信息列表movieList，然后循环遍历前10页豆瓣电影TOP250页面，分别抓取每一页的网页源码，并使用getEveryItem()函数解析出电影信息并存储到movieList中，最后使用writeData()函数将电影信息写入CSV文件。

效果图：

源代码：

python 复制代码

私信博主进入交流群，一起学习探讨，如果对CSDN周边以及有偿返现活动感兴趣：
可添加博主：Yan--yingjie
如果想免费获取图书，也可添加博主微信，每周免费送数十本
 
 
#代码首先导入了需要使用的模块：requests、lxml和csv。
import requests
from lxml import etree
import csv
 
#
doubanUrl = 'https://movie.douban.com/top250?start={}&filter='
 
 
# 然后定义了豆瓣电影TOP250页面的URL地址，并实现了一个函数getSource(url)来获取网页的源码。该函数发送HTTP请求，添加了请求头信息以防止被网站识别为爬虫，并通过requests.get()方法获取网页源码。
def getSource(url):
    # 反爬 填写headers请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
    }
 
    response = requests.get(url, headers=headers)
    # 防止出现乱码
    response.encoding = 'utf-8'
    # print(response.text)
    return response.text
 
 
# 定义了一个函数getEveryItem(source)来解析每个电影的信息。首先，使用lxml库的etree模块将源码转换为HTML元素对象。然后，使用XPath表达式定位到包含电影信息的每个HTML元素。通过对每个元素进行XPath查询，提取出电影的标题、副标题、URL、评分和引言等信息。最后，将这些信息存储在一个字典中，并将所有电影的字典存储在一个列表中。
def getEveryItem(source):
    html_element = etree.HTML(source)
 
    movieItemList = html_element.xpath('//div[@class="info"]')
 
    # 定义一个空的列表
    movieList = []
 
    for eachMoive in movieItemList:
 
        # 创建一个字典 像列表中存储数据[{电影一},{电影二}......]
        movieDict = {}
 
        title = eachMoive.xpath('div[@class="hd"]/a/span[@class="title"]/text()')  # 标题
        otherTitle = eachMoive.xpath('div[@class="hd"]/a/span[@class="other"]/text()')  # 副标题
        link = eachMoive.xpath('div[@class="hd"]/a/@href')[0]  # url
        star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]  # 评分
        quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()')  # 引言（名句）
 
        if quote:
            quote = quote[0]
        else:
            quote = ''
        # 保存数据
        movieDict['title'] = ''.join(title + otherTitle)
        movieDict['url'] = link
        movieDict['star'] = star
        movieDict['quote'] = quote
 
        movieList.append(movieDict)
 
        print(movieList)
    return movieList
 
 
# 保存数据
def writeData(movieList):
    with open('douban.csv', 'w', encoding='utf-8', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['title', 'star', 'quote', 'url'])
 
        writer.writeheader()  # 写入表头
 
        for each in movieList:
            writer.writerow(each)
 
 
if __name__ == '__main__':
    movieList = []
 
    # 一共有10页
 
    for i in range(10):
        pageLink = doubanUrl.format(i * 25)
 
        source = getSource(pageLink)
 
        movieList += getEveryItem(source)
 
    writeData(movieList)