使用 Requests 库直接发送 HTTP 请求,结合 lxml 解析 HTML 内容,实现更高效的数据爬取。
1 环境准备与基础设置
使用之前需要导入必要的库并设置基础环境:
python复制代码
import fake_useragent
import requests
from lxml import etree
import os
import re
n = 0
def count():
global n
n += 1
return n
if not os.path.exists(r"./Picture"):
os.mkdir(r"./Picture")
这里定义了一个计数器函数 count(),用于为下载的图片生成序号,同时创建了用于存储图片的目录。
2 网页爬取与数据解析
以下代码展示了如何爬取壁纸网站的图片数据:
python复制代码
head = {
"User-Agent": fake_useragent.UserAgent().random
}
for i in range(1, 3):
url = f"https://10wallpaper.com/List_wallpapers/page/{i}"
resp = requests.get(url, headers=head)
result = resp.text
tree = etree.HTML(result)
p_list = tree.xpath("//div[@id='pics-list']/p")
for p in p_list:
ima_url = p.xpath("./a/img/@src")[0]
ima_url1 = "https://10wallpaper.com" + ima_url
print(ima_url1)
img_name = count()
print(img_name)
img_resp = requests.get(ima_url1, headers=head)
img_content = img_resp.content
with open(f"./Picture/{img_name}.jpg","wb") as f:
f.write(img_resp.content)