Python 爬虫基础

Python 爬虫基础

1.1 理论

在浏览器通过网页拼接【/robots.txt】来了解可爬取的网页路径范围

例如访问: https://www.csdn.net/robots.txt

User-agent: *

Disallow: /scripts

Disallow: /public

Disallow: /css/

Disallow: /images/

Disallow: /content/

Disallow: /ui/

Disallow: /js/

Disallow: /scripts/

Disallow: /article_preview.html*

Disallow: /tag/

Disallow: /?

Disallow: /link/

Disallow: /tags/

Disallow: /news/

Disallow: /xuexi/

通过Python Requests 库发送HTTP【Hypertext Transfer Protocol "超文本传输协议"】请求

通过Python Beautiful Soup 库来解析获取到的HTML内容

HTTP请求

HTTP响应

1.2 实践代码 【获取价格&书名】

python 复制代码
import requests
# 解析HTML
from bs4 import BeautifulSoup

# 将程序伪装成浏览器请求
head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
requests = requests.get("http://books.toscrape.com/",headers= head)
# 指定编码
# requests.encoding= 'gbk'
if requests.ok:
    # file = open(r'C:\Users\root\Desktop\Bug.html', 'w')
    # file.write(requests.text)
    # file.close
    content =  requests.text
    ## html.parser 指定当前解析 HTML 元素
    soup = BeautifulSoup(content, "html.parser")
    
    ## 获取价格
    all_prices = soup.findAll("p", attrs={"class":"price_color"})
    for price in all_prices:
        print(price.string[2:])

    ## 获取名称
    all_title = soup.findAll("h3")
    for title in all_title:
        ## 获取h3下面的第一个a元素
        print(title.find("a").string)
else:
    print(requests.status_code)

1.3 实践代码 【获取 Top250 的电影名】

python 复制代码
import requests
# 解析HTML
from bs4 import BeautifulSoup
head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
# 获取 TOP 250个电影名
for i in range(0,250,25):
    response = requests.get(f"https://movie.douban.com/top250?start={i}", headers= head)
    if response.ok:
        content =  response.text
        soup = BeautifulSoup(content, "html.parser")
        all_titles = soup.findAll("span", attrs={"class": "title"})
        for title in all_titles:
            if "/" not in title.string:
                print(title.string) 
    else:
        print(response.status_code)

1.4 实践代码 【下载图片】

python 复制代码
import requests
# 解析HTML
from bs4 import BeautifulSoup

head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(f"https://www.maoyan.com/", headers= head)
if response.ok:
    soup = BeautifulSoup(response.text, "html.parser")
    for img in soup.findAll("img", attrs={"class": "movie-poster-img"}):
        img_url = img.get('data-src')
        alt = img.get('alt')
        path = 'img/' + alt + '.jpg'
        res = requests.get(img_url)
        with open(path, 'wb') as f:
            f.write(res.content)
else:
    print(response.status_code)

1.5 实践代码 【千图网图片 - 爬取 - 下载图片】

python 复制代码
import requests
# 解析HTML
from bs4 import BeautifulSoup


# 千图网图片 - 爬取
head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
# response = requests.get(f"https://www.58pic.com/piccate/53-0-0.html", headers= head)
# response = requests.get(f"https://www.58pic.com/piccate/53-598-2544.html", headers= head)
response = requests.get(f"https://www.58pic.com/piccate/53-527-1825.html", headers= head)
if response.ok:
    soup = BeautifulSoup(response.text, "html.parser")
    for img in soup.findAll("img", attrs={"class": "lazy"}):
        img_url = "https:" + img.get('data-original')
        alt = img.get('alt')
        path = 'imgqiantuwang/' + str(alt) + '.jpg'
        res = requests.get(img_url)
        with open(path, 'wb') as f:
            f.write(res.content)
else:
    print(response.status_code)
相关推荐
深度学习入门17 分钟前
机器学习,深度学习,神经网络,深度神经网络之间有何区别?
人工智能·python·深度学习·神经网络·机器学习·机器学习入门·深度学习算法
TNTLWT18 分钟前
Qt控件:交互控件
开发语言·qt
量化金策21 分钟前
震荡指标工具
开发语言
北漂老男孩23 分钟前
ChromeDriver进程泄漏问题分析与最佳实践解决方案
开发语言·爬虫
李迟28 分钟前
Golang实践录:在go中使用curl实现https请求
开发语言·golang·https
森哥的歌1 小时前
Python uv包管理器使用指南:从入门到精通
python·开发工具·uv·虚拟环境·包管理
qq_214782611 小时前
给你的matplotlib images添加scale Bar
python·数据分析·matplotlib
Johny_Zhao1 小时前
Vmware workstation安装部署微软SCCM服务系统
网络·人工智能·python·sql·网络安全·信息安全·微软·云计算·shell·系统运维·sccm
waterHBO1 小时前
python + flask 做一个图床
python
运维-大白同学1 小时前
go-数据库基本操作
开发语言·数据库·golang