python爬虫-bs4
目录
说明
BeautifulSoup 是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据
在爬虫项目中经常会遇到不规范、及其复杂的HTML代码
BeautifulSoup4提供了强大的方法来遍历文档的节点以及根据各种条件搜索和过滤文档中的元素。你可以使用CSS选择器、正则表达式等灵活的方式来定位和提取所需的数据
安装
shell
pip install BeautiifulSoup4
导入
python
from bs4 import BeautifulSoup
基础用法
解析对象
python
soup = BeautifulSoup('目标数据','解析器')
目前有三种主流解析器
html.parser
lxml
(推荐)html5lib
获取文本
获取文本的方法两种方式text
和contents
contents
:
python
from bs4 import BeautifulSoup
data = """
<h1>Welcome to BeautifulSoup Practice</h1>
<div class="article">
<h2>Article Title</h2>
<p>This is a paragraph of text for practicing BeautifulSoup.</p>
<a href="https://www.example.com">Link to Example Website</a>
"""
soup = BeautifulSoup(data, 'lxml')
print(soup.contents)
# 输出:
"""
[<html><body><h1>Welcome to BeautifulSoup Practice</h1>
<div class="article">
<h2>Article Title</h2>
<p>This is a paragraph of text for practicing BeautifulSoup.</p>
<a href="https://www.example.com">Link to Example Website</a>
</div></body></html>]
"""
text
:
python
print(soup.text)
"""
Welcome to BeautifulSoup Practice
Article Title
This is a paragraph of text for practicing BeautifulSoup.
Link to Example Website
"""
Tag对象
获取HTML中的标签内容
比如<p>
<div>
示例:
python
print(soup.h2)
# <h2>Article Title</h2>
print(soup.h2.text)
# Article Title
find参数
获取class要加下划线,因为在python中它属于关键字,除了class还可以换成任意属性名
python
data = """
<h1>Welcome to BeautifulSoup Practice</h1>
<div class="article">
<p>This is a paragraph of text for practicing BeautifulSoup.</p>
</div>
<div class="ex2">
<p>This is a abcd.</p>
</div>
"""
soup = BeautifulSoup(data, 'lxml')
print(soup.find('div', class_='article'))
获取标签属性
python
data = ' <p id = "apple">This is a paragraph of text for practicing BeautifulSoup.</p>'
soup = BeautifulSoup(data, 'lxml')
tag = soup.find('p')
print(tag.get('id'))
# apple
获取所有标签
python
soup = BeautifulSoup(data, 'lxml')
print(soup.find_all('p'))
# [<p>This is a paragraph of text for practicing BeautifulSoup.</p>, <p>This is a abcd.</p>]
print(len(soup.find_all('p')))
# 2
括号为空则获取全部标签
获取标签名
python
print(soup.div.name)
# div
嵌套获取
示例HTML如下
python
html = '''
<div class="article">
<h2>Article Title</h2>
<p>This is a paragraph of text for practicing BeautifulSoup.</p>
<p>This is a abcd.</p>
<a href="https://www.example.com">Link to Example Website</a>
</div>
'''
目标:获取div下的所有p标签内容
python
print(soup.find('div', class_='article').find_all('p'))
子节点和父节点
python
soup = BeautifulSoup(data, 'lxml')
# 遍历获取所有父节点
for item in soup.p.parents:
print(item)
# 遍历获取所有子节点
for i in soup.p.children:
print(soup.p.children)