目录
- 一、介绍
- 二、使用
- [三、Bs4 对象种类](#三、Bs4 对象种类)
-
- 1、tag:标签
- [2、NavigableString :可导航的字符串](#2、NavigableString :可导航的字符串)
- 3、BeautifulSoup:bs对象
- 4、Comment:注释
- 四、遍历文档树
- 五、常用方法
- 六、CSS选择器
- 七、案例
一、介绍
Bs4 (beautifulsoup4):是一个可以从 HTML 或 XML 文件中提取数据的网页信息提取库。
Bs4 与 XPath 区别
XPath:根据路径找数据
Bs4:使用封装好的方法获取数据
二、使用
安装第三方库
pip install beautifulSoup4
pip install lxml
python
# 导入
from bs4 import BeautifulSoup
# 创建对象,网页源码,解析器
soup = BeautifulSoup(html_doc, 'lxml')
三、Bs4 对象种类
python
# 导入库
from bs4 import BeautifulSoup
# 模拟数据
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1、tag:标签
python
# soup.title:查找的是 title 标签
print(soup.title) # <title>The Dormouse's story</title>
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(soup.p) # <p class="title"><b>The Dormouse's story</b></p>
print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2、NavigableString :可导航的字符串
python
# 标签里面的文本内容
title_tag = soup.title
print(title_tag.string) # The Dormouse's story
print(type(title_tag.string)) # <class 'bs4.element.NavigableString'>
3、BeautifulSoup:bs对象
python
print(type(soup)) # <class 'bs4.BeautifulSoup'>
4、Comment:注释
python
html = '<b><!--好好坚持学习python--></b>'
soup2 = BeautifulSoup(html, "lxml")
print(soup2.b.string) # 好好坚持学习python
print(type(soup2.b.string)) # <class 'bs4.element.Comment'>
四、遍历文档树
python
# 导入库
from bs4 import BeautifulSoup
# 模拟数据
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and the
ir names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1、遍历子节点
1)contents:返回的是一个所有子节点的列表
python
print(soup.head.contents) # [<title>The Dormouse's story</title>]
2)children:返回的是一个子节点的迭代器
python
# 通过循环取出迭代器里面的内容
for i in soup.head.children:
print(i) # <title>The Dormouse's story</title>
3)descendants:返回的是一个生成器遍历子子孙孙
python
for i in soup.head.descendants:
print(i)
# <title>The Dormouse's story</title>
# The Dormouse's story
2、获取节点内容
1)string:获取标签里面的内容
python
# 只能获取单个
print(soup.title.string) # The Dormouse's story
print(soup.head.string) # The Dormouse's story
print(soup.html.string) # None
2)strings:返回的是一个生成器对象用过来获取多个标签内容
python
# 返回生成器
print(soup.html.strings) # <generator object _all_strings at 0x0000020F4F3EB5C8>
# 通过循环取出生成器里面的内容,使用 strings,会出现多个换行符
for i in soup.html.strings:
print(i)
# The Dormouse's story
#
#
#
#
# The Dormouse's story
#
#
# Once upon a time there were three little sisters; and the
# ir names were
#
# Elsie
# ,
#
# Lacie
# and
#
# Tillie
# ;
# and they lived at the bottom of a well.
#
#
# ...
#
#
3)stripped_strings:和 strings 基本一致,但是它可以把多余的空格去掉
python
print(soup.html.stripped_strings) # <generator object stripped_strings at 0x000001FD0822B5C8>
# 通过循环取出生成器里面的内容
for i in soup.html.stripped_strings:
print(i)
# The Dormouse's story
# The Dormouse's story
# Once upon a time there were three little sisters; and the
# ir names were
# Elsie
# ,
# Lacie
# and
# Tillie
# ;
# and they lived at the bottom of a well.
# ...
3、遍历父节点
1)parent:直接获得父节点
python
print(soup.title.parent) # <head><title>The Dormouse's story</title></head>
# 在 Bs4 中,html的父节点是 BeautifulSoup 对象
print(soup.html.parent)
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
# <p class="story">Once upon a time there were three little sisters; and the
# ir names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body></html>
print(type(soup.html.parent)) # <class 'bs4.BeautifulSoup'>
2)parents:获取所有的父节点
python
# parents:获取所有的父节点
print(soup.a.parents) # <generator object parents at 0x000001E7984EA5C8>
for i in soup.a.parents:
print(i.name, soup.name)
# p [document]
# body [document]
# html [document]
# [document] [document]
4、遍历兄弟节点
python
# 导入库
from bs4 import BeautifulSoup
# 模拟数据
html_doc = '''<a>
<b>bbb</b><c>ccc</c><d>ddd</d>
</a>'''
# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1)next_sibling:下一个兄弟节点
python
# 紧挨着的
print(soup.b.next_sibling) # <c>ccc</c>
2)previous_sibling:上一个兄弟节点
python
print(soup.c.previous_sibling) # <b>bbb</b>
3)next_siblings:下一个所有兄弟节点
python
for i in soup.b.next_siblings:
print(i)
# <c>ccc</c>
# <d>ddd</d>
4)previous_siblings:上一个所有兄弟节点
python
for i in soup.d.previous_siblings:
print(i)
# <c>ccc</c>
# <b>bbb</b>
五、常用方法
python
# 导入库
from bs4 import BeautifulSoup
# 模拟数据
html_doc = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
1、find_all()
以列表形式返回所有的搜索到的标签数据
2、find()
返回搜索到的第一条数据
python
# 1、获取第一个 tr 标签
tr = soup.find('tr') # 默认查找一个
print(tr)
# 2、获取所有的 tr 标签
trs = soup.find_all('tr') # 返回列表,每一组 tr 存放在列表
print(trs)
# 3、获取第二个 tr 标签
tr2 = soup.find_all('tr')[1] # 返回列表,取下标即可
print(tr2)
# 4、获取 class="odd" 的标签
# 方法一
odd = soup.find_all('tr', class_='odd')
for tr in odd:
print(tr)
# 方法二
odd2 = soup.find_all('tr', attrs={'class': 'odd'})
for tr in odd2:
print(tr)
# 5、获取所有 a 标签里面的 href 属性值
lst = soup.find_all('a')
for a in lst:
print(a['href'])
# 6、获取所有的岗位信息
lst_data = soup.find_all('a')
for a in lst_data:
print(a.string)
六、CSS选择器
python
# 导入库
from bs4 import BeautifulSoup
# 模拟数据
html_doc = """
<html><head><title>睡鼠的故事</title></head>
<body>
<p class="title"><b>睡鼠的故事</b></p>
<p class="story">从前有三个小姐妹;他们的名字是
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>、
<a href="http://example.com/lacie " class="sister" id="link2">Lacie</a> 和
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
他们住在井底。</p>
<p class="story">...</p>
"""
# features 指定解析器
soup = BeautifulSoup(html_doc, features='lxml')
相关语法:http://www.w3cmap.com/cssref/css-selectors.html
python
# 获取 class 为 sister 的数据
print(soup.select('.sister'))
# 获取 id 为 link1 的数据
print(soup.select('#link1'))
# 获取 title 标签里面的文本内容
print(soup.select('title')[0].string)
# 获取 p 标签下的 a 标签
print(soup.select('p a'))
七、案例
目标网站:http://www.weather.com.cn/textFC/hb.shtml
需求:获取全国的天气,包括城市名称和最低温度,并将数据保存到 csv 文件当中
python
# 导入
import requests
from bs4 import BeautifulSoup
import csv
# 表格数据
lst = []
# 获取网页源码
def get_html(url):
# 发请求
html = requests.get(url)
# 发现乱码,处理编码
html.encoding = 'utf-8'
# 得到网页源码
html = html.text
# 返回到函数调用处
return html
# 解析网页数据
def parse_html(html):
# 创建对象
soup = BeautifulSoup(html,'html5lib')
# 解析
conMidtab = soup.find('div', class_='conMidtab')
# print(conMidtab)
tables = conMidtab.find_all('table')
for table in tables:
trs = table.find_all('tr')[2:]
for index, tr in enumerate(trs):
dic = {}
# 拿到对应的标签
if index == 0: # 判断是否是第一个城市
# 第一个城市
city_td = tr.find_all('td')[1]
else:
# 其他城市
city_td = tr.find_all('td')[0]
temp_td = tr.find_all('td')[-2]
# print(city_td,temp_td)
# 对应的标签里面拿文本内容
dic['city'] = list(city_td.stripped_strings)[0]
dic['temp'] = temp_td.string
lst.append(dic)
# 保存数据
def save_data():
# 规定表头
head = ('city','temp')
# csv 文件写入
with open('weather.csv','w',encoding='utf-8-sig',newline='') as f:
# 创建 csv 对象
writer = csv.DictWriter(f, fieldnames=head)
# 写入表头
writer.writeheader()
# 写入数据
writer.writerows(lst)
# 获取不同地区 url
def area(link):
# 获取网页源码
link = get_html(link)
# 创建对象
soup = BeautifulSoup(link, 'html5lib')
# 解析
conMidtab = soup.find('ul', class_='lq_contentboxTab2')
# 找到 a 链接
tagas = conMidtab.find_all('a')
# url 列表
hrefs = []
# 循环获取 url
for i in tagas:
hrefs.append('http://www.weather.com.cn' + i.get('href'))
# 打印 url 列表
# print(hrefs)
# 返回函数值
return hrefs
# 处理主逻辑
def main():
# 确定 url
link = 'http://www.weather.com.cn/textFC/hb.shtml'
# 不同地区 url
lst = area(link)
# print(lst)
for i in lst:
url = i
# 获取网页源码
html = get_html(url)
# 数据解析
parse_html(html)
# 保存内容
save_data()
# 运行主程序
main()
记录学习过程,欢迎讨论交流,尊重原创,转载请注明出处~