bs4介绍和遍历文档树、搜索文档树、案例：爬美女图片、 bs4其它用法、css选择器

bs4介绍和遍历文档树

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库，解析库

需要安装模块：pip install beautifulsoup4

使用

解析库可以使用 lxml，速度快（必须安装）可以使用python内置的

python 复制代码

# html_doc爬出的网页text
soup = BeautifulSoup(html_doc, 'html.parser')

重点：遍历文档树

遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个

用法：通过 .遍历

python 复制代码

# 拿到 以下的第一个title
res=soup.html.head.title

# 拿到第一个p
res=soup.p

取标签的名称

python 复制代码

res=soup.html.head.title.name
res=soup.p.name

获取标签的属性

python 复制代码

# 标签的所有属性
res=soup.body.a.attrs  # 所有属性放到字典中 ：{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

# 获取第一个属性值
res=soup.body.a.attrs.get('href')
res=soup.body.a.attrs['href']
res=soup.body.a['href']

获取标签的内容

python 复制代码

res=soup.body.a.text
res=soup.p.text

# 这个标签有且只有文本，才取出来，如果有子孙，就是None
res=soup.a.string  
res=soup.p.strings

嵌套选择
就是通过.嵌套

子节点、子孙节点

python 复制代码

#p下所有子节点
print(soup.p.contents)

#得到一个迭代器,包含p下所有子节点
print(list(soup.p.children)) 

#获取子子孙节点,p下所有的标签都会选择出来
print(list(soup.p.descendants))

父节点、祖先节点

python 复制代码

#获取a标签的父节点
print(soup.a.parent)

#找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
print(list(soup.a.parents) )

兄弟节点

python 复制代码

print(soup.a.next_sibling) #下一个兄弟
print(soup.a.previous_sibling) #上一个兄弟
print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print(soup.a.previous_siblings) #上面的兄弟们=>生成器对象

搜索文档树

find_all ：找所有列表
find：找一个 Tag类的对象

find和find_all

五种过滤器: 字符串、正则表达式、列表、True、方法

字符串

可以按标签名，可以按属性，可以按文本内容

无论按标签名，按属性，按文本内容都是按字符串形式查找

python 复制代码

# 找到类名叫 story的p标签
p=soup.find('p')

# 可以按标签名，可以按属性，可以按文本内容
p=soup.find(name='p',class_='story')
obj=soup.find(name='span',text='lqz')
obj=soup.find(href='http://example.com/tillie')

# 属性可以写成这样
obj=soup.find(attrs={'class':'title'})

正则

无论按标签名，按属性，按文本内容都是按正则形式查找

python 复制代码

import re

# 找到所有名字以b开头的所有标签
obj=soup.find_all(name=re.compile('^b'))

# 以y结尾
obj=soup.find_all(name=re.compile('y$'))

obj=soup.find_all(href=re.compile('^http:'))
obj=soup.find_all(text=re.compile('i'))

列表

无论按标签名，按属性，按文本内容都是按列表形式查找

python 复制代码

# 所有a标签和标签放到一个列表里
obj=soup.find_all(name=['p','a'])
obj = soup.find_all(class_=['sister', 'title'])

True

无论按标签名，按属性，按文本内容都是按布尔形式查找

python 复制代码

obj=soup.find_all(id=True)
obj=soup.find_all(href=True)
obj=soup.find_all(name='img',src=True)

方法

无论按标签名，按属性，按文本内容都是按方法形式查找

python 复制代码

## 有class但没有id
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(name=has_class_but_no_id))

案例：爬美女图片

python 复制代码

import requests
from bs4 import BeautifulSoup

res = requests.get('https://pic.netbian.com/tupian/32518.html')
res.encoding = 'gbk'

soup = BeautifulSoup(res.text, 'html.parser')

ul = soup.find('ul', class_='clearfix')
img_list = ul.find_all(name='img', src=True)

for img in img_list:
    try:
        url = img.attrs.get('src')
        if not url.startwith('http'):
            url = 'https://pic.netbian.com' + url
        res1 = requests.get('url')
        name = url.split('-')[-1]
        with open('./img/%s' % name, 'wb') as f:
            for line in res1.iter_content():
                f.write(line)
    except Exception as e:
        continue

bs4其它用法

遍历，搜索文档树 ⇢ \dashrightarrow ⇢ bs4还可以修改xml
- java的配置文件一般喜欢用xml写
- .conf
- .ini
- .yaml
- .xml
find_all 其他参数
- limit=数字 找几条，如果写1 ，就是一条
- recursive：默认是True，如果改False，在查找时只查找子节点标签，不再去子子孙孙中寻找
搜索文档树和遍历文档树可以混用，找属性，找文本跟之前学的一样

css选择器

id选择器：#id号
标签选择器：标签名
类选择器：.类名
属性选择器

需要记住的

#id
.sister
head
div>a：# div下直接子节点a
div a ：div下子子孙孙节点a

一旦会了css选择器的用法 ⇢ \dashrightarrow ⇢ 以后所有的解析库都可以使用css选择器去找

查找：p=soup.select('css选择器')

复制参考：https://www.runoob.com/cssref/css-selectors.html

案例

python 复制代码

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')
soup = BeautifulSoup(res.text, 'html.parser')

# 以后直接复制即可
p = soup.select('a[title="下载哔哩哔哩视频"]')[0].attrs.get('href')
print(p)