BeautifulSoup 库的常用操作

BeautifulSoup库

0、所有方法都有的

python 复制代码
from bs4 import BeautifulSoup
# 前面几个方法使用的都是这个参数,所以统一使用这个(后面的那些方法没有引用这个html文本文件)
html_doc = """
            <html><head><title>The Dormouse's story</title></head>
            <body>
            <p class="title"><b>The Dormouse's story</b></p>

            <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.</p>

            <p class="story">...</p>
            """

1、基本用法

python 复制代码
'''
基本用法demo1
'''
def demo01(html_doc):

    # 这里的作用是将html_doc中缺少的标签补充完善,使用的库是lxml进行补全
    soup = BeautifulSoup(html_doc, "lxml")
    # 更正html_doc的格式,使得上面文本的格式是正确的
    print(soup.prettify())
    # 查看经过上面步骤处理过后的结果
    print(soup.title.string)

2、节点选择器

python 复制代码
'''
节点选择器demo2
'''
def demo02(html_doc):
    soup = BeautifulSoup(html_doc, 'lxml')
    # 选择html_doc中的title标签
    # 结果:<title>The Dormouse's story</title>
    print(soup.title)
    # 查看对应的类型
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.title))
    # 结果:The Dormouse's story
    print(soup.title.string)
    # 结果:<head><title>The Dormouse's story</title></head>
    print(soup.head)
    # 结果:<p class="title"><b>The Dormouse's story</b></p>
    print(soup.p)
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.p))
    # 结果:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 【默认返回第一个】
    print(soup.a)

3、提取节点信息

python 复制代码
'''
提取节点信息demo3
'''
def demo03(html_doc):
    soup = BeautifulSoup(html_doc, "lxml")
    # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    tag = soup.a
    # 1、获取名称
    # 结果:a
    print(tag.name)

    # 2、获取属性值
    # 结果:
    #     class值为:  ['sister']
    #     href值为:  http://example.com/elsie
    print("class值为: ", tag.attrs["class"])
    print("href值为: ", tag.attrs["href"])

    # 3、获取内容
    # 结果:Elsie
    print(tag.string)

4、获取子节点信息

python 复制代码
'''
获取子节点信息demo4
'''
def demo04(html_doc):
    soup = BeautifulSoup(html_doc, 'lxml')
    # 1、首先获取head标签的内容部分
    # 结果:<head><title>The Dormouse's story</title></head>
    print(soup.head)
    # 2、然后获取head中title标签的内容
    # 结果:<title>The Dormouse's story</title>
    print(soup.head.title)
    # 3、获取head中title下的文本内容
    # 结果:The Dormouse's story
    print(soup.head.title.string)

5、关联选择

1、获取子节点--contents

python 复制代码
'''
关联选择demo05--01--下级节点
使用contents属性进行获取--获取子节点
介绍:
    在做选择的时候,有时候不能做到一步就获取到我想要的节点元素,需要选取某一个节点元素,
    然后以这个节点为基准再选取它的子节点、父节点、兄弟节点等
'''
def demo05():
    # 注意它的第一个p标签没有换行展示
    html_doc01 = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>


    <p class="story">...</p>
    """
    # 注意它和html_doc01的区别在于,p标签进行了换行
    html_doc02 = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b>
    </p>


    <p class="story">...</p>
    """

    # 1、获取节点的子节点和子孙节点--contents属性
    soup01 = BeautifulSoup(html_doc01, "lxml")
    # 结果:[<b>The Dormouse's story</b>]
    print(soup01.p.contents)

    soup02 = BeautifulSoup(html_doc02, "lxml")
    # 注意这里的结果多了一个换行符
    # 结果:[<b>The Dormouse's story</b>, '\n']
    print(soup02.p.contents)

2、获取子节点--children

python 复制代码
'''
关联选择demo06--02--下级节点
使用children属性进行获取--获取子节点
'''
def demo06():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<list_iterator object at 0x000002B35915BFA0
    print(soup.p.children)
    # 结果:[
    #           '\n        Once upon a time there were three little sisters; and their names were\n        ',
    #           <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #           ',\n        ',
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           ' and\n        ',
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
    #           ';\n        and they lived at the bottom of a well.\n    '
    #       ]
    print(list(soup.p.children))
    for item in soup.p.children:
        print(item)

3、获取子孙节点--descendants

python 复制代码
'''
关联选择demo07--03--下级节点
使用descendants属性进行获取--获取子孙节点(获取:子节点和孙节点的内容)
'''
def demo07():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span>Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<generator object Tag.descendants at 0x000001C0E79DCC10>
    print(soup.p.descendants)
    # 结果:[
    #           'Once upon a time there were three little sisters; and their names were\n    ',
    #           <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span>Elsie</a>,
    #           <span>Elsie</span>,
    #           'Elsie',
    #           'Elsie',
    #           ',\n    ',
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           'Lacie',
    #           ' and\n    ',
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
    #           'Tillie',
    #           ';\n    and they lived at the bottom of a well.'
    #       ]
    print(list(soup.p.descendants))
    # for item in soup.p.descendants:
    #     print(item)

4、获取父节点--parent、祖先节点--parents

python 复制代码
'''
关联选择demo08--01--上级节点
使用parent属性进行获取--获取父节点
使用parents属性进行获取--获取祖先节点
'''
def demo08():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    <p>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    </p>
    </p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 会打印出<body>标签中所有的内容,包括子节点p标签和孙节点a标签等全部的值
    print(soup.p.parent)
    # 获取第一个a标签的父节点p标签的值,包括当前的这个a标签中的文本内容
    print(soup.a.parent)

    print("=======================")
    # 结果:<generator object PageElement.parents at 0x000001403E6ECC10>
    print(soup.a.parents)
    for i, parent in enumerate(soup.a.parents):
        print(i, parent)

5、获取兄弟节点

python 复制代码
'''
关联选择demo09--兄弟节点
# 可以使用的属性有:
    1、next_sibling
    2、previous_sibling
    3、next_siblings
    4、previous_siblings
'''
def demo09():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>hello
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    <a href="http://example.com/a" class="sister" id="link3">a</a>
    <a href="http://example.com/b" class="sister" id="link3">b</a>
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、使用next_sibling
    # 结果:hello
    print(soup.a.next_sibling)

    # 2、使用next_siblings
    # 结果:<generator object PageElement.next_siblings at 0x00000241CA26CC10>
    print(soup.a.next_siblings)
    # print(list(soup.a.next_siblings))

    # 3、使用previous_sibling
    # 结果:Once upon a time there were three little sisters; and their names were
    print(soup.a.previous_sibling)

    # 4、使用previous_siblings
    # <generator object PageElement.previous_siblings at 0x000001F4E6E6CBA0>
    print(soup.a.previous_siblings)
    # print(list(soup.a.previous_siblings))

6、方法选择器

1、find_all()

python 复制代码
'''
方法选择器 -- find_all() -- 以列表形式返回多个元素
find_all(name, attrs={}, recursive=True, string, limit)
# 1、name: 标签的名称--查找标签
# 2、attrs: 属性过滤器字典
# 3、recursive: 递归查找一个元素的子孙元素们,默认为True
# 4、string:查找文本
# 5、limit: 查找结果的个数限制
'''
def demo10():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、【基本使用】找到所有的a标签
    # 结果:[
    #           <a class="sister hi" href="http://example.com/elsie" id="link1">Elsie</a>,
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    #      ]
    print(soup.find_all("a"))
    # for item in soup.find_all("a"):
    #     print(item.string)

    # 2、【属性查找】根据指定的属性字典进行元素的查找,这里查找的是class为sister的元素
    print(soup.find_all(attrs={"class": "sister"}))
    # 效果同上
    print(soup.find_all(class_ = "sister"))
    # ============这个没有找到结果,需找到原因============
    print(soup.find_all(class_ = "hi"))

    # 3、【文本查找】查找文本为Elsie的内容
    print(soup.find_all(string="Elsie"))

2、find()

python 复制代码
'''
方法选择器 -- find() -- 返回单个元素【一般是返回第一个元素作为结果】
'''
def demo11():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>,
    <a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
    <a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
    print(soup.find("a"))

3、其他方法选择器

python 复制代码
'''
其他方法选择器
find_parents(): 返回所以的祖先节点
find_parent(): 返回当前节点的父节点
find_next_siblings():返回当前节点后面的所有兄弟节点
find_previous_siblings():返回当前节点后面的相邻的那个兄弟节点

find_next_sibling():返回当前节点前面的所有兄弟节点
find_previous_sibling():返回当前节点前面的相邻的那个兄弟节点
'''

7、CSS选择器--select()

python 复制代码
'''
CSS选择器 -- select()方法
'''
def demo12():
    html_doc = """
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello World</h4>   
        </div>

        <div class="panel-body">
            <ul class="list" id="list-1">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>

            <ul class="list list-samll" id="list-2">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>
        </div>
        </div>
    </div>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、获取class为panel-heading的节点
    # 结果:[<div class="panel-heading">
    # <h4>Hello World</h4>
    # </div>]
    print(soup.select(".panel-heading"))

    # 2、获取ul下的li节点
    # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    print(soup.select("ul li"))

    # 3、获取id为list-2下的li节点
    # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    print(soup.select("#list-2 li"))

    # 4、获取所有的ul节点
    # 结果:[<ul class="list" id="list-1">
    # <li class="element">Foo</li>
    # <li class="element">Bar</li>
    # <li class="element">Jay</li>
    # </ul>, <ul class="list list-samll" id="list-2">
    # <li class="element">Foo</li>
    # <li class="element">Bar</li>
    # <li class="element">Jay</li>
    # </ul>]
    print(soup.select("ul"))
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.select('ul')[0]))

说明:

bash 复制代码
# 1、查询所有的子孙节点
在 select(css)中的 css 有多个节点时,节点元素之间用空格分开,就是查找子孙节点,
例如 soup.select("div p")是查找所有<div>节点下面的所有子孙<p>节点。

# 2、只查直接的子节点,不查孙节点
节点元素之间用" > "分开(注意>的前后至少包含一个空格),就是查找直接子节点:
# 例如 soup.select("div > p")是查找所有<div>节点下面的所有直接子节点<p>,不包含孙节点。

# 3、查找某个节点同级别的某类节点
用" ~ "连接两个节点表示查找前一个节点后面的所有同级别的兄弟节点(注意~号前后至少有一个空格),
例如 soup.select("div ~ p")查找<div>后面的所有同级别的<p>兄弟节点。

# 4、查找同级别某个节点后的第一个某类节点
用" + "连接两个节点表示查找前一个节点后面的第一个同级别的兄弟节点(注意+号前后至少有一个空格):
例如 soup.select("div + p")查找<div>后面的第一个同级别的<p>兄弟节点。

8、嵌套选择--select()

python 复制代码
'''
嵌套选择 -- select( )方法
'''
def demo13():
    html_doc = """
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello World</h4>   
        </div>

        <div class="panel-body">
            <ul class="list" id="list-1">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>

            <ul class="list list-samll" id="list-2">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>
        </div>
        </div>
    </div>
    """
    soup = BeautifulSoup(html_doc, 'lxml')
    # 运行结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    # [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    for ul in soup.select('ul'):
        print(ul.select('li'))

9、获取属性

python 复制代码
'''
获取属性(两种方法)
'''
def demo14():
    html_doc = """
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello World</h4>   
            </div>

            <div class="panel-body">
                <ul class="list" id="list-1">
                   <li class="element">Foo</li>
                   <li class="element">Bar</li>
                   <li class="element">Jay</li>
                </ul>

                <ul class="list list-samll" id="list-2">
                   <li class="element">Foo</li>
                   <li class="element">Bar</li>
                   <li class="element">Jay</li>
                </ul>
            </div>
            </div>
        </div>
        """
    soup = BeautifulSoup(html_doc, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])

10、获取文本

python 复制代码
'''
获取文本(两种方式)
'''
def demo15():
    html_doc = """
           <div class="panel">
               <div class="panel-heading">
                   <h4>Hello World</h4>   
               </div>

               <div class="panel-body">
                   <ul class="list" id="list-1">
                      <li class="element">Foo</li>
                      <li class="element">Bar</li>
                      <li class="element">Jay</li>
                   </ul>

                   <ul class="list list-samll" id="list-2">
                      <li class="element">Foo</li>
                      <li class="element">Bar</li>
                      <li class="element">Jay</li>
                   </ul>
               </div>
               </div>
           </div>
           """
    soup = BeautifulSoup(html_doc, 'lxml')
    for li in soup.select('li'):
        print('String:', li.string)
        print('get text:', li.get_text())

参考链接

1、Python爬虫:史上最详细的BeautifulSoup教程

相关推荐
START_GAME20 分钟前
深度学习Diffusers:用 DiffusionPipeline 实现图像生成
开发语言·python·深度学习
Deamon Tree1 小时前
后端开发常用Linux命令
linux·运维·python
卡卡恩2 小时前
使用uv创建系统全局python执行环境
python
查士丁尼·绵3 小时前
笔试-座位调整
python
飞翔的佩奇3 小时前
【完整源码+数据集+部署教程】【运动的&足球】足球场地区域图像分割系统源码&数据集全套:改进yolo11-RFAConv
前端·python·yolo·计算机视觉·数据集·yolo11·足球场地区域图像分割系统
MYX_3094 小时前
第四章 多层感知机
开发语言·python
盼哥PyAI实验室4 小时前
《Python爬虫 + 飞书自动化上传》全流程详细讲解
爬虫·python·飞书
时空无限5 小时前
conda 管理 python 版本和虚拟环境
python·conda
隔壁程序员老王5 小时前
基于 Python 的坦克大战小程序,使用 Pygame 库开发
python·小程序·pygame·1024程序员节
kaikaile19955 小时前
Java面试题总结
开发语言·python