BeautifulSoup 库的常用操作

BeautifulSoup库

0、所有方法都有的

python 复制代码
from bs4 import BeautifulSoup
# 前面几个方法使用的都是这个参数,所以统一使用这个(后面的那些方法没有引用这个html文本文件)
html_doc = """
            <html><head><title>The Dormouse's story</title></head>
            <body>
            <p class="title"><b>The Dormouse's story</b></p>

            <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.</p>

            <p class="story">...</p>
            """

1、基本用法

python 复制代码
'''
基本用法demo1
'''
def demo01(html_doc):

    # 这里的作用是将html_doc中缺少的标签补充完善,使用的库是lxml进行补全
    soup = BeautifulSoup(html_doc, "lxml")
    # 更正html_doc的格式,使得上面文本的格式是正确的
    print(soup.prettify())
    # 查看经过上面步骤处理过后的结果
    print(soup.title.string)

2、节点选择器

python 复制代码
'''
节点选择器demo2
'''
def demo02(html_doc):
    soup = BeautifulSoup(html_doc, 'lxml')
    # 选择html_doc中的title标签
    # 结果:<title>The Dormouse's story</title>
    print(soup.title)
    # 查看对应的类型
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.title))
    # 结果:The Dormouse's story
    print(soup.title.string)
    # 结果:<head><title>The Dormouse's story</title></head>
    print(soup.head)
    # 结果:<p class="title"><b>The Dormouse's story</b></p>
    print(soup.p)
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.p))
    # 结果:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 【默认返回第一个】
    print(soup.a)

3、提取节点信息

python 复制代码
'''
提取节点信息demo3
'''
def demo03(html_doc):
    soup = BeautifulSoup(html_doc, "lxml")
    # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    tag = soup.a
    # 1、获取名称
    # 结果:a
    print(tag.name)

    # 2、获取属性值
    # 结果:
    #     class值为:  ['sister']
    #     href值为:  http://example.com/elsie
    print("class值为: ", tag.attrs["class"])
    print("href值为: ", tag.attrs["href"])

    # 3、获取内容
    # 结果:Elsie
    print(tag.string)

4、获取子节点信息

python 复制代码
'''
获取子节点信息demo4
'''
def demo04(html_doc):
    soup = BeautifulSoup(html_doc, 'lxml')
    # 1、首先获取head标签的内容部分
    # 结果:<head><title>The Dormouse's story</title></head>
    print(soup.head)
    # 2、然后获取head中title标签的内容
    # 结果:<title>The Dormouse's story</title>
    print(soup.head.title)
    # 3、获取head中title下的文本内容
    # 结果:The Dormouse's story
    print(soup.head.title.string)

5、关联选择

1、获取子节点--contents

python 复制代码
'''
关联选择demo05--01--下级节点
使用contents属性进行获取--获取子节点
介绍:
    在做选择的时候,有时候不能做到一步就获取到我想要的节点元素,需要选取某一个节点元素,
    然后以这个节点为基准再选取它的子节点、父节点、兄弟节点等
'''
def demo05():
    # 注意它的第一个p标签没有换行展示
    html_doc01 = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>


    <p class="story">...</p>
    """
    # 注意它和html_doc01的区别在于,p标签进行了换行
    html_doc02 = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b>
    </p>


    <p class="story">...</p>
    """

    # 1、获取节点的子节点和子孙节点--contents属性
    soup01 = BeautifulSoup(html_doc01, "lxml")
    # 结果:[<b>The Dormouse's story</b>]
    print(soup01.p.contents)

    soup02 = BeautifulSoup(html_doc02, "lxml")
    # 注意这里的结果多了一个换行符
    # 结果:[<b>The Dormouse's story</b>, '\n']
    print(soup02.p.contents)

2、获取子节点--children

python 复制代码
'''
关联选择demo06--02--下级节点
使用children属性进行获取--获取子节点
'''
def demo06():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<list_iterator object at 0x000002B35915BFA0
    print(soup.p.children)
    # 结果:[
    #           '\n        Once upon a time there were three little sisters; and their names were\n        ',
    #           <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    #           ',\n        ',
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           ' and\n        ',
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
    #           ';\n        and they lived at the bottom of a well.\n    '
    #       ]
    print(list(soup.p.children))
    for item in soup.p.children:
        print(item)

3、获取子孙节点--descendants

python 复制代码
'''
关联选择demo07--03--下级节点
使用descendants属性进行获取--获取子孙节点(获取:子节点和孙节点的内容)
'''
def demo07():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span>Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<generator object Tag.descendants at 0x000001C0E79DCC10>
    print(soup.p.descendants)
    # 结果:[
    #           'Once upon a time there were three little sisters; and their names were\n    ',
    #           <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span>Elsie</a>,
    #           <span>Elsie</span>,
    #           'Elsie',
    #           'Elsie',
    #           ',\n    ',
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           'Lacie',
    #           ' and\n    ',
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
    #           'Tillie',
    #           ';\n    and they lived at the bottom of a well.'
    #       ]
    print(list(soup.p.descendants))
    # for item in soup.p.descendants:
    #     print(item)

4、获取父节点--parent、祖先节点--parents

python 复制代码
'''
关联选择demo08--01--上级节点
使用parent属性进行获取--获取父节点
使用parents属性进行获取--获取祖先节点
'''
def demo08():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    <p>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    </p>
    </p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 会打印出<body>标签中所有的内容,包括子节点p标签和孙节点a标签等全部的值
    print(soup.p.parent)
    # 获取第一个a标签的父节点p标签的值,包括当前的这个a标签中的文本内容
    print(soup.a.parent)

    print("=======================")
    # 结果:<generator object PageElement.parents at 0x000001403E6ECC10>
    print(soup.a.parents)
    for i, parent in enumerate(soup.a.parents):
        print(i, parent)

5、获取兄弟节点

python 复制代码
'''
关联选择demo09--兄弟节点
# 可以使用的属性有:
    1、next_sibling
    2、previous_sibling
    3、next_siblings
    4、previous_siblings
'''
def demo09():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>


    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>hello
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    <a href="http://example.com/a" class="sister" id="link3">a</a>
    <a href="http://example.com/b" class="sister" id="link3">b</a>
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、使用next_sibling
    # 结果:hello
    print(soup.a.next_sibling)

    # 2、使用next_siblings
    # 结果:<generator object PageElement.next_siblings at 0x00000241CA26CC10>
    print(soup.a.next_siblings)
    # print(list(soup.a.next_siblings))

    # 3、使用previous_sibling
    # 结果:Once upon a time there were three little sisters; and their names were
    print(soup.a.previous_sibling)

    # 4、使用previous_siblings
    # <generator object PageElement.previous_siblings at 0x000001F4E6E6CBA0>
    print(soup.a.previous_siblings)
    # print(list(soup.a.previous_siblings))

6、方法选择器

1、find_all()

python 复制代码
'''
方法选择器 -- find_all() -- 以列表形式返回多个元素
find_all(name, attrs={}, recursive=True, string, limit)
# 1、name: 标签的名称--查找标签
# 2、attrs: 属性过滤器字典
# 3、recursive: 递归查找一个元素的子孙元素们,默认为True
# 4、string:查找文本
# 5、limit: 查找结果的个数限制
'''
def demo10():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、【基本使用】找到所有的a标签
    # 结果:[
    #           <a class="sister hi" href="http://example.com/elsie" id="link1">Elsie</a>,
    #           <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
    #           <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    #      ]
    print(soup.find_all("a"))
    # for item in soup.find_all("a"):
    #     print(item.string)

    # 2、【属性查找】根据指定的属性字典进行元素的查找,这里查找的是class为sister的元素
    print(soup.find_all(attrs={"class": "sister"}))
    # 效果同上
    print(soup.find_all(class_ = "sister"))
    # ============这个没有找到结果,需找到原因============
    print(soup.find_all(class_ = "hi"))

    # 3、【文本查找】查找文本为Elsie的内容
    print(soup.find_all(string="Elsie"))

2、find()

python 复制代码
'''
方法选择器 -- find() -- 返回单个元素【一般是返回第一个元素作为结果】
'''
def demo11():
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="Dormouse"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>,
    <a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
    <a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 结果:<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
    print(soup.find("a"))

3、其他方法选择器

python 复制代码
'''
其他方法选择器
find_parents(): 返回所以的祖先节点
find_parent(): 返回当前节点的父节点
find_next_siblings():返回当前节点后面的所有兄弟节点
find_previous_siblings():返回当前节点后面的相邻的那个兄弟节点

find_next_sibling():返回当前节点前面的所有兄弟节点
find_previous_sibling():返回当前节点前面的相邻的那个兄弟节点
'''

7、CSS选择器--select()

python 复制代码
'''
CSS选择器 -- select()方法
'''
def demo12():
    html_doc = """
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello World</h4>   
        </div>

        <div class="panel-body">
            <ul class="list" id="list-1">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>

            <ul class="list list-samll" id="list-2">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>
        </div>
        </div>
    </div>
    """
    soup = BeautifulSoup(html_doc, "lxml")
    # 1、获取class为panel-heading的节点
    # 结果:[<div class="panel-heading">
    # <h4>Hello World</h4>
    # </div>]
    print(soup.select(".panel-heading"))

    # 2、获取ul下的li节点
    # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    print(soup.select("ul li"))

    # 3、获取id为list-2下的li节点
    # 结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    print(soup.select("#list-2 li"))

    # 4、获取所有的ul节点
    # 结果:[<ul class="list" id="list-1">
    # <li class="element">Foo</li>
    # <li class="element">Bar</li>
    # <li class="element">Jay</li>
    # </ul>, <ul class="list list-samll" id="list-2">
    # <li class="element">Foo</li>
    # <li class="element">Bar</li>
    # <li class="element">Jay</li>
    # </ul>]
    print(soup.select("ul"))
    # 结果:<class 'bs4.element.Tag'>
    print(type(soup.select('ul')[0]))

说明:

bash 复制代码
# 1、查询所有的子孙节点
在 select(css)中的 css 有多个节点时,节点元素之间用空格分开,就是查找子孙节点,
例如 soup.select("div p")是查找所有<div>节点下面的所有子孙<p>节点。

# 2、只查直接的子节点,不查孙节点
节点元素之间用" > "分开(注意>的前后至少包含一个空格),就是查找直接子节点:
# 例如 soup.select("div > p")是查找所有<div>节点下面的所有直接子节点<p>,不包含孙节点。

# 3、查找某个节点同级别的某类节点
用" ~ "连接两个节点表示查找前一个节点后面的所有同级别的兄弟节点(注意~号前后至少有一个空格),
例如 soup.select("div ~ p")查找<div>后面的所有同级别的<p>兄弟节点。

# 4、查找同级别某个节点后的第一个某类节点
用" + "连接两个节点表示查找前一个节点后面的第一个同级别的兄弟节点(注意+号前后至少有一个空格):
例如 soup.select("div + p")查找<div>后面的第一个同级别的<p>兄弟节点。

8、嵌套选择--select()

python 复制代码
'''
嵌套选择 -- select( )方法
'''
def demo13():
    html_doc = """
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello World</h4>   
        </div>

        <div class="panel-body">
            <ul class="list" id="list-1">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>

            <ul class="list list-samll" id="list-2">
               <li class="element">Foo</li>
               <li class="element">Bar</li>
               <li class="element">Jay</li>
            </ul>
        </div>
        </div>
    </div>
    """
    soup = BeautifulSoup(html_doc, 'lxml')
    # 运行结果:[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    # [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    for ul in soup.select('ul'):
        print(ul.select('li'))

9、获取属性

python 复制代码
'''
获取属性(两种方法)
'''
def demo14():
    html_doc = """
        <div class="panel">
            <div class="panel-heading">
                <h4>Hello World</h4>   
            </div>

            <div class="panel-body">
                <ul class="list" id="list-1">
                   <li class="element">Foo</li>
                   <li class="element">Bar</li>
                   <li class="element">Jay</li>
                </ul>

                <ul class="list list-samll" id="list-2">
                   <li class="element">Foo</li>
                   <li class="element">Bar</li>
                   <li class="element">Jay</li>
                </ul>
            </div>
            </div>
        </div>
        """
    soup = BeautifulSoup(html_doc, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])
        print(ul.attrs['id'])

10、获取文本

python 复制代码
'''
获取文本(两种方式)
'''
def demo15():
    html_doc = """
           <div class="panel">
               <div class="panel-heading">
                   <h4>Hello World</h4>   
               </div>

               <div class="panel-body">
                   <ul class="list" id="list-1">
                      <li class="element">Foo</li>
                      <li class="element">Bar</li>
                      <li class="element">Jay</li>
                   </ul>

                   <ul class="list list-samll" id="list-2">
                      <li class="element">Foo</li>
                      <li class="element">Bar</li>
                      <li class="element">Jay</li>
                   </ul>
               </div>
               </div>
           </div>
           """
    soup = BeautifulSoup(html_doc, 'lxml')
    for li in soup.select('li'):
        print('String:', li.string)
        print('get text:', li.get_text())

参考链接

1、Python爬虫:史上最详细的BeautifulSoup教程

相关推荐
数据智能老司机2 小时前
精通 Python 设计模式——分布式系统模式
python·设计模式·架构
数据智能老司机3 小时前
精通 Python 设计模式——并发与异步模式
python·设计模式·编程语言
数据智能老司机3 小时前
精通 Python 设计模式——测试模式
python·设计模式·架构
数据智能老司机3 小时前
精通 Python 设计模式——性能模式
python·设计模式·架构
c8i3 小时前
drf初步梳理
python·django
每日AI新事件3 小时前
python的异步函数
python
这里有鱼汤5 小时前
miniQMT下载历史行情数据太慢怎么办?一招提速10倍!
前端·python
databook14 小时前
Manim实现脉冲闪烁特效
后端·python·动效
程序设计实验室14 小时前
2025年了,在 Django 之外,Python Web 框架还能怎么选?
python
倔强青铜三16 小时前
苦练Python第46天:文件写入与上下文管理器
人工智能·python·面试