Python - 爬虫利器 - BeautifulSoup4常用 API

文章目录

- 前言
- [BeautifulSoup4 简介](#BeautifulSoup4 简介)
- - 主要特点：
  - 安装方式:
- [常用 API](#常用 API)
- - [1. 创建 BeautifulSoup 对象](#1. 创建 BeautifulSoup 对象)
  - [2. 查找标签](#2. 查找标签)
  - - [find(): 返回匹配的第一个元素](#find(): 返回匹配的第一个元素)
    - [find_all(): 返回所有匹配的元素列表](#find_all(): 返回所有匹配的元素列表)
    - [select_one() & select(): CSS 选择器](#select_one() & select(): CSS 选择器)
  - [3. 访问标签内容](#3. 访问标签内容)
  - - [text 属性: 获取标签内纯文本](#text 属性: 获取标签内纯文本)
    - [get_text(): 同样作用于获取文本](#get_text(): 同样作用于获取文本)
    - [attrs 属性: 获取标签的所有属性](#attrs 属性: 获取标签的所有属性)
    - [[attribute]: 直接访问某个属性值](#[attribute]: 直接访问某个属性值)
  - [4. 修改文档](#4. 修改文档)
  - [5. 导航树结构](#5. 导航树结构)
  - - [parent: 上级父节点](#parent: 上级父节点)
    - [children: 下级子节点迭代器](#children: 下级子节点迭代器)
    - [siblings: 并列兄弟节点](#siblings: 并列兄弟节点)
- 实战小技巧(关键点)
- 结束语

前言

在时光的长河里，每一滴水都是昨日的星辰，映照着永不重复的今天。

BeautifulSoup4 简介

BeautifulSoup4（通常简称为 BS4）是一个用于解析 HTML 和 XML 文档的 Python 库。它的设计目的是简化从复杂网页中提取数据的过程。BeautifulSoup4 可以处理各种各样的标记语言，并提供了一个简单的接口来进行文档导航、搜索和修改。

主要特点：

跨平台支持: Beautiful Soup 支持 Windows、Linux、Mac OS X 等多个操作系统。
兼容性强 : 支持多种解析器，包括 Python 内置的标准库解析器 (html.parser)、第三方解析器 lxml 和 html5lib。
易于学习: 提供了简单且直观的 API，适合初学者使用。
强大功能: 包含丰富的函数和方法，可以帮助开发者高效地完成任务。

安装方式:

你可以通过 pip 工具轻松安装 BeautifulSoup4:

bash 复制代码

pip install beautifulsoup4

常用 API

以下是 BeautifulSoup4 中一些常用的 API 方法和功能：

1. 创建 BeautifulSoup 对象

首先，你需要创建一个 BeautifulSoup 对象来解析 HTML 或 XML 文档。

python 复制代码

from bs4 import BeautifulSoup

# 使用默认的 html.parser 解析器
html_doc = "<html><head><title>Example Page</title></head><body id='id'><a href='123'></a><p class='my-class child-class'><i>444</i><h1>Hello World</h1></p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# 打印解析后的结果
print(soup.prettify())

2. 查找标签

可以通过标签名称或其他属性来查找特定的元素。

find(): 返回匹配的第一个元素

python 复制代码

first_paragraph = soup.find('p')
print(first_paragraph)  # 输出: <p>Hello World</p>

find_all(): 返回所有匹配的元素列表

python 复制代码

all_headings = soup.find_all(['h1', 'h2'])
for heading in all_headings:
    print(heading.text)

select_one() & select(): CSS 选择器

python 复制代码

css_selector_example = soup.select_one('.my-class')
print(css_selector_example)

css_selectors_examples = soup.select('#id > .child-class')
for element in css_selectors_examples:
    print(element.text)

3. 访问标签内容

访问标签内的文本和其他属性。

text 属性: 获取标签内纯文本

python 复制代码

text_content = first_paragraph.text
print(text_content)  # 输出: Hello World

get_text(): 同样作用于获取文本

python 复制代码

get_text_content = first_paragraph.get_text()
print(get_text_content)  # 输出: Hello World

attrs 属性: 获取标签的所有属性

python 复制代码

attributes = first_paragraph.attrs
print(attributes)  # 如果没有其他属性，则为空字典 {}

[attribute]: 直接访问某个属性值

python 复制代码

link_tag = soup.a
href_value = link_tag['href']
print(href_value)

4. 修改文档

除了查询外，还可以动态地添加、删除或修改文档中的节点。

添加新标签

python 复制代码

new_tag = soup.new_tag("b")
new_tag.string = "Bold Text"
first_paragraph.append(new_tag)
print(first_paragraph)  # 输出: <p>Hello World<b>Bold Text</b></p>

删除标签

python 复制代码

tag_to_remove = soup.b
tag_to_remove.decompose()
print(first_paragraph)  # 输出: <p>Hello World</p>

替换标签

python 复制代码

replacement_tag = soup.new_tag("i")
replacement_tag.string = "Italic Text"
first_paragraph.i.replace_with(replacement_tag)
print(first_paragraph)  # 输出: <p>Hello World<i>Italic Text</i></p>

5. 导航树结构

BeautifulSoup 还提供了多种方法来遍历和操作 DOM 树。

parent: 上级父节点

python 复制代码

parent_node = first_paragraph.parent
print(parent_node.name)  # 输出: body

children: 下级子节点迭代器

python 复制代码

children_nodes = list(first_paragraph.children)
for child in children_nodes:
    print(child)

siblings: 并列兄弟节点

python 复制代码

next_sibling = first_paragraph.next_sibling
previous_sibling = first_paragraph.previous_sibling
print(next_sibling)
print(previous_sibling)

实战小技巧(关键点)

实际情况下，很多节点不好找到，可以利用浏览器功能，可以直接复制css选择器

F12打开控制台

复制对应图片的css选择器

直接代码中使用

python 复制代码

from bs4 import BeautifulSoup

# 使用默认的 html.parser 解析器
html_doc = "<html></html>"
soup = BeautifulSoup(html_doc, 'html.parser')
# 只是为了示例  不可运行 以下是复制出来的内容
soup.select('#ice-container > div.tbpc-layout > div.screen-outer.clearfix > div.main > div.core.J_Core > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div:nth-child(3) > div > div > a')

结束语

文章中API都验证过，可直接运行👽👽👽

运行有问题可联系作者评论交流🤭🤭🤭

风是自由的，你也是自由🤠🤠🤠

欢迎一起交流学习☠️☠️☠️

有帮助请留下足迹一键三连🥰🥰🥰

爬虫大佬勿喷，欢迎指正问题😈😈😈

后面会做一系列的爬虫文章，请持续关注作者🤡🤡🤡。