【Python】HTMLParser：HTML解析

html.parser 是 Python 标准库中的一个模块，用于解析和处理 HTML。它的核心类是 HTMLParser，这个类提供了多种方法，允许你处理 HTML 文档的各个部分。我们可以按以下几个模块来详细讲解 html.parser 的功能和使用方法。

HTMLParser 类

HTMLParser 类是 html.parser 模块的核心类，用于解析HTML文档。通过继承这个类并重写其提供的回调方法，你可以自定义对HTML标签、属性和内容的处理方式。

初始化和基础使用

要使用 HTMLParser 类，通常需要继承它，并在子类中重写一些回调方法。这些方法会在解析过程中自动调用，允许你处理HTML文档的不同部分。

示例：自定义解析器

python 复制代码

from html.parser import HTMLParser

# 创建自定义解析器类，继承HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(f"Start tag: {tag}")
        if attrs:
            for attr in attrs:
                print(f"  Attribute: {attr}")

    def handle_endtag(self, tag):
        print(f"End tag: {tag}")

    def handle_data(self, data):
        print(f"Data: {data}")

# 创建解析器实例
parser = MyHTMLParser()

# 解析HTML字符串
html_string = "<html><head><title>Test</title></head><body><h1>Title</h1><p>Hello, World!</p></body></html>"
parser.feed(html_string)

在这个例子中，MyHTMLParser 类继承了 HTMLParser，并重写了三个方法：handle_starttag、handle_endtag 和 handle_data。这些方法分别用于处理HTML文档中的开始标签、结束标签和数据内容。

输出结果：

yaml 复制代码

Start tag: html
Start tag: head
Start tag: title
Data: Test
End tag: title
End tag: head
Start tag: body
Start tag: h1
Data: Title
End tag: h1
Start tag: p
Data: Hello, World!
End tag: p
End tag: body
End tag: html

重要方法详解

HTMLParser 类提供了一些关键方法，这些方法会在解析HTML文档时自动调用。

`handle_starttag(self, tag, attrs)`

功能 : 当解析器遇到开始标签（如 <div>）时调用。
参数:
- tag 是标签的名字，例如 'div'。
- attrs 是一个包含(属性名, 属性值)元组的列表，例如 [('class', 'header')]。
用法示例:

python 复制代码

def handle_starttag(self, tag, attrs):
    print(f"Start tag: {tag}")
    if attrs:
        for attr, value in attrs:
            print(f" - Attribute: {attr} = {value}")

`handle_endtag(self, tag)`

功能 : 当解析器遇到结束标签（如 </div>）时调用。
参数:
- tag 是标签的名字，例如 'div'。
用法示例:

python 复制代码

def handle_endtag(self, tag):
    print(f"End tag: {tag}")

`handle_startendtag(self, tag, attrs)`

功能 : 当解析器遇到自闭合标签（如 <br />）时调用。
参数:
- tag 是标签的名字，例如 'br'。
- attrs 是一个包含(属性名, 属性值)元组的列表。
用法示例:

python 复制代码

def handle_startendtag(self, tag, attrs):
    print(f"Self-closing tag: {tag}")
    if attrs:
        for attr, value in attrs:
            print(f" - Attribute: {attr} = {value}")

`handle_data(self, data)`

功能: 当解析器遇到标签之间的文本数据时调用。
参数:
- data 是标签之间的文本内容，例如 'Hello, World!'。
用法示例:

python 复制代码

def handle_data(self, data):
    print(f"Data: {data}")

`handle_comment(self, data)`

功能: 当解析器遇到HTML注释时调用。
参数:
- data 是注释的内容，例如 'This is a comment'。
用法示例:

python 复制代码

def handle_comment(self, data):
    print(f"Comment: {data}")

`handle_entityref(self, name)`

功能 : 当解析器遇到命名字符引用（如 &）时调用。
参数:
- name 是实体引用的名字，例如 'amp'。
用法示例:

python 复制代码

def handle_entityref(self, name):
    print(f"Named entity: &{name};")

`handle_charref(self, name)`

功能 : 当解析器遇到数字字符引用（如 {）时调用。
参数:
- name 是字符的编号，可以是十进制或十六进制，例如 '123'。
用法示例:

python 复制代码

python复制代码def handle_charref(self, name):
    print(f"Numeric entity: &#{name};")

解析HTML文档的流程

当你调用 feed() 方法向 HTMLParser 对象提供HTML数据时，解析器会逐字符读取输入并调用相应的回调方法。例如：

当解析到 <html> 标签时，handle_starttag 方法会被调用。
解析到 </html> 时，handle_endtag 方法会被调用。
在两个标签之间的文本内容（如 Hello, World!）会触发 handle_data 方法。
遇到  时，handle_comment 会被调用。

常见操作实例

以下是一些常见的 HTMLParser 操作实例：

解析HTML标签和属性

python 复制代码

class TagParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(f"Tag: {tag}")
        for attr, value in attrs:
            print(f" - Attribute: {attr} = {value}")

html_code = '<a href="https://example.com" title="Example">Click here</a>'
parser = TagParser()
parser.feed(html_code)

输出：

Tag: a
 - Attribute: href = https://example.com
 - Attribute: title = Example

解析HTML文本和注释

python 复制代码

class ContentParser(HTMLParser):
    def handle_data(self, data):
        print(f"Data: {data}")

    def handle_comment(self, data):
        print(f"Comment: {data}")

html_code = '<p>This is a paragraph.<!-- This is a comment --></p>'
parser = ContentParser()
parser.feed(html_code)

输出：

Data: This is a paragraph.
Comment: This is a comment

高级功能

处理特殊字符

HTMLParser 可以处理HTML实体和字符引用，如 & 和 {。

命名字符引用

python 复制代码

class EntityParser(HTMLParser):
    def handle_entityref(self, name):
        print(f"Named entity: &{name};")

html_code = 'AT&amp;T'
parser = EntityParser()
parser.feed(html_code)

输出：

Named entity: &amp;

数字字符引用

python 复制代码

class CharRefParser(HTMLParser):
    def handle_charref(self, name):
        print(f"Numeric entity: &#{name};")

html_code = '&#169; 2023'
parser = CharRefParser()
parser.feed(html_code)

输出：

Numeric entity: &#169;

处理HTML文档片段

HTMLParser 还可以解析HTML文档片段，并根据解析的情况做出不同的响应。

解析HTML文档片段

python 复制代码

class FragmentParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(f"Start tag: {tag}")

    def handle_endtag(self, tag):
        print(f"End tag: {tag}")

html_code = 'This is a <b>bold</b> statement.'
parser = FragmentParser()
parser.feed(html_code)

输出：

Start tag: b
End tag: b

扩展与增强功能

停止解析

在某些情况下，你可能希望在解析过程中停止解析。

python 复制代码

class StopParser(HTMLParser):
    def handle_data(self, data):
        print(f"Data: {data}")
        # 在遇到特定数据时停止解析
        if data == "stop":
            self.close()

html_code = '<p>start</p><p>stop</p><p>end</p>'
parser = StopParser()
parser.feed(html_code)

输出：

Data: start
Data: stop

重置解析器

你可以使用 reset() 方法重置解析器，以便重新开始解析新的HTML内容。

python 复制代码

class ResetParser(HTMLParser):
    def handle_data(self, data):
        print(f"Data: {data}")

html_code_1 = '<p>first part</p>'
html_code_2 = '<p>second part</p>'

parser = ResetParser()
parser.feed(html_code_1)

parser.reset()  # 重置解析器

parser.feed(html_code_2)

输出：

Data: first part
Data: second part