Python xml操作 - 技术栈

XPath

XPath 是用于在 XML 或 HTML 文档中定位节点的语言。XPath 语法允许你使用路径表达式来选择节点。XPath 提供了多种功能，包括节点选择、过滤、计算等。

XPath表达式语法

语法	说明	示例
/	选择根节点。	/root # 选择根节点 root
//	选择从当前节点开始的所有子节点，递归。	//child # 选择所有 child 节点，不论其位置
.	选择当前节点。	./child # 选择当前节点的子节点 child
..	选择当前节点的父节点。	../parent # 选择当前节点的父节点 parent
*	匹配所有节点。	//*[contains(text(), "下一页")] # 通配所有节点找到文本内容包含下一页的节点
@	选择属性节点。	/bookstore/book/@id # 选择所有 book 节点的 id 属性值
child:	选择当前节点的子节点。	/bookstore/child::book # 选择 bookstore 节点的所有子节点 book
descendant::	选择当前节点的所有子孙节点。	/bookstore/descendant::title # 选择 bookstore 节点的所有子孙节点 title
parent::	选择当前节点的父节点。	/bookstore/title/parent::book # 选择 title 节点的父节点 book
ancestor::	选择当前节点的所有祖先节点。	/bookstore/title/ancestor::bookstore # 选择 title 节点的所有祖先节点 bookstore
following-sibling::	选择当前节点之后的所有兄弟节点。	/bookstore/book/following-sibling::book # 选择 book 节点之后的所有兄弟 book 节点
preceding-sibling::	选择当前节点之前的所有兄弟节点。	/bookstore/book/preceding-sibling::book # 选择 book 节点之前的所有兄弟 book 节点
following::	选择当前节点之后的所有节点。	/bookstore/book/following::title # 选择 book 节点之后的所有 title 节点
preceding::	选择当前节点之前的所有节点。	/bookstore/book/preceding::title # 选择 book 节点之前的所有 title 节点
node()	匹配所有类型的节点	/bookstore/node() # 选择 bookstore 节点下的所有节点
text()	匹配文本节点	/bookstore/book/text() # 选择 book 节点下的文本内容
comment()	匹配注释节点。	/bookstore/comment() # 选择 bookstore 节点下的所有注释节点
processing-instruction()	匹配处理指令节点	/bookstore/processing-instruction() # 选择 bookstore 节点下的所有处理指令节点
[n]	选择第 `n` 个节点（从 1 开始）	/bookstore/book[1] # 选择第一个 book 节点
[last()]	选择最后一个节点。	/bookstore/book[last()] # 选择最后一个 book 节点
[position() > n]	选择位置大于 `n` 的节点。	/bookstore/book[position() > 2] # 选择位置大于 2 的 book 节点
[contains()]	选择包含特定子串的节点。	/bookstore/book[title[contains(., 'aaa')]] # 选择 title 包含 'aaa' 的 book 节点
[starts-with()]	选择以特定子串开头的节点。	/bookstore/book[title[starts-with(., 'aaa')]] # 选择 title 以 'aaa' 开头的 book 节点
[string-length()]	选择字符串长度符合条件的节点。	/bookstore/book[not(string-length(title) < 5)] # 选择 title 长度不小于 5 的 book 节点
and	逻辑与操作。	/bookstore/book[price < 20 and category = 'Science'] # 选择价格小于 20 且类别为 'Science' 的 book 节点
or	逻辑或操作。	/bookstore/book[price < 20 or category = 'Science'] # 选择价格小于 20 或类别为 'Science' 的 book 节点
not()	逻辑非操作。	/bookstore/book[not(price < 20)] # 选择价格不小于 20 的 book 节点
contains()	检查节点值是否包含指定子串。	//title[contains(text(), 'aaa')] # 选择文本内容包含 'aaa' 的 title 节点
starts-with()	检查节点值是否以指定子串开头。	//title[starts-with(text(), 'aaa')] # 选择以 'aaa' 开头的 title 节点
substring-before()	提取字符串中子串之前的部分。	substring-before('XPath Tutorial', ' ') # 返回 'XPath'
substring-after()	提取字符串中子串之后的部分。	substring-after('XPath Tutorial', ' ') # 返回 'Tutorial'
normalize-space()	移除节点值两端的空白字符，并将多个空白字符转换为单个空白字符。	normalize-space('//title') # 处理 title 节点的空白字符
\|	并集操作，用于合并多个 XPath 表达式的结果集。	/bookstore/book/title \| /bookstore/book/author # 选择所有 title 和 author 节点

lxml

lxml 是一个 Python 库，用于处理 XML 和 HTML 文档。它结合了 libxml2 和 libxslt 的强大功能，为 Python 提供了一个高效、灵活的接口。lxml 提供了对 XPath 的支持，使得可以在 XML 或 HTML 文档中执行 XPath 语法进行查询。

安装 lxml

复制代码

pip install lxml

lxml.etree 模块

解析和创建文档的方法

etree.parse(source, parser=None): 从文件、字符串或文件对象解析 XML 或 HTML 文档。

etree.fromstring(text, parser=None): 从字符串解析 XML 或 HTML 文档，返回一个Element对象。

etree.XML(text, parser=None): 解析 XML 字符串，返回一个Element对象。

etree.HTML(text, parser=None): 解析 HTML 字符串，返回一个Element对象。

etree.Element(tag, attrib={}, nsmap=None, **extra): 创建一个新的 XML 元素。

etree.ElementTree(element=None, parser=None): 创建一个新的 XML 或 HTML 元素树。

etree.XSLT(transform): 创建一个 XSLT 转换对象。

查找和操作元素的方法

element.xpath(path): 使用xpath语法。

element.find(path): 查找第一个匹配 XPath 表达式的子元素。

element.findall(path): 查找所有匹配 XPath 表达式的子元素。

element.findtext(path, default=None): 查找第一个匹配 XPath 表达式的子元素并返回其文本内容。

element.get(key, default=None): 获取指定属性的值。

element.set(key, value): 设置指定属性的值。

element.text: 获取或设置元素的文本内容。

element.attrib: 获取或设置元素的属性字典。

element.append(element): 向元素添加一个子元素。

element.remove(element): 从元素中移除一个子元素。

element.insert(index, element): 在指定位置插入一个子元素。

element.clear(): 清除元素及其所有子元素的内容。

序列化方法

etree.tostring(element, encoding=None, method='xml', with_comments=False, pretty_print=False): 将元素序列化为字符串。

etree.ElementTree.write(filename, encoding=None, xml_declaration=False, method='xml'): 将元素树写入文件。

示例：

python 复制代码

from lxml import etree

# html 字符串
html_data = '''
<div> 
    <ul> 
        <li class="item-1">
            <a href="link1.html">first item</a>
        </li> 
        <li class="item-1">
            <a href="link2.html">second item</a>
        </li> 
        <li class="item-inactive">
            <a href="link3.html">third item</a>
        </li> 
        <li class="item-1">
            <a href="link4.html">fourth item</a>
        </li> 
        <li class="item-0">
            <a href="link5.html">fifth item</a> </li>
    </ul> 
</div>
'''
# 创建Element对象
html = etree.HTML(html_data)

# 获取所有a标签的文本
print(html.xpath('//a/text()'))  # ['first item', 'second item', 'third item', 'fourth item', 'fifth item']

# 获取所有a标签的链接
print(html.xpath('//a/@href'))  # ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

# 获取包含 'third' 内容的文本
print(html.xpath('//*[contains(text(), "third")]/text()'))  # ['third item']

# 遍历a节点的文本和链接
el_list = html.xpath('//a')
for el in el_list:
    print(f"{el.text} --- {el.attrib['href']}")  # first item --- link1.html