目录
- [1. XPath 的使用](#1. XPath 的使用)
-
- [1.1 XPath 常用规则](#1.1 XPath 常用规则)
- [1.2 安装](#1.2 安装)
- [1.3 实例引入](#1.3 实例引入)
- [1.4 所有节点](#1.4 所有节点)
- [1.5 子节点](#1.5 子节点)
- [1.6 父节点](#1.6 父节点)
- [1.7 属性匹配](#1.7 属性匹配)
- [1.8 文本获取](#1.8 文本获取)
- [1.9 属性获取](#1.9 属性获取)
- [1.10 属性多值匹配](#1.10 属性多值匹配)
- [1.11 多属性匹配](#1.11 多属性匹配)
- [1.12 按序选择](#1.12 按序选择)
- [1.13 节点轴选择](#1.13 节点轴选择)
- [2. Beautiful Soup 的使用](#2. Beautiful Soup 的使用)
-
- [2.1 解析器](#2.1 解析器)
- [2.2 安装](#2.2 安装)
- [2.3 基本使用](#2.3 基本使用)
- [2.4 节点选择器](#2.4 节点选择器)
- [2.5 提取信息](#2.5 提取信息)
- [2.6 关联选择](#2.6 关联选择)
- [2.7 方法选择器](#2.7 方法选择器)
- [2.8 CSS 选择器](#2.8 CSS 选择器)
- [3. pyquery 的使用](#3. pyquery 的使用)
-
- [3.1 安装](#3.1 安装)
- [3.2 初始化](#3.2 初始化)
- [3.3 基本CSS选择器](#3.3 基本CSS选择器)
- [3.4 查找节点](#3.4 查找节点)
- [3.5 遍历节点](#3.5 遍历节点)
- [3.6 节点操作](#3.6 节点操作)
-
- addClass和removeClass
- attr、text和html
- remove
- [3.7 伪类选择器](#3.7 伪类选择器)
- [4. parsel的使用](#4. parsel的使用)
-
- [4.1 安装](#4.1 安装)
- [4.2 初始化](#4.2 初始化)
- [4.3 提取文本](#4.3 提取文本)
- [4.4 提取属性](#4.4 提取属性)
- [4.5 正则提取](#4.5 正则提取)
1. XPath 的使用
- XPath(XML Path Language)
- XML 路径语言
- 用于在 XML 文档中查找信息
- html 文档也适用
1.1 XPath 常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
... | 选取当前节点的父节点 |
@ | 选取属性 |
- 示例
python
//title[@lang="eng"]
1.2 安装
python
pip3 install lxml
1.3 实例引入
python
from lxml import etree
text = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
-
etree 模块可以自动修正 HTML 文本
-
直接读取文本
-
test.html
html<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div>
-
python
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
-
1.4 所有节点
- 获取所有节点(//*)
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//*")
print(result)
- 获取所有 li 节点(//li)
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li")
print(result)
- 使用 // + name 来获取所有名称为 name 的节点
1.5 子节点
- 获取 li 节点的直接子节点 a
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a")
print(result)
- 获取 ul 节点的子孙节点 a
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//ul//a")
print(result)
1.6 父节点
- 获取 href 属性为 link4.html 的 a 节点,然后获取其父节点,再获取父节点的 class 属性
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href=\"link4.html\"]/../@class")
print(result)
- 通过 parent:: 获取父节点
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href=\"link4.html\"]/parent::*/@class")
print(result)
1.7 属性匹配
- 使用 @ 过滤
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class=\"item-0\"]")
print(result)
1.8 文本获取
- 获取 class 属性为 item-0 的 li 节点,并获取其直接子节点的文本
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class=\"item-0\"]/text()")
print(result)
-
获取 class 属性为 item-0 的 li 节点,并获取其内部的文本
-
先选取 a 节点再获取文本
pythonfrom lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]/a/text()") print(result)
-
使用 //
pythonfrom lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]//text()") print(result)
-
-
获取子孙节点下的所有文本://
-
获取特定子孙节点下的所有文本:/特定节点/
1.9 属性获取
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a/@href")
print(result)
1.10 属性多值匹配
- li 属性有两个值
- li
- li-first
- 使用之前的属性匹配将无法获取
python
from lxml import etree
text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[@class=\"li\"]/a/text()")
print(result)
- 使用 contains 进行属性的多值匹配
python
from lxml import etree
text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[contains(@class, \"li\")]/a/text()")
print(result)
1.11 多属性匹配
python
from lxml import etree
text = """
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath(
"//li[contains(@class, \"li\") and @name=\"item\"]/a/text()")
print(result)
- 运算符
运算符 | 描述 |
---|---|
or | 或 |
and | 与 |
mod | 求余 |
| | 求两个节点集 |
+ | 加 |
- | 减 |
* | 乘 |
div | 除 |
= | 等于 |
!= | 不等于 |
< | 小于 |
<= | 小于等于 |
> | 大于 |
>= | 大于等于 |
1.12 按序选择
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
# 选取第一个节点
result = html.xpath("//li[1]/a/text()")
print(result)
# 选取最后一个节点
result = html.xpath("//li[last()]/a/text()")
print(result)
# 获取限定位置的节点
result = html.xpath("//li[position() < 3]/a/text()")
print(result)
# 获取倒数第三个节点
result = html.xpath("//li[last() - 2]/a/text()")
print(result)
1.13 节点轴选择
python
from lxml import etree
html = etree.parse("./test.html", etree.HTMLParser())
# ancestor轴: 获取所有祖先节点
result = html.xpath("//li[1]/ancestor::*")
print(result)
result = html.xpath("//li[1]/ancestor::div")
print(result)
# attribute轴: 获取所有属性值
result = html.xpath("//li[1]/attribute::*")
print(result)
# child轴: 获取所有直接子节点
result = html.xpath("//li[1]/child::a[@href=\"link1.html\"]")
print(result)
# descendant轴: 获取所有子孙节点
result = html.xpath("//li[1]/descendant::span")
print(result)
# following轴: 获取当前节点之后的所有节点
result = html.xpath("//li[1]/following::*[2]")
print(result)
# following-sibling轴: 获取当前节点之后的所有同级节点
result = html.xpath("//li[1]/following-sibling::*")
print(result)
2. Beautiful Soup 的使用
- 借助网页的结构和属性等特性来解析网页
2.1 解析器
- Beautiful Soup 在解析时是依赖解析器的
- Beautiful Soup 支持的解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python 标准库 | BeautifulSoup(markup, "html.parser") | Python 的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 或 Python 3.2.2 前的版本中文容错能力差 |
LXML HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文档容错能力强 | 需要安装 C 语言库 |
LXML XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、唯一支持 XML 的解析器 | 需要安装 C 语言库 |
html5lib | BeautifulSoup(markup, "html5lib") | 提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
2.2 安装
python
pip3 install beautifulsoup4
pip3 install lxml
2.3 基本使用
python
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>Test</title>
</head>
<body>
<p class="title" name="name"><b>Test</b></p>
<p class="story">A B C
<a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
<p class="story">···</p>
"""
soup = BeautifulSoup(html, "lxml")
# 以标准的缩进格式输出
print(soup.prettify())
print(soup.title.string)
- Beautiful Soup 在初始化时会自动修正格式
2.4 节点选择器
python
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Test</title></head>
<body>
<p class="title" name="name"><b>Test</b></p>
<p class="story">A B C
<a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
<p class="story">···</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)
- 此种选择方式只能选择到第一个匹配的节点,后面的其他节点都会忽略
2.5 提取信息
获取节点名称(name)
python
print(soup.title.name)
获取属性(attrs)
python
print(soup.p.attrs)
print(soup.p.attrs["name"])
获取内容(string)
python
print(soup.p.string)
嵌套选择
python
print(soup.head.title)
print(soup.head.title.string)
2.6 关联选择
子节点和子孙节点
-
直接子节点
-
contents
pythonfrom bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.p.contents)
-
children
pythonfrom bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml") print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child)
-
-
子孙节点(descendants)
python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
父节点和祖先节点
- 父节点(parent)
python
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.a.parent)
- 祖先节点(parents)
python
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
<a href="http://example.com/B" class="abc" id="link2">B</a> and
<a href="http://example.com/C" class="abc" id="link3">C </a>;
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))
兄弟节点
python
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
Test
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>
ABC
<a href="http://example.com/B" class="abc" id="link2">B</a>
abc
<a href="http://example.com/C" class="abc" id="link3">C </a>
123
</p>
"""
soup = BeautifulSoup(html, "lxml")
print("next sibling", soup.a.next_sibling)
print("prev sibling", soup.a.previous_sibling)
print("next siblings", list(enumerate(soup.a.next_siblings)))
print("prev siblings", list(enumerate(soup.a.previous_siblings)))
提取信息
python
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p class="story">
Test
<a href="http://example.com/A" class="abc" id="link1"><span>A</span></a><a href="http://example.com/B" class="abc" id="link2">B</a><a href="http://example.com/C" class="abc" id="link3">C </a>
</p>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print(soup.a.parents)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs["class"])
2.7 方法选择器
find_all
python
findall(name, attrs, recursive, text, **kwargs)
name
python
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(name="ul"))
print(type(soup.find_all(name="ul")[0]))
for ul in soup.find_all(name="ul"):
print(ul.find_all(name="li"))
for li in ul.find_all(name="li"):
print(li.string)
attrs
python
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(attrs={"id": "list-1"}))
print(soup.find_all(attrs={"class": "element"}))
# 等效于
print(soup.find_all(id="list-1"))
print(soup.find_all(class_="element"))
text
- 该参数应为正则表达式对象
- 返回结果是由所有与该正则表达式相匹配的节点文本组成的列表
python
import re
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-body">
<a>link1</a>
<a>link2</a>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find_all(text=re.compile("link")))
find
- 只能获取第一个匹配的节点元素
python
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>
Hello
</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find(name="ul"))
print(type(soup.find(name="ul")))
print(soup.find(class_="list"))
find_parents
- 获取所有的祖先节点
find_parent
- 获取直接父节点
find_next_siblings
- 获取后面的所有兄弟节点
find_next_sibling
- 获取后面的第一个兄弟节点
find_previous_siblings
- 获取前面的所有兄弟节点
find_previous_sibling
- 获取前面第一个兄弟节点
find_all_next
- 获取节点后面所有符合条件的节点
find_next
- 获取节点后面第一个符合条件的节点
find_all_previous
- 获取节点前面所有符合条件的节点
find_previous
- 获取节点前面第一个符合条件的节点
2.8 CSS 选择器
实例
- select(节点1,节点2,节点3 ···)
- 获取所有 节点1 下的所有 节点2 下的所有 节点3 ···
- 节点可以换为
- .class名
- #id名
python
from bs4 import BeautifulSoup
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">A</li>
<li class="element">B</li>
<li class="element">C</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">a</li>
<li class="element">b</li>
<li class="element">c</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.select(".panel .panel-body .list"))
print(soup.select("ul li"))
print(soup.select("#list-2 .element"))
print(type(soup.select("ul")[0]))
嵌套选择
python
soup = BeautifulSoup(html, "lxml")
for ul in soup.select("ul"):
print(ul.select("li"))
获取属性
- 方括号获取
- .attrs[]
python
soup = BeautifulSoup(html, "lxml")
for ul in soup.select("ul"):
print(ul["id"])
# 等效于
print(ul.attrs["id"])
获取文本
- .get_text()
- .string
python
soup = BeautifulSoup(html, "lxml")
for li in soup.select("li"):
print(f"text: {li.get_text()}")
# 等效于
print(f"string: {li.string}")
3. pyquery 的使用
3.1 安装
python
pip3 install pyquery
3.2 初始化
字符串初始化
python
from pyquery import PyQuery
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = PyQuery(html)
print(doc("li"))
URL初始化
python
from pyquery import PyQuery
doc = PyQuery(url="https://www.bilibili.com/")
# 等效于
doc = PyQuery(requests.get(url).text)
print(doc("title"))
文件初始化
python
from pyquery import PyQuery
doc = PyQuery(filename="test.html")
print(doc("li"))
3.3 基本CSS选择器
python
from pyquery import PyQuery
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc = PyQuery(html)
print(doc("#container .list li"))
print(type(doc("#container .list li")))
for item in doc("#container .list li").items():
print(item.text())
3.4 查找节点
直接子节点
python
from pyquery import PyQuery
doc = PyQuery(html)
items = doc(".list")
lis = items.children()
print(type(lis))
print(lis)
子孙节点
python
from pyquery import PyQuery
doc = PyQuery(html)
items = doc(".list")
print(type(items))
print(items)
lis = items.find("li")
print(type(lis))
print(lis)
直接父节点
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
items = doc(".list")
container = items.parent()
print(type(container))
print(container)
祖先节点
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
items = doc(".list")
parents = items.parents()
print(type(parents))
print(parents)
兄弟节点(有问题)
- 输出优先级
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
lis = doc(".list .item-0.active")
for item in lis.siblings().items():
print(item.text())
3.5 遍历节点
获取单个节点
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
print(str(li))
获取多个节点
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
lis = doc("li").items()
print(type(lis))
for li in lis:
print(li, type(li))
获取信息
- 信息
- 属性
- 文本
获取属性(attr())
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
a = doc(".item-0.active a")
print(a, type(a))
print(a.attr("href"))
# 等效于
print(a.attr.href)
a = doc("a")
print(a, type(a))
print(a.attr("href"))
for item in a.items():
print(item.attr("href"))
获取文本
获取内部文本(.text())
- 贪婪
- 如果拥有多个文本内容均会输出,且以空格分隔
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
a = doc(".item-0.active a")
print(a)
print(a.text())
获取内部的HTML文本(.html())
- 非贪婪
- 只会输出第一个匹配的文本
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
print(li.html())
.html()的相关误区
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
a = doc(".item-0.active a")
# 输出内容相同,但内容并不相等
print(li.html())
print(a)
# 类型为str
print(type(li.html()))
# 类型为pyquery.pyquery.PyQuery
print(type(a))
3.6 节点操作
addClass和removeClass
- 动态改变节点的Class属性
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
li.remove_class("active")
print(li)
li.add_class("active")
print(li)
attr、text和html
python
from pyquery import PyQuery
html = """
<ul class="warp">
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.text("changed item")
print(li)
li.html("<span>changed item</span>")
print(li)
remove
python
from pyquery import PyQuery
html = """
<ul class="warp">
ABC
<p>abc<p>
</ul>
"""
doc = PyQuery(html)
li = doc(".item-0.active")
warp = doc(".warp")
warp.find("p").remove()
print(warp.text())
3.7 伪类选择器
python
from pyquery import PyQuery
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = PyQuery(html)
# 第一个li节点
li = doc("li:first-child")
print(li)
# 最后一个li节点
li = doc("li:last-child")
print(li)
# 第二个li节点
li = doc("li:nth-child(2)")
print(li)
# 下标大于2的节点
li = doc("li:gt(2)")
print(li)
# 以2为倍数位置的节点
li = doc("li:nth-child(2n)")
print(li)
# 包含second文本的节点
li = doc("li:contains(second)")
print(li)
4. parsel的使用
- 可以接卸HTML和XML
- 支持使用XPath和CSS选择器对内容进行提取和修改
- 融合了正则表达式的提取功能
4.1 安装
python
pip3 install parsel
4.2 初始化
python
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
items1 = selector.css(".item-0")
print(len(items1), type(items1), items1)
items2 = selector.xpath("//li[contains(@class, 'item-0')]")
print(len(items2), type(items2), items2)
- .css()方法提取的结果是xpath属性而不是css属性
- CSS选择器被转化成XPath
- 由底层cssselect库实现
- 真正用于节点提取的是XPath
- CSS选择器被转化成XPath
4.3 提取文本
提取单个文本(get())
- 只能获取第一个匹配的对象
python
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
items = selector.css(".item-0")
print(f"type of items: {type(items)}")
for item in items:
print(f"type of item: {type(item)}")
text = item.xpath(".//text()").get()
print(text)
提取多个文本(getall())
python
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.xpath("//li[contains(@class, \"item-0\")]//text()").getall()
print(result)
result = selector.css(".item-0 *::text").getall()
print(result)
4.4 提取属性
python
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.css(".item-0.active a::attr(href)").get()
print(result)
result = selector.xpath("//li[contains(@class, \"item-0\") and contains(@class, \"active\")]/a/@href").get()
print(result)
4.5 正则提取
python
from parsel import Selector
html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
selector = Selector(text=html)
result = selector.css(".item-0").re("link(.*?)\"")
print(result)
result = selector.css(".item-0 *::text").re(".*item")
print(result)
# 获取第一个符合
result = selector.css(".item-0").re_first(">(.*?item)")
print(result)