第3章 网页数据的解析提取

目录

1. XPath 的使用

  • XPath(XML Path Language)
    • XML 路径语言
    • 用于在 XML 文档中查找信息
      • html 文档也适用

1.1 XPath 常用规则

表达式 描述
nodename 选取此节点的所有子节点
/ 从当前节点选取直接子节点
// 从当前节点选取子孙节点
. 选取当前节点
... 选取当前节点的父节点
@ 选取属性
  • 示例
python 复制代码
//title[@lang="eng"]

1.2 安装

python 复制代码
pip3 install lxml

1.3 实例引入

python 复制代码
from lxml import etree

text = """
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
</div>
"""

html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
  • etree 模块可以自动修正 HTML 文本

  • 直接读取文本

    • test.html

      html 复制代码
      <div>
          <ul>
              <li class="item-0"><a href="link1.html">first item</a></li>
              <li class="item-1"><a href="link2.html">second item</a></li>
              <li class="item-inactive"><a href="link3.html">third item</a></li>
              <li class="item-1"><a href="link4.html">fourth item</a></li>
              <li class="item-0"><a href="link5.html">fifth item</a>
          </ul>
      </div>
    • test.py

      python 复制代码
      from lxml import etree
      
      html = etree.parse("./test.html", etree.HTMLParser())
      result = etree.tostring(html)
      print(result.decode('utf-8'))

1.4 所有节点

  • 获取所有节点(//*)
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//*")
print(result)
  • 获取所有 li 节点(//li)
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li")
print(result)
  • 使用 // + name 来获取所有名称为 name 的节点

1.5 子节点

  • 获取 li 节点的直接子节点 a
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a")
print(result)
  • 获取 ul 节点的子孙节点 a
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//ul//a")
print(result)

1.6 父节点

  • 获取 href 属性为 link4.html 的 a 节点,然后获取其父节点,再获取父节点的 class 属性
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href=\"link4.html\"]/../@class")
print(result)
  • 通过 parent:: 获取父节点
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//a[@href=\"link4.html\"]/parent::*/@class")
print(result)

1.7 属性匹配

  • 使用 @ 过滤
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class=\"item-0\"]")
print(result)

1.8 文本获取

  • 获取 class 属性为 item-0 的 li 节点,并获取其直接子节点的文本
python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li[@class=\"item-0\"]/text()")
print(result)
  • 获取 class 属性为 item-0 的 li 节点,并获取其内部的文本

    • 先选取 a 节点再获取文本

      python 复制代码
      from lxml import etree
      
      html = etree.parse("./test.html", etree.HTMLParser())
      result = html.xpath("//li[@class=\"item-0\"]/a/text()")
      print(result)
    • 使用 //

      python 复制代码
      from lxml import etree
      
      html = etree.parse("./test.html", etree.HTMLParser())
      result = html.xpath("//li[@class=\"item-0\"]//text()")
      print(result)
  • 获取子孙节点下的所有文本://

  • 获取特定子孙节点下的所有文本:/特定节点/

1.9 属性获取

python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())
result = html.xpath("//li/a/@href")
print(result)

1.10 属性多值匹配

  • li 属性有两个值
    • li
    • li-first
  • 使用之前的属性匹配将无法获取
python 复制代码
from lxml import etree

text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[@class=\"li\"]/a/text()")
print(result)
  • 使用 contains 进行属性的多值匹配
python 复制代码
from lxml import etree

text = """
<li class="li li-first"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath("//li[contains(@class, \"li\")]/a/text()")
print(result)

1.11 多属性匹配

python 复制代码
from lxml import etree

text = """
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
"""
html = etree.HTML(text)
result = html.xpath(
    "//li[contains(@class, \"li\") and @name=\"item\"]/a/text()")
print(result)
  • 运算符
运算符 描述
or
and
mod 求余
| 求两个节点集
+
-
*
div
= 等于
!= 不等于
< 小于
<= 小于等于
> 大于
>= 大于等于

1.12 按序选择

python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())

# 选取第一个节点
result = html.xpath("//li[1]/a/text()")
print(result)

# 选取最后一个节点
result = html.xpath("//li[last()]/a/text()")
print(result)

# 获取限定位置的节点
result = html.xpath("//li[position() < 3]/a/text()")
print(result)

# 获取倒数第三个节点
result = html.xpath("//li[last() - 2]/a/text()")
print(result)

1.13 节点轴选择

python 复制代码
from lxml import etree

html = etree.parse("./test.html", etree.HTMLParser())

# ancestor轴: 获取所有祖先节点
result = html.xpath("//li[1]/ancestor::*")
print(result)
result = html.xpath("//li[1]/ancestor::div")
print(result)

# attribute轴: 获取所有属性值
result = html.xpath("//li[1]/attribute::*")
print(result)

# child轴: 获取所有直接子节点
result = html.xpath("//li[1]/child::a[@href=\"link1.html\"]")
print(result)

# descendant轴: 获取所有子孙节点
result = html.xpath("//li[1]/descendant::span")
print(result)

# following轴: 获取当前节点之后的所有节点
result = html.xpath("//li[1]/following::*[2]")
print(result)

# following-sibling轴: 获取当前节点之后的所有同级节点
result = html.xpath("//li[1]/following-sibling::*")
print(result)

2. Beautiful Soup 的使用

  • 借助网页的结构和属性等特性来解析网页

2.1 解析器

  • Beautiful Soup 在解析时是依赖解析器的
  • Beautiful Soup 支持的解析器
解析器 使用方法 优势 劣势
Python 标准库 BeautifulSoup(markup, "html.parser") Python 的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3 或 Python 3.2.2 前的版本中文容错能力差
LXML HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装 C 语言库
LXML XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持 XML 的解析器 需要安装 C 语言库
html5lib BeautifulSoup(markup, "html5lib") 提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

2.2 安装

python 复制代码
pip3 install beautifulsoup4
pip3 install lxml

2.3 基本使用

python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <p class="title" name="name"><b>Test</b></p>
        <p class="story">A B C
        <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
        <a href="http://example.com/B" class="abc" id="link2">B</a> and
        <a href="http://example.com/C" class="abc" id="link3">C </a>;
        </p>
        <p class="story">···</p>
"""

soup = BeautifulSoup(html, "lxml")

# 以标准的缩进格式输出
print(soup.prettify())
print(soup.title.string)
  • Beautiful Soup 在初始化时会自动修正格式

2.4 节点选择器

python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <head><title>Test</title></head>
    <body>
        <p class="title" name="name"><b>Test</b></p>
        <p class="story">A B C
        <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>,
        <a href="http://example.com/B" class="abc" id="link2">B</a> and
        <a href="http://example.com/C" class="abc" id="link3">C </a>;
        </p>
        <p class="story">···</p>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)
  • 此种选择方式只能选择到第一个匹配的节点,后面的其他节点都会忽略

2.5 提取信息

获取节点名称(name)

python 复制代码
print(soup.title.name)

获取属性(attrs)

python 复制代码
print(soup.p.attrs)
print(soup.p.attrs["name"])

获取内容(string)

python 复制代码
print(soup.p.string)

嵌套选择

python 复制代码
print(soup.head.title)
print(soup.head.title.string)

2.6 关联选择

子节点和子孙节点

  • 直接子节点

    • contents

      python 复制代码
      from bs4 import BeautifulSoup
      
      html = """
      <html>
          <head><title>Test</title></head>
          <body>
              <p class="story">A B C
              <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
              <a href="http://example.com/B" class="abc" id="link2">B</a> and
              <a href="http://example.com/C" class="abc" id="link3">C </a>;
              </p>
              <p class="story">···</p>
      """
      
      soup = BeautifulSoup(html, "lxml")
      
      print(soup.p.contents)
    • children

      python 复制代码
      from bs4 import BeautifulSoup
      
      soup = BeautifulSoup(html, "lxml")
      
      print(soup.p.children)
      for i, child in enumerate(soup.p.children):
          print(i, child)
  • 子孙节点(descendants)

python 复制代码
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

父节点和祖先节点

  • 父节点(parent)
python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <body>
    	<p class="story">
        <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
        <a href="http://example.com/B" class="abc" id="link2">B</a> and
        <a href="http://example.com/C" class="abc" id="link3">C </a>;
        </p>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.a.parent)
  • 祖先节点(parents)
python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <body>
    	<p class="story">
        <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>,
        <a href="http://example.com/B" class="abc" id="link2">B</a> and
        <a href="http://example.com/C" class="abc" id="link3">C </a>;
        </p>
"""

soup = BeautifulSoup(html, "lxml")

print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

兄弟节点

python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <body>
    	<p class="story">
        Test
        <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>
        ABC
        <a href="http://example.com/B" class="abc" id="link2">B</a>
        abc
        <a href="http://example.com/C" class="abc" id="link3">C </a>
        123
        </p>
"""

soup = BeautifulSoup(html, "lxml")

print("next sibling", soup.a.next_sibling)
print("prev sibling", soup.a.previous_sibling)
print("next siblings", list(enumerate(soup.a.next_siblings)))
print("prev siblings", list(enumerate(soup.a.previous_siblings)))

提取信息

python 复制代码
from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <p class="story">
        Test
        <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a><a href="http://example.com/B" class="abc" id="link2">B</a><a href="http://example.com/C" class="abc" id="link3">C </a>
        </p>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print(soup.a.parents)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs["class"])

2.7 方法选择器

find_all

python 复制代码
findall(name, attrs, recursive, text, **kwargs)

name

python 复制代码
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>
            Hello
        </h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">A</li>
            <li class="element">B</li>
            <li class="element">C</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">a</li>
            <li class="element">b</li>
            <li class="element">c</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find_all(name="ul"))
print(type(soup.find_all(name="ul")[0]))
for ul in soup.find_all(name="ul"):
    print(ul.find_all(name="li"))
    for li in ul.find_all(name="li"):
        print(li.string)

attrs

python 复制代码
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>
            Hello
        </h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">A</li>
            <li class="element">B</li>
            <li class="element">C</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">a</li>
            <li class="element">b</li>
            <li class="element">c</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find_all(attrs={"id": "list-1"}))
print(soup.find_all(attrs={"class": "element"}))
# 等效于
print(soup.find_all(id="list-1"))
print(soup.find_all(class_="element"))

text

  • 该参数应为正则表达式对象
  • 返回结果是由所有与该正则表达式相匹配的节点文本组成的列表
python 复制代码
import re
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-body">
        <a>link1</a>
        <a>link2</a>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find_all(text=re.compile("link")))

find

  • 只能获取第一个匹配的节点元素
python 复制代码
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>
            Hello
        </h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">A</li>
            <li class="element">B</li>
            <li class="element">C</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">a</li>
            <li class="element">b</li>
            <li class="element">c</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find(name="ul"))
print(type(soup.find(name="ul")))
print(soup.find(class_="list"))

find_parents

  • 获取所有的祖先节点

find_parent

  • 获取直接父节点

find_next_siblings

  • 获取后面的所有兄弟节点

find_next_sibling

  • 获取后面的第一个兄弟节点

find_previous_siblings

  • 获取前面的所有兄弟节点

find_previous_sibling

  • 获取前面第一个兄弟节点

find_all_next

  • 获取节点后面所有符合条件的节点

find_next

  • 获取节点后面第一个符合条件的节点

find_all_previous

  • 获取节点前面所有符合条件的节点

find_previous

  • 获取节点前面第一个符合条件的节点

2.8 CSS 选择器

实例

  • select(节点1,节点2,节点3 ···)
    • 获取所有 节点1 下的所有 节点2 下的所有 节点3 ···
  • 节点可以换为
    • .class名
    • #id名
python 复制代码
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">A</li>
            <li class="element">B</li>
            <li class="element">C</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">a</li>
            <li class="element">b</li>
            <li class="element">c</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.select(".panel .panel-body .list"))
print(soup.select("ul li"))
print(soup.select("#list-2 .element"))
print(type(soup.select("ul")[0]))

嵌套选择

python 复制代码
soup = BeautifulSoup(html, "lxml")

for ul in soup.select("ul"):
    print(ul.select("li"))

获取属性

  • 方括号获取
  • .attrs[]
python 复制代码
soup = BeautifulSoup(html, "lxml")

for ul in soup.select("ul"):
    print(ul["id"])
    # 等效于
    print(ul.attrs["id"])

获取文本

  • .get_text()
  • .string
python 复制代码
soup = BeautifulSoup(html, "lxml")

for li in soup.select("li"):
    print(f"text: {li.get_text()}")
    # 等效于
    print(f"string: {li.string}")

3. pyquery 的使用

3.1 安装

python 复制代码
pip3 install pyquery

3.2 初始化

字符串初始化

python 复制代码
from pyquery import PyQuery

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

doc = PyQuery(html)
print(doc("li"))

URL初始化

python 复制代码
from pyquery import PyQuery

doc = PyQuery(url="https://www.bilibili.com/")
# 等效于
doc = PyQuery(requests.get(url).text)
print(doc("title"))

文件初始化

python 复制代码
from pyquery import PyQuery

doc = PyQuery(filename="test.html")
print(doc("li"))

3.3 基本CSS选择器

python 复制代码
from pyquery import PyQuery

html = """
<div id="container">
    <ul class="list">
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

doc = PyQuery(html)

print(doc("#container .list li"))
print(type(doc("#container .list li")))

for item in doc("#container .list li").items():
    print(item.text())

3.4 查找节点

直接子节点

python 复制代码
from pyquery import PyQuery

doc = PyQuery(html)
items = doc(".list")
lis = items.children()

print(type(lis))
print(lis)

子孙节点

python 复制代码
from pyquery import PyQuery

doc = PyQuery(html)

items = doc(".list")
print(type(items))
print(items)

lis = items.find("li")
print(type(lis))
print(lis)

直接父节点

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
items = doc(".list")
container = items.parent()

print(type(container))
print(container)

祖先节点

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
items = doc(".list")
parents = items.parents()

print(type(parents))
print(parents)

兄弟节点(有问题)

  • 输出优先级
python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
lis = doc(".list .item-0.active")

for item in lis.siblings().items():
    print(item.text())

3.5 遍历节点

获取单个节点

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
li = doc(".item-0.active")

print(li)
print(str(li))

获取多个节点

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
lis = doc("li").items()

print(type(lis))

for li in lis:
    print(li, type(li))

获取信息

  • 信息
    • 属性
    • 文本

获取属性(attr())

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
a = doc(".item-0.active a")

print(a, type(a))
print(a.attr("href"))
# 等效于
print(a.attr.href)

a = doc("a")

print(a, type(a))
print(a.attr("href"))
for item in a.items():
    print(item.attr("href"))

获取文本

获取内部文本(.text())
  • 贪婪
  • 如果拥有多个文本内容均会输出,且以空格分隔
python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
a = doc(".item-0.active a")

print(a)
print(a.text())
获取内部的HTML文本(.html())
  • 非贪婪
  • 只会输出第一个匹配的文本
python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
li = doc(".item-0.active")

print(li)
print(li.html())
.html()的相关误区
python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
li = doc(".item-0.active")
a = doc(".item-0.active a")

# 输出内容相同,但内容并不相等
print(li.html())
print(a)

# 类型为str
print(type(li.html()))

# 类型为pyquery.pyquery.PyQuery
print(type(a))

3.6 节点操作

addClass和removeClass

  • 动态改变节点的Class属性
python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)
li = doc(".item-0.active")
print(li)

li.remove_class("active")
print(li)

li.add_class("active")
print(li)

attr、text和html

python 复制代码
from pyquery import PyQuery

html = """
<ul class="warp">
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
"""

doc = PyQuery(html)

li = doc(".item-0.active")
print(li)

li.attr("name", "link")
print(li)

li.text("changed item")
print(li)

li.html("<span>changed item</span>")
print(li)

remove

python 复制代码
from pyquery import PyQuery

html = """
<ul class="warp">
    ABC
    <p>abc<p>
</ul>
"""

doc = PyQuery(html)

li = doc(".item-0.active")
warp = doc(".warp")
warp.find("p").remove()
print(warp.text())

3.7 伪类选择器

python 复制代码
from pyquery import PyQuery

html = """
<div class="wrap">
    <div id="container">
        <ul class="list">
            <li class="item-0">first item</li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>
</div>
"""

doc = PyQuery(html)

# 第一个li节点
li = doc("li:first-child")
print(li)

# 最后一个li节点
li = doc("li:last-child")
print(li)


# 第二个li节点
li = doc("li:nth-child(2)")
print(li)

# 下标大于2的节点
li = doc("li:gt(2)")
print(li)

# 以2为倍数位置的节点
li = doc("li:nth-child(2n)")
print(li)

# 包含second文本的节点
li = doc("li:contains(second)")
print(li)

4. parsel的使用

  • 可以接卸HTML和XML
  • 支持使用XPath和CSS选择器对内容进行提取和修改
  • 融合了正则表达式的提取功能

4.1 安装

python 复制代码
pip3 install parsel

4.2 初始化

python 复制代码
from parsel import Selector

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

selector = Selector(text=html)

items1 = selector.css(".item-0")
print(len(items1), type(items1), items1)

items2 = selector.xpath("//li[contains(@class, 'item-0')]")
print(len(items2), type(items2), items2)
  • .css()方法提取的结果是xpath属性而不是css属性
    • CSS选择器被转化成XPath
      • 由底层cssselect库实现
    • 真正用于节点提取的是XPath

4.3 提取文本

提取单个文本(get())

  • 只能获取第一个匹配的对象
python 复制代码
from parsel import Selector

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

selector = Selector(text=html)

items = selector.css(".item-0")
print(f"type of items: {type(items)}")
for item in items:
    print(f"type of item: {type(item)}")
    text = item.xpath(".//text()").get()
    print(text)

提取多个文本(getall())

python 复制代码
from parsel import Selector

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

selector = Selector(text=html)

result = selector.xpath("//li[contains(@class, \"item-0\")]//text()").getall()
print(result)

result = selector.css(".item-0 *::text").getall()
print(result)

4.4 提取属性

python 复制代码
from parsel import Selector

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

selector = Selector(text=html)

result = selector.css(".item-0.active a::attr(href)").get()
print(result)

result = selector.xpath("//li[contains(@class, \"item-0\") and contains(@class, \"active\")]/a/@href").get()
print(result)

4.5 正则提取

python 复制代码
from parsel import Selector

html = """
<div>
    <ul>
        <li class="item-0">first item</li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1 active"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
"""

selector = Selector(text=html)

result = selector.css(".item-0").re("link(.*?)\"")
print(result)

result = selector.css(".item-0 *::text").re(".*item")
print(result)

# 获取第一个符合
result = selector.css(".item-0").re_first(">(.*?item)")
print(result)
相关推荐
思则变1 小时前
[Pytest] [Part 2]增加 log功能
开发语言·python·pytest
漫谈网络1 小时前
WebSocket 在前后端的完整使用流程
javascript·python·websocket
try2find3 小时前
安装llama-cpp-python踩坑记
开发语言·python·llama
泡泡以安4 小时前
安卓高版本HTTPS抓包:终极解决方案
爬虫·https·安卓逆向·安卓抓包
博观而约取4 小时前
Django ORM 1. 创建模型(Model)
数据库·python·django
精灵vector5 小时前
构建专家级SQL Agent交互
python·aigc·ai编程
q567315235 小时前
Java Selenium反爬虫技术方案
java·爬虫·selenium
Zonda要好好学习5 小时前
Python入门Day2
开发语言·python
Vertira6 小时前
pdf 合并 python实现(已解决)
前端·python·pdf
太凉6 小时前
Python之 sorted() 函数的基本语法
python