Python网络爬虫设计（二）

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <h1>这是一个网页</h1>
    <a href="https://www.baidu.com">点我进入百度</a>
    <a href="https://www.google.com">点我进入谷歌</a>
</body>
</html>

再在其当前文件创建一个py文件：

python 复制代码

import bs4

Soup = bs4.BeautifulSoup(open(r"D:\学习\Python\001.html" , "r" , encoding = "utf-8") , "html.parser")
#"html.parser" 是一个解析器的名称，用于解析 HTML 文档
tags = Soup.find_all("a")       #如果只想输出第一个，就是用find函数

for tag in tags :
        print(tag.text)

输出：

当然，我们可以把第一行换成一个字符串，字符串里面包含一个HTML文档，也可以指定一个网址，用getHTML()

6、BeautifulSoup库进阶

上面的代码只是寻找一个tag的内容或者输出所有名字为x的一类tag，上面提到tag是可以嵌套的，而且tag拥有可以拥有很多属性（比如class，id等）那么我们怎么在众多的属性中和嵌套中找到我们想要的结果：

HTML代码：

html 复制代码

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>test</title>
</head>
<body>
    <span id="css">
        <p>这是一个p标签</p>
    </span>
    <span id="html">
        <div class="p1">这是一个div标签1</div>
        <div class="p2">这是一个div标签2</div>
        <div class="p3">这是一个div标签3</div>
        <div class="p4">
            <scy class="scy" id="hello">
                <a class="one" href="https://www.baidu.com">点我进入百度</a>
                <a class="two" href="https://www.google.com">点我进入谷歌</a>
            </scy>
        </div>


    </span>
</body>
</html>

Python代码：

python 复制代码

import bs4

soup = bs4.BeautifulSoup(open(r"D:\学习\Python\001.html" , encoding = "utf-8") , "html.parser")
#打开文件并读取内容
diva = soup.find("span" , attrs = {"id" : "html"})
#先寻找一个id是html的span标签

if diva != None :   #如果有符合要求的
    for x in diva.find_all("div" , attrs = {"class" : "p4"}) :      #再在里面找有没有class是p4的div标签
        print(x.text)

        if x != None :      #这里x就是class是p4的div标签，在此基础上如果x存在
            for y in x.find_all("a" , attrs = {"class" : "one"}) :  #就在x中找有没有class是one的a标签
                print(y.text)
                print(y["href"])    #输出符合条件的标签的href属性中的内容

输出：

以上就是Python网络爬虫设计（二）的全部内容:)