Python网络爬虫实现selenium对百度识图二次开发以及批量保存Excel

复制代码

一.百度识图自动上传图片

python 复制代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By
edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=edge_options)
driver.get('https://graph.baidu.com/pcpage/index?tpl_from=pc')
driver.find_element(by=By.NAME, value='file').send_keys(r"D:\7.18\图1.jpg")
input('')

代码讲解：百度识图自动上传图片

这段代码使用 Selenium 库实现了自动打开百度识图网页并上传本地图片的功能。下面是对代码的逐行解释：

python 复制代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.by import By

导入必要的 Selenium 模块：webdriver用于控制浏览器，Options用于配置浏览器选项，By用于定位网页元素。

python 复制代码

edge_options = Options()
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"

创建 Edge 浏览器的配置选项对象，并指定 Edge 浏览器的安装路径。这一步是可选的，如果 Edge 浏览器已添加到系统 PATH 中，可以省略。

python 复制代码

driver = webdriver.Edge(options=edge_options)

初始化 Edge 浏览器驱动，创建一个可以控制浏览器的驱动对象。

python 复制代码

driver.get('https://graph.baidu.com/pcpage/index?tpl_from=pc')

使用浏览器打开百度识图的网页。

python 复制代码

driver.find_element(by=By.NAME, value='file').send_keys(r"D:\7.18\图1.jpg")

定位网页上的文件上传元素（通过元素的name属性值为file来查找）。
使用send_keys()方法模拟键盘输入，将本地图片的路径发送给上传元素，从而实现自动上传图片。

python 复制代码

input('')

不启动浏览器获取网页资源

上述代码使用了 Selenium WebDriver，它需要启动一个真实的浏览器来执行操作。如果只需要获取网页的静态资源（如 HTML 内容、JSON 数据等），可以使用更轻量级的库，如requests：

程序会在此处暂停，等待用户输入（按 Enter 键）后才会继续执行并关闭浏览器。这通常用于调试时暂停程序，方便查看结果。

Selenium WebDriver 更多操作解释
元素交互方法：
- click()：模拟鼠标点击元素，常用于按钮、链接等可点击元素。
- send_keys(text)：模拟键盘输入文本到输入框等元素中。
- clear()：清空输入框中的内容。
- submit()：提交表单，通常用于表单中的提交按钮。
浏览器导航方法：
- back()：模拟浏览器的后退按钮，返回上一页。
- forward()：模拟浏览器的前进按钮，前进到下一页。
- refresh()：刷新当前页面。
- get(url)：打开指定 URL 的网页。
- current_url：获取当前页面的 URL。
浏览器控制方法：
- close()：关闭当前浏览器窗口。
- quit()：退出整个浏览器进程，关闭所有窗口。
- maximize_window()：最大化浏览器窗口。
- set_window_size(width, height)：设置浏览器窗口大小。
元素定位方法：
- find_element(By.ID, value)：通过元素 ID 定位。
- find_element(By.NAME, value)：通过元素 name 属性定位。
- find_element(By.CSS_SELECTOR, value)：通过 CSS 选择器定位。
- find_element(By.XPATH, value)：通过 XPath 表达式定位。
- find_elements()：返回所有匹配的元素列表。

python 复制代码

import requests

url = 'https://graph.baidu.com/pcpage/index?tpl_from=pc'
response = requests.get(url)

if response.status_code == 200:
    # 获取网页的HTML内容
    html_content = response.text
    print(html_content)
else:
    print(f"请求失败，状态码：{response.status_code}")

二.批量获取 Excel 相关图书信息

python 复制代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
def get_info(driver):
    time.sleep(5)
    eles_p = driver.find_elements(By.CLASS_NAME, 'book_item')
    print(f"找到 {len(eles_p)} 个图书项")  # 调试输出
    for ele_p in eles_p:
        ele_p.click()
        handles = driver.window_handles
        driver.switch_to.window(handles[-1])
        time.sleep(5)
        name = driver.find_element(By.CLASS_NAME, 'book-name').text
        price = driver.find_element(By.CLASS_NAME, 'price').text
        author = driver.find_element(By.CLASS_NAME, 'book-author').text
        file.write(f'图书名：{name}\t价格：{price}\t作者名：{author}\n')
        print(f"已保存：{name}")  # 调试输出
        driver.close()
        driver.switch_to.window(handles[-2])  # 回到上一个标签页（索引-2）
file = open('excel图书汇总.txt', 'w', encoding='utf-8')
chrome_options = Options()
chrome_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=chrome_options)
driver.get('https://www.ptpress.com.cn/')
elements = driver.find_elements(By.TAG_NAME, "input")
elements[0].send_keys("excel" + Keys.RETURN)
handles = driver.window_handles
driver.switch_to.window(handles[1])
driver.find_element(By.ID, "booksMore").click()
handles = driver.window_handles
driver.switch_to.window(handles[-1])
get_info(driver)
page_num = 1  # 记录当前页码
while True:
    try:
        # 尝试查找下一页按钮
        next_button = driver.find_element(By.CLASS_NAME, 'ivu-page-next')

        # 检查按钮是否禁用（不同网站禁用状态的class可能不同，需要根据实际情况调整）
        if 'ivu-page-disabled' in next_button.get_attribute('class'):
            print(f"已到达最后一页（第{page_num}页），停止爬取")
            break

        next_button.click()
        page_num += 1
        print(f"已翻到第{page_num}页")
        time.sleep(3)  # 等待页面加载
        get_info(driver)

    except Exception as e:
        print(f"爬取过程中出错：{e}")
        print(f"最后成功爬取的是第{page_num}页")
        break
file.close()
driver.quit()  # 关闭浏览器

代码讲解：批量获取 Excel 相关图书信息

这段代码使用 Selenium 自动化浏览器操作，从人民邮电出版社网站批量获取 Excel 相关图书的信息，并保存到文本文件中。下面是对代码的详细解释：

整体流程分析

这个程序主要分为以下几个部分：

浏览器初始化与搜索
信息提取函数
翻页与循环处理
异常处理与资源释放

代码详细解释

python 复制代码

from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time

导入必要的库：Selenium 相关模块用于控制浏览器，time 模块用于添加等待时间。

python 复制代码

def get_info(driver):
    time.sleep(5)
    eles_p = driver.find_elements(By.CLASS_NAME, 'book_item')
    print(f"找到 {len(eles_p)} 个图书项")  # 调试输出
    for ele_p in eles_p:
        ele_p.click()
        handles = driver.window_handles
        driver.switch_to.window(handles[-1])
        time.sleep(5)
        name = driver.find_element(By.CLASS_NAME, 'book-name').text
        price = driver.find_element(By.CLASS_NAME, 'price').text
        author = driver.find_element(By.CLASS_NAME, 'book-author').text
        file.write(f'图书名：{name}\t价格：{price}\t作者名：{author}\n')
        print(f"已保存：{name}")  # 调试输出
        driver.close()
        driver.switch_to.window(handles[-2])  # 回到上一个标签页（索引-2）

get_info 函数 ：负责从当前页面提取图书信息
- 等待 5 秒让页面加载完成
- 查找所有图书项元素
- 遍历每个图书项，点击打开详情页
- 切换到新打开的标签页
- 提取书名、价格和作者信息并写入文件
- 关闭当前标签页，回到图书列表页

python 复制代码

file = open('excel图书汇总.txt', 'w', encoding='utf-8')
chrome_options = Options()
chrome_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
driver = webdriver.Edge(options=chrome_options)
driver.get('https://www.ptpress.com.cn/')

打开文件用于保存数据
配置并启动 Edge 浏览器
访问人民邮电出版社网站

python 复制代码

elements = driver.find_elements(By.TAG_NAME, "input")
elements[0].send_keys("excel" + Keys.RETURN)
handles = driver.window_handles
driver.switch_to.window(handles[1])
driver.find_element(By.ID, "booksMore").click()
handles = driver.window_handles
driver.switch_to.window(handles[-1])
get_info(driver)

找到搜索框并输入 "excel" 进行搜索
切换到搜索结果页面
点击 "更多图书" 按钮
切换到新打开的图书列表页面
调用 get_info 函数提取第一页的图书信息

python 复制代码

page_num = 1  # 记录当前页码
while True:
    try:
        # 尝试查找下一页按钮
        next_button = driver.find_element(By.CLASS_NAME, 'ivu-page-next')

        # 检查按钮是否禁用
        if 'ivu-page-disabled' in next_button.get_attribute('class'):
            print(f"已到达最后一页（第{page_num}页），停止爬取")
            break

        next_button.click()
        page_num += 1
        print(f"已翻到第{page_num}页")
        time.sleep(3)  # 等待页面加载
        get_info(driver)

    except Exception as e:
        print(f"爬取过程中出错：{e}")
        print(f"最后成功爬取的是第{page_num}页")
        break

实现自动翻页功能：
- 查找下一页按钮
- 检查按钮是否被禁用（表示已到最后一页）
- 点击下一页按钮并更新页码
- 等待页面加载后继续提取信息
- 使用 try-except 捕获异常，确保程序健壮性

python 复制代码

file.close()
driver.quit()  # 关闭浏览器

关闭文件和浏览器，释放资源

代码优化建议

等待机制改进：
- 使用显式等待替代固定的 time.sleep ()，提高代码稳定性
- 示例：

python 复制代码

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 等待图书项元素加载完成
eles_p = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CLASS_NAME, 'book_item'))
)

异常处理增强：

在 get_info 函数中添加元素查找的异常处理
示例：

python 复制代码

try:
    name = driver.find_element(By.CLASS_NAME, 'book-name').text
except:
    name = "未找到书名"

数据存储优化：

考虑将数据保存为 CSV 或 Excel 格式，便于后续处理
示例：

python 复制代码

import csv

with open('books.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['图书名', '价格', '作者名'])
    writer.writerow([name, price, author])

增加日志记录：

使用 logging 模块替代简单的 print 语句，便于调试和追踪
示例：

python 复制代码

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logging.info(f"找到 {len(eles_p)} 个图书项")

这个程序通过自动化浏览器操作，成功实现了批量获取图书信息的功能。通过合理的优化，可以进一步提高代码的稳定性和可维护性。

Python网络爬虫实现selenium对百度识图二次开发以及批量保存Excel

代码讲解：百度识图自动上传图片

不启动浏览器获取网页资源

Selenium WebDriver 更多操作解释

代码讲解：批量获取 Excel 相关图书信息

整体流程分析

代码详细解释

代码优化建议