工作任务和目标：自动获取百度实时热搜榜的标题和热搜指数

标题：<div class="c-single-text-ellipsis"> 东部战区台岛战巡演练模拟动画 </div>

第一步，在deepseek中输入如下提示词：

你是一个Python爬虫专家，完成以下网页爬取的Python脚本任务：

在F:\aivideo文件夹里面新建一个Excel文件：topbaidu.xlsx

设置chromedriver的路径为："D:\Program Files\chromedriver125\chromedriver.exe"

用selenium打开网页：https://top.baidu.com/board?tab=realtime；

请求标头为：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8

Cache-Control:

max-age=0

Connection:

keep-alive

Host:

http://top.baidu.com

Referer:

https://top.baidu.com/board?platform=pc&tab=homepage&sa=pc_index_homepage_all

Sec-Ch-Ua:

"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

解析源代码，并打印输出；

定位class="c-single-text-ellipsis"的div标签，提取其文本内容作为热搜标题，保存到topbaidu.xlsx的第1列；

定位class="hot-index_1Bl1a"的div标签，提取其文本内容作为热搜指数，保存到topbaidu.xlsx的第2列；

注意：

每一步都要输出相关信息到屏幕；

每解析一个网页随机暂停1-10秒；

设置请求头，来应对反爬虫机制；

DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用，并且在后续版本中被移除。为了解决这个问题，我们可以使用 concat 函数来代替 append

第二步，查看审阅deepseek生成的Python代码：

import os

import time

import random

import pandas as pd

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

设置chromedriver的路径

chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"

设置请求头

headers = {

'authority': 'top.baidu.com',

'method': 'GET',

'path': '/board?tab=realtime',

'scheme': 'https',

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',

'accept-encoding': 'gzip, deflate, br, zstd',

'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',

'cache-control': 'max-age=0',

'cookie': '__root_domain_v=.baidu.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850',

'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT',

'priority': 'u=0, i',

'referer': 'https://www.sogou.com/link?url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw',

'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',

'sec-ch-ua-mobile': '?0',

'sec-ch-ua-platform': '"Windows"',

'sec-fetch-dest': 'document',

'sec-fetch-mode': 'navigate',

'sec-fetch-site': 'cross-site',

'sec-fetch-user': '?1',

'upgrade-insecure-requests': '1',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'

}

设置Chrome选项

chrome_options = Options()

chrome_options.add_argument(f'user-agent={headers $"user-agent"$ }')

设置Chrome服务

service = Service(chromedriver_path)

启动Chrome浏览器

driver = webdriver.Chrome(service=service, options=chrome_options)

打开网页

url = 'https://top.baidu.com/board?tab=realtime'

driver.get(url)

等待页面加载完成

time.sleep(5)

获取网页源代码

html = driver.page_source

print("网页源代码：")

print(html)

创建Excel文件

excel_path = os.path.join('F:\\aivideo', 'topbaidu.xlsx')

df = pd.DataFrame(columns= $'热搜标题', '热搜指数'$ )

df.to_excel(excel_path, index=False)

定位热搜标题和指数

titles = driver.find_elements(By.CLASS_NAME, 'c-single-text-ellipsis')

indices = driver.find_elements(By.CLASS_NAME, 'hot-index_1Bl1a')

保存热搜标题和指数到Excel文件

data = \[\]

for title, index in zip(titles, indices):

title_text = title.text

index_text = index.text

输出相关信息

print(f"热搜标题: {title_text}")

print(f"热搜指数: {index_text}")

添加数据到列表

data.append({'热搜标题': title_text, '热搜指数': index_text})

随机暂停

time.sleep(random.randint(1, 10))

使用concat将数据添加到DataFrame

df = pd.concat( $df, pd.DataFrame(data)$ , ignore_index=True)