
1. 引言
在旅游行业数据分析、舆情监测或竞品研究中,获取携程等平台的游记数据具有重要价值。然而,携程的游记页面通常采用动态加载(Ajax、JavaScript渲染),传统的**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">BeautifulSoup</font>**
方案难以直接获取完整数据。
解决方案 :使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**
模拟浏览器行为,配合**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">BeautifulSoup</font>**
或**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">lxml</font>**
解析动态加载的游记内容。本文将详细介绍如何利用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Python+Selenium</font>**
爬取携程动态加载的游记,并存储至**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">CSV</font>**
文件。
2. 技术选型与工具准备
2.1 技术栈
- Python 3.8+(推荐使用最新稳定版)
- Selenium(浏览器自动化工具)
- BeautifulSoup4(HTML解析库)
- Pandas(数据存储与处理)
- ChromeDriver(与Chrome浏览器配合使用)
2.2 环境安装
2.3 下载浏览器驱动
3. 爬取携程动态加载游记的步骤
3.1 分析携程游记页面结构
目标URL示例(以"北京"为例):
https://you.ctrip.com/travels/beijing1/t3.html
关键观察点:
- 动态加载:游记列表通过滚动或点击"加载更多"动态获取。
- Ajax请求:可通过浏览器开发者工具(F12→Network→XHR)查看数据接口。
- 反爬机制 :
- User-Agent检测
- IP限制(需代理或控制请求频率)
- 登录验证(部分内容需登录)
3.2 Selenium 模拟浏览器操作
plain
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from bs4 import BeautifulSoup
# 配置ChromeDriver路径
driver_path = "chromedriver.exe" # 替换为你的驱动路径
options = webdriver.ChromeOptions()
options.add_argument("--headless") # 无头模式(可选)
options.add_argument("--disable-gpu")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
driver = webdriver.Chrome(executable_path=driver_path, options=options)
3.3 访问目标页面并滚动加载数据
plain
def scroll_to_bottom(driver, max_scroll=5):
"""模拟滚动加载"""
for _ in range(max_scroll):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # 等待数据加载
url = "https://you.ctrip.com/travels/beijing1/t3.html"
driver.get(url)
scroll_to_bottom(driver) # 滚动加载更多游记
3.4 解析游记数据
plain
def parse_travel_notes(driver):
soup = BeautifulSoup(driver.page_source, 'html.parser')
notes = soup.find_all('div', class_='journalslist') # 游记列表容器
data = []
for note in notes:
title = note.find('a', class_='journal-title').get_text(strip=True)
author = note.find('a', class_='nickname').get_text(strip=True)
date = note.find('span', class_='time').get_text(strip=True)
views = note.find('span', class_='num').get_text(strip=True)
content = note.find('p', class_='journal-content').get_text(strip=True)
data.append({
"标题": title,
"作者": author,
"发布时间": date,
"阅读量": views,
"内容摘要": content
})
return data
travel_data = parse_travel_notes(driver)
3.5 存储数据至CSV
plain
df = pd.DataFrame(travel_data)
df.to_csv("ctrip_travel_notes.csv", index=False, encoding="utf_8_sig") # 避免中文乱码
4. 完整代码实现
plain
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
from selenium.webdriver.chrome.options import Options
def scroll_to_bottom(driver, max_scroll=5):
"""模拟滚动加载"""
for _ in range(max_scroll):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
def parse_travel_notes(driver):
"""解析游记数据"""
soup = BeautifulSoup(driver.page_source, 'html.parser')
notes = soup.find_all('div', class_='journalslist')
data = []
for note in notes:
title = note.find('a', class_='journal-title').get_text(strip=True)
author = note.find('a', class_='nickname').get_text(strip=True)
date = note.find('span', class_='time').get_text(strip=True)
views = note.find('span', class_='num').get_text(strip=True)
content = note.find('p', class_='journal-content').get_text(strip=True)
data.append({
"标题": title,
"作者": author,
"发布时间": date,
"阅读量": views,
"内容摘要": content
})
return data
def main():
# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"
# 初始化浏览器配置
options = Options()
# 设置User-Agent
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
# 设置代理(带认证)
proxy_options = {
'proxy': {
'http': f'http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}',
'https': f'https://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}',
'no_proxy': 'localhost,127.0.0.1'
}
}
# 添加代理扩展(适用于需要认证的代理)
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.proxy import Proxy, ProxyType
# 方法1:使用ChromeOptions添加代理(基础方法,可能不支持认证)
# options.add_argument(f'--proxy-server=http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}')
# 方法2:使用插件方式添加认证代理(推荐)
# 需要先创建一个代理认证插件
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Chrome Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%s",
port: parseInt(%s)
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "%s",
password: "%s"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
);
""" % (proxyHost, proxyPort, proxyUser, proxyPass)
# 创建临时插件目录
import os
import tempfile
import zipfile
plugin_dir = tempfile.mkdtemp()
with open(os.path.join(plugin_dir, "manifest.json"), 'w') as f:
f.write(manifest_json)
with open(os.path.join(plugin_dir, "background.js"), 'w') as f:
f.write(background_js)
# 打包插件
proxy_plugin_path = os.path.join(plugin_dir, "proxy_auth_plugin.zip")
with zipfile.ZipFile(proxy_plugin_path, 'w') as zp:
zp.write(os.path.join(plugin_dir, "manifest.json"), "manifest.json")
zp.write(os.path.join(plugin_dir, "background.js"), "background.js")
# 添加插件到ChromeOptions
options.add_extension(proxy_plugin_path)
try:
# 初始化浏览器(带代理)
driver = webdriver.Chrome(executable_path="chromedriver.exe", options=options)
# 访问页面并滚动加载
url = "https://you.ctrip.com/travels/beijing1/t3.html"
driver.get(url)
scroll_to_bottom(driver)
# 解析数据
travel_data = parse_travel_notes(driver)
# 存储数据
df = pd.DataFrame(travel_data)
df.to_csv("ctrip_travel_notes.csv", index=False, encoding="utf_8_sig")
print("数据爬取完成,已保存至 ctrip_travel_notes.csv")
finally:
driver.quit()
# 清理临时插件文件
import shutil
shutil.rmtree(plugin_dir)
if __name__ == "__main__":
main()
5. 进阶优化
5.1 反反爬策略
- 随机User-Agent :使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">fake_useragent</font>**
库动态切换UA。 - IP代理池 :结合
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">proxies</font>**
绕过IP限制。 - 模拟登录:处理需要登录才能查看的游记。
5.2 数据增强
- 情感分析 :使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">SnowNLP</font>**
或**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">TextBlob</font>**
分析游记情感倾向。 - 关键词提取 :利用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">jieba</font>**
分词提取热门景点关键词。
5.3 分布式爬虫
- Scrapy+Redis:提升爬取效率。
- 多线程/异步爬取 :使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">asyncio</font>**
或**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp</font>**
优化请求速度。