
在现代 Web 自动化领域,Playwright 通过其多语言支持可在 Python、Java 及 .NET 三大生态中共享同一底层实现,简化了跨团队协作与维护成本 。然而,若忽略代理IP等必要配置,很容易导致功能异常或被目标网站限制,本文将以反面教材的形式,通过错误示例 → 问题剖析 → 修复过程 → 总结教训,完整演示如何使用爬虫代理(示例域名、端口、用户名、密码)并结合其它策略设置,从 https://www.dongchedi.com 上按汽车型号关键词搜索车友圈问答并进行数据存储与分析 。
错误代码案例
1. Python 版(错误示例)
python
from playwright.sync_api import sync_playwright
def scrape(keyword):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # ❌ 未设置代理
page = browser.new_page() # ❌ 未设置 Cookie 和 User‑Agent
page.goto(f"https://www.dongchedi.com/all-posts?search={keyword}") # ❌ 忽略 await 时序
content = page.content()
# ... 后续解析逻辑
browser.close()
2. Java 版(错误示例)
java
import com.microsoft.playwright.*;
public class DongchediScraper {
public static void main(String[] args) {
try (Playwright pw = Playwright.create()) {
Browser browser = pw.chromium().launch(); // ❌ 未配置 proxy
Page page = browser.newPage(); // ❌ 未设置 headers
page.navigate("https://www.dongchedi.com/all-posts?search=" + args[0]);
String html = page.content();
// ... 继续处理
browser.close();
}
}
}
3. .NET 版(错误示例)
csharp
using Microsoft.Playwright;
using System.Threading.Tasks;
class Program {
public static async Task Main(string[] args) {
using var playwright = await Playwright.CreateAsync();
var browser = await playwright.Chromium.LaunchAsync(); // ❌ 忽略 proxy
var page = await browser.NewPageAsync(); // ❌ 忽略 UA 和 Cookie
await page.GotoAsync($"https://www.dongchedi.com/all-posts?search={args[0]}");
var html = await page.ContentAsync();
// ... 继续处理
await browser.CloseAsync();
}
}
造成问题
- 无代理:目标站点会检测到相同 IP 的频繁请求并封禁,导致请求失败或 429 错误(HTTP Too Many Requests)。
- 缺少 Cookie/UA:未模拟真实浏览器环境,常触发反爬检测,返回 CAPTCH A 或重定向登录页面 。
- 异步时序错误(Python) :漏写
await
/使用同步 API 导致页面未完全加载即抓取,数据不完整或抛出超时异常 。
修复过程
1. 核心思路
- 利用 Playwright 的
context.new_context
(Python/JavaScript)或对应参数(Java/.NET)配置代理、Cookie、User‑Agent。 - 通过
extraHTTPHeaders
或addCookies
/setExtraHTTPHeaders
方法一次性注入 UA 和 Cookie,确保会话有效。
2. Python 修复代码
python
from playwright.sync_api import sync_playwright
def scrape(keyword):
with sync_playwright() as p:
# 创建带代理与 headers 的浏览上下文
# 参考亿牛云爬虫代理设置 www.16yun.cn
context = p.chromium.launch().new_context(
proxy={
"server": "proxy.16yun.cn:12345", # 代理域名与端口
"username": "16YUN", # 代理用户名
"password": "16IP" # 代理密码
},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "\
"(KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36", # 自定义 UA
extra_http_headers={
"Cookie": "sessionid=abcdef12345; locale=zh-CN" # 示例 Cookie
}
)
page = context.new_page() # 创建页面
page.goto(f"https://www.dongchedi.com/all-posts?search={keyword}") # 正确时序
# 数据提取示例
posts = page.query_selector_all(".post-card")
for post in posts:
title = post.query_selector("h3").inner_text()
answer = post.query_selector(".answer").inner_text()
print(title, answer)
context.close()
3. Java 修复代码
java
import com.microsoft.playwright.*;
public class DongchediScraper {
public static void main(String[] args) {
try (Playwright pw = Playwright.create()) {
Browser browser = pw.chromium().launch();
//参考亿牛云爬虫代理设置 www.16yun.cn
BrowserContext context = browser.newContext(new Browser.NewContextOptions()
.setProxy(new Proxy("proxy.16yun.cn:12345", "16YUN", "16IP"))
.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "\
"(KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36")
.setExtraHTTPHeaders(Map.of("Cookie", "sessionid=abcdef12345; locale=zh-CN"))
);
Page page = context.newPage();
page.navigate("https://www.dongchedi.com/all-posts?search=" + args[0]);
List<ElementHandle> posts = page.querySelectorAll(".post-card");
for (ElementHandle post : posts) {
System.out.println(post.querySelector("h3").innerText() + " -> " +
post.querySelector(".answer").innerText());
}
context.close();
browser.close();
}
}
}
4. .NET 修复代码
csharp
using Microsoft.Playwright;
using System;
using System.Threading.Tasks;
class Program {
public static async Task Main(string[] args) {
using var playwright = await Playwright.CreateAsync();
var browser = await playwright.Chromium.LaunchAsync();
//参考亿牛云爬虫代理设置 www.16yun.cn
var context = await browser.NewContextAsync(new BrowserNewContextOptions {
Proxy = new Proxy { Server = "proxy.16yun.cn:12345", Username = "16YUN", Password = "16IP" },
UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "\
"(KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
ExtraHTTPHeaders = new Dictionary<string, string> {
["Cookie"] = "sessionid=abcdef12345; locale=zh-CN"
}
});
var page = await context.NewPageAsync();
await page.GotoAsync($"https://www.dongchedi.com/all-posts?search={args[0]}");
var posts = await page.QuerySelectorAllAsync(".post-card");
foreach (var post in posts) {
var title = await post.QuerySelectorAsync("h3");
var answer = await post.QuerySelectorAsync(".answer");
Console.WriteLine(await title.InnerTextAsync() + " -> " + await answer.InnerTextAsync());
}
await context.CloseAsync();
await browser.CloseAsync();
}
}
总结教训
- 始终配置代理:避免单机 IP 被限制,尤其是高频采集场景;推荐使用动态切换或域名访问模式以提高稳定性 。
- 提前注入 Cookie/UA:真实化请求头,绕过简易反爬策略。
- 遵循异步时序:Playwright 同步与异步 API 各有适用场景,务必正确使用 Await/Sync 模式,避免未加载完全就开始爬取。
- 多语言示例统一化:同一底层 API 实现,减少团队维护成本,代码示例可互相参考。
通过上述反面教材式拆解,读者可以快速掌握 Playwright 在多语言环境下的配置要点,实战抓取懂车帝车友圈问答数据并完成存储与分析。