🎯 适合人群 :有爬虫基础的 Java 开发者
⏱️ 阅读时长 :35 分钟
📌 你将收获:掌握动态网页爬取、多线程爬虫、WebMagic 框架
📖 目录
一、动态网页爬取
1.1 静态网页 vs 动态网页
静态网页:
- HTML 源码包含所有数据
- Jsoup 可直接解析
动态网页:
- 数据通过 JavaScript 动态加载
- 源码中没有数据(Ajax 请求)
1.2 动态网页爬取方案
方案 1:分析 Ajax 请求(推荐)
java
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
public class AjaxSpider {
public static void main(String[] args) throws Exception {
// 假设分析出的 Ajax 接口
String apiUrl = "https://example.com/api/products?page=1";
CloseableHttpClient client = HttpClients.createDefault();
HttpGet request = new HttpGet(apiUrl);
// 设置请求头(模拟浏览器)
request.setHeader("User-Agent", "Mozilla/5.0");
request.setHeader("Referer", "https://example.com");
request.setHeader("X-Requested-With", "XMLHttpRequest");
String json = EntityUtils.toString(
client.execute(request).getEntity()
);
// 解析 JSON
JSONObject result = JSON.parseObject(json);
JSONArray products = result.getJSONArray("data");
for (int i = 0; i < products.size(); i++) {
JSONObject product = products.getJSONObject(i);
String name = product.getString("name");
Double price = product.getDouble("price");
System.out.println(name + ": ¥" + price);
}
client.close();
}
}
方案 2:使用 Selenium(模拟浏览器)
添加依赖:
xml
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.16.1</version>
</dependency>
基础使用:
java
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;
public class SeleniumSpider {
public static void main(String[] args) {
// 设置 ChromeDriver 路径
System.setProperty("webdriver.chrome.driver",
"D:/chromedriver.exe");
// 配置 Chrome 选项
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless"); // 无头模式
options.addArguments("--disable-gpu");
// 创建浏览器实例
WebDriver driver = new ChromeDriver(options);
try {
// 访问网页
driver.get("https://example.com");
// 等待加载(动态内容)
Thread.sleep(3000);
// 查找元素
List<WebElement> products = driver.findElements(
By.className("product-item")
);
for (WebElement product : products) {
String name = product.findElement(By.className("name"))
.getText();
String price = product.findElement(By.className("price"))
.getText();
System.out.println(name + ": " + price);
}
} finally {
driver.quit();
}
}
}
1.3 等待策略
java
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
// 显式等待(等待元素出现)
WebDriverWait wait = new WebDriverWait(driver, 10);
WebElement element = wait.until(
ExpectedConditions.presenceOfElementLocated(
By.id("product-list")
)
);
// 隐式等待(全局设置)
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
二、多线程爬虫
2.1 为什么需要多线程?
单线程问题:
- 爬取 1000 个页面,每个 2 秒 = 2000 秒(33 分钟)
多线程优化:
- 10 个线程并发 = 200 秒(3 分钟)
2.2 使用线程池
java
import java.util.concurrent.*;
import java.util.*;
public class MultiThreadSpider {
public static void main(String[] args) {
// 创建线程池
ExecutorService executor = Executors.newFixedThreadPool(10);
// 待爬取的 URL 列表
List<String> urls = new ArrayList<>();
for (int i = 1; i <= 100; i++) {
urls.add("https://example.com/page/" + i);
}
// 提交任务
for (String url : urls) {
executor.submit(() -> {
try {
crawl(url);
} catch (Exception e) {
e.printStackTrace();
}
});
}
// 关闭线程池
executor.shutdown();
try {
executor.awaitTermination(1, TimeUnit.HOURS);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static void crawl(String url) throws Exception {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(5000)
.get();
// 解析数据
String title = doc.title();
System.out.println(Thread.currentThread().getName()
+ " - " + title);
// 休眠(避免请求过快)
Thread.sleep(1000);
}
}
2.3 使用 CompletableFuture
java
import java.util.concurrent.CompletableFuture;
import java.util.List;
import java.util.stream.Collectors;
public class AsyncSpider {
public static void main(String[] args) {
List<String> urls = Arrays.asList(
"https://example.com/1",
"https://example.com/2",
"https://example.com/3"
);
// 异步爬取
List<CompletableFuture<String>> futures = urls.stream()
.map(url -> CompletableFuture.supplyAsync(() -> {
try {
return crawl(url);
} catch (Exception e) {
return "Error: " + url;
}
}))
.collect(Collectors.toList());
// 等待所有任务完成
CompletableFuture.allOf(
futures.toArray(new CompletableFuture[0])
).join();
// 获取结果
futures.forEach(f -> {
try {
System.out.println(f.get());
} catch (Exception e) {
e.printStackTrace();
}
});
}
private static String crawl(String url) throws Exception {
Document doc = Jsoup.connect(url).get();
return doc.title();
}
}
三、WebMagic 框架
3.1 WebMagic 简介
WebMagic:国产开源爬虫框架
特点:
- ✅ 简单易用
- ✅ 模块化设计
- ✅ 支持多线程
- ✅ 支持分布式
3.2 添加依赖
xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.6</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.6</version>
</dependency>
3.3 快速入门
java
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class GithubRepoPageProcessor implements PageProcessor {
// Site 配置
private Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000)
.setUserAgent("Mozilla/5.0");
@Override
public void process(Page page) {
// 提取链接(添加到待爬队列)
page.addTargetRequests(
page.getHtml().links().regex("https://github.com/\\w+/\\w+").all()
);
// 提取数据
page.putField("author",
page.getUrl().regex("https://github.com/(\\w+)/.*").toString());
page.putField("name",
page.getHtml().xpath("//h1[@class='entry-title']/strong/a/text()").toString());
page.putField("readme",
page.getHtml().xpath("//div[@id='readme']/tidyText()").toString());
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
.addUrl("https://github.com/code4craft")
.thread(5) // 5 个线程
.run();
}
}
3.4 数据管道(Pipeline)
java
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.Map;
public class ConsolePipeline implements Pipeline {
@Override
public void process(ResultItems resultItems, Task task) {
System.out.println("获取结果: " + resultItems.getRequest().getUrl());
for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
}
// 使用
Spider.create(new GithubRepoPageProcessor())
.addUrl("https://github.com/code4craft")
.addPipeline(new ConsolePipeline())
.thread(5)
.run();
3.5 实战案例:爬取知乎问题
java
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
import java.util.List;
public class ZhihuPageProcessor implements PageProcessor {
private Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000)
.setUserAgent("Mozilla/5.0");
@Override
public void process(Page page) {
// 问题列表页
List<String> questionLinks = page.getHtml()
.css("h2.ContentItem-title a", "href")
.all();
page.addTargetRequests(questionLinks);
// 问题详情页
if (page.getUrl().regex("https://www.zhihu.com/question/\\d+").match()) {
// 问题标题
String title = page.getHtml()
.xpath("//h1[@class='QuestionHeader-title']/text()")
.toString();
// 问题描述
String description = page.getHtml()
.xpath("//div[@class='QuestionRichText']//text()")
.toString();
// 回答数
String answerCount = page.getHtml()
.xpath("//h4[@class='List-headerText']/span/text()")
.toString();
page.putField("title", title);
page.putField("description", description);
page.putField("answerCount", answerCount);
}
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new ZhihuPageProcessor())
.addUrl("https://www.zhihu.com/hot")
.thread(3)
.run();
}
}
四、数据存储
4.1 存储到文件
java
import us.codecraft.webmagic.pipeline.FilePipeline;
Spider.create(new MyPageProcessor())
.addPipeline(new FilePipeline("D:/data/"))
.run();
4.2 存储到 MySQL
java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
public class MysqlPipeline implements Pipeline {
private String url = "jdbc:mysql://localhost:3306/spider";
private String user = "root";
private String password = "root";
@Override
public void process(ResultItems resultItems, Task task) {
String title = resultItems.get("title");
String content = resultItems.get("content");
try (Connection conn = DriverManager.getConnection(url, user, password)) {
String sql = "INSERT INTO articles (title, content) VALUES (?, ?)";
PreparedStatement ps = conn.prepareStatement(sql);
ps.setString(1, title);
ps.setString(2, content);
ps.executeUpdate();
} catch (Exception e) {
e.printStackTrace();
}
}
}
4.3 存储到 MongoDB
xml
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongodb-driver-sync</artifactId>
<version>4.11.1</version>
</dependency>
java
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
public class MongoPipeline implements Pipeline {
private MongoClient mongoClient;
private MongoCollection<Document> collection;
public MongoPipeline() {
mongoClient = MongoClients.create("mongodb://localhost:27017");
MongoDatabase database = mongoClient.getDatabase("spider");
collection = database.getCollection("articles");
}
@Override
public void process(ResultItems resultItems, Task task) {
Document doc = new Document();
doc.append("title", resultItems.get("title"));
doc.append("content", resultItems.get("content"));
doc.append("url", resultItems.getRequest().getUrl());
collection.insertOne(doc);
}
}
五、实战案例
5.1 爬取招聘网站职位信息
java
public class JobSpider implements PageProcessor {
private Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(2000)
.setUserAgent("Mozilla/5.0");
@Override
public void process(Page page) {
// 职位列表
List<Selectable> jobs = page.getHtml()
.css("div.job-item")
.nodes();
for (Selectable job : jobs) {
String title = job.css(".job-title", "text").toString();
String company = job.css(".company-name", "text").toString();
String salary = job.css(".salary", "text").toString();
String location = job.css(".location", "text").toString();
page.putField("title", title);
page.putField("company", company);
page.putField("salary", salary);
page.putField("location", location);
}
// 下一页
String nextPage = page.getHtml()
.css(".next-page", "href")
.toString();
if (nextPage != null) {
page.addTargetRequest(nextPage);
}
}
@Override
public Site getSite() {
return site;
}
}
5.2 爬取电商商品价格
java
public class ProductPriceSpider implements PageProcessor {
private Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000)
.setUserAgent("Mozilla/5.0");
@Override
public void process(Page page) {
// 商品详情页
if (page.getUrl().regex(".*item\\.jd\\.com/\\d+\\.html").match()) {
String title = page.getHtml()
.xpath("//div[@class='sku-name']/text()")
.toString();
// 价格通过 Ajax 获取,需要额外请求
String productId = page.getUrl()
.regex("item\\.jd\\.com/(\\d+)\\.html")
.toString();
String priceUrl = "https://p.3.cn/prices/mgets?skuIds=J_" + productId;
// 请求价格接口...
page.putField("title", title);
page.putField("productId", productId);
}
}
@Override
public Site getSite() {
return site;
}
}
六、面试题
面试题 1:静态网页和动态网页的区别?
参考答案:
| 类型 | 静态网页 | 动态网页 |
|---|---|---|
| 数据来源 | HTML 源码 | Ajax 请求 |
| 爬取方式 | Jsoup 直接解析 | 分析接口 / Selenium |
| 难度 | 简单 | 复杂 |
面试题 2:如何爬取动态网页?
参考答案:
方案 1:分析 Ajax 请求(推荐)
- 打开浏览器开发者工具
- 找到数据接口
- 直接请求接口
方案 2:Selenium 模拟浏览器
- 加载完整页面
- 等待 JavaScript 执行
- 提取数据
面试题 3:多线程爬虫的优缺点?
参考答案:
优点:
- ✅ 提高效率
- ✅ 充分利用 CPU
缺点:
- ❌ 增加服务器压力
- ❌ 容易被封 IP
- ❌ 数据竞争问题
建议:
- 控制线程数(5-10 个)
- 设置延时
- 使用代理池
面试题 4:WebMagic 的核心组件有哪些?
参考答案:
- PageProcessor:页面处理(解析数据)
- Pipeline:数据管道(存储数据)
- Scheduler:URL 管理器
- Downloader:下载器
面试题 5:如何提高爬虫效率?
参考答案:
- 多线程/异步:并发爬取
- 分析 Ajax:直接请求数据接口
- 增量爬取:只爬新数据
- 分布式:多台机器协作
- 缓存:避免重复请求
总结
✅ 动态网页 :Ajax 分析、Selenium 模拟浏览器
✅ 多线程 :线程池、CompletableFuture
✅ WebMagic :PageProcessor + Pipeline
✅ 数据存储:文件、MySQL、MongoDB
下一篇预告:高级爬虫与反反爬(代理池、验证码识别、分布式爬虫)