🎯 适合人群 :有爬虫实战经验的开发者
⏱️ 阅读时长 :35 分钟
📌 你将收获:掌握反反爬策略、代理池、验证码识别、分布式爬虫
📖 目录
一、常见反爬虫策略
1.1 反爬虫手段
| 反爬手段 | 说明 | 应对策略 |
|---|---|---|
| User-Agent 检测 | 检查请求头 | 设置 User-Agent |
| IP 限流 | 同一 IP 请求过多 | 使用代理池 |
| Cookie 验证 | 需要登录 | 模拟登录 |
| 验证码 | 图片/滑块验证码 | OCR 识别 / 打码平台 |
| JavaScript 混淆 | 动态加密参数 | 逆向分析 / Selenium |
| Referer 检测 | 检查来源页面 | 设置 Referer |
| 请求频率检测 | 请求过快 | 添加延时 |
1.2 User-Agent 池
java
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class UserAgentPool {
private static final List<String> USER_AGENTS = new ArrayList<>();
static {
USER_AGENTS.add("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
USER_AGENTS.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36");
USER_AGENTS.add("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36");
USER_AGENTS.add("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)");
USER_AGENTS.add("Mozilla/5.0 (iPad; CPU OS 14_0 like Mac OS X)");
}
/**
* 随机获取 User-Agent
*/
public static String random() {
Random random = new Random();
int index = random.nextInt(USER_AGENTS.size());
return USER_AGENTS.get(index);
}
}
// 使用
HttpGet request = new HttpGet(url);
request.setHeader("User-Agent", UserAgentPool.random());
1.3 请求头伪装
java
public class HeaderUtils {
/**
* 设置完整的请求头(模拟浏览器)
*/
public static void setHeaders(HttpGet request, String referer) {
request.setHeader("User-Agent", UserAgentPool.random());
request.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
request.setHeader("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8");
request.setHeader("Accept-Encoding", "gzip, deflate, br");
request.setHeader("Connection", "keep-alive");
request.setHeader("Referer", referer);
request.setHeader("Upgrade-Insecure-Requests", "1");
}
}
二、代理 IP 池
2.1 为什么需要代理池?
问题:同一 IP 请求过多被封
解决:使用代理 IP 池,轮换 IP
2.2 免费代理获取
java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.List;
public class ProxySpider {
/**
* 爬取免费代理网站
*/
public static List<Proxy> getProxies() throws Exception {
List<Proxy> proxies = new ArrayList<>();
// 免费代理网站(示例)
String url = "https://www.kuaidaili.com/free/";
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.get();
Elements rows = doc.select("#list tbody tr");
for (Element row : rows) {
String ip = row.select("td[data-title='IP']").text();
int port = Integer.parseInt(row.select("td[data-title='PORT']").text());
String type = row.select("td[data-title='类型']").text();
Proxy proxy = new Proxy(ip, port, type);
proxies.add(proxy);
}
return proxies;
}
}
class Proxy {
private String ip;
private int port;
private String type; // HTTP / HTTPS
// 构造方法、getter/setter 省略
}
2.3 代理验证
java
import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class ProxyValidator {
/**
* 验证代理是否可用
*/
public static boolean isValid(Proxy proxy) {
try {
CloseableHttpClient httpClient = HttpClients.createDefault();
// 设置代理
HttpHost proxyHost = new HttpHost(proxy.getIp(), proxy.getPort());
RequestConfig config = RequestConfig.custom()
.setProxy(proxyHost)
.setConnectTimeout(3000)
.setSocketTimeout(3000)
.build();
// 测试请求(访问百度)
HttpGet request = new HttpGet("https://www.baidu.com");
request.setConfig(config);
int statusCode = httpClient.execute(request).getStatusLine().getStatusCode();
httpClient.close();
return statusCode == 200;
} catch (Exception e) {
return false;
}
}
}
2.4 代理池管理
java
import java.util.*;
import java.util.concurrent.*;
public class ProxyPool {
private Queue<Proxy> availableProxies = new ConcurrentLinkedQueue<>();
private Set<Proxy> unavailableProxies = ConcurrentHashMap.newKeySet();
/**
* 初始化代理池
*/
public void init() throws Exception {
// 获取代理列表
List<Proxy> proxies = ProxySpider.getProxies();
// 验证代理
for (Proxy proxy : proxies) {
if (ProxyValidator.isValid(proxy)) {
availableProxies.offer(proxy);
} else {
unavailableProxies.add(proxy);
}
}
System.out.println("可用代理: " + availableProxies.size());
}
/**
* 获取一个可用代理
*/
public Proxy getProxy() {
Proxy proxy = availableProxies.poll();
if (proxy == null) {
throw new RuntimeException("代理池已空");
}
return proxy;
}
/**
* 归还代理(可用)
*/
public void returnProxy(Proxy proxy) {
availableProxies.offer(proxy);
}
/**
* 标记代理不可用
*/
public void markUnavailable(Proxy proxy) {
unavailableProxies.add(proxy);
}
/**
* 定时刷新代理池
*/
public void refresh() {
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
scheduler.scheduleAtFixedRate(() -> {
try {
init();
} catch (Exception e) {
e.printStackTrace();
}
}, 0, 30, TimeUnit.MINUTES); // 每 30 分钟刷新
}
}
2.5 使用代理池爬取
java
public class SpiderWithProxy {
private ProxyPool proxyPool = new ProxyPool();
public String crawl(String url) throws Exception {
Proxy proxy = proxyPool.getProxy();
try {
CloseableHttpClient httpClient = HttpClients.createDefault();
// 设置代理
HttpHost proxyHost = new HttpHost(proxy.getIp(), proxy.getPort());
RequestConfig config = RequestConfig.custom()
.setProxy(proxyHost)
.setConnectTimeout(5000)
.setSocketTimeout(10000)
.build();
HttpGet request = new HttpGet(url);
request.setConfig(config);
request.setHeader("User-Agent", UserAgentPool.random());
String html = EntityUtils.toString(
httpClient.execute(request).getEntity()
);
httpClient.close();
// 归还代理
proxyPool.returnProxy(proxy);
return html;
} catch (Exception e) {
// 标记代理不可用
proxyPool.markUnavailable(proxy);
throw e;
}
}
}
三、验证码识别
3.1 验证码类型
| 类型 | 难度 | 识别方式 |
|---|---|---|
| 数字/字母 | ⭐ | OCR(Tesseract) |
| 图片验证码 | ⭐⭐ | 打码平台 |
| 滑块验证码 | ⭐⭐⭐ | Selenium 模拟 |
| 点选验证码 | ⭐⭐⭐⭐ | 图像识别 / 人工 |
3.2 Tesseract OCR 识别
安装 Tesseract:
- 下载:https://github.com/tesseract-ocr/tesseract
- 安装后设置环境变量
添加依赖:
xml
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.9.0</version>
</dependency>
识别验证码:
java
import net.sourceforge.tess4j.Tesseract;
import java.io.File;
public class CaptchaRecognizer {
public static String recognize(String imagePath) {
try {
Tesseract tesseract = new Tesseract();
// 设置训练数据路径
tesseract.setDatapath("D:/tessdata");
tesseract.setLanguage("eng");
// 识别
String result = tesseract.doOCR(new File(imagePath));
// 去除空格和换行
return result.replaceAll("\\s+", "");
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
public static void main(String[] args) {
String code = recognize("captcha.png");
System.out.println("验证码: " + code);
}
}
3.3 图片预处理(提高识别率)
java
import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.File;
public class ImagePreprocessor {
/**
* 图片预处理:灰度化 + 二值化
*/
public static BufferedImage preprocess(BufferedImage image) {
// ① 灰度化
BufferedImage grayImage = new BufferedImage(
image.getWidth(), image.getHeight(),
BufferedImage.TYPE_BYTE_GRAY
);
Graphics g = grayImage.getGraphics();
g.drawImage(image, 0, 0, null);
g.dispose();
// ② 二值化
int threshold = 128; // 阈值
for (int y = 0; y < grayImage.getHeight(); y++) {
for (int x = 0; x < grayImage.getWidth(); x++) {
int pixel = grayImage.getRGB(x, y);
int gray = pixel & 0xff;
int newPixel = gray > threshold ? 0xFFFFFF : 0x000000;
grayImage.setRGB(x, y, newPixel);
}
}
return grayImage;
}
public static void main(String[] args) throws Exception {
BufferedImage image = ImageIO.read(new File("captcha.png"));
BufferedImage processed = preprocess(image);
ImageIO.write(processed, "png", new File("captcha_processed.png"));
}
}
3.4 打码平台(推荐)
常用平台:
示例(超级鹰):
java
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.File;
public class ChaoJiYing {
private String username;
private String password;
private String softId;
/**
* 识别验证码
*
* @param imagePath 验证码图片路径
* @param codeType 验证码类型(1004: 4位数字)
* @return 识别结果
*/
public String recognize(String imagePath, String codeType) throws Exception {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpPost post = new HttpPost("http://upload.chaojiying.net/Upload/Processing.php");
// 构建表单
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addTextBody("user", username);
builder.addTextBody("pass", password);
builder.addTextBody("softid", softId);
builder.addTextBody("codetype", codeType);
builder.addBinaryBody("userfile", new File(imagePath));
post.setEntity(builder.build());
String response = EntityUtils.toString(
httpClient.execute(post).getEntity()
);
httpClient.close();
// 解析结果(返回 JSON)
JSONObject json = JSON.parseObject(response);
return json.getString("pic_str");
}
}
四、Cookie 与登录
4.1 Cookie 管理
java
import org.apache.http.client.CookieStore;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class CookieExample {
public static void main(String[] args) throws Exception {
// 创建 Cookie 存储
CookieStore cookieStore = new BasicCookieStore();
// 创建 HttpClient(共享 Cookie)
CloseableHttpClient httpClient = HttpClients.custom()
.setDefaultCookieStore(cookieStore)
.build();
// 第一次请求(设置 Cookie)
HttpGet request1 = new HttpGet("https://example.com/login");
httpClient.execute(request1);
// 第二次请求(自动携带 Cookie)
HttpGet request2 = new HttpGet("https://example.com/user/profile");
String html = EntityUtils.toString(
httpClient.execute(request2).getEntity()
);
httpClient.close();
}
}
4.2 模拟登录
java
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.message.BasicNameValuePair;
import java.util.ArrayList;
import java.util.List;
public class LoginSpider {
/**
* 模拟登录
*/
public static CloseableHttpClient login(String username, String password) throws Exception {
CookieStore cookieStore = new BasicCookieStore();
CloseableHttpClient httpClient = HttpClients.custom()
.setDefaultCookieStore(cookieStore)
.build();
// 构建登录请求
HttpPost loginPost = new HttpPost("https://example.com/api/login");
// 登录参数
List<BasicNameValuePair> params = new ArrayList<>();
params.add(new BasicNameValuePair("username", username));
params.add(new BasicNameValuePair("password", password));
loginPost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
loginPost.setHeader("User-Agent", "Mozilla/5.0");
// 执行登录
String response = EntityUtils.toString(
httpClient.execute(loginPost).getEntity()
);
System.out.println("登录响应: " + response);
// 返回已登录的 HttpClient
return httpClient;
}
public static void main(String[] args) throws Exception {
// 登录
CloseableHttpClient httpClient = login("user", "pass");
// 使用已登录的 HttpClient 访问需要登录的页面
HttpGet request = new HttpGet("https://example.com/user/info");
String html = EntityUtils.toString(
httpClient.execute(request).getEntity()
);
System.out.println(html);
httpClient.close();
}
}
五、分布式爬虫
5.1 为什么需要分布式?
问题:
- 数据量大(千万级 URL)
- 单机性能瓶颈
解决:
- 多台机器协作
- 共享 URL 队列
- 数据统一存储
5.2 Redis 实现分布式队列
添加依赖:
xml
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>5.1.0</version>
</dependency>
分布式 URL 队列:
java
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
public class RedisUrlQueue {
private JedisPool jedisPool;
public RedisUrlQueue(String host, int port) {
this.jedisPool = new JedisPool(host, port);
}
/**
* 添加 URL 到队列
*/
public void push(String url) {
try (Jedis jedis = jedisPool.getResource()) {
jedis.lpush("spider:urls", url);
}
}
/**
* 从队列获取 URL
*/
public String pop() {
try (Jedis jedis = jedisPool.getResource()) {
return jedis.rpop("spider:urls");
}
}
/**
* 检查 URL 是否已爬取
*/
public boolean isCrawled(String url) {
try (Jedis jedis = jedisPool.getResource()) {
return jedis.sismember("spider:crawled", url);
}
}
/**
* 标记 URL 已爬取
*/
public void markCrawled(String url) {
try (Jedis jedis = jedisPool.getResource()) {
jedis.sadd("spider:crawled", url);
}
}
}
5.3 分布式爬虫 Worker
java
public class DistributedSpider {
private RedisUrlQueue urlQueue;
public DistributedSpider(String redisHost, int redisPort) {
this.urlQueue = new RedisUrlQueue(redisHost, redisPort);
}
/**
* 启动爬虫 Worker
*/
public void start() {
while (true) {
// 从 Redis 获取 URL
String url = urlQueue.pop();
if (url == null) {
System.out.println("队列为空,等待...");
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
break;
}
continue;
}
// 检查是否已爬取
if (urlQueue.isCrawled(url)) {
continue;
}
try {
// 爬取页面
String html = crawl(url);
// 解析数据
parseAndSave(html);
// 提取新 URL
List<String> newUrls = extractUrls(html);
for (String newUrl : newUrls) {
urlQueue.push(newUrl);
}
// 标记已爬取
urlQueue.markCrawled(url);
System.out.println("已爬取: " + url);
} catch (Exception e) {
System.err.println("爬取失败: " + url);
e.printStackTrace();
}
// 延时
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
break;
}
}
}
private String crawl(String url) throws Exception {
// 爬取逻辑
return Jsoup.connect(url).get().html();
}
private void parseAndSave(String html) {
// 解析并存储数据
}
private List<String> extractUrls(String html) {
// 提取新 URL
return new ArrayList<>();
}
public static void main(String[] args) {
// 启动多个 Worker(可在不同机器上)
DistributedSpider spider = new DistributedSpider("localhost", 6379);
spider.start();
}
}
六、爬虫监控与管理
6.1 爬虫监控指标
java
import java.util.concurrent.atomic.AtomicInteger;
public class SpiderMonitor {
private AtomicInteger totalRequests = new AtomicInteger(0);
private AtomicInteger successRequests = new AtomicInteger(0);
private AtomicInteger failedRequests = new AtomicInteger(0);
private long startTime;
public SpiderMonitor() {
this.startTime = System.currentTimeMillis();
}
public void recordSuccess() {
totalRequests.incrementAndGet();
successRequests.incrementAndGet();
}
public void recordFailure() {
totalRequests.incrementAndGet();
failedRequests.incrementAndGet();
}
/**
* 打印统计信息
*/
public void printStats() {
long duration = (System.currentTimeMillis() - startTime) / 1000;
double qps = totalRequests.get() / (double) duration;
double successRate = successRequests.get() / (double) totalRequests.get() * 100;
System.out.println("========== 爬虫统计 ==========");
System.out.println("运行时间: " + duration + " 秒");
System.out.println("总请求数: " + totalRequests.get());
System.out.println("成功数: " + successRequests.get());
System.out.println("失败数: " + failedRequests.get());
System.out.println("成功率: " + String.format("%.2f", successRate) + "%");
System.out.println("QPS: " + String.format("%.2f", qps));
System.out.println("=============================");
}
}
七、面试题
面试题 1:如何应对 IP 封禁?
参考答案:
-
使用代理 IP 池
- 免费代理 / 付费代理
- 定时验证和刷新
-
控制请求频率
- 添加延时(1-3 秒)
- 随机延时
-
分布式爬虫
- 多台机器分担请求
面试题 2:如何识别验证码?
参考答案:
| 方案 | 适用场景 | 成本 |
|---|---|---|
| OCR(Tesseract) | 简单验证码 | 免费 |
| 打码平台 | 复杂验证码 | 收费(¥0.01/次) |
| 深度学习 | 大量数据 | 高(训练成本) |
| Selenium 模拟 | 滑块验证码 | 免费但慢 |
面试题 3:分布式爬虫如何去重?
参考答案:
方案 1:Redis Set
java
jedis.sadd("crawled_urls", url); // 添加
jedis.sismember("crawled_urls", url); // 检查
方案 2:Bloom Filter(节省内存)
java
BloomFilter<String> bloomFilter = BloomFilter.create(
Funnels.stringFunnel(Charset.defaultCharset()),
10000000, // 预计元素数量
0.01 // 误判率
);
bloomFilter.put(url);
bloomFilter.mightContain(url);
面试题 4:如何提高爬虫稳定性?
参考答案:
-
异常处理
- 捕获所有异常
- 失败重试(最多 3 次)
-
超时设置
- 连接超时:5 秒
- 读取超时:10 秒
-
监控告警
- 监控失败率
- 异常及时通知
-
数据备份
- 定时备份爬取数据
- 断点续爬
面试题 5:爬虫的法律风险有哪些?
参考答案:
合法:
- ✅ 爬取公开数据
- ✅ 遵守 robots.txt
- ✅ 个人学习研究
违法:
- ❌ 爬取个人隐私
- ❌ 商业侵权
- ❌ 破坏服务器
- ❌ 数据倒卖
建议:
- 控制频率(不影响网站正常运行)
- 标识爬虫身份(User-Agent)
- 数据仅供内部使用
总结
✅ 反反爬 :User-Agent、代理池、请求头伪装
✅ 代理 IP :免费代理爬取、验证、池化管理
✅ 验证码 :OCR 识别、打码平台
✅ 登录 :Cookie 管理、模拟登录
✅ 分布式:Redis 队列、多机协作
🎉 Java 爬虫系列完结
📚 系列回顾
| 序号 | 标题 | 核心内容 |
|---|---|---|
| ① | 零基础入门 | HTTP 请求、Jsoup 解析、实战案例 |
| ② | 进阶技术 | 动态网页、多线程、WebMagic 框架 |
| ③ | 高级技术 | 反反爬、代理池、验证码、分布式 |
💡 学习路线
第 1-2 天 :基础入门(HTTP + Jsoup)
第 3-5 天 :进阶技术(动态网页 + 多线程)
第 6-7 天:高级技术(反反爬 + 分布式)
⚠️ 重要提醒
- 遵守法律法规
- 遵守 robots.txt
- 控制请求频率
- 数据仅供学习
祝你爬虫之路顺利! 🚀