当想要一个企业级可扩展的多线程j爬虫时,可以尝试使用java语言来。java多线程爬虫尤其独特的优势,当使用线程池(ExecutorService)来管理线程,但注意,我们需要动态调整线程数,因此可以使用可配置的线程池(如ThreadPoolExecutor),并允许动态更改核心线程数和最大线程数。

以下是我通宵整理的一个企业级可动态调整线程数的Java多线程爬虫实现,包含线程管理、任务队列、优雅停机和错误处理等关键特性:
java
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Consumer;
public class DynamicThreadedCrawler {
// 线程池配置
private volatile ThreadPoolExecutor executor;
private final BlockingQueue<Runnable> taskQueue = new LinkedBlockingQueue<>();
private final AtomicBoolean isRunning = new AtomicBoolean(false);
private final AtomicInteger activeThreads = new AtomicInteger(0);
private final Object threadLock = new Object();
// 爬虫配置
private final int maxThreads;
private final int minThreads;
private final long keepAliveTime;
private final Consumer<String> pageProcessor;
public DynamicThreadedCrawler(int minThreads, int maxThreads, long keepAliveSeconds, Consumer<String> pageProcessor) {
this.minThreads = minThreads;
this.maxThreads = maxThreads;
this.keepAliveTime = keepAliveSeconds;
this.pageProcessor = pageProcessor;
}
public void start() {
if (isRunning.compareAndSet(false, true)) {
executor = new ThreadPoolExecutor(
minThreads,
maxThreads,
keepAliveTime,
TimeUnit.SECONDS,
new LinkedBlockingQueue<>(),
new CrawlerThreadFactory()
);
executor.prestartAllCoreThreads();
System.out.println("Crawler started with " + minThreads + " threads");
}
}
public void stop() {
if (isRunning.compareAndSet(true, false)) {
executor.shutdown();
try {
if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
executor.shutdownNow();
}
} catch (InterruptedException e) {
executor.shutdownNow();
Thread.currentThread().interrupt();
}
System.out.println("Crawler stopped");
}
}
public void addTask(String url) {
if (isRunning.get()) {
taskQueue.offer(() -> processUrl(url));
executor.execute(() -> processTasks());
}
}
public void increaseThreads(int delta) {
synchronized (threadLock) {
int newCore = Math.min(executor.getCorePoolSize() + delta, maxThreads);
executor.setCorePoolSize(newCore);
executor.setMaximumPoolSize(newCore);
System.out.println("Increased threads to: " + newCore);
}
}
public void decreaseThreads(int delta) {
synchronized (threadLock) {
int newCore = Math.max(executor.getCorePoolSize() - delta, minThreads);
executor.setCorePoolSize(newCore);
executor.setMaximumPoolSize(newCore);
System.out.println("Decreased threads to: " + newCore);
}
}
private void processTasks() {
while (isRunning.get() && !Thread.currentThread().isInterrupted()) {
try {
Runnable task = taskQueue.poll(500, TimeUnit.MILLISECONDS);
if (task != null) {
task.run();
} else if (activeThreads.get() > minThreads) {
return; // 空闲线程自动退出
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
private void processUrl(String url) {
activeThreads.incrementAndGet();
try {
// 模拟网页抓取和处理
pageProcessor.accept(url);
TimeUnit.MILLISECONDS.sleep(200); // 模拟网络延迟
// 这里可以添加链接提取逻辑
// List<String> newUrls = extractLinks(htmlContent);
// newUrls.forEach(this::addTask);
} catch (Exception e) {
System.err.println("Error processing URL " + url + ": " + e.getMessage());
// 错误重试逻辑
taskQueue.offer(() -> processUrl(url));
} finally {
activeThreads.decrementAndGet();
}
}
public int getActiveThreadCount() {
return activeThreads.get();
}
public int getQueuedTaskCount() {
return taskQueue.size();
}
private class CrawlerThreadFactory implements ThreadFactory {
private final AtomicInteger threadCount = new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread thread = new Thread(r);
thread.setName("CrawlerThread-" + threadCount.incrementAndGet());
thread.setDaemon(false);
thread.setUncaughtExceptionHandler((t, e) ->
System.err.println("Uncaught exception in thread " + t.getName() + ": " + e));
return thread;
}
}
public static void main(String[] args) throws InterruptedException {
// 使用示例
DynamicThreadedCrawler crawler = new DynamicThreadedCrawler(
2, // 最小线程数
10, // 最大线程数
30, // 线程空闲超时时间(秒)
url -> System.out.println("Processing: " + url + " on " + Thread.currentThread().getName())
);
crawler.start();
// 添加初始任务
for (int i = 0; i < 20; i++) {
crawler.addTask("https://example.com/page" + i);
}
// 动态调整线程
TimeUnit.SECONDS.sleep(3);
crawler.increaseThreads(3); // 增加3个线程
TimeUnit.SECONDS.sleep(2);
crawler.decreaseThreads(2); // 减少2个线程
// 监控状态
ScheduledExecutorService monitor = Executors.newSingleThreadScheduledExecutor();
monitor.scheduleAtFixedRate(() ->
System.out.printf("[Monitor] Active: %d, Queued: %d, Pool: %d/%d%n",
crawler.getActiveThreadCount(),
crawler.getQueuedTaskCount(),
crawler.executor.getPoolSize(),
crawler.executor.getCorePoolSize()),
0, 1, TimeUnit.SECONDS);
// 运行一段时间后停止
TimeUnit.SECONDS.sleep(10);
monitor.shutdown();
crawler.stop();
}
}
核心设计说明:
1、动态线程管理:
increaseThreads()
/decreaseThreads()
方法实时调整线程数- 使用
ThreadPoolExecutor
的setCorePoolSize()
实现动态调整 - 空闲线程自动超时回收(通过
keepAliveTime
参数控制)
2、任务处理流程:
- 双队列系统:外部任务队列 + 线程池工作队列
- 工作线程从任务队列获取URL进行处理
- 支持任务失败自动重试
3、资源控制:
- 最小/最大线程数限制
- 线程空闲超时自动回收
- 线程工厂统一管理线程创建
4、错误处理:
- 统一异常捕获
- 未捕获异常处理器
- 任务级错误隔离和重试
5、监控能力:
- 实时获取活动线程数
- 查看排队任务数量
- 线程池状态监控
使用示例:
scss
public static void main(String[] args) {
// 创建爬虫实例
DynamicThreadedCrawler crawler = new DynamicThreadedCrawler(
4, // 最小线程数
20, // 最大线程数
60, // 线程空闲超时(秒)
url -> {
// 实际页面处理逻辑
String content = downloadPage(url);
processContent(content);
extractLinks(content).forEach(crawler::addTask);
}
);
crawler.start();
crawler.addTask("https://example.com/seed");
// 运行时动态调整
crawler.increaseThreads(5); // 增加5个线程
crawler.decreaseThreads(3); // 减少3个线程
}
企业级增强建议:
1、分布式扩展:
- 使用消息队列(如RabbitMQ/Kafka)替代本地任务队列
- 添加Redis实现URL去重
2、流量控制:
- 添加域名请求速率限制
- 实现自适应流量控制算法
3、容错机制:
- 心跳检测和线程自动恢复
- 熔断器模式防止级联故障
4、监控集成:
- 集成Micrometer暴露JMX指标
- 添加Prometheus监控端点
5、配置管理:
- 支持热更新配置(如线程数、超时时间)
- 配置文件动态加载
此实现提供了企业级爬虫的核心框架,具体可根据实际需求扩展分布式处理、高级去重策略和反爬虫机制等功能。以上内容如有不懂的可以留言讨论。