引言:.NET 爬虫开发的现代挑战与机遇
在当今[数据驱动]的时代,网络爬虫已成为获取和分析网络信息的重要工具。作为微软生态系统的核心开发框架,.NET 提供了一系列强大的工具和库来实现高效、稳定的网络爬虫。本文将全面探讨在.NET平台上实现爬虫的最优方案,涵盖从基础到高级的各个方面,包括HTTP请求处理、HTML解析、反爬策略应对、分布式爬取、性能优化等关键主题。
与Python等动态语言相比,.NET在爬虫开发中具有类型安全、性能优越、并发处理能力强等独特优势。我们将深入分析如何充分利用.NET的特性来构建工业级爬虫系统,并通过大量实际代码示例展示最佳实践。
一、.NET 爬虫基础架构设计
1.1 核心组件与工作流程
一个完整的.NET爬虫系统通常包含以下核心组件:
- 调度器(Scheduler) :管理待抓取URL队列
- 下载器(Downloader) :执行HTTP请求获取页面内容
- 解析器(Parser) :提取所需数据和新的URL
- 存储模块(Storage) :持久化爬取结果
- 监控系统(Monitor) :跟踪爬虫运行状态
基础爬虫工作流程示例:
csharp
public class BasicCrawler
{
private readonly HttpClient _httpClient;
private readonly ConcurrentQueue<Uri> _urlQueue = new();
private readonly HashSet<Uri> _visitedUrls = new();
public BasicCrawler()
{
_httpClient = new HttpClient(new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.All
});
_httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (compatible; MyCrawler/1.0)");
}
public async Task CrawlAsync(Uri startUrl, int maxPages = 100)
{
_urlQueue.Enqueue(startUrl);
while (_urlQueue.TryDequeue(out var currentUrl) && _visitedUrls.Count < maxPages)
{
if (_visitedUrls.Contains(currentUrl)) continue;
try
{
Console.WriteLine($"Crawling: {currentUrl}");
var html = await DownloadAsync(currentUrl);
_visitedUrls.Add(currentUrl);
var links = ParseLinks(html, currentUrl);
foreach (var link in links)
{
if (!_visitedUrls.Contains(link))
_urlQueue.Enqueue(link);
}
// 处理页面内容...
await ProcessPageAsync(html, currentUrl);
}
catch (Exception ex)
{
Console.WriteLine($"Error crawling {currentUrl}: {ex.Message}");
}
}
}
private async Task<string> DownloadAsync(Uri url)
{
var response = await _httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
private IEnumerable<Uri> ParseLinks(string html, Uri baseUri)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc.DocumentNode.SelectNodes("//a[@href]")
?.Select(a => a.GetAttributeValue("href", null))
.Where(href => !string.IsNullOrEmpty(href))
.Select(href => new Uri(baseUri, href))
.Where(uri => uri.Scheme == Uri.UriSchemeHttp || uri.Scheme == Uri.UriSchemeHttps)
.Distinct() ?? Enumerable.Empty<Uri>();
}
private Task ProcessPageAsync(string html, Uri url)
{
// 实现具体页面处理逻辑
return Task.CompletedTask;
}
}
AI写代码csharp
运行
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
1.2 HTTP客户端最佳实践
在.NET中,HttpClient
是进行HTTP通信的核心类,正确使用它至关重要:
优化HttpClient使用:
ini
public static class HttpClientFactory
{
private static readonly Lazy<HttpClient> _sharedClient = new(() =>
{
var handler = new SocketsHttpHandler
{
PooledConnectionLifetime = TimeSpan.FromMinutes(5),
PooledConnectionIdleTimeout = TimeSpan.FromMinutes(1),
MaxConnectionsPerServer = 50,
UseCookies = false
};
var client = new HttpClient(handler)
{
Timeout = TimeSpan.FromSeconds(30)
};
client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
return client;
});
public static HttpClient CreateClient()
{
return _sharedClient.Value;
}
public static HttpClient CreateRotatingClient(IEnumerable<WebProxy> proxies)
{
var proxy = proxies.OrderBy(_ => Guid.NewGuid()).First();
var handler = new HttpClientHandler
{
Proxy = proxy,
UseProxy = true,
AutomaticDecompression = DecompressionMethods.All
};
return new HttpClient(handler)
{
Timeout = TimeSpan.FromSeconds(30)
};
}
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
关键优化点:
- 使用
SocketsHttpHandler
替代默认处理器以获得更好性能 - 配置连接池参数避免端口耗尽
- 实现代理轮换机制
- 设置合理的超时和重试策略
二、高级HTML解析技术
2.1 AngleSharp vs HtmlAgilityPack
.NET生态中有两个主流的HTML解析库,各有优缺点:
AngleSharp优势:
- 更符合现代浏览器解析标准
- 支持CSS选择器
- 内置DOM操作API
HtmlAgilityPack优势:
- 对畸形HTML容错性更好
- 更轻量级
- XPath支持更成熟
AngleSharp解析示例:
ini
public async Task<Dictionary<string, string>> ExtractProductDataAsync(Uri url)
{
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(url.ToString());
var products = document.QuerySelectorAll(".product-item")
.Select(product => new
{
Name = product.QuerySelector(".product-name")?.TextContent.Trim(),
Price = product.QuerySelector(".price")?.TextContent.Trim(),
Description = product.QuerySelector(".description")?.TextContent.Trim(),
ImageUrl = product.QuerySelector("img.product-image")?.GetAttribute("src")
})
.Where(p => !string.IsNullOrEmpty(p.Name))
.ToDictionary(
p => p.Name,
p => $"{p.Price}|{p.Description}|{p.ImageUrl}");
return products;
}
AI写代码csharp
运行
12345678910111213141516171819202122
HtmlAgilityPack解析示例:
ini
public Dictionary<string, string> ExtractDataWithXPath(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = new Dictionary<string, string>();
var nodes = doc.DocumentNode.SelectNodes("//div[contains(@class,'news-item')]");
if (nodes == null) return result;
foreach (var node in nodes)
{
var titleNode = node.SelectSingleNode(".//h3/a");
if (titleNode == null) continue;
var title = titleNode.InnerText.Trim();
var url = titleNode.GetAttributeValue("href", "");
var date = node.SelectSingleNode(".//span[@class='date']")?.InnerText.Trim();
if (!string.IsNullOrEmpty(title) && !string.IsNullOrEmpty(url))
result[title] = $"{date}|{url}";
}
return result;
}
AI写代码csharp
运行
12345678910111213141516171819202122232425
2.2 动态内容处理
现代网站大量使用JavaScript动态加载内容,传统HTTP请求无法获取这些内容。解决方案包括:
- 使用Headless浏览器:如PuppeteerSharp
- 分析API请求:直接调用数据接口
- JavaScript引擎:执行简单脚本
PuppeteerSharp示例:
ini
public async Task<List<string>> CrawlDynamicPageAsync(string url)
{
var options = new LaunchOptions
{
Headless = true,
ExecutablePath = "C:\Program Files\Google\Chrome\Application\chrome.exe"
};
await using var browser = await Puppeteer.LaunchAsync(options);
await using var page = await browser.NewPageAsync();
await page.SetUserAgentAsync("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
await page.SetJavaScriptEnabledAsync(true);
Console.WriteLine($"Navigating to {url}");
await page.GoToAsync(url, WaitUntilNavigation.Networkidle2);
// 等待特定元素加载
await page.WaitForSelectorAsync(".dynamic-content", new WaitForSelectorOptions { Timeout = 5000 });
// 滚动页面触发懒加载
await AutoScrollPageAsync(page);
// 获取渲染后的HTML
var content = await page.GetContentAsync();
// 使用AngleSharp或HtmlAgilityPack解析内容
var doc = new HtmlDocument();
doc.LoadHtml(content);
return doc.DocumentNode.SelectNodes("//div[@class='item']")
?.Select(node => node.InnerText.Trim())
.Where(text => !string.IsNullOrEmpty(text))
.ToList() ?? new List<string>();
}
private static async Task AutoScrollPageAsync(IPage page)
{
await page.EvaluateExpressionAsync(@"{
await new Promise((resolve) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
}");
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
三、反爬策略与应对方案
3.1 常见反爬机制分析
- User-Agent检测:检查请求头是否来自真实浏览器
- 请求频率限制:单位时间内过多请求会被封禁
- IP封禁:识别并封禁爬虫IP
- 验证码:要求用户验证
- 行为分析:检测非人类操作模式
- Web应用防火墙(WAF) :如Cloudflare、Akamai等
3.2 高级规避技术
综合反反爬策略实现:
ini
public class AntiAntiCrawler
{
private readonly List<string> _userAgents = new()
{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15"
};
private readonly List<WebProxy> _proxies;
private readonly Random _random = new();
private readonly ConcurrentDictionary<string, DateTime> _domainDelay = new();
public AntiAntiCrawler(IEnumerable<string> proxyStrings)
{
_proxies = proxyStrings
.Select(p => new WebProxy(p))
.ToList();
}
public async Task<string> SmartGetAsync(string url, int retryCount = 3)
{
for (int i = 0; i < retryCount; i++)
{
try
{
var domain = new Uri(url).Host;
// 域名级延迟控制
if (_domainDelay.TryGetValue(domain, out var lastAccess))
{
var delay = (int)(lastAccess.AddSeconds(5) - DateTime.Now).TotalMilliseconds;
if (delay > 0)
await Task.Delay(delay);
}
using var client = CreateDisposableClient();
var response = await client.GetAsync(url);
if ((int)response.StatusCode == 429) // Too Many Requests
{
await ApplyBackoffStrategyAsync();
continue;
}
response.EnsureSuccessStatusCode();
_domainDelay[domain] = DateTime.Now;
return await response.Content.ReadAsStringAsync();
}
catch (HttpRequestException ex) when (ex.StatusCode == System.Net.HttpStatusCode.Forbidden)
{
await ApplyBackoffStrategyAsync();
}
catch (Exception ex)
{
Console.WriteLine($"Attempt {i + 1} failed: {ex.Message}");
if (i == retryCount - 1)
throw;
}
}
throw new InvalidOperationException("All retry attempts failed");
}
private HttpClient CreateDisposableClient()
{
var handler = new HttpClientHandler
{
Proxy = _proxies.Count > 0 ? _proxies[_random.Next(_proxies.Count)] : null,
UseProxy = _proxies.Count > 0,
AutomaticDecompression = DecompressionMethods.All,
UseCookies = false
};
var client = new HttpClient(handler)
{
Timeout = TimeSpan.FromSeconds(30)
};
client.DefaultRequestHeaders.UserAgent.ParseAdd(_userAgents[_random.Next(_userAgents.Count)]);
client.DefaultRequestHeaders.Accept.ParseAdd("text/html,application/xhtml+xml,application/xml;q=0.9");
client.DefaultRequestHeaders.AcceptLanguage.ParseAdd("en-US,en;q=0.5");
client.DefaultRequestHeaders.Referrer = new Uri("https://www.google.com/");
return client;
}
private async Task ApplyBackoffStrategyAsync()
{
var delay = _random.Next(5000, 15000); // 5-15秒随机延迟
Console.WriteLine($"Applying backoff delay: {delay}ms");
await Task.Delay(delay);
}
}
AI写代码csharp
运行
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
3.3 验证码处理方案
验证码自动识别方案对比:
方案 | 优点 | 缺点 | 适用场景 |
---|---|---|---|
第三方服务(2Captcha) | 识别率高 | 需要付费 | 复杂验证码 |
OCR库(Tesseract) | 免费 | 准确率有限 | 简单文本验证码 |
机器学习模型 | 可定制 | 开发成本高 | 特定验证码类型 |
人工打码 | 100%准确 | 速度慢 | 关键任务 |
集成2Captcha服务示例:
csharp
public class CaptchaSolver
{
private readonly string _apiKey;
private readonly HttpClient _httpClient;
public CaptchaSolver(string apiKey)
{
_apiKey = apiKey;
_httpClient = new HttpClient { BaseAddress = new Uri("https://2captcha.com/") };
}
public async Task<string> SolveReCaptchaV2Async(string siteKey, string pageUrl)
{
var parameters = new Dictionary<string, string>
{
["key"] = _apiKey,
["method"] = "userrecaptcha",
["googlekey"] = siteKey,
["pageurl"] = pageUrl,
["json"] = "1"
};
// 提交验证码识别请求
var response = await _httpClient.PostAsync("in.php", new FormUrlEncodedContent(parameters));
var result = await response.Content.ReadFromJsonAsync<CaptchaResponse>();
if (result?.Status != 1 || string.IsNullOrEmpty(result.Request))
throw new Exception($"Failed to submit captcha: {result?.ErrorText}");
string captchaId = result.Request;
// 轮询获取结果
for (int i = 0; i < 30; i++)
{
await Task.Delay(5000); // 每5秒检查一次
var checkResponse = await _httpClient.GetAsync($"res.php?key={_apiKey}&action=get&id={captchaId}&json=1");
var checkResult = await checkResponse.Content.ReadFromJsonAsync<CaptchaResponse>();
if (checkResult?.Status == 1)
return checkResult.Request;
if (checkResult?.ErrorText == "CAPCHA_NOT_READY")
continue;
throw new Exception($"Failed to solve captcha: {checkResult?.ErrorText}");
}
throw new Exception("Captcha solving timeout");
}
private class CaptchaResponse
{
public int Status { get; set; }
public string? Request { get; set; }
public string? ErrorText { get; set; }
}
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
四、分布式爬虫系统设计
4.1 架构设计考量
大规模爬虫系统需要考虑以下方面:
- URL去重:布隆过滤器或分布式缓存
- 任务调度:消息队列或专用调度服务
- 状态持久化:数据库存储爬取状态
- 故障恢复:检查点和重试机制
- 监控告警:性能指标和异常检测
基于Redis的分布式URL管理器:
csharp
public class RedisUrlManager
{
private readonly IDatabase _db;
private readonly string _queueKey = "url:queue";
private readonly string _visitedKey = "url:visited";
private readonly string _errorKey = "url:error";
public RedisUrlManager(IConnectionMultiplexer redis)
{
_db = redis.GetDatabase();
}
public async Task<bool> AddUrlAsync(string url)
{
// 使用布隆过滤器避免重复(RedisBloom模块)
if (await _db.SetContainsAsync(_visitedKey, url))
return false;
// 添加到待爬队列
await _db.ListLeftPushAsync(_queueKey, url);
return true;
}
public async Task<string?> GetNextUrlAsync()
{
return await _db.ListRightPopAsync(_queueKey);
}
public async Task MarkAsCompletedAsync(string url, bool success = true)
{
await _db.SetAddAsync(_visitedKey, url);
if (!success)
await _db.SetAddAsync(_errorKey, url);
}
public async Task<long> GetQueueLengthAsync()
{
return await _db.ListLengthAsync(_queueKey);
}
public async Task<IEnumerable<string>> GetErrorUrlsAsync()
{
var values = await _db.SetMembersAsync(_errorKey);
return values.Select(v => v.ToString());
}
}
AI写代码csharp
运行
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
4.2 基于Actor模型的爬虫实现
使用Orleans框架实现分布式爬虫:
csharp
// Grain接口定义
public interface ICrawlerGrain : IGrainWithStringKey
{
Task StartCrawlingAsync(string startUrl);
Task<int> GetCrawledCountAsync();
}
// Grain实现
public class CrawlerGrain : Grain, ICrawlerGrain
{
private readonly IHttpClientFactory _httpClientFactory;
private readonly ILogger<CrawlerGrain> _logger;
private readonly IGrainFactory _grainFactory;
private int _crawledCount;
private readonly HashSet<string> _visitedUrls = new();
public CrawlerGrain(
IHttpClientFactory httpClientFactory,
ILogger<CrawlerGrain> logger,
IGrainFactory grainFactory)
{
_httpClientFactory = httpClientFactory;
_logger = logger;
_grainFactory = grainFactory;
}
public override Task OnActivateAsync()
{
_logger.LogInformation($"Crawler {this.GetPrimaryKeyString()} activated");
return base.OnActivateAsync();
}
public async Task StartCrawlingAsync(string startUrl)
{
if (_visitedUrls.Contains(startUrl))
return;
_visitedUrls.Add(startUrl);
try
{
var client = _httpClientFactory.CreateClient();
var response = await client.GetStringAsync(startUrl);
// 解析页面内容
var doc = new HtmlDocument();
doc.LoadHtml(response);
var links = doc.DocumentNode.SelectNodes("//a[@href]")
?.Select(a => a.GetAttributeValue("href", null))
.Where(href => !string.IsNullOrEmpty(href))
.Select(href => new Uri(new Uri(startUrl), href).AbsoluteUri)
.Where(url => url.StartsWith("http"))
.Distinct()
.ToList();
_crawledCount++;
_logger.LogInformation($"Crawled {startUrl}, found {links?.Count ?? 0} links");
// 分布式处理新链接
if (links != null)
{
var tasks = new List<Task>();
foreach (var link in links)
{
// 使用一致性哈希将URL分配到不同的Grain
var grain = _grainFactory.GetGrain<ICrawlerGrain>(link);
tasks.Add(grain.StartCrawlingAsync(link));
}
await Task.WhenAll(tasks);
}
}
catch (Exception ex)
{
_logger.LogError(ex, $"Error crawling {startUrl}");
}
}
public Task<int> GetCrawledCountAsync()
{
return Task.FromResult(_crawledCount);
}
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
五、性能优化与资源管理
5.1 高效并发控制
使用System.Threading.Channels实现生产者-消费者模式:
csharp
public class ChannelBasedCrawler
{
private readonly Channel<string> _urlChannel;
private readonly IHttpClientFactory _httpClientFactory;
private readonly ConcurrentDictionary<string, byte> _visitedUrls = new();
private readonly int _maxConcurrency;
public ChannelBasedCrawler(IHttpClientFactory httpClientFactory, int maxConcurrency = 10)
{
_httpClientFactory = httpClientFactory;
_maxConcurrency = maxConcurrency;
_urlChannel = Channel.CreateUnbounded<string>();
}
public async Task StartAsync(string startUrl, CancellationToken cancellationToken = default)
{
// 生产者任务
var producer = Task.Run(async () =>
{
await _urlChannel.Writer.WriteAsync(startUrl, cancellationToken);
await _urlChannel.Writer.CompleteAsync();
}, cancellationToken);
// 消费者任务
var consumerTasks = Enumerable.Range(0, _maxConcurrency)
.Select(_ => ProcessUrlsAsync(cancellationToken))
.ToArray();
await Task.WhenAll(consumerTasks);
}
private async Task ProcessUrlsAsync(CancellationToken cancellationToken)
{
await foreach (var url in _urlChannel.Reader.ReadAllAsync(cancellationToken))
{
if (_visitedUrls.ContainsKey(url))
continue;
_visitedUrls[url] = 0;
try
{
var client = _httpClientFactory.CreateClient();
var response = await client.GetStringAsync(url, cancellationToken);
var doc = new HtmlDocument();
doc.LoadHtml(response);
var links = doc.DocumentNode.SelectNodes("//a[@href]")
?.Select(a => a.GetAttributeValue("href", null))
.Where(href => !string.IsNullOrEmpty(href))
.Select(href => new Uri(new Uri(url), href).AbsoluteUri)
.Where(newUrl => newUrl.StartsWith("http"))
.Distinct()
.ToList();
if (links != null)
{
foreach (var link in links)
{
if (!_visitedUrls.ContainsKey(link))
await _urlChannel.Writer.WriteAsync(link, cancellationToken);
}
}
Console.WriteLine($"Processed {url}, found {links?.Count ?? 0} links");
}
catch (Exception ex)
{
Console.WriteLine($"Error processing {url}: {ex.Message}");
}
}
}
}
AI写代码csharp
运行
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
5.2 内存优化技巧
对象池模式减少GC压力:
csharp
public class HtmlParserPool : IObjectPool<HtmlDocument>
{
private readonly ConcurrentBag<HtmlDocument> _pool = new();
private int _count;
private readonly int _maxSize;
public HtmlParserPool(int maxSize = 20)
{
_maxSize = maxSize;
}
public HtmlDocument Get()
{
if (_pool.TryTake(out var document))
return document;
if (_count < _maxSize)
{
Interlocked.Increment(ref _count);
return new HtmlDocument();
}
throw new InvalidOperationException("Pool exhausted");
}
public void Return(HtmlDocument document)
{
document.DocumentNode.RemoveAll();
_pool.Add(document);
}
}
// 使用示例
var pool = new HtmlParserPool();
var document = pool.Get();
try
{
document.LoadHtml(htmlContent);
// 解析操作...
}
finally
{
pool.Return(document);
}
AI写代码csharp
运行
123456789101112131415161718192021222324252627282930313233343536373839404142434445
六、数据存储与处理
6.1 存储方案选择
根据数据规模和访问模式选择合适的存储:
存储类型 | 适用场景 | .NET集成方案 |
---|---|---|
关系型数据库(SQL Server) | 结构化数据,复杂查询 | Entity Framework Core |
NoSQL(MongoDB) | 半结构化数据,灵活模式 | MongoDB.Driver |
搜索引擎(Elasticsearch) | 全文搜索,日志分析 | NEST |
文件存储(Parquet) | 大数据分析,数据湖 | Parquet.NET |
分布式缓存(Redis) | 高速访问,临时数据 | StackExchange.Redis |
Elasticsearch集成示例:
csharp
public class ElasticSearchService
{
private readonly ElasticClient _client;
public ElasticSearchService(string url = "http://localhost:9200")
{
var settings = new ConnectionSettings(new Uri(url))
.DefaultIndex("webpages")
.EnableDebugMode()
.PrettyJson();
_client = new ElasticClient(settings);
}
public async Task IndexPageAsync(WebPage page)
{
var response = await _client.IndexDocumentAsync(page);
if (!response.IsValid)
throw new Exception($"Failed to index document: {response.DebugInformation}");
}
public async Task<IReadOnlyCollection<WebPage>> SearchAsync(string query)
{
var response = await _client.SearchAsync<WebPage>(s => s
.Query(q => q
.MultiMatch(m => m
.Query(query)
.Fields(f => f
.Field(p => p.Title)
.Field(p => p.Content)
)
)
)
.Highlight(h => h
.Fields(f => f
.Field(p => p.Content)
)
)
);
return response.Documents;
}
}
public class WebPage
{
public string Id { get; set; } = Guid.NewGuid().ToString();
public string Url { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public DateTime CrawledTime { get; set; } = DateTime.UtcNow;
}
AI写代码csharp
运行
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
6.2 数据清洗与转换
使用LINQ进行数据清洗:
ini
public class DataCleaner
{
public IEnumerable<CleanProduct> CleanProducts(IEnumerable<RawProduct> rawProducts)
{
return rawProducts
.Where(p => !string.IsNullOrWhiteSpace(p.Name))
.Select(p => new CleanProduct
{
Id = GenerateProductId(p.Name),
Name = NormalizeName(p.Name),
Price = ParsePrice(p.PriceText),
Category = InferCategory(p.Name, p.Description),
Features = ExtractFeatures(p.Description).ToList(),
LastUpdated = DateTime.UtcNow
})
.Where(p => p.Price > 0)
.GroupBy(p => p.Id)
.Select(g => g.OrderByDescending(p => p.LastUpdated).First());
}
private string GenerateProductId(string name)
{
var normalized = name.ToLowerInvariant().Trim();
using var sha256 = SHA256.Create();
var hashBytes = sha256.ComputeHash(Encoding.UTF8.GetBytes(normalized));
return BitConverter.ToString(hashBytes).Replace("-", "").Substring(0, 12);
}
private decimal ParsePrice(string priceText)
{
if (string.IsNullOrWhiteSpace(priceText))
return 0;
var cleanText = new string(priceText
.Where(c => char.IsDigit(c) || c == '.' || c == ',')
.ToArray());
if (decimal.TryParse(cleanText, NumberStyles.Currency, CultureInfo.InvariantCulture, out var price))
return price;
return 0;
}
private string NormalizeName(string name)
{
if (string.IsNullOrWhiteSpace(name))
return string.Empty;
return Regex.Replace(name.Trim(), @"\s+", " ");
}
private IEnumerable<string> ExtractFeatures(string description)
{
if (string.IsNullOrWhiteSpace(description))
yield break;
var sentences = description.Split(new[] { '.', ';', '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var sentence in sentences)
{
var cleanSentence = sentence.Trim();
if (cleanSentence.Length > 10 && cleanSentence.Length < 150)
yield return cleanSentence;
}
}
}
AI写代码csharp
运行
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
七、监控与运维
7.1 健康监控与指标收集
使用AppMetrics进行系统监控:
ini
public class CrawlerMetrics
{
private readonly IMetricsRoot _metrics;
public CrawlerMetrics()
{
_metrics = new MetricsBuilder()
.OutputMetrics.AsPrometheusPlainText()
.Configuration.Configure(options =>
{
options.DefaultContextLabel = "crawler";
options.GlobalTags.Add("host", Environment.MachineName);
})
.Build();
}
public void RecordRequest(string domain, int statusCode, long elapsedMs)
{
_metrics.Measure.Timer.Time(MetricsOptions.RequestTimer, elapsedMs, TimeUnit.Milliseconds, tags: new MetricTags("domain", domain));
_metrics.Measure.Meter.Mark(MetricsOptions.RequestMeter, tags: new MetricTags(new[] { "domain", "status" }, new[] { domain, statusCode.ToString() }));
}
public void RecordError(string domain, string errorType)
{
_metrics.Measure.Counter.Increment(MetricsOptions.ErrorCounter, tags: new MetricTags(new[] { "domain", "type" }, new[] { domain, errorType }));
}
public async Task<string> GetMetricsAsPrometheusAsync()
{
var snapshot = _metrics.Snapshot.Get();
var formatter = new MetricsPrometheusTextOutputFormatter();
using var stream = new MemoryStream();
await formatter.WriteAsync(stream, snapshot);
stream.Position = 0;
using var reader = new StreamReader(stream);
return await reader.ReadToEndAsync();
}
public static class MetricsOptions
{
public static readonly TimerOptions RequestTimer = new()
{
Name = "request_duration",
MeasurementUnit = Unit.Requests,
DurationUnit = TimeUnit.Milliseconds,
RateUnit = TimeUnit.Minutes
};
public static readonly MeterOptions RequestMeter = new()
{
Name = "request_rate",
MeasurementUnit = Unit.Requests,
RateUnit = TimeUnit.Minutes
};
public static readonly CounterOptions ErrorCounter = new()
{
Name = "error_count",
MeasurementUnit = Unit.Errors
};
}
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
7.2 日志记录最佳实践
结构化日志配置:
ini
public static class LoggingConfiguration
{
public static ILoggerFactory CreateLoggerFactory(string serviceName)
{
return LoggerFactory.Create(builder =>
{
builder.AddConfiguration(LoadLoggingConfiguration());
// 控制台输出(开发环境)
builder.AddConsole(options =>
{
options.FormatterName = "json";
}).AddJsonConsole(options =>
{
options.IncludeScopes = true;
options.TimestampFormat = "yyyy-MM-ddTHH:mm:ss.fffZ";
options.JsonWriterOptions = new JsonWriterOptions { Indented = false };
});
// 文件输出
builder.AddFile("logs/crawler-{Date}.log", fileOptions =>
{
fileOptions.FormatLogFileName = fName => string.Format(fName, DateTime.Now);
fileOptions.FileSizeLimitBytes = 50 * 1024 * 1024; // 50MB
fileOptions.RetainedFileCountLimit = 30; // 保留30天
});
// Application Insights(生产环境)
if (!string.IsNullOrEmpty(Environment.GetEnvironmentVariable("APPINSIGHTS_INSTRUMENTATIONKEY")))
{
builder.AddApplicationInsights();
}
// 设置全局过滤器
builder.AddFilter((category, level) =>
{
if (category.StartsWith("Microsoft") && level < LogLevel.Warning)
return false;
return level >= LogLevel.Information;
});
// 设置服务名称
builder.SetMinimumLevel(LogLevel.Information);
});
}
private static IConfiguration LoadLoggingConfiguration()
{
return new ConfigurationBuilder()
.SetBasePath(Directory.GetCurrentDirectory())
.AddJsonFile("appsettings.json")
.AddJsonFile($"appsettings.{Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT")}.json", optional: true)
.AddEnvironmentVariables()
.Build();
}
}
// 使用示例
var loggerFactory = LoggingConfiguration.CreateLoggerFactory("WebCrawler");
var logger = loggerFactory.CreateLogger<Program>();
try
{
logger.LogInformation("Starting crawler at {StartTime}", DateTime.UtcNow);
// 爬取逻辑...
logger.LogInformation("Completed crawling {PageCount} pages", pageCount);
}
catch (Exception ex)
{
logger.LogError(ex, "Unexpected error occurred during crawling");
throw;
}
AI写代码csharp
运行
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
八、法律与伦理考量
8.1 合法爬取原则
- 尊重robots.txt:实现自动解析和遵守
- 控制请求频率:避免对目标服务器造成负担
- 遵守数据使用条款:检查网站的服务条款
- 隐私保护:不抓取个人隐私信息
- 版权意识:合理使用抓取内容
Robots.txt解析器实现:
csharp
public class RobotsTxtParser
{
private static readonly Regex _directiveRegex = new(@"^(?<directive>[A-Za-z-]+):\s*(?<value>.+)$", RegexOptions.Compiled);
private static readonly Regex _pathRegex = new(@"^/(.*)", RegexOptions.Compiled);
private readonly Dictionary<string, List<string>> _rules = new(StringComparer.OrdinalIgnoreCase);
private TimeSpan _crawlDelay = TimeSpan.Zero;
public async Task LoadFromUrlAsync(string baseUrl)
{
var uri = new Uri(new Uri(baseUrl), "/robots.txt");
using var client = new HttpClient();
client.DefaultRequestHeaders.UserAgent.ParseAdd("MyCrawler/1.0");
try
{
var response = await client.GetStringAsync(uri);
ParseContent(response);
}
catch (HttpRequestException)
{
// 如果robots.txt不存在,默认允许所有爬取
}
}
public void ParseContent(string content)
{
string? currentUserAgent = null;
foreach (var line in content.Split(new[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries))
{
var match = _directiveRegex.Match(line.Trim());
if (!match.Success) continue;
var directive = match.Groups["directive"].Value.Trim();
var value = match.Groups["value"].Value.Trim();
switch (directive.ToLower())
{
case "user-agent":
currentUserAgent = value;
break;
case "disallow":
if (currentUserAgent != null && !string.IsNullOrWhiteSpace(value))
{
if (!_rules.TryGetValue(currentUserAgent, out var disallows))
{
disallows = new List<string>();
_rules[currentUserAgent] = disallows;
}
disallows.Add(value);
}
break;
case "allow":
// 处理allow指令(如果需要)
AI写代码csharp
运行