如何建立针对 .NET Core web 程序的线程池的长期监控

建立针对 .NET Core Web 应用程序线程池的长期监控是一个系统性的工程，它涉及代码集成、指标收集、存储、可视化和告警。

核心思路

线程池监控不是孤立的，它必须与应用程序的整体性能指标（如请求量、响应时间、错误率）结合分析才有意义。我们的目标是：当应用出现性能退化时，能快速判断是否由线程池问题引起，并定位根本原因。

第一步：定义要监控的核心指标

.NET 的 System.Threading.ThreadPool 类提供了以下关键性能计数器：

线程数量 (Thread Count)
- ThreadPool.ThreadCount: 线程池中当前存在的线程总数（包括忙碌和空闲的）。这是最核心的指标。
工作项队列长度 (Work Item Queue Length)
- ThreadPool.PendingWorkItemCount (.NET 6+) 或通过 ThreadPool.GetAvailableThreads(out workerThreads, out ioThreads) 间接计算（旧版本）。队列积压是性能问题最直接的信号。
已完成工作项 (Completed Work Items)
- ThreadPool.CompletedWorkItemCount: 自启动以来完成的工作项总数。可用于计算吞吐量（率）。
线程注入速率 (Thread Injection Rate)
- 监控 ThreadCount 的增长速度。线程池缓慢增加线程是正常的，但短时间内陡增（"线程风暴"）意味着有大量阻塞性任务。

此外，必须关联的应用程序指标：

应用层： 每秒请求数 (RPS)、95th/99th 分位响应时间、错误率（特别是 5xx）。
系统层： CPU 使用率、内存使用率。

第二步：选择技术栈并实施监控（三种主流方案）

您可以根据公司现有的技术栈和复杂度要求选择一种或组合使用。

方案一：使用 ASP.NET Core 内置指标与 Prometheus + Grafana（云原生首选）

这是目前最流行、最现代化的方案。

暴露指标端点：

安装 prometheus-net.AspNetCore NuGet 包。
在 Program.cs 中添加指标收集和暴露端点：

csharp 复制代码

using Prometheus;

var builder = WebApplication.CreateBuilder(args);
// ... 其他服务配置

var app = builder.Build();

// 启用收集 ASP.NET Core 指标
app.UseHttpMetrics();
// 暴露指标端点，供 Prometheus 抓取
app.MapMetrics("/metrics"); // 默认端口是 80/tcp

// ... 其他中间件配置
app.Run();

添加自定义线程池指标：

prometheus-net 包会自动收集很多指标，但线程池指标需要我们自己定义和更新。
创建一个后台服务 ThreadPoolMetricsService.cs：

csharp 复制代码

using System.Threading;
using System.Threading.Tasks;
using Microsoft.Extensions.Hosting;
using Prometheus;

public class ThreadPoolMetricsService : BackgroundService
{
    private readonly Gauge _threadCountGauge;
    private readonly Gauge _pendingWorkItemGauge;
    private readonly Counter _completedWorkItemCounter;

    public ThreadPoolMetricsService()
    {
        _threadCountGauge = Metrics.CreateGauge(
            "dotnet_threadpool_thread_count",
            "Number of thread pool threads"
        );

        // 注意：PendingWorkItemCount 在 .NET 6 及更高版本中可用
        _pendingWorkItemGauge = Metrics.CreateGauge(
            "dotnet_threadpool_pending_work_item_count",
            "Number of pending work items"
        );

        _completedWorkItemCounter = Metrics.CreateCounter(
            "dotnet_threadpool_completed_work_item_total",
            "Total number of work items completed"
        );
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            // 每 5 秒更新一次指标值
            _threadCountGauge.Set(ThreadPool.ThreadCount);
            #if NET6_0_OR_GREATER
            _pendingWorkItemGauge.Set(ThreadPool.PendingWorkItemCount);
            #endif
            _completedWorkItemCounter.IncTo(ThreadPool.CompletedWorkItemCount);

            await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);
        }
    }
}

在 Program.cs 中注册这个服务：builder.Services.AddHostedService<ThreadPoolMetricsService>();

部署和配置基础设施：
- 部署 Prometheus： 配置 scrape_configs 来抓取你的 Web 应用的 /metrics 端点。
- 部署 Grafana： 添加 Prometheus 作为数据源。
创建 Grafana 仪表板：
- 编写 PromQL 查询来可视化指标，例如：
  - dotnet_threadpool_thread_count
  - rate(dotnet_threadpool_completed_work_item_total[5m]) (吞吐量)
  - dotnet_threadpool_pending_work_item_count
- 将线程池指标与 http_requests_received_total (请求量)、http_request_duration_seconds (延迟) 等指标放在同一个仪表板上进行关联分析。

方案二：使用 Application Insights（Azure 生态首选）

如果你已经在使用 Azure，Application Insights 提供了开箱即用的集成，但自定义线程池指标需要一些配置。

集成 Application Insights SDK：
- 安装 Microsoft.ApplicationInsights.AspNetCore NuGet 包。
- 在 Program.cs 中启用它：builder.Services.AddApplicationInsightsTelemetry();

发送自定义指标：

使用 TelemetryClient 的 GetMetric 方法来发送自定义指标。创建一个类似的 IHostedService：

csharp 复制代码

public class ThreadPoolMetricsService : BackgroundService
{
    private readonly TelemetryClient _telemetryClient;

    public ThreadPoolMetricsService(TelemetryClient telemetryClient)
    {
        _telemetryClient = telemetryClient;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            _telemetryClient.GetMetric("ThreadPool.ThreadCount").TrackValue(ThreadPool.ThreadCount);
            #if NET6_0_OR_GREATER
            _telemetryClient.GetMetric("ThreadPool.PendingWorkItemCount").TrackValue(ThreadPool.PendingWorkItemCount);
            #endif
            // CompletedWorkItemCount 更适合作为累计计数器，但 App Insights 指标主要是快照值
            // 可以考虑计算差值后发送速率

            await Task.Delay(TimeSpan.FromSeconds(30), stoppingToken); // App Insights 聚合周期较长，无需太频繁
        }
    }
}

注册服务：builder.Services.AddHostedService<ThreadPoolMetricsService>();

在 Azure 门户中查看：
- 在 Application Insights 资源的 Metrics 页面中，你可以找到你的自定义指标并创建图表。

方案三：使用诊断工具和事件源（用于深度诊断）

对于长期监控，上述方案更合适。但当你需要深度排查 一个复杂的线程池问题时，.NET EventCounters 和 dotnet-counters 工具是无价之宝。

临时诊断：
- 在生产环境服务器上，使用 dotnet-counters 命令实时监控：
  bash 复制代码
```
dotnet-counters monitor --name <your-process-name> --counters System.Threading.ThreadPool
```
  这会显示线程池的实时变化，非常适合在压测或故障发生时进行观察。
在代码中监听 EventSource：
- System.Threading.ThreadPool 会发出 EventSource 事件。你可以编写一个 EventListener 来长期收集这些事件并转发到你的监控系统（如 Elasticsearch），但这通常更复杂，除非有非常特殊的需求。

第三步：设置告警机制

监控的目的是为了及时发现问题。你需要为关键指标设置告警：

线程队列积压告警：
- 规则： dotnet_threadpool_pending_work_item_count > X 持续超过 Y 分钟。
- 说明： X 的阈值需要根据你的应用基线来定。即使是 5-10 的持续积压也可能意味着问题。这是最应关注的警报。
线程数异常增长告警：
- 规则： derivative(dotnet_threadpool_thread_count[5m]) > Z。
- 说明： 线程数在5分钟内增长超过 Z 个（例如10个），可能意味着发生了"线程风暴"。
线程数上限告警：
- 规则： dotnet_threadpool_thread_count 接近你的理论或配置上限。
- 说明： 防止线程耗尽导致应用完全停滞。

告警工具：

Prometheus-based: 使用 Alertmanager。
Azure: 使用 Application Insights 警报 或 Azure Monitor 警报。
其他： 使用 Grafana 内置的告警功能或集成 PagerDuty、OpsGenie 等。

总结与最佳实践

步骤	推荐方案	工具
1. 代码集成	自定义 BackgroundService	`prometheus-net` 或 `Application Insights SDK`
2. 数据收集	拉取模型（Pull）	Prometheus
	推送模型（Push）	Application Insights
3. 可视化	自定义仪表板	Grafana (配 Prometheus) 或 Azure 门户
4. 告警	基于阈值	Prometheus Alertmanager 或 Azure Alerts
5. 深度诊断	临时工具	`dotnet-counters`, `dotnet-dump`

最佳实践：

建立基线： 在正常负载下运行你的应用，记录线程数、队列长度等的正常范围。告警阈值应基于这些基线。
关联分析： 永远不要孤立地看线程池指标。线程池队列积压时，一定要同时检查 CPU 使用率、响应时间和错误日志。高CPU下的积压和低CPU下的积压，其原因完全不同（可能是计算密集型 vs. IO等待密集型）。
长期趋势： 观察线程数量的长期趋势。一个健康的、处理稳态负载的应用，其线程数应该是相对稳定的。线程数的持续缓慢增长可能意味着存在微小的资源泄漏（如未关闭的数据库连接）或任务调度不当。
日志记录： 在告警触发时，确保你的应用日志已经记录了足够的上下文（如当时正在处理哪些请求、是否有大量异常），这能极大帮助排查问题。

通过以上方案，你可以构建一个强大的、面向生产的线程池长期监控系统，从而保证你的 .NET Core Web 应用的稳健运行。