WebView 从0到1搭建线上性能监控体系

前言

做了这么多年的WebView内嵌开发，我见过太多团队陷入同一个循环：性能优化做了一大堆，上线后却不知道效果到底如何；版本迭代后性能出现退化，等用户反馈才发现问题。

第3篇文章我们聊了WebView性能优化的实战技巧，但优化完了然后呢？怎么验证优化效果？线上用户的真实性能数据从哪来？性能退化谁来第一时间发现？

监控体系，就是来解决这些问题的。 它不是优化技巧的重复，而是从0到1搭建一套可持续运转的线上性能保障机制。

这篇文章，我把手头项目里跑了3年的监控体系完整梳理一遍，从指标设计到数据采集，从上报策略到看板搭建，从告警机制到实战闭环，毫无保留地分享出来。代码示例都是生产环境直接可用的，不整虚的。

一、为什么必须搭建监控体系

1.1 优化后的灵魂拷问

做完性能优化后，你一定被问过这些问题：

"这次优化对线上用户有多少提升？"
"有没有数据支撑？"
"下个版本会不会又退化？"

如果你的回答是"应该会好吧"、"测试环境看了下还可以"，那说明你的优化缺乏数据闭环。

1.2 监控体系解决什么问题

问题场景	没有监控	有监控
性能退化发现时机	用户投诉后	告警触发后30分钟内
优化效果验证	靠感觉	精确到百分位数
问题定位	复现困难大海捞针	有数据支撑快速定位
版本发布底气	忐忑不安	有据可依

1.3 与Inspector的关系

有人问，第8篇讲过线上问题排查，第6篇提到过WebView Inspector，它们和监控体系是什么关系？

简单类比：监控体系是"体检中心"，Inspector是"专科门诊"。

监控体系负责7×24小时持续观察，发现异常指标自动告警；Inspector负责在发现问题时进行深度排查。两者是互补关系，不是替代关系。监控告诉你"指标异常了"，Inspector帮你定位"为什么异常"。

二、监控指标设计

2.1 核心指标体系

我把这些年用下来的指标分了三类：

scss 复制代码

WebView性能监控指标体系
├── 加载性能指标
│   ├── 白屏时间 (First Blank Paint)
│   ├── 首内容绘制 (FCP)
│   ├── 最大内容绘制 (LCP)
│   ├── 页面完全加载时间
│   └── 关键资源加载耗时
├── 交互性能指标
│   ├── 首次输入延迟 (FID)
│   ├── 交互到渲染延迟 (INP)
│   └── JS主线程阻塞时间
└── 稳定性指标
    ├── JS错误率
    ├── 内存占用峰值
    ├── 内存泄漏趋势
    ├── WebView崩溃率
    └── 通信失败率

2.2 指标定义与采集方式

在设计指标时，我有个原则：能采集到的指标不等于应该采集的指标。有些团队恨不得把浏览器所有API数据都采集上来，结果数据量爆炸、服务器扛不住、真正有用的信号反而被淹没。我的经验是：先聚焦核心指标，有余力再扩展。

白屏时间 (First Blank Paint)

用户点击到页面出现任何内容的时间，这是用户感知的第一道门槛。很多人容易把白屏和FCP搞混：白屏是"屏幕变白"的时间点，而FCP是"第一个内容像素出现"的时间点。在很多场景下，白屏比FCP更能反映用户的等待焦虑。

首屏时间 (Above the Fold)

这是最容易引发"撕逼"的指标。产品说首屏1秒，开发说3秒，测试说2秒------原因是大家对"首屏"定义不一致。我建议用 LSO（Last Significant Object） 的方式来定义：最后一个above-the-fold元素渲染完成的时间。这个元素的选取需要和UI同学提前对齐。

关键指标采集详解

指标	定义	采集API	达标阈值
FCP	首内容绘制	`PerformanceObserver(paint)`	< 1.8s
LCP	最大内容绘制	`PerformanceObserver(lcp)`	< 2.5s
FID	首次输入延迟	`PerformanceObserver(first-input)`	< 100ms
CLS	累计布局偏移	`PerformanceObserver(layout-shift)`	< 0.1
TTFB	首字节时间	`performance.timing.responseStart`	< 600ms

白屏时间 (First Blank Paint)

用户点击到页面出现任何内容的时间，这是用户感知的第一道门槛。

javascript 复制代码

// 采集方案：MutationObserver 监听 DOM 变化
const firstBlankPaint = {
  startTime: 0,
  recorded: false,
  
  start() {
    this.startTime = performance.now();
    
    const observer = new MutationObserver(() => {
      if (!this.recorded) {
        // 检测到DOM变化但内容还未渲染
        // 记录这个时间点
        this.record();
      }
    });
    
    observer.observe(document.documentElement, {
      childList: true,
      subtree: true
    });
  },
  
  record() {
    this.recorded = true;
    const duration = performance.now() - this.startTime;
    
    // 上报到监控
    monitor.report('first_blank_paint', duration);
  }
};

首内容绘制 (FCP) & 最大内容绘制 (LCP)

这两个是Core Web Vitals的核心指标，Chrome团队背书的体验指标。

javascript 复制代码

// 使用 PerformanceObserver API 采集
const perfMetrics = {
  init() {
    // FCP 采集
    const fcpObserver = new PerformanceObserver((list) => {
      const entries = list.getEntries();
      const fcp = entries[entries.length - 1];
      
      monitor.report('fcp', {
        value: fcp.startTime,
        rating: fcp.startTime < 1800 ? 'good' : 'needs-improvement'
      });
    });
    fcpObserver.observe({ type: 'paint', buffered: true });
    
    // LCP 采集
    const lcpObserver = new PerformanceObserver((list) => {
      const entries = list.getEntries();
      const lcp = entries[entries.length - 1];
      
      monitor.report('lcp', {
        value: lcp.startTime,
        element: lcp.element?.tagName,  // 记录是哪个元素
        rating: lcp.startTime < 2500 ? 'good' : 'needs-improvement'
      });
    });
    lcpObserver.observe({ type: 'largest-contentful-paint', buffered: true });
  }
};

首次输入延迟 (FID) & 交互到渲染延迟 (INP)

反映页面交互响应速度，这两个指标直接关系到用户是否觉得页面"跟手"。

javascript 复制代码

// FID 采集（用户在页面加载初期的交互响应）
const fidCollector = {
  firstInputTime: 0,
  
  init() {
    const observer = new PerformanceObserver((list) => {
      const entries = list.getEntries();
      const firstInput = entries[0];
      
      // FID = 首次输入时间 - 页面开始时间
      const fid = firstInput.processingStart - firstInput.startTime;
      
      monitor.report('fid', {
        value: fid,
        eventType: firstInput.name,
        rating: fid < 100 ? 'good' : fid < 300 ? 'needs-improvement' : 'poor'
      });
    });
    
    observer.observe({ type: 'first-input', buffered: true });
  }
};

// INP 采集（全生命周期交互响应）
const inpObserver = new PerformanceObserver((list) => {
  list.getEntries().forEach(entry => {
    if (entry.interactionId) {
      monitor.report('inp', {
        value: entry.duration,
        eventType: entry.name,
        rating: entry.duration < 200 ? 'good' : 'poor'
      });
    }
  });
});
inpObserver.observe({ type: 'event', buffered: true, durationThreshold: 16 });

JS错误率

这个指标很多团队会忽略，但它对用户体验影响极大。

javascript 复制代码

class JSErrorCollector {
  constructor() {
    this.errors = [];
    this.errorTypes = ['error', 'unhandledrejection'];
  }
  
  init() {
    // 监听 JS Error
    window.addEventListener('error', (e) => {
      this收集Error(e, 'uncaught');
    });
    
    // 监听 Promise  rejection
    window.addEventListener('unhandledrejection', (e) => {
      this收集Error(e, 'unhandledrejection');
    });
    
    // Vue/React 的全局错误处理（如果有的话）
    if (window.Vue) {
      window.Vue.config.errorHandler = (err, vm, info) => {
        this.收集Error(err, 'vue', { component: vm, info });
      };
    }
  }
  
  收集Error(e, type, extra = {}) {
    const errorInfo = {
      type,
      message: e.message || String(e),
      stack: e.error?.stack || '',
      filename: e.filename || '',
      lineno: e.lineno || 0,
      colno: e.colno || 0,
      timestamp: Date.now(),
      url: location.href,
      userAgent: navigator.userAgent,
      ...extra
    };
    
    // 过滤掉一些无关紧要的错误
    if (this.shouldIgnore(errorInfo)) return;
    
    monitor.report('js_error', errorInfo);
  }
  
  shouldIgnore(error) {
    const ignorePatterns = [
      /ResizeObserver/i,
      /Failed to load resource/i,
      /favicon\.ico/i
    ];
    return ignorePatterns.some(p => p.test(error.message));
  }
}

内存占用监控

这是第5篇讲内存治理的延伸，线上监控内存趋势同样重要。

javascript 复制代码

// 内存监控 - 仅支持部分浏览器
class MemoryMonitor {
  constructor() {
    this.checkInterval = null;
  }
  
  start() {
    // 内存数据需要 Performance.memory API
    if (!performance.memory) {
      console.warn('Memory API not supported');
      return;
    }
    
    // 定期采集内存数据
    this.checkInterval = setInterval(() => {
      const memory = performance.memory;
      
      monitor.report('memory', {
        usedJSHeapSize: memory.usedJSHeapSize,      // 已用堆大小
        totalJSHeapSize: memory.totalJSHeapSize,    // 总堆大小
        jsHeapSizeLimit: memory.jsHeapSizeLimit,    // 堆大小限制
        usagePercent: (memory.usedJSHeapSize / memory.jsHeapSizeLimit * 100).toFixed(2),
        timestamp: Date.now()
      });
    }, 30000); // 30秒采集一次
    
    // 页面卸载前采集峰值
    window.addEventListener('beforeunload', () => {
      this.recordPeak();
    });
  }
  
  stop() {
    if (this.checkInterval) {
      clearInterval(this.checkInterval);
    }
  }
  
  recordPeak() {
    const memory = performance.memory;
    monitor.report('memory_peak', {
      usedJSHeapSize: memory.usedJSHeapSize,
      timestamp: Date.now()
    });
  }
}

三、数据采集方案

3.1 三层采集架构

css 复制代码

┌─────────────────────────────────────────────────────────┐
│                    Native 采集层                         │
│  • WebView 创建/销毁时间                                 │
│  • 页面加载耗时（native network）                        │
│  • 内存峰值（原生能力采集更准确）                        │
│  • 崩溃日志                                              │
└─────────────────────────────────────────────────────────┘
                            ↕ 桥接
┌─────────────────────────────────────────────────────────┐
│                    H5 采集层                             │
│  • Performance API（标准指标）                          │
│  • 自定义埋点（业务指标）                                │
│  • 资源加载详情                                          │
│  • 用户行为轨迹                                          │
└─────────────────────────────────────────────────────────┘
                            ↕ 桥接
┌─────────────────────────────────────────────────────────┐
│                    融合分析层                             │
│  • Native + H5 数据关联                                  │
│  • 异常数据清洗                                          │
│  • 设备信息关联                                          │
└─────────────────────────────────────────────────────────┘

3.2 Performance API 的正确打开方式

很多人知道 performance.timing，但不知道怎么用才能发挥最大价值。这里有个踩坑提醒：Safari对Performance API的支持有差异，某些字段在iOS上可能拿不到，需要做兼容性处理。

javascript 复制代码

// 完整的页面加载时间线分析
const getPageTimingDetails = () => {
  const timing = performance.timing;
  const nav = performance.getEntriesByType('navigation')[0];
  
  return {
    // DNS 解析
    dnsTime: timing.domainLookupEnd - timing.domainLookupStart,
    
    // TCP 连接
    tcpTime: timing.connectEnd - timing.connectStart,
    
    // SSL/TLS（如果是 HTTPS）
    sslTime: timing.secureConnectionStart > 0 
      ? timing.connectEnd - timing.secureConnectionStart 
      : 0,
    
    // 请求发起到响应接收
    ttfb: timing.responseStart - timing.requestStart,
    
    // 内容下载
    downloadTime: timing.responseEnd - timing.responseStart,
    
    // DOM 解析
    domParseTime: timing.domInteractive - timing.responseEnd,
    
    // DOM ready
    domReadyTime: timing.domContentLoadedEventEnd - timing.navigationStart,
    
    // 页面完全加载
    loadTime: timing.loadEventEnd - timing.navigationStart,
    
    // 首次渲染（需要计算）
    firstPaint: performance.getEntriesByType('paint')
      .find(e => e.name === 'first-paint')?.startTime || 0,
      
    // 首内容绘制
    fcp: performance.getEntriesByType('paint')
      .find(e => e.name === 'first-contentful-paint')?.startTime || 0
  };
};

// 资源加载详情
const getResourceTiming = () => {
  const resources = performance.getEntriesByType('resource');
  
  return resources
    .filter(r => r.initiatorType !== 'navigation')
    .map(r => ({
      name: r.name,
      type: r.initiatorType,
      duration: r.responseEnd - r.startTime,
      size: r.transferSize,
      dns: r.domainLookupEnd - r.domainLookupStart,
      tcp: r.connectEnd - r.connectStart,
      ttfb: r.responseStart - r.requestStart
    }))
    .sort((a, b) => b.duration - a.duration)  // 按耗时排序
    .slice(0, 10);  // 取TOP10最慢的资源
};

3.3 Native 侧桥接采集

有些数据只有Native侧能准确拿到，比如WKWebView的加载进度、iOS的内存压力通知等。这里重点说说iOS端采集的几个坑：

iOS WKWebView 采集注意事项：

WebView创建时机：WKWebView初始化本身就有耗时，如果你在点击后才创建WebView，这段时间也要算进去
WKURLSchemeHandler：iOS 11+支持自定义URL Scheme，这时候的请求不再走系统缓存，需要单独处理
prewarm时机：iOS 13+支持WebView预热，但预热后的WebView状态和冷启动不同，需要区分统计

kotlin 复制代码

// Android Native 采集示例
class WebViewPerformanceMonitor {
    
    // WebView 创建到首帧渲染
    fun monitorFirstFrame(webView: WebView, pageUrl: String) {
        val startTime = System.currentTimeMillis()
        
        webView.webViewClient = object : WebViewClient() {
            override fun onPageStarted(view: WebView?, url: String?) {
                super.onPageStarted(view, url)
                // 记录页面开始加载时间
                Log.d("Perf", "Page started: $url at $startTime")
            }
            
            override fun onPageFinished(view: WebView?, url: String?) {
                super.onPageFinished(view, url)
                // 通知 H5 侧页面加载完成
                view?.evaluateJavascript(
                    "window.__NATIVE_PAGE_FINISH__ = ${System.currentTimeMillis()}",
                    null
                )
            }
        }
    }
    
    // 内存峰值监控
    fun getMemoryUsage(webView: WebView): Long {
        return try {
            val method = WebView::class.java.getMethod("getMemoryContents")
            method.invoke(webView) as? Long ?: 0L
        } catch (e: Exception) {
            0L
        }
    }
    
    // 通知 H5 侧 Native 采集的数据
    fun notifyH5(metrics: Map<String, Any>) {
        val json = Gson().toJson(metrics)
        webView.evaluateJavascript(
            "window.__NATIVE_METRICS__ = $json",
            null
        )
    }
}

四、数据上报策略

4.1 上报机制设计

上报策略是监控体系的生死线------策略设计得好，既能保证数据完整性，又能控制服务器压力和电量消耗。

javascript 复制代码

class MonitorReporter {
  constructor() {
    this.queue = [];           // 待上报队列
    this.maxQueueSize = 100;   // 队列上限
    this.flushInterval = 5000; // 5秒强制上报
    this.isReporting = false;
    
    this.init();
  }
  
  init() {
    // 页面隐藏时立即上报（用户可能随时离开）
    document.addEventListener('visibilitychange', () => {
      if (document.visibilityState === 'hidden') {
        this.flush();
      }
    });
    
    // 定期检查队列
    setInterval(() => this.checkQueue(), this.flushInterval);
    
    // 页面卸载前尝试上报
    window.addEventListener('beforeunload', () => this.syncReport());
  }
  
  report(key, value) {
    const data = {
      key,
      value,
      timestamp: Date.now(),
      url: location.href,
      sessionId: this.getSessionId(),
      userId: this.getUserId(),
      deviceInfo: this.getDeviceInfo()
    };
    
    this.queue.push(data);
    
    // 关键指标立即上报
    if (this.isCriticalMetric(key)) {
      this.flush();
    }
    
    // 队列满则上报
    if (this.queue.length >= this.maxQueueSize) {
      this.flush();
    }
  }
  
  isCriticalMetric(key) {
    // JS错误和崩溃必须立即上报
    return ['js_error', 'webview_crash', 'memory_critical'].includes(key);
  }
  
  async flush() {
    if (this.isReporting || this.queue.length === 0) return;
    
    this.isReporting = true;
    const batch = this.queue.splice(0, this.maxQueueSize);
    
    try {
      await this.sendToServer(batch);
    } catch (error) {
      // 上报失败，降级处理
      this.handleReportFailure(batch, error);
    } finally {
      this.isReporting = false;
    }
  }
  
  async sendToServer(data) {
    return fetch('/api/monitor/batch', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(data),
      // 关键：用户离开页面时不阻塞
      keepalive: true
    });
  }
  
  handleReportFailure(batch, error) {
    // 离线缓存，降级到 localStorage
    try {
      const failed = JSON.parse(localStorage.getItem('monitor_failed') || '[]');
      failed.push(...batch);
      localStorage.setItem('monitor_failed', JSON.stringify(failed.slice(-500)));
    } catch (e) {
      // localStorage 也满了，只能丢弃
    }
  }
  
  checkQueue() {
    // 检查离线缓存中是否有待上报数据
    try {
      const failed = JSON.parse(localStorage.getItem('monitor_failed') || '[]');
      if (failed.length > 0 && navigator.onLine) {
        this.queue.push(...failed);
        localStorage.removeItem('monitor_failed');
        this.flush();
      }
    } catch (e) {}
  }
}

4.2 降级与兜底策略

javascript 复制代码

// 完整的降级兜底策略
const MonitorStrategy = {
  // 网络状态降级
  networkCondition: {
    wifi: { batchSize: 100, flushInterval: 5000 },
    cellular: { batchSize: 50, flushInterval: 15000 },
    offline: { batchSize: 0, useLocal: true }  // 离线时只缓存
  },
  
  // 电量降级（移动端）
  batteryCondition: {
    low: { disableNonCritical: true },  // 低电量时关闭非关键指标采集
    normal: { disableNonCritical: false }
  },
  
  // 采样率控制
  sampling: {
    normal: 1.0,      // 100% 采集
    highLoad: 0.5,    // 高峰期采样50%
    release: 0.1      // 大版本发布后前3天全量，之后采样
  },
  
  getStrategy() {
    const connection = navigator.connection?.effectiveType || 'wifi';
    const battery = navigator.getBattery?.() || Promise.resolve({ level: 1 });
    
    return {
      network: this.networkCondition[connection] || this.networkCondition.wifi,
      battery: battery.level < 0.2 ? this.batteryCondition.low : this.batteryCondition.normal,
      sampling: this.getSamplingRate()
    };
  }
};

五、监控看板搭建

5.1 核心看板设计

看板不需要多花哨，关键是能看到这些核心数据：

erlang 复制代码

┌─────────────────────────────────────────────────────────────────┐
│                      WebView 性能监控看板                         │
├──────────────┬──────────────┬──────────────┬─────────────────────┤
│  今日异常数   │   实时在线    │  平均加载时间  │    错误率           │
│     127      │    3,421     │    1.8s      │    0.23%            │
│   ↑ 15%      │   ↓ 8%       │   ↓ 12%      │    ↑ 0.05%          │
├──────────────┴──────────────┴──────────────┴─────────────────────┤
│                                                                 │
│   FCP 分布                    LCP 趋势                           │
│   ┌─────────────┐            ┌─────────────────────────────┐    │
│   │  好  65%    │            │    ~~折线图~~               │    │
│   │  中  28%    │            │                              │    │
│   │  差  7%     │            └─────────────────────────────┘    │
│   └─────────────┘                                              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│   设备分布                    操作系统分布                      │
│   ┌─────────────┐            ┌─────────────────────────────┐    │
│   │  iOS  55%   │            │  iOS 12+  : ████████████    │    │
│   │ Android 45% │            │  Android10+: ██████████     │    │
│   └─────────────┘            └─────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

5.2 Grafana + Prometheus 方案

如果团队有条件，推荐这套开源方案：

yaml 复制代码

# prometheus 配置文件片段
scrape_configs:
  - job_name: 'webview-monitor'
    metrics_path: '/api/monitor/prometheus'
    scrape_interval: 15s
    static_configs:
      - targets: ['your-monitor-server:8080']

# 关键指标PromQL
# 1. P95 加载时间
histogram_quantile(0.95, rate(webview_load_duration_seconds_bucket[5m]))

# 2. 错误率
sum(rate(js_errors_total[5m])) / sum(rate(page_views_total[5m]))

# 3. LCP 达标率
sum(rate(lcp_good_total[5m])) / sum(rate(lcp_total[5m]))

5.3 轻量级方案

没有运维资源的话，我也用过一套极简方案。这套方案的好处是：零依赖、5分钟能跑起来、数据自己掌控。

javascript 复制代码

// 后端用 Node.js + SQLite，前端用 ECharts
// 数据汇总接口
app.get('/api/monitor/summary', async (req, res) => {
  const today = await db.query(`
    SELECT 
      DATE(timestamp) as date,
      AVG(load_time) as avg_load,
      percentile(load_time, 95) as p95_load,
      AVG(fcp) as avg_fcp,
      SUM(error_count) / COUNT(*) as error_rate
    FROM metrics
    WHERE timestamp > date('now', '-7 days')
    GROUP BY DATE(timestamp)
  `);
  
  res.json(today);
});

六、告警机制

6.1 分级告警策略

erlang 复制代码

告警分级
├── P0 紧急 - 核心功能不可用
│   ├── WebView 崩溃率 > 1%
│   ├── JS 错误率 > 5%
│   └── 页面完全不可用
│
├── P1 严重 - 性能严重劣化
│   ├── FCP P95 > 5s
│   ├── LCP P95 > 8s
│   └── 内存峰值 > 500MB
│
├── P2 警告 - 需要关注
│   ├── FCP P95 > 3s
│   ├── LCP P95 > 5s
│   └── 错误率较昨日上升 50%
│
└── P3 提醒 - 趋势提醒
    ├── 性能指标连续3天上升
    └── 新版本性能指标下降

关于告警阈值的设定，有几点经验之谈：

不要用固定阈值：不同设备、网络下差异巨大。低端Android机的LCP 5秒可能是正常的，高端iPhone的LCP超过2秒就该告警了
用百分位数而非平均值：平均值容易被异常值带偏，P95/P99更能反映真实用户体验
动态基线：建议用过去7天的同时间段数据作为基线，超出基线一定百分比才告警

6.2 告警实现

javascript 复制代码

class AlertManager {
  constructor() {
    this.rules = this.initRules();
    this.cooldown = new Map();  // 防止告警风暴
  }
  
  initRules() {
    return [
      {
        name: 'js_error_rate',
        condition: (current, history) => current > 0.05,
        severity: 'P0',
        window: '5m',
        check: async () => {
          const rate = await this.getMetricRate('js_error', '5m');
          return rate;
        }
      },
      {
        name: 'fcp_p95',
        condition: (current) => current > 5000,
        severity: 'P1',
        window: '5m',
        check: async () => {
          const p95 = await this.getMetricP95('fcp', '5m');
          return p95;
        }
      },
      {
        name: 'memory_peak',
        condition: (current) => current > 500 * 1024 * 1024,
        severity: 'P1',
        window: '5m',
        check: async () => {
          const peak = await this.getMetricMax('memory_peak', '5m');
          return peak;
        }
      },
      {
        name: 'lcp_trend',
        condition: (current, history) => {
          // 连续3天上升超过20%
          if (history.length < 3) return false;
          return history.slice(-3).every((v, i, arr) => 
            i > 0 && v > arr[i-1] * 1.2
          );
        },
        severity: 'P2',
        window: '3d',
        check: async () => {
          const trend = await this.getMetricTrend('lcp', '3d');
          return trend;
        }
      }
    ];
  }
  
  async checkAll() {
    for (const rule of this.rules) {
      await this.checkRule(rule);
    }
  }
  
  async checkRule(rule) {
    // 冷却期检查，防止告警风暴
    const cooldownKey = `${rule.name}_${rule.severity}`;
    if (this.cooldown.has(cooldownKey)) {
      const lastAlert = this.cooldown.get(cooldownKey);
      if (Date.now() - lastAlert < this.getCooldown(rule.severity)) {
        return;
      }
    }
    
    const current = await rule.check();
    const history = await this.getMetricHistory(rule.name, rule.window);
    
    if (rule.condition(current, history)) {
      await this.sendAlert(rule, current);
      this.cooldown.set(cooldownKey, Date.now());
    }
  }
  
  getCooldown(severity) {
    const map = { P0: 300000, P1: 1800000, P2: 3600000, P3: 86400000 };
    return map[severity] || 3600000;
  }
  
  async sendAlert(rule, value) {
    const alertData = {
      name: rule.name,
      severity: rule.severity,
      value,
      timestamp: Date.now(),
      message: this.formatMessage(rule, value)
    };
    
    // 对接飞书/钉钉/企业微信
    await this.notifyChannel(alertData);
  }
  
  formatMessage(rule, value) {
    const templates = {
      'js_error_rate': `🚨 [P0] JS错误率告警\n当前错误率: ${(value * 100).toFixed(2)}%\n请立即检查！`,
      'fcp_p95': `⚠️ [P1] 首内容绘制告警\nP95值: ${(value / 1000).toFixed(2)}s\n超过5秒阈值！`,
      'memory_peak': `⚠️ [P1] 内存峰值告警\n峰值内存: ${(value / 1024 / 1024).toFixed(2)}MB\n可能存在内存泄漏！`
    };
    return templates[rule.name] || `告警: ${rule.name} = ${value}`;
  }
}

七、实战案例：LCP优化闭环

7.1 问题发现

某天上午收到告警：

css 复制代码

⚠️ [P1] 首内容绘制告警
P95值: 6.8s
超过5秒阈值！

7.2 数据分析

登录监控看板，调出LCP详细数据：

sql 复制代码

-- 调出LCP异常详情
SELECT 
  lcp_value,
  lcp_element,  -- 关键：看LCP元素是什么
  url,
  device_model,
  os_version,
  network_type,
  timestamp
FROM webview_metrics
WHERE lcp_value > 5000
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 100;

分析结果：

LCP元素	占比	设备分布
`#banner-img`	67%	Android低端机
`.product-title`	21%	全部设备
`img.hero`	12%	iOS

发现问题：商品详情页的大Banner图片，在低端Android机上加载极慢。

7.3 根因定位

进一步分析资源加载链路：

sql 复制代码

-- 对比有LCP问题的URL和正常URL的资源加载
SELECT 
  url,
  AVG(cdn_download_time) as avg_cdn,
  AVG(origin_download_time) as avg_origin,
  cdn_hit_rate
FROM resource_timing
GROUP BY url
HAVING AVG(cdn_download_time) > 2000;

同时调出网络监控数据，分析加载慢的资源特征：

javascript 复制代码

const slowResources = performance.getEntriesByType('resource')
  .filter(r => r.duration > 3000)  // 超过3秒的资源
  .map(r => ({
    name: r.name,
    type: r.initiatorType,
    duration: r.duration,
    dns: r.domainLookupEnd - r.domainLookupStart,
    tcp: r.connectEnd - r.connectStart,
    ttfb: r.responseStart - r.requestStart,
    transferSize: r.transferSize  // 判断是否走缓存
  }));

// 发现问题：
// 1. Banner图片 transferSize > 500KB，且没有命中浏览器缓存
// 2. 图片URL没有CDN域名，直接回源
// 3. 低端机CPU解析大图本身就需要1-2秒

根因定位：

CDN配置错误，Banner图片没有接入CDN，直接回源站
图片体积过大（原始1.2MB），低端机解码耗时
没有预加载和缓存策略优化

7.4 制定优化方案

紧急措施：CDN配置修正，让Banner图片走CDN
短期优化 ：添加 <link rel="preload"> 预加载Banner
长期方案 ：对LCP元素进行 fetchpriority="high" 标记

html 复制代码

<!-- 预加载关键图片 -->
<link rel="preload" as="image" href="https://cdn.example.com/banner.jpg" fetchpriority="high">

<!-- 或者在JS中预加载 -->
const preloadImage = new Image();
preloadImage.fetchPriority = 'high';
preloadImage.src = 'https://cdn.example.com/banner.jpg';

7.5 效果验证

优化上线后，观察监控数据：

ini 复制代码

优化前：LCP P95 = 6.8s，达标率 = 62%
优化后：LCP P95 = 2.1s，达标率 = 94%

监控看板显示LCP趋势明显好转，整个优化周期从发现到验证不超过24小时。

八、总结

回顾一下这套监控体系的搭建要点：

指标设计：围绕加载性能、交互性能、稳定性三个维度，覆盖FCP/LCP/FID/JS错误率/内存等核心指标
数据采集：Native + H5 双通道采集，Performance API + 自定义埋点结合
上报策略：批量上报 + 离线缓存 + 网络/电量降级，保证数据完整性的同时控制开销
看板搭建：轻量级用 Node.js + SQLite + ECharts，有条件上 Grafana + Prometheus
告警机制：分级告警 + 冷却期 + 趋势告警，避免告警风暴
闭环验证：监控发现 → 数据分析 → 根因定位 → 优化落地 → 效果验证

监控体系不是一劳永逸的，需要持续迭代优化。建议每个季度做一次指标Review，看看是否有新增的业务场景需要监控，是否有指标已经失去意义可以下线。

一些经验之谈

从小做起：不要一开始就追求大而全，先把最影响业务的2-3个指标监控起来，跑通闭环后再扩展
数据质量大于数量：宁可采集100个高质量数据点，也不要采集10000个充满噪声的垃圾数据
告警要收敛：收到100条告警等于没收到告警，要舍得删除无效告警规则
和业务指标挂钩：最终要的是"页面加载时间和用户留存的关系"，而不是单纯的技术数字