Chaos Mesh / LitmusChaos 混沌工程：验证 ABP 的韧性策略

📚 目录

[Chaos Mesh / LitmusChaos 混沌工程：验证 ABP 的韧性策略](#Chaos Mesh / LitmusChaos 混沌工程：验证 ABP 的韧性策略)
- [0. TL;DR 🚀](#0. TL;DR 🚀)
- [1. 总体架构 🧱](#1. 总体架构 🧱)
- [2. 工具选型与原则 🧭](#2. 工具选型与原则 🧭)
- [3. 实验环境与安装（固定版本以保证可复现）🔧](#3. 实验环境与安装（固定版本以保证可复现）🔧)
- [4. 混沌用例与 YAML 🧪](#4. 混沌用例与 YAML 🧪)
- - [4.1 网络抖动（延迟 + 丢包，Chaos Mesh / netem）](#4.1 网络抖动（延迟 + 丢包，Chaos Mesh / netem）)
  - [4.1-b 限速（单独实验，Chaos Mesh / bandwidth）](#4.1-b 限速（单独实验，Chaos Mesh / bandwidth）)
  - [4.2 依赖超时（Chaos Mesh / HTTPChaos）](#4.2 依赖超时（Chaos Mesh / HTTPChaos）)
  - [4.3 磁盘占满（LitmusChaos / `disk-fill`）](#4.3 磁盘占满（LitmusChaos / disk-fill）)
- [5. 自动恢复与容错（多层防线）🛡️](#5. 自动恢复与容错（多层防线）🛡️)
- - [5.1 Kubernetes 探针（自愈基石）](#5.1 Kubernetes 探针（自愈基石）)
  - [5.2 Istio（集中超时/重试/熔断）](#5.2 Istio（集中超时/重试/熔断）)
  - [5.3 .NET / ABP 侧韧性（Polly v8 管道）](#5.3 .NET / ABP 侧韧性（Polly v8 管道）)
- [6. 观测与 SLO（Prometheus + Grafana + Sloth）📊](#6. 观测与 SLO（Prometheus + Grafana + Sloth）📊)
- - [6.1 指标暴露与抓取](#6.1 指标暴露与抓取)
  - [6.2 常用 SLI 的 PromQL（以 OTel 命名为例）](#6.2 常用 SLI 的 PromQL（以 OTel 命名为例）)
  - [6.3 Sloth 生成 SLO（多窗多烧）](#6.3 Sloth 生成 SLO（多窗多烧）)
- [7. 实验流程（时序 & 时间线）🕒](#7. 实验流程（时序 & 时间线）🕒)
- [8. 实验执行与 SLO 验收 ✅](#8. 实验执行与 SLO 验收 ✅)
- [9. 性能与高可用落地清单 🧰](#9. 性能与高可用落地清单 🧰)
- [10. 事后复盘模板 📝](#10. 事后复盘模板 📝)

0. TL;DR 🚀

注入面：
- Chaos Mesh ：NetworkChaos(netem) 叠加延迟/丢包/乱序/损坏；带宽限速改用 bandwidth 动作 （与 netem 分离）。HTTPChaos 延迟/中断/替换/补丁（可组合；⚠️ HTTPS 默认不支持，需启用 TLS 模式；keep-alive 复用连接不生效）。
- LitmusChaos ：disk-fill 填满 ephemeral-storage，验证驱逐与自愈路径。
恢复面：Kubernetes 探针（liveness/readiness/startup）+ Istio（超时/重试/熔断）+ .NET Resilience（Polly v8 管道）。
度量面 ：Prometheus + Grafana；SLO 用 Sloth 生成多窗多烧规则，SRE 化验收。

1. 总体架构 🧱

Chaos Observability Internet /metrics CRD 注入 disk-fill Chaos Mesh LitmusChaos Prometheus Grafana Alertmanager 用户/压测工具 Ingress / Gateway Istio ABP API ABP 应用/领域服务 PostgreSQL Redis RabbitMQ

目标

SLO 验证 ：如 Error Ratio ≤ 1% 、P95 ≤ 200ms；
恢复优化：探针/重试/熔断/限流降低 MTTD/MTTR；
可复现：脚本/版本/面板一键复跑对比。

2. 工具选型与原则 🧭

Chaos Mesh
- NetworkChaos(netem)：delay/loss/reorder/corrupt 可组合；限速请用 action: bandwidth（与 netem 互斥，分两个实验更清晰）。
- HTTPChaos：支持 abort/delay/replace/patch ；HTTPS 默认不支持 ，对 TLS 连接生效需启用 TLS 模式 并配置证书；对已复用 TCP（keep-alive）请求不生效。
LitmusChaos
- disk-fill 按百分比或 MiB 填充 Pod ephemeral-storage ，可触发驱逐，最贴近"磁盘满"场景；提供最小 RBAC。

3. 实验环境与安装（固定版本以保证可复现）🔧

bash 复制代码

# (A) Prometheus / Grafana / Alertmanager（kube-prometheus-stack）
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade -i mon prometheus-community/kube-prometheus-stack \
  --version <chart_version> -n monitoring --create-namespace

bash 复制代码

# (B) Chaos Mesh（按你的容器运行时设置 runtime）
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm upgrade -i chaos-mesh chaos-mesh/chaos-mesh \
  --version <chart_version> -n chaos-mesh --create-namespace \
  --set chaosDaemon.runtime=containerd

bash 复制代码

# (C) LitmusChaos Operator（建议固定具体版本，而非 latest）
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-<app_version>.yaml

ABP/.NET 侧 ：HttpClientFactory 命名客户端 + 健康检查端点 /health/ready、/health/live、/health/startup（供探针/监控）✅

4. 混沌用例与 YAML 🧪

⚠️ 安全边界 ：第一次演练请使用 mode: fixed/fixed-percent 并限制到非关键 副本；所有实验设置明确的 "duration"；准备好 kubectl delete -f 与回滚剧本。涉及有状态组件（DB/消息队列）请先在影子环境复现。

4.1 网络抖动（延迟 + 丢包，Chaos Mesh / netem）

yaml 复制代码

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: abp-api-netem
  namespace: default
spec:
  action: netem
  mode: fixed-percent
  value: "50"                       # 仅影响 50% 副本
  selector:
    namespaces: ["default"]
    labelSelectors:
      app: abp-api
  delay:
    latency: "100ms"
    correlation: "0"
    jitter: "10ms"
  loss:
    loss: "10"
    correlation: "0"
  duration: "5m"

4.1-b 限速（单独实验，Chaos Mesh / bandwidth）

yaml 复制代码

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: abp-api-bandwidth
  namespace: default
spec:
  action: bandwidth
  mode: fixed-percent
  value: "50"
  selector:
    namespaces: ["default"]
    labelSelectors:
      app: abp-api
  bandwidth:
    rate: "10mbps"
    limit: 10000
    buffer: 10000
  duration: "5m"

4.2 依赖超时（Chaos Mesh / HTTPChaos）

yaml 复制代码

apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: abp-http-delay
  namespace: default
spec:
  mode: fixed
  value: "1"
  selector:
    labelSelectors:
      app: abp-api
  target: Request
  port: 80
  method: GET
  path: /api/external/*    # 支持通配符
  delay: "500ms"           # 建议加引号避免解析歧义
  duration: "5m"

⚠️ 注意：

默认不支持 HTTPS 注入；若需对 TLS 生效，启用 spec.tls.* 并配置证书。

对keep-alive 复用连接不生效；压测侧可临时禁用 keep-alive/强制新建连接以确保注入可见。

4.3 磁盘占满（LitmusChaos / `disk-fill`）

前置：为目标工作负载设置 ephemeral-storage 的 requests/limits（验证"超过限额→驱逐→自愈"路径）。

yaml 复制代码

# 工作负载片段：为容器设置 ephemeral-storage 限额
resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
    ephemeral-storage: "1Gi"
  limits:
    cpu: "500m"
    memory: "512Mi"
    ephemeral-storage: "2Gi"

最小 RBAC（建议）：

yaml 复制代码

apiVersion: v1
kind: ServiceAccount
metadata:
  name: disk-fill-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: disk-fill-sa
  namespace: default
rules:
  - apiGroups: [""]
    resources: ["pods","events","configmaps","pods/log","pods/exec"]
    verbs: ["create","delete","get","list","patch","update","deletecollection","watch"]
  - apiGroups: ["apps"]
    resources: ["deployments","statefulsets","replicasets","daemonsets"]
    verbs: ["list","get"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines","chaosexperiments","chaosresults"]
    verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: disk-fill-sa
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: disk-fill-sa
subjects:
- kind: ServiceAccount
  name: disk-fill-sa
  namespace: default

ChaosEngine（百分比方案，二选一；不要与 MiB 同时配置）：

yaml 复制代码

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: abp-diskfill
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=abp-background"
    appkind: deployment
  chaosServiceAccount: disk-fill-sa
  experiments:
  - name: disk-fill
    spec:
      components:
        env:
        - name: FILL_PERCENTAGE
          value: "85"
        - name: TOTAL_CHAOS_DURATION
          value: "600"
        # 可选：显式指定容器名；缺省时取 Pod 的第一个容器
        # - name: TARGET_CONTAINER
        #   value: abp-background

5. 自动恢复与容错（多层防线）🛡️

5.1 Kubernetes 探针（自愈基石）

yaml 复制代码

livenessProbe:
  httpGet: { path: /health/live, port: 80 }
  initialDelaySeconds: 30
  timeoutSeconds: 3
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/ready, port: 80 }
  initialDelaySeconds: 10
  timeoutSeconds: 2
  periodSeconds: 5
startupProbe:
  httpGet: { path: /health/startup, port: 80 }
  failureThreshold: 30
  periodSeconds: 5

✅ 慢启动 应使用 startupProbe：在其成功前屏蔽 liveness/readiness，避免"刚起就被杀"。

5.2 Istio（集中超时/重试/熔断）

VirtualService（超时 + 重试 + perTryTimeout）：

yaml 复制代码

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: abp-api
  namespace: default
spec:
  hosts: ["abp-api.default.svc.cluster.local"]
  http:
  - route:
    - destination:
        host: abp-api.default.svc.cluster.local
        subset: v1
    timeout: "2s"
    retries:
      attempts: 3
      perTryTimeout: "600ms"
      retryOn: gateway-error,connect-failure,refused-stream,reset,5xx

DestinationRule（子集 + 异常剔除）：

yaml 复制代码

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: abp-api
  namespace: default
spec:
  host: abp-api.default.svc.cluster.local
  subsets:
  - name: v1
    labels: { version: v1 }
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: "5s"
      baseEjectionTime: "30s"
      maxEjectionPercent: 50

💡 避免重试叠加 ：网格与客户端二选一或协调预算（perTryTimeout × attempts）。

5.3 .NET / ABP 侧韧性（Polly v8 管道）

csharp 复制代码

// Program.cs
// dotnet add package Microsoft.Extensions.Http.Resilience
services.AddHttpClient("ExternalApi", c =>
{
    c.BaseAddress = new Uri(configuration["ExternalApi:BaseUrl"]);
})
.AddStandardResilienceHandler(); // 预置超时/重试/熔断/限速等策略（基于 Polly v8）

6. 观测与 SLO（Prometheus + Grafana + Sloth）📊

6.1 指标暴露与抓取

ASP.NET Core（OTel Prometheus Exporter） ：通过 UseOpenTelemetryPrometheusScrapingEndpoint() 在应用端口 暴露 /metrics；Prometheus 直接抓取。
若独立监听 9464：则在 Service/ServiceMonitor 中指向该端口（否则推荐直接抓应用端口）。

Service + ServiceMonitor（抓应用端口 /metrics）：

yaml 复制代码

# Service（与容器端口一致）
apiVersion: v1
kind: Service
metadata:
  name: abp-api
  namespace: default
  labels:
    app: abp-api
    release: mon
spec:
  selector: { app: abp-api }
  ports:
  - name: http
    port: 80
    targetPort: 80
---
# ServiceMonitor（Prometheus Operator CRD）
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: abp-api
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: abp-api
      release: mon
  namespaceSelector:
    matchNames: ["default"]
  endpoints:
  - port: http
    path: /metrics
    interval: "15s"

6.2 常用 SLI 的 PromQL（以 OTel 命名为例）

http.server.request.duration → http_server_request_duration_seconds_{bucket,count,sum}（单位秒）。不同导出器的标签名可能差异，上线前先在 Prometheus 中查看实际标签维度。

promql 复制代码

# (1) 成功率（5m）
1 - (
  sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m])) /
  sum(rate(http_server_request_duration_seconds_count[5m]))
)

# (2) P95 延迟（5m，直方图）
histogram_quantile(
  0.95,
  sum by (le) (rate(http_server_request_duration_seconds_bucket[5m]))
)

6.3 Sloth 生成 SLO（多窗多烧）

yaml 复制代码

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: abp-api-slo
  namespace: monitoring
spec:
  service: "abp-api"
  slos:
    - name: "availability"
      objective: 99
      sli:
        events:
          errorQuery: |
            sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[{{.window}}]))
          totalQuery: |
            sum(rate(http_server_request_duration_seconds_count[{{.window}}]))
    - name: "latency-p95-<=200ms"
      objective: 99
      sli:
        events:
          errorQuery: |
            sum(rate(http_server_request_duration_seconds_count[{{.window}}]))
            - sum(rate(http_server_request_duration_seconds_bucket{le="0.2"}[{{.window}}]))
          totalQuery: |
            sum(rate(http_server_request_duration_seconds_count[{{.window}}]))

7. 实验流程（时序 & 时间线）🕒

流程时序图 ：
工程师/SRE Chaos Mesh/Litmus Kubernetes ABP 应用监控/SLO 选择实验并下发 YAML 创建 CRD / Job 注入故障（netem / bandwidth / http-delay / disk-fill）暴露指标/日志/Trace 面板&告警（烧率/阈值）回滚/结束实验自愈（探针/重试/熔断） HTTPS 默认不支持 HTTPChaos 注入；TLS 模式另行配置\nkeep-alive 复用连接不受注入影响工程师/SRE Chaos Mesh/Litmus Kubernetes ABP 应用监控/SLO

时间线：
10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 基线压测与面板注入 network netem SLO 验收/烧率检查回滚 & 验证自愈导出图表&记录准备注入观察恢复复盘混沌实验时间线（示例）

韧性分层图：
应用层（.NET Resilience/Polly）服务网格（Istio timeout/retry/CB） Kubernetes 自愈（liveness/readiness/startup）基础设施（资源配额/限额/节点稳定性）

8. 实验执行与 SLO 验收 ✅

基线压测：建立 300--500 RPS 基线，记录成功率与 P95。
注入：kubectl apply -f chaos-mesh/abp-api-netem.yaml（"5m"）。
观察：SLO 面板中 Error Ratio 与 P95 是否达标；检查 Istio 重试与剔除。
自动恢复 ：实验结束，readiness 恢复就绪；慢启动由 startupProbe 兜底。
验收：若 Sloth 多窗烧率（短窗 5m / 长窗 1h）超阈 → 不通过；否则通过。
回滚：kubectl delete -f chaos-mesh/abp-api-netem.yaml

对 HTTP 延迟 与 磁盘占满 重用同一验收剧本，便于横向对比。

9. 性能与高可用落地清单 🧰

避免重试叠加 ：网格与客户端二选一或协调预算（perTryTimeout × attempts）。
熔断阈值 ：设置 consecutive5xxErrors/interval/baseEjectionTime/maxEjectionPercent，防止过度弹出健康实例。
探针门槛 ：慢启动使用 startupProbe；readinessProbe 决定是否接流量。
磁盘实验 ：务必设置 ephemeral-storage 限额；SLO 面板记录驱逐次数/MTTR。
HTTPChaos 细节 ：默认不支持 HTTPS；需要时启用 TLS 模式并配置证书；避免 keep-alive 影响注入效果。

10. 事后复盘模板 📝

A. 基本信息：实验名称 / 时间窗口 / 影响服务 / 变更编号
B. 目标与 SLO：目标阈值；多窗多烧报警记录
C. 现象：吞吐、延迟分布、错误码、重试/熔断次数、最慢链路
D. 根因：代码 / 配置 / 基础设施 / 依赖
E. 改进项（Owner/DDL）：重试/超时/熔断阈值、探针阈值、限流/缓存、升配/扩容、慢查询优化、在 CI 引入回归混沌

Chaos Mesh / LitmusChaos 混沌工程：验证 ABP 的韧性策略