使用 Terraform Grafana Provider 实现 Grafana 全栈 IaC 一体化管理的完整方案

以下是使用 Terraform Grafana Provider 实现 Grafana 全栈 IaC 一体化管理的完整方案,覆盖从架构设计到生产落地的全部实现细节。


一、架构总览与核心设计原则

1.1 为什么选 Terraform 路线

Grafana 官方提供多种 as-code 工具(Terraform、Ansible、Operator、Crossplane)。Terraform Provider 是资源覆盖度最广的方案,支持 Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 等几乎所有 Grafana 资源。

适用场景

  • 已有 Terraform 工作流管理云资源(AWS/GCP/Azure/K8s)
  • 需要统一管理 Dashboard + Alert + SLO + Datasource + 权限
  • 多环境(dev/staging/prod)一致性要求严格
  • 团队已有 HCL 技能储备

1.2 架构分层

复制代码
┌─────────────────────────────────────────────────────────────┐
│                      Git Repository                          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐  │
│  │dashboards│ │ datasources│ │ alerting │ │ iam/teams    │  │
│  │  (.json) │ │  (.tf)   │ │  (.tf)   │ │  (.tf)       │  │
│  └──────────┘ └──────────┘ └──────────┘ └────────────────┘  │
└────────────────────┬──────────────────────────────────────────┘
                     │ PR Review / CI Validation
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              CI/CD Pipeline (GitHub Actions/GitLab CI)      │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐ │
│  │ terraform fmt│  │ terraform plan│  │ terraform apply   │ │
│  │  validate    │  │  (review req) │  │  (auto/staging)   │ │
│  └──────────────┘  └──────────────┘  └────────────────────┘ │
└────────────────────┬──────────────────────────────────────────┘
                     │ State Backend (S3 + DynamoDB / Terraform Cloud)
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Grafana Instance(s)                              │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐  │
│  │  OSS    │ │  Cloud  │ │  AWS    │ │  Multi-tenant   │  │
│  │  Self   │ │  Stack  │ │  Managed│ │  (prod/staging) │  │
│  │  Hosted │ │         │ │  Grafana│ │                 │  │
│  └─────────┘ └─────────┘ └─────────┘ └─────────────────┘  │
└─────────────────────────────────────────────────────────────┘

二、Provider 配置与认证体系

2.1 基础 Provider 配置

Terraform Grafana Provider 当前稳定版本为 ~> 2.0,支持 Grafana OSS 和 Grafana Cloud。

hcl 复制代码
# versions.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 2.0"  # 或 ">= 3.0" 若已发布
    }
  }
}

# provider.tf
provider "grafana" {
  url  = var.grafana_url
  auth = var.grafana_auth  # Service Account Token 推荐
}

认证方式优先级

  1. Service Account Token(推荐生产):在 Grafana 中创建 Service Account → 分配 Viewer/Editor/Admin 角色 → 生成 Token
  2. API Key(已逐步被 Service Account 替代)
  3. Basic Authadmin:password(仅初始化或本地测试)

2.2 多实例管理(Provider Alias)

管理多套 Grafana 环境(如 prod Grafana Cloud + dev OSS 实例):

hcl 复制代码
provider "grafana" {
  alias = "production"
  url   = "https://my-stack.grafana.net/"
  auth  = var.grafana_prod_token
}

provider "grafana" {
  alias = "staging"
  url   = "https://staging.grafana.local/"
  auth  = var.grafana_staging_token
}

# 使用示例
resource "grafana_folder" "prod_infra" {
  provider = grafana.production
  title    = "Infrastructure"
}

resource "grafana_folder" "staging_infra" {
  provider = grafana.staging
  title    = "Infrastructure"
}

2.3 Grafana Cloud 专属配置

Grafana Cloud 需要额外的 Cloud Access Policy Token 来管理 Stack、Synthetic Monitoring 等资源:

hcl 复制代码
provider "grafana" {
  alias = "cloud"
  url   = "https://grafana.com"
  auth  = var.grafana_cloud_api_key  # Cloud Access Policy Token

  # Synthetic Monitoring 专用
  sm_access_token = var.grafana_sm_token
}

三、Dashboard 资源深度管理

Dashboard 是 Grafana 中最复杂的资源类型。Terraform 通过 config_json 字段接收完整的 Dashboard Model JSON。

3.1 目录结构与文件组织

bash 复制代码
grafana-terraform/
├── modules/
│   └── dashboard-stack/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── dashboards/
│   ├── platform/
│   │   ├── cluster-overview.json
│   │   └── node-exporter.json
│   ├── application/
│   │   ├── api-gateway.json
│   │   └── payment-service.json
│   └── templates/
│       └── service-overview.json.tpl
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── staging/
│       ├── main.tf
│       └── terraform.tfvars
└── global/
    ├── folders.tf
    ├── datasources.tf
    └── permissions.tf

3.2 批量导入 Dashboard JSON

使用 for_each + fileset 实现批量管理,避免为每个 Dashboard 写重复代码:

hcl 复制代码
# dashboards.tf
locals {
  dashboard_folders = {
    "platform"    = grafana_folder.platform.id
    "application" = grafana_folder.application.id
  }
}

resource "grafana_dashboard" "all" {
  for_each = {
    for pair in setproduct(keys(local.dashboard_folders), fileset("${path.module}/dashboards", "*/*.json")) :
    "${pair[0]}-${trimsuffix(basename(pair[1]), ".json")}" => {
      folder = local.dashboard_folders[pair[0]]
      path   = "${path.module}/dashboards/${pair[1]}"
    }
  }

  folder      = each.value.folder
  config_json = file(each.value.path)
  overwrite   = true
}

3.3 Dashboard JSON 预处理规范

从 Grafana UI 导出的 JSON 需要清理后才能用于 Terraform:

bash 复制代码
# 清理脚本:删除 id、version,保留 uid
jq 'del(.id, .version) | .uid |= .' exported.json > clean.json

关键字段处理

  • id:必须删除,由 Grafana 自动分配
  • version:必须删除,避免版本冲突
  • uid必须保留且固定,用于唯一标识和更新
  • datasource.uid:建议引用 Terraform 数据源资源,而非硬编码

3.4 使用 Templatefile 实现参数化

对于结构相似但指标不同的 Dashboard(如各微服务统一视图),使用 Terraform 模板:

hcl 复制代码
# templates/service-overview.json.tpl
{
  "title": "${service_name} Overview",
  "uid": "svc-${service_name}",
  "panels": [
    {
      "title": "Request Rate",
      "targets": [
        {
          "expr": "rate(http_requests_total{service=\"${service_name}\"}[$__rate_interval])"
        }
      ]
    }
  ]
}

# main.tf
resource "grafana_dashboard" "services" {
  for_each = toset(["api-gateway", "web-frontend", "worker", "billing"])

  folder      = grafana_folder.application.id
  config_json = templatefile("${path.module}/templates/service-overview.json.tpl", {
    service_name = each.key
  })
}

3.5 Grafonnet + Terraform 混合工作流

对于复杂 Dashboard,手写 JSON 维护困难。推荐 Grafonnet (Jsonnet) 生成 JSON,Terraform 负责部署:

bash 复制代码
# 工作流
dashboards/*.jsonnet --[jsonnet]--> output/*.json --[terraform]--> Grafana

Jsonnet 示例

jsonnet 复制代码
// dashboards/cluster-overview.jsonnet
local g = import 'grafonnet/grafana.libsonnet';

g.dashboard.new(
  title='Kubernetes Cluster Overview',
  uid='k8s-cluster-overview',
  timezone='utc',
)
.addPanel(
  g.panel.timeSeries.new('CPU Usage')
  .addTarget(
    g.target.prometheus.new('prometheus', 'sum(rate(container_cpu_usage_seconds_total[$__rate_interval])) by (namespace)')
  ),
  gridPos={x: 0, y: 0, w: 12, h: 8}
)

CI 集成

yaml 复制代码
# .github/workflows/dashboards.yml
- name: Generate Dashboards
  run: |
    jb install  # jsonnet-bundler 安装依赖
    mkdir -p output
    for f in dashboards/*.jsonnet; do
      jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json"
    done

- name: Validate & Deploy
  run: |
    terraform init
    terraform plan
    terraform apply -auto-approve

四、Datasource 与 Folder 管理

4.1 数据源全类型配置

Terraform 支持 Prometheus、Elasticsearch、CloudWatch、Jaeger、Loki、Tempo 等数十种数据源。

hcl 复制代码
# datasources.tf
resource "grafana_data_source" "prometheus" {
  type       = "prometheus"
  name       = "Prometheus"
  uid        = "prometheus-main"  # 固定 UID,Dashboard 中引用
  url        = "http://prometheus.monitoring.svc:9090"
  is_default = true

  json_data_encoded = jsonencode({
    httpMethod    = "POST"
    manageAlerts  = true
    prometheusType = "Prometheus"
    prometheusVersion = "2.40.0"
  })
}

resource "grafana_data_source" "cloudwatch" {
  type = "cloudwatch"
  name = "AWS CloudWatch"
  uid  = "cloudwatch-main"

  json_data_encoded = jsonencode({
    defaultRegion = "us-east-1"
    authType      = "default"  # 使用 EC2 IAM Role
  })
}

resource "grafana_data_source" "elasticsearch" {
  type          = "elasticsearch"
  name          = "Application Logs"
  uid           = "es-logs"
  url           = "https://es.example.com:9200"
  database_name = "[logs-]YYYY.MM.DD"

  json_data_encoded = jsonencode({
    esVersion                  = "8.0.0"
    timeField                  = "@timestamp"
    maxConcurrentShardRequests = 256
    logMessageField            = "message"
    logLevelField              = "level"
  })
}

关键注意事项

  • 始终显式设置 uid,Dashboard 中通过 ${grafana_data_source.prometheus.uid} 引用
  • 使用 json_data_encoded 而非旧版 json_data 块,避免 provider 版本兼容问题
  • AWS Managed Grafana 需配置 sigv4_auth 等 SigV4 参数

4.2 Folder 与权限体系

hcl 复制代码
# folders.tf
resource "grafana_folder" "platform" {
  title = "Platform Engineering"
  uid   = "platform"
}

resource "grafana_folder" "application" {
  title = "Application Teams"
  uid   = "application"
}

# permissions.tf - Folder 级别权限
resource "grafana_folder_permission" "platform" {
  folder_uid = grafana_folder.platform.uid

  permissions {
    role       = "Viewer"
    permission = "View"
  }
  permissions {
    team_id    = grafana_team.sre.id
    permission = "Edit"
  }
  permissions {
    team_id    = grafana_team.platform.id
    permission = "Admin"
  }
}

# Dashboard 级别细粒度权限
resource "grafana_dashboard_permission" "sensitive" {
  dashboard_uid = grafana_dashboard.security_overview.uid

  permissions {
    team_id    = grafana_team.security.id
    permission = "View"
  }
}

五、Alerting 告警体系 as Code

Grafana Alerting 是 Terraform 管理中最复杂的部分,包含 Contact Point、Notification Policy、Alert Rule、Mute Timing、Message Template 五大资源。

5.1 联系点(Contact Points)

hcl 复制代码
# alerting/contact-points.tf
resource "grafana_contact_point" "email_ops" {
  name = "Operations Email"

  email {
    addresses    = ["ops@company.com", "sre@company.com"]
    single_email = true
    message      = "{{ template \"default.message\" . }}"
  }
}

resource "grafana_contact_point" "slack_alerts" {
  name = "Slack Alerts"

  slack {
    url       = var.slack_webhook_url
    recipient = "#alerts"
    title     = "{{ template \"default.title\" . }}"
    text      = "{{ template \"default.message\" . }}"
  }
}

resource "grafana_contact_point" "pagerduty_critical" {
  name = "PagerDuty Critical"

  pagerduty {
    integration_key = var.pagerduty_key
    severity        = "critical"
  }
}

5.2 通知模板(Message Templates)

hcl 复制代码
resource "grafana_message_template" "custom" {
  name = "custom_alerts"

  template = <<EOT
{{ define "custom_email.message" }}
Alert: {{ .CommonLabels.alertname }}
Severity: {{ .CommonLabels.severity }}
Summary: {{ .CommonAnnotations.summary }}
Runbook: {{ .CommonAnnotations.runbook_url }}
{{ end }}
EOT
}

# 在 contact point 中引用模板
resource "grafana_contact_point" "email_custom" {
  name = "Custom Email"

  email {
    addresses = ["oncall@company.com"]
    message   = "{{ template \"custom_email.message\" . }}"
  }
}

5.3 静默时间(Mute Timings)

hcl 复制代码
resource "grafana_mute_timing" "weekends" {
  name = "No Weekends"

  intervals {
    weekdays = ["saturday", "sunday"]
  }
}

resource "grafana_mute_timing" "maintenance" {
  name = "Maintenance Window"

  intervals {
    weekdays = ["monday"]
    times {
      start = "02:00"
      end   = "04:00"
    }
  }
}

5.4 通知策略树(Notification Policy)

⚠️ 关键警告grafana_notification_policy 是一个单例资源,应用它会覆盖整个通知策略树。必须在代码中完整定义所有策略。

hcl 复制代码
resource "grafana_notification_policy" "main" {
  group_by      = ["alertname", "grafana_folder", "severity"]
  contact_point = grafana_contact_point.email_ops.name

  group_wait      = "30s"
  group_interval  = "5m"
  repeat_interval = "4h"

  # 关键告警 -> PagerDuty
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point = grafana_contact_point.pagerduty_critical.name
    group_wait    = "10s"
    continue      = true  # 继续匹配其他策略
  }

  # 警告 -> Slack
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "warning"
    }
    contact_point = grafana_contact_point.slack_alerts.name
  }

  # 开发环境告警 -> 静默周末
  policy {
    matcher {
      label = "environment"
      match = "="
      value = "development"
    }
    contact_point = grafana_contact_point.slack_alerts.name
    mute_timings  = [grafana_mute_timing.weekends.name]
  }
}

5.5 告警规则组(Alert Rules)

hcl 复制代码
resource "grafana_rule_group" "platform" {
  name        = "platform_alerts"
  folder_uid  = grafana_folder.platform.uid
  interval    = 60  # 评估间隔 60s

  rule {
    name      = "High CPU Usage"
    condition = "B"

    data {
      ref_id = "A"
      relative_time_range {
        from = 300
        to   = 0
      }
      datasource_uid = grafana_data_source.prometheus.uid
      model = jsonencode({
        expr  = "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80"
        refId = "A"
      })
    }

    data {
      ref_id = "B"
      relative_time_range {
        from = 0
        to   = 0
      }
      datasource_uid = "__expr__"
      model = jsonencode({
        type       = "threshold"
        expression = "A"
        conditions = [{
          evaluator = {
            type   = "gt"
            params = [80]
          }
        }]
      })
    }

    annotations = {
      summary     = "CPU usage above 80% on {{ $labels.instance }}"
      description = "Instance {{ $labels.instance }} has CPU usage of {{ $value }}%"
      runbook_url = "https://wiki.internal/runbooks/high-cpu"
    }

    labels = {
      severity = "critical"
      team     = "sre"
    }
  }
}

Alert Rule 设计要点

  • 一个 rule_group 内的所有 rule 是原子评估
  • 使用 for_each 批量创建同类告警
  • datasource_uid 引用 Terraform 数据源资源,避免硬编码

六、SLO 与 Synthetic Monitoring

6.1 SLO as Code

Grafana Cloud SLO 功能可通过 Terraform 管理。创建 SLO 后,系统会自动生成关联的 Recording Rules、Dashboard 和 Alert。

hcl 复制代码
resource "grafana_slo" "api_availability" {
  name        = "API Availability"
  description = "99.9% availability target for API gateway"

  query {
    type = "ratio"
    ratio {
      success_metric = "sum(rate(http_requests_total{status!~\"5..\"}[5m]))"
      total_metric   = "sum(rate(http_requests_total[5m]))"
    }
  }

  objectives {
    value  = 0.999
    window = "30d"
  }

  alert {
    fastburn {
      annotation {
        key   = "severity"
        value = "critical"
      }
      label {
        key   = "team"
        value = "sre"
      }
    }
    slowburn {
      annotation {
        key   = "severity"
        value = "warning"
      }
    }
  }

  # 可选:关联到特定文件夹
  folder_uid = grafana_folder.slo.id
}

⚠️ 初始化陷阱 :新创建的 Grafana Cloud Stack 需要先手动初始化 SLO 功能(在 UI 中点击一次),否则 Terraform 首次 apply 会报错。可通过 time_sleep 资源延迟创建或先执行初始化脚本。

hcl 复制代码
resource "time_sleep" "wait_for_slo_init" {
  create_duration = "60s"
  depends_on      = [grafana_cloud_stack.main]
}

6.2 Synthetic Monitoring

hcl 复制代码
resource "grafana_synthetic_monitoring_check" "homepage" {
  job      = "homepage"
  target   = "https://example.com"
  enabled  = true
  frequency = 60000  # 60s
  timeout   = 5000

  probes = [
    data.grafana_synthetic_monitoring_probes.main.probes.0
  ]

  settings {
    http {
      method       = "GET"
      valid_status_codes = [200]
      valid_http_versions = ["HTTP/1.1", "HTTP/2"]
    }
  }
}

七、IAM 与组织架构

7.1 用户与团队管理

hcl 复制代码
# iam.tf
resource "grafana_user" "developers" {
  for_each = toset([
    "alice@company.com",
    "bob@company.com",
    "charlie@company.com"
  ])

  email    = each.value
  login    = split("@", each.value)[0]
  password = random_password.user_passwords[each.value].result
  is_admin = false
}

resource "random_password" "user_passwords" {
  for_each = toset(["alice@company.com", "bob@company.com", "charlie@company.com"])
  length   = 16
  special  = true
}

resource "grafana_team" "sre" {
  name  = "SRE Team"
  email = "sre@company.com"
  members = [
    grafana_user.developers["alice@company.com"].email,
    grafana_user.developers["bob@company.com"].email,
  ]
}

resource "grafana_team" "platform" {
  name  = "Platform Team"
  email = "platform@company.com"
  members = [
    grafana_user.developers["charlie@company.com"].email,
  ]
}

7.2 组织与多租户

hcl 复制代码
resource "grafana_organization" "engineering" {
  name = "Engineering"
}

provider "grafana" {
  alias  = "engineering"
  org_id = grafana_organization.engineering.org_id
  auth   = var.grafana_auth
}

resource "grafana_folder" "eng_infra" {
  provider = grafana.engineering
  title    = "Infrastructure"
}

八、多环境管理策略

8.1 Terraform Workspace 方案

使用 Terraform Workspace 隔离环境状态:

bash 复制代码
terraform workspace new production
terraform workspace new staging
terraform workspace new development
hcl 复制代码
# environments.tfvars 按 workspace 区分
locals {
  env = terraform.workspace

  grafana_configs = {
    production = {
      url   = "https://my-stack.grafana.net/"
      token = var.grafana_prod_token
    }
    staging = {
      url   = "https://staging.grafana.local/"
      token = var.grafana_staging_token
    }
  }
}

provider "grafana" {
  url  = local.grafana_configs[local.env].url
  auth = local.grafana_configs[local.env].token
}

8.2 环境差异化配置

hcl 复制代码
locals {
  environment_tags = {
    production = ["prod", "critical"]
    staging    = ["staging", "non-critical"]
  }
}

resource "grafana_dashboard" "overview" {
  folder = grafana_folder.main.id
  config_json = templatefile("${path.module}/dashboards/overview.json.tpl", {
    environment = local.env
    tags        = local.environment_tags[local.env]
    datasource  = grafana_data_source.prometheus.uid
  })
}

8.3 模块复用模式

hcl 复制代码
# modules/monitoring-stack/main.tf
variable "environment" { type = string }
variable "prometheus_url" { type = string }

resource "grafana_folder" "main" {
  title = "${var.environment} Monitoring"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus ${var.environment}"
  url  = var.prometheus_url
}

resource "grafana_dashboard" "overview" {
  folder      = grafana_folder.main.id
  config_json = file("${path.module}/dashboards/overview.json")
}

output "folder_id" {
  value = grafana_folder.main.id
}

# environments/production/main.tf
module "prod_monitoring" {
  source = "../../modules/monitoring-stack"

  environment    = "Production"
  prometheus_url = "http://prometheus-prod.monitoring.svc:9090"
}

九、状态管理与协作

9.1 Remote Backend 配置

hcl 复制代码
# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "grafana/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

9.2 资源导入策略

从现有 Grafana UI 迁移到 Terraform 的批量导入流程:

bash 复制代码
# 1. 导出 Dashboard JSON 并清理
curl -H "Authorization: Bearer $TOKEN" \
  "$URL/api/dashboards/uid/my-dashboard" | \
  jq '.dashboard | del(.id, .version)' > dashboards/my-dashboard.json

# 2. 编写 Terraform 资源
resource "grafana_dashboard" "my_dashboard" {
  folder      = grafana_folder.main.id
  config_json = file("${path.module}/dashboards/my-dashboard.json")
}

# 3. 导入到 Terraform State
terraform import grafana_dashboard.my_dashboard <uid>
terraform plan  # 对比差异,补齐代码

批量导入脚本

bash 复制代码
#!/bin/bash
# import-all.sh
uids=$(curl -s -H "Authorization: Bearer $TOKEN" \
  "$URL/api/search?type=dash-db&limit=1000" | jq -r '.[].uid')

for uid in $uids; do
  echo "Importing dashboard: $uid"
  terraform import grafana_dashboard.$uid $uid 2>/dev/null || echo "Skipped $uid"
done

十、CI/CD 完整流水线

10.1 GitHub Actions 工作流

yaml 复制代码
# .github/workflows/grafana-terraform.yml
name: Grafana Infrastructure as Code

on:
  push:
    branches: [main]
    paths:
      - 'terraform/grafana/**'
      - 'dashboards/**'
  pull_request:
    paths:
      - 'terraform/grafana/**'
      - 'dashboards/**'

env:
  TF_VAR_grafana_auth: ${{ secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN }}

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.0"

      - name: Terraform Format Check
        working-directory: terraform/grafana
        run: terraform fmt -check -recursive

      - name: Terraform Init
        working-directory: terraform/grafana
        run: terraform init

      - name: Terraform Validate
        working-directory: terraform/grafana
        run: terraform validate

      - name: Generate Dashboards (Jsonnet)
        if: hashFiles('dashboards/**/*.jsonnet') != ''
        run: |
          go install github.com/google/go-jsonnet/cmd/jsonnet@latest
          go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
          jb install
          mkdir -p output
          for f in dashboards/*.jsonnet; do
            jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json"
          done

      - name: Validate Dashboard JSON
        run: |
          for f in output/*.json dashboards/**/*.json; do
            jq empty "$f"
          done

  plan:
    needs: validate
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init & Plan
        working-directory: terraform/grafana
        run: |
          terraform init
          terraform plan -no-color -out=tfplan

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('terraform/grafana/tfplan.stdout', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `### Terraform Plan\n\`\`\`\n${plan}\n\`\`\``
            });

  deploy:
    needs: validate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production  # 需要审批
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init & Apply
        working-directory: terraform/grafana
        run: |
          terraform init
          terraform apply -auto-approve

10.2 审批与回滚策略

  • Plan 阶段:PR 时自动执行,结果评论到 PR
  • Apply 阶段 :合并到 main 后触发,通过 GitHub Environment Protection Rules 设置人工审批
  • 回滚:利用 Terraform State 历史版本或 Git Revert + Re-apply
  • Dashboard 专属变更 :仅 dashboards/** 路径变更时触发,减少无关构建

十一、最佳实践与常见陷阱

11.1 核心最佳实践

实践项 说明
固定 UID Dashboard、Folder、Datasource 必须显式设置 uid,避免重复创建
删除 id/version 导入 JSON 时删除 idversion 字段
禁用 UI 编辑 生产环境设置 disable_provenance = false(默认),保持 Terraform 为唯一真理源
敏感信息隔离 Webhook URL、PagerDuty Key、密码使用 sensitive = true 变量,注入环境变量
分支保护 main 分支禁止直接推送,必须通过 PR + Code Review
状态锁定 使用 DynamoDB 或 Terraform Cloud 防止并发操作
模块复用 将通用监控栈封装为模块,环境间复用
UTC 时区 Dashboard 统一设置 timezone: "utc",避免时区混乱

11.2 常见陷阱与解决方案

问题 原因 解决方案
Contact Point 删除失败 409 被 Notification Policy 引用 先更新 Policy 移除引用,再删除 Contact Point;或设计时避免循环依赖
Datasource 引用失效 硬编码 UID 与环境不匹配 使用 grafana_data_source.xxx.uid 动态引用
SLO 首次创建失败 Grafana Cloud SLO 功能未初始化 手动在 UI 初始化一次,或使用 time_sleep 延迟
Dashboard 重复创建 UID 冲突或未设置 确保所有 Dashboard 有固定 UID
Alert Rule 评估异常 __expr__ 数据源配置错误 严格遵循 ref_iddatasource_uid = "__expr__" 规范
Terraform Plan 频繁漂移 UI 手动修改导致 设置 disable_provenance = false,禁止 UI 编辑 provisioned 资源

11.3 监控 Terraform 本身

建议将 Terraform 状态变更也纳入审计:

hcl 复制代码
# 在 Terraform 中记录部署信息
resource "grafana_annotation" "deployment" {
  text         = "Terraform apply: ${timestamp()}"
  dashboard_id = grafana_dashboard.overview.id
  tags         = ["terraform", "deployment"]
}

十二、完整项目结构示例

bash 复制代码
grafana-infrastructure/
├── README.md
├── .github/
│   └── workflows/
│       └── grafana-terraform.yml
├── modules/
│   ├── monitoring-stack/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── dashboards/
│   │       └── overview.json
│   └── alerting-policy/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── backend.tf
│   │   └── terraform.tfvars
│   └── staging/
│       ├── main.tf
│       ├── backend.tf
│       └── terraform.tfvars
├── global/
│   ├── providers.tf
│   ├── versions.tf
│   ├── variables.tf
│   ├── folders.tf
│   ├── datasources.tf
│   ├── permissions.tf
│   ├── iam.tf
│   └── alerting/
│       ├── contact-points.tf
│       ├── notification-policy.tf
│       ├── mute-timings.tf
│       ├── templates.tf
│       └── rule-groups.tf
├── dashboards/
│   ├── jsonnet/
│   │   ├── lib/
│   │   ├── cluster-overview.jsonnet
│   │   └── service-detail.jsonnet
│   └── json/          # CI 生成或手写的最终 JSON
│       ├── platform/
│       └── application/
└── scripts/
    ├── import-dashboards.sh
    └── validate-json.sh

十三、选型总结

Terraform Grafana IaC 路线是已有 Terraform 工作流团队的最优选择,其核心价值在于:

  1. 全资源覆盖:Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 统一管理
  2. 环境一致性:通过 Workspace + Module 实现多环境复刻
  3. 变更可审计:Git 历史 + Terraform Plan 提供完整的变更审查链
  4. 灾难恢复:从 Git + State 可完全重建整个 Grafana 配置

实施路径建议

  • 第 1 周:搭建 Provider + 导入现有 Datasource 和 Folder
  • 第 2-3 周:批量导入 Dashboard,建立 Jsonnet/Terraform 混合工作流
  • 第 4 周:迁移 Alerting(Contact Point → Policy → Rule Group)
  • 第 5 周:接入 SLO、Synthetic Monitoring、IAM
  • 第 6 周:完善 CI/CD、状态锁定、审批流程、文档

此方案将 Grafana 从"手工配置的 UI 工具"转变为"可版本控制、可审查、可自动化的基础设施组件",真正实现监控体系的 GitOps 闭环。

相关推荐
爱吃龙利鱼8 小时前
ubuntu2026.04部署k8s1.36版本的傻瓜式教程(注:运行时为docker,网络插件为calico)
运维·网络·笔记·docker·云原生·kubernetes
l167751685410 小时前
天翼云服务器失联排查完整报告_事件报告
运维·服务器·云原生·云计算
古城小栈11 小时前
k8s 存储练习
云原生·容器·kubernetes
无级程序员11 小时前
记一次K8S增加新节点
云原生·容器·kubernetes
wb1891 天前
Kubernetes服务优化
云原生·容器·kubernetes
Waay1 天前
图文详解|K8s Pod内部结构
docker·云原生·kubernetes
openFuyao1 天前
以开源之力,突破多样化算力困局——openFuyao开源一周年背后的故事
人工智能·云原生·开源·openfuyao·多样化算力·集群软件
JiaWen技术圈1 天前
IaC 双引擎:Terraform + Ansible 完整最佳实践
云原生·ansible·terraform
JiaWen技术圈1 天前
可观测体系最佳实践:Prometheus+Grafana+Loki+Jaeger
grafana·prometheus