以下是使用 Terraform Grafana Provider 实现 Grafana 全栈 IaC 一体化管理的完整方案,覆盖从架构设计到生产落地的全部实现细节。
一、架构总览与核心设计原则
1.1 为什么选 Terraform 路线
Grafana 官方提供多种 as-code 工具(Terraform、Ansible、Operator、Crossplane)。Terraform Provider 是资源覆盖度最广的方案,支持 Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 等几乎所有 Grafana 资源。
适用场景
- 已有 Terraform 工作流管理云资源(AWS/GCP/Azure/K8s)
- 需要统一管理 Dashboard + Alert + SLO + Datasource + 权限
- 多环境(dev/staging/prod)一致性要求严格
- 团队已有 HCL 技能储备
1.2 架构分层
┌─────────────────────────────────────────────────────────────┐
│ Git Repository │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │dashboards│ │ datasources│ │ alerting │ │ iam/teams │ │
│ │ (.json) │ │ (.tf) │ │ (.tf) │ │ (.tf) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │
└────────────────────┬──────────────────────────────────────────┘
│ PR Review / CI Validation
▼
┌─────────────────────────────────────────────────────────────┐
│ CI/CD Pipeline (GitHub Actions/GitLab CI) │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ terraform fmt│ │ terraform plan│ │ terraform apply │ │
│ │ validate │ │ (review req) │ │ (auto/staging) │ │
│ └──────────────┘ └──────────────┘ └────────────────────┘ │
└────────────────────┬──────────────────────────────────────────┘
│ State Backend (S3 + DynamoDB / Terraform Cloud)
▼
┌─────────────────────────────────────────────────────────────┐
│ Grafana Instance(s) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
│ │ OSS │ │ Cloud │ │ AWS │ │ Multi-tenant │ │
│ │ Self │ │ Stack │ │ Managed│ │ (prod/staging) │ │
│ │ Hosted │ │ │ │ Grafana│ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
二、Provider 配置与认证体系
2.1 基础 Provider 配置
Terraform Grafana Provider 当前稳定版本为 ~> 2.0,支持 Grafana OSS 和 Grafana Cloud。
hcl
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 2.0" # 或 ">= 3.0" 若已发布
}
}
}
# provider.tf
provider "grafana" {
url = var.grafana_url
auth = var.grafana_auth # Service Account Token 推荐
}
认证方式优先级
- Service Account Token(推荐生产):在 Grafana 中创建 Service Account → 分配 Viewer/Editor/Admin 角色 → 生成 Token
- API Key(已逐步被 Service Account 替代)
- Basic Auth :
admin:password(仅初始化或本地测试)
2.2 多实例管理(Provider Alias)
管理多套 Grafana 环境(如 prod Grafana Cloud + dev OSS 实例):
hcl
provider "grafana" {
alias = "production"
url = "https://my-stack.grafana.net/"
auth = var.grafana_prod_token
}
provider "grafana" {
alias = "staging"
url = "https://staging.grafana.local/"
auth = var.grafana_staging_token
}
# 使用示例
resource "grafana_folder" "prod_infra" {
provider = grafana.production
title = "Infrastructure"
}
resource "grafana_folder" "staging_infra" {
provider = grafana.staging
title = "Infrastructure"
}
2.3 Grafana Cloud 专属配置
Grafana Cloud 需要额外的 Cloud Access Policy Token 来管理 Stack、Synthetic Monitoring 等资源:
hcl
provider "grafana" {
alias = "cloud"
url = "https://grafana.com"
auth = var.grafana_cloud_api_key # Cloud Access Policy Token
# Synthetic Monitoring 专用
sm_access_token = var.grafana_sm_token
}
三、Dashboard 资源深度管理
Dashboard 是 Grafana 中最复杂的资源类型。Terraform 通过 config_json 字段接收完整的 Dashboard Model JSON。
3.1 目录结构与文件组织
bash
grafana-terraform/
├── modules/
│ └── dashboard-stack/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── dashboards/
│ ├── platform/
│ │ ├── cluster-overview.json
│ │ └── node-exporter.json
│ ├── application/
│ │ ├── api-gateway.json
│ │ └── payment-service.json
│ └── templates/
│ └── service-overview.json.tpl
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ └── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── global/
├── folders.tf
├── datasources.tf
└── permissions.tf
3.2 批量导入 Dashboard JSON
使用 for_each + fileset 实现批量管理,避免为每个 Dashboard 写重复代码:
hcl
# dashboards.tf
locals {
dashboard_folders = {
"platform" = grafana_folder.platform.id
"application" = grafana_folder.application.id
}
}
resource "grafana_dashboard" "all" {
for_each = {
for pair in setproduct(keys(local.dashboard_folders), fileset("${path.module}/dashboards", "*/*.json")) :
"${pair[0]}-${trimsuffix(basename(pair[1]), ".json")}" => {
folder = local.dashboard_folders[pair[0]]
path = "${path.module}/dashboards/${pair[1]}"
}
}
folder = each.value.folder
config_json = file(each.value.path)
overwrite = true
}
3.3 Dashboard JSON 预处理规范
从 Grafana UI 导出的 JSON 需要清理后才能用于 Terraform:
bash
# 清理脚本:删除 id、version,保留 uid
jq 'del(.id, .version) | .uid |= .' exported.json > clean.json
关键字段处理
id:必须删除,由 Grafana 自动分配version:必须删除,避免版本冲突uid:必须保留且固定,用于唯一标识和更新datasource.uid:建议引用 Terraform 数据源资源,而非硬编码
3.4 使用 Templatefile 实现参数化
对于结构相似但指标不同的 Dashboard(如各微服务统一视图),使用 Terraform 模板:
hcl
# templates/service-overview.json.tpl
{
"title": "${service_name} Overview",
"uid": "svc-${service_name}",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total{service=\"${service_name}\"}[$__rate_interval])"
}
]
}
]
}
# main.tf
resource "grafana_dashboard" "services" {
for_each = toset(["api-gateway", "web-frontend", "worker", "billing"])
folder = grafana_folder.application.id
config_json = templatefile("${path.module}/templates/service-overview.json.tpl", {
service_name = each.key
})
}
3.5 Grafonnet + Terraform 混合工作流
对于复杂 Dashboard,手写 JSON 维护困难。推荐 Grafonnet (Jsonnet) 生成 JSON,Terraform 负责部署:
bash
# 工作流
dashboards/*.jsonnet --[jsonnet]--> output/*.json --[terraform]--> Grafana
Jsonnet 示例
jsonnet
// dashboards/cluster-overview.jsonnet
local g = import 'grafonnet/grafana.libsonnet';
g.dashboard.new(
title='Kubernetes Cluster Overview',
uid='k8s-cluster-overview',
timezone='utc',
)
.addPanel(
g.panel.timeSeries.new('CPU Usage')
.addTarget(
g.target.prometheus.new('prometheus', 'sum(rate(container_cpu_usage_seconds_total[$__rate_interval])) by (namespace)')
),
gridPos={x: 0, y: 0, w: 12, h: 8}
)
CI 集成
yaml
# .github/workflows/dashboards.yml
- name: Generate Dashboards
run: |
jb install # jsonnet-bundler 安装依赖
mkdir -p output
for f in dashboards/*.jsonnet; do
jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json"
done
- name: Validate & Deploy
run: |
terraform init
terraform plan
terraform apply -auto-approve
四、Datasource 与 Folder 管理
4.1 数据源全类型配置
Terraform 支持 Prometheus、Elasticsearch、CloudWatch、Jaeger、Loki、Tempo 等数十种数据源。
hcl
# datasources.tf
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
uid = "prometheus-main" # 固定 UID,Dashboard 中引用
url = "http://prometheus.monitoring.svc:9090"
is_default = true
json_data_encoded = jsonencode({
httpMethod = "POST"
manageAlerts = true
prometheusType = "Prometheus"
prometheusVersion = "2.40.0"
})
}
resource "grafana_data_source" "cloudwatch" {
type = "cloudwatch"
name = "AWS CloudWatch"
uid = "cloudwatch-main"
json_data_encoded = jsonencode({
defaultRegion = "us-east-1"
authType = "default" # 使用 EC2 IAM Role
})
}
resource "grafana_data_source" "elasticsearch" {
type = "elasticsearch"
name = "Application Logs"
uid = "es-logs"
url = "https://es.example.com:9200"
database_name = "[logs-]YYYY.MM.DD"
json_data_encoded = jsonencode({
esVersion = "8.0.0"
timeField = "@timestamp"
maxConcurrentShardRequests = 256
logMessageField = "message"
logLevelField = "level"
})
}
关键注意事项
- 始终显式设置
uid,Dashboard 中通过${grafana_data_source.prometheus.uid}引用 - 使用
json_data_encoded而非旧版json_data块,避免 provider 版本兼容问题 - AWS Managed Grafana 需配置
sigv4_auth等 SigV4 参数
4.2 Folder 与权限体系
hcl
# folders.tf
resource "grafana_folder" "platform" {
title = "Platform Engineering"
uid = "platform"
}
resource "grafana_folder" "application" {
title = "Application Teams"
uid = "application"
}
# permissions.tf - Folder 级别权限
resource "grafana_folder_permission" "platform" {
folder_uid = grafana_folder.platform.uid
permissions {
role = "Viewer"
permission = "View"
}
permissions {
team_id = grafana_team.sre.id
permission = "Edit"
}
permissions {
team_id = grafana_team.platform.id
permission = "Admin"
}
}
# Dashboard 级别细粒度权限
resource "grafana_dashboard_permission" "sensitive" {
dashboard_uid = grafana_dashboard.security_overview.uid
permissions {
team_id = grafana_team.security.id
permission = "View"
}
}
五、Alerting 告警体系 as Code
Grafana Alerting 是 Terraform 管理中最复杂的部分,包含 Contact Point、Notification Policy、Alert Rule、Mute Timing、Message Template 五大资源。
5.1 联系点(Contact Points)
hcl
# alerting/contact-points.tf
resource "grafana_contact_point" "email_ops" {
name = "Operations Email"
email {
addresses = ["ops@company.com", "sre@company.com"]
single_email = true
message = "{{ template \"default.message\" . }}"
}
}
resource "grafana_contact_point" "slack_alerts" {
name = "Slack Alerts"
slack {
url = var.slack_webhook_url
recipient = "#alerts"
title = "{{ template \"default.title\" . }}"
text = "{{ template \"default.message\" . }}"
}
}
resource "grafana_contact_point" "pagerduty_critical" {
name = "PagerDuty Critical"
pagerduty {
integration_key = var.pagerduty_key
severity = "critical"
}
}
5.2 通知模板(Message Templates)
hcl
resource "grafana_message_template" "custom" {
name = "custom_alerts"
template = <<EOT
{{ define "custom_email.message" }}
Alert: {{ .CommonLabels.alertname }}
Severity: {{ .CommonLabels.severity }}
Summary: {{ .CommonAnnotations.summary }}
Runbook: {{ .CommonAnnotations.runbook_url }}
{{ end }}
EOT
}
# 在 contact point 中引用模板
resource "grafana_contact_point" "email_custom" {
name = "Custom Email"
email {
addresses = ["oncall@company.com"]
message = "{{ template \"custom_email.message\" . }}"
}
}
5.3 静默时间(Mute Timings)
hcl
resource "grafana_mute_timing" "weekends" {
name = "No Weekends"
intervals {
weekdays = ["saturday", "sunday"]
}
}
resource "grafana_mute_timing" "maintenance" {
name = "Maintenance Window"
intervals {
weekdays = ["monday"]
times {
start = "02:00"
end = "04:00"
}
}
}
5.4 通知策略树(Notification Policy)
⚠️ 关键警告 :grafana_notification_policy 是一个单例资源,应用它会覆盖整个通知策略树。必须在代码中完整定义所有策略。
hcl
resource "grafana_notification_policy" "main" {
group_by = ["alertname", "grafana_folder", "severity"]
contact_point = grafana_contact_point.email_ops.name
group_wait = "30s"
group_interval = "5m"
repeat_interval = "4h"
# 关键告警 -> PagerDuty
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.pagerduty_critical.name
group_wait = "10s"
continue = true # 继续匹配其他策略
}
# 警告 -> Slack
policy {
matcher {
label = "severity"
match = "="
value = "warning"
}
contact_point = grafana_contact_point.slack_alerts.name
}
# 开发环境告警 -> 静默周末
policy {
matcher {
label = "environment"
match = "="
value = "development"
}
contact_point = grafana_contact_point.slack_alerts.name
mute_timings = [grafana_mute_timing.weekends.name]
}
}
5.5 告警规则组(Alert Rules)
hcl
resource "grafana_rule_group" "platform" {
name = "platform_alerts"
folder_uid = grafana_folder.platform.uid
interval = 60 # 评估间隔 60s
rule {
name = "High CPU Usage"
condition = "B"
data {
ref_id = "A"
relative_time_range {
from = 300
to = 0
}
datasource_uid = grafana_data_source.prometheus.uid
model = jsonencode({
expr = "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80"
refId = "A"
})
}
data {
ref_id = "B"
relative_time_range {
from = 0
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
type = "threshold"
expression = "A"
conditions = [{
evaluator = {
type = "gt"
params = [80]
}
}]
})
}
annotations = {
summary = "CPU usage above 80% on {{ $labels.instance }}"
description = "Instance {{ $labels.instance }} has CPU usage of {{ $value }}%"
runbook_url = "https://wiki.internal/runbooks/high-cpu"
}
labels = {
severity = "critical"
team = "sre"
}
}
}
Alert Rule 设计要点
- 一个
rule_group内的所有 rule 是原子评估的 - 使用
for_each批量创建同类告警 datasource_uid引用 Terraform 数据源资源,避免硬编码
六、SLO 与 Synthetic Monitoring
6.1 SLO as Code
Grafana Cloud SLO 功能可通过 Terraform 管理。创建 SLO 后,系统会自动生成关联的 Recording Rules、Dashboard 和 Alert。
hcl
resource "grafana_slo" "api_availability" {
name = "API Availability"
description = "99.9% availability target for API gateway"
query {
type = "ratio"
ratio {
success_metric = "sum(rate(http_requests_total{status!~\"5..\"}[5m]))"
total_metric = "sum(rate(http_requests_total[5m]))"
}
}
objectives {
value = 0.999
window = "30d"
}
alert {
fastburn {
annotation {
key = "severity"
value = "critical"
}
label {
key = "team"
value = "sre"
}
}
slowburn {
annotation {
key = "severity"
value = "warning"
}
}
}
# 可选:关联到特定文件夹
folder_uid = grafana_folder.slo.id
}
⚠️ 初始化陷阱 :新创建的 Grafana Cloud Stack 需要先手动初始化 SLO 功能(在 UI 中点击一次),否则 Terraform 首次 apply 会报错。可通过 time_sleep 资源延迟创建或先执行初始化脚本。
hcl
resource "time_sleep" "wait_for_slo_init" {
create_duration = "60s"
depends_on = [grafana_cloud_stack.main]
}
6.2 Synthetic Monitoring
hcl
resource "grafana_synthetic_monitoring_check" "homepage" {
job = "homepage"
target = "https://example.com"
enabled = true
frequency = 60000 # 60s
timeout = 5000
probes = [
data.grafana_synthetic_monitoring_probes.main.probes.0
]
settings {
http {
method = "GET"
valid_status_codes = [200]
valid_http_versions = ["HTTP/1.1", "HTTP/2"]
}
}
}
七、IAM 与组织架构
7.1 用户与团队管理
hcl
# iam.tf
resource "grafana_user" "developers" {
for_each = toset([
"alice@company.com",
"bob@company.com",
"charlie@company.com"
])
email = each.value
login = split("@", each.value)[0]
password = random_password.user_passwords[each.value].result
is_admin = false
}
resource "random_password" "user_passwords" {
for_each = toset(["alice@company.com", "bob@company.com", "charlie@company.com"])
length = 16
special = true
}
resource "grafana_team" "sre" {
name = "SRE Team"
email = "sre@company.com"
members = [
grafana_user.developers["alice@company.com"].email,
grafana_user.developers["bob@company.com"].email,
]
}
resource "grafana_team" "platform" {
name = "Platform Team"
email = "platform@company.com"
members = [
grafana_user.developers["charlie@company.com"].email,
]
}
7.2 组织与多租户
hcl
resource "grafana_organization" "engineering" {
name = "Engineering"
}
provider "grafana" {
alias = "engineering"
org_id = grafana_organization.engineering.org_id
auth = var.grafana_auth
}
resource "grafana_folder" "eng_infra" {
provider = grafana.engineering
title = "Infrastructure"
}
八、多环境管理策略
8.1 Terraform Workspace 方案
使用 Terraform Workspace 隔离环境状态:
bash
terraform workspace new production
terraform workspace new staging
terraform workspace new development
hcl
# environments.tfvars 按 workspace 区分
locals {
env = terraform.workspace
grafana_configs = {
production = {
url = "https://my-stack.grafana.net/"
token = var.grafana_prod_token
}
staging = {
url = "https://staging.grafana.local/"
token = var.grafana_staging_token
}
}
}
provider "grafana" {
url = local.grafana_configs[local.env].url
auth = local.grafana_configs[local.env].token
}
8.2 环境差异化配置
hcl
locals {
environment_tags = {
production = ["prod", "critical"]
staging = ["staging", "non-critical"]
}
}
resource "grafana_dashboard" "overview" {
folder = grafana_folder.main.id
config_json = templatefile("${path.module}/dashboards/overview.json.tpl", {
environment = local.env
tags = local.environment_tags[local.env]
datasource = grafana_data_source.prometheus.uid
})
}
8.3 模块复用模式
hcl
# modules/monitoring-stack/main.tf
variable "environment" { type = string }
variable "prometheus_url" { type = string }
resource "grafana_folder" "main" {
title = "${var.environment} Monitoring"
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus ${var.environment}"
url = var.prometheus_url
}
resource "grafana_dashboard" "overview" {
folder = grafana_folder.main.id
config_json = file("${path.module}/dashboards/overview.json")
}
output "folder_id" {
value = grafana_folder.main.id
}
# environments/production/main.tf
module "prod_monitoring" {
source = "../../modules/monitoring-stack"
environment = "Production"
prometheus_url = "http://prometheus-prod.monitoring.svc:9090"
}
九、状态管理与协作
9.1 Remote Backend 配置
hcl
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "grafana/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
9.2 资源导入策略
从现有 Grafana UI 迁移到 Terraform 的批量导入流程:
bash
# 1. 导出 Dashboard JSON 并清理
curl -H "Authorization: Bearer $TOKEN" \
"$URL/api/dashboards/uid/my-dashboard" | \
jq '.dashboard | del(.id, .version)' > dashboards/my-dashboard.json
# 2. 编写 Terraform 资源
resource "grafana_dashboard" "my_dashboard" {
folder = grafana_folder.main.id
config_json = file("${path.module}/dashboards/my-dashboard.json")
}
# 3. 导入到 Terraform State
terraform import grafana_dashboard.my_dashboard <uid>
terraform plan # 对比差异,补齐代码
批量导入脚本
bash
#!/bin/bash
# import-all.sh
uids=$(curl -s -H "Authorization: Bearer $TOKEN" \
"$URL/api/search?type=dash-db&limit=1000" | jq -r '.[].uid')
for uid in $uids; do
echo "Importing dashboard: $uid"
terraform import grafana_dashboard.$uid $uid 2>/dev/null || echo "Skipped $uid"
done
十、CI/CD 完整流水线
10.1 GitHub Actions 工作流
yaml
# .github/workflows/grafana-terraform.yml
name: Grafana Infrastructure as Code
on:
push:
branches: [main]
paths:
- 'terraform/grafana/**'
- 'dashboards/**'
pull_request:
paths:
- 'terraform/grafana/**'
- 'dashboards/**'
env:
TF_VAR_grafana_auth: ${{ secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN }}
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Terraform Format Check
working-directory: terraform/grafana
run: terraform fmt -check -recursive
- name: Terraform Init
working-directory: terraform/grafana
run: terraform init
- name: Terraform Validate
working-directory: terraform/grafana
run: terraform validate
- name: Generate Dashboards (Jsonnet)
if: hashFiles('dashboards/**/*.jsonnet') != ''
run: |
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
jb install
mkdir -p output
for f in dashboards/*.jsonnet; do
jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json"
done
- name: Validate Dashboard JSON
run: |
for f in output/*.json dashboards/**/*.json; do
jq empty "$f"
done
plan:
needs: validate
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init & Plan
working-directory: terraform/grafana
run: |
terraform init
terraform plan -no-color -out=tfplan
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('terraform/grafana/tfplan.stdout', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `### Terraform Plan\n\`\`\`\n${plan}\n\`\`\``
});
deploy:
needs: validate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # 需要审批
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init & Apply
working-directory: terraform/grafana
run: |
terraform init
terraform apply -auto-approve
10.2 审批与回滚策略
- Plan 阶段:PR 时自动执行,结果评论到 PR
- Apply 阶段 :合并到
main后触发,通过 GitHub Environment Protection Rules 设置人工审批 - 回滚:利用 Terraform State 历史版本或 Git Revert + Re-apply
- Dashboard 专属变更 :仅
dashboards/**路径变更时触发,减少无关构建
十一、最佳实践与常见陷阱
11.1 核心最佳实践
| 实践项 | 说明 |
|---|---|
| 固定 UID | Dashboard、Folder、Datasource 必须显式设置 uid,避免重复创建 |
| 删除 id/version | 导入 JSON 时删除 id 和 version 字段 |
| 禁用 UI 编辑 | 生产环境设置 disable_provenance = false(默认),保持 Terraform 为唯一真理源 |
| 敏感信息隔离 | Webhook URL、PagerDuty Key、密码使用 sensitive = true 变量,注入环境变量 |
| 分支保护 | main 分支禁止直接推送,必须通过 PR + Code Review |
| 状态锁定 | 使用 DynamoDB 或 Terraform Cloud 防止并发操作 |
| 模块复用 | 将通用监控栈封装为模块,环境间复用 |
| UTC 时区 | Dashboard 统一设置 timezone: "utc",避免时区混乱 |
11.2 常见陷阱与解决方案
| 问题 | 原因 | 解决方案 |
|---|---|---|
| Contact Point 删除失败 409 | 被 Notification Policy 引用 | 先更新 Policy 移除引用,再删除 Contact Point;或设计时避免循环依赖 |
| Datasource 引用失效 | 硬编码 UID 与环境不匹配 | 使用 grafana_data_source.xxx.uid 动态引用 |
| SLO 首次创建失败 | Grafana Cloud SLO 功能未初始化 | 手动在 UI 初始化一次,或使用 time_sleep 延迟 |
| Dashboard 重复创建 | UID 冲突或未设置 | 确保所有 Dashboard 有固定 UID |
| Alert Rule 评估异常 | __expr__ 数据源配置错误 |
严格遵循 ref_id 和 datasource_uid = "__expr__" 规范 |
| Terraform Plan 频繁漂移 | UI 手动修改导致 | 设置 disable_provenance = false,禁止 UI 编辑 provisioned 资源 |
11.3 监控 Terraform 本身
建议将 Terraform 状态变更也纳入审计:
hcl
# 在 Terraform 中记录部署信息
resource "grafana_annotation" "deployment" {
text = "Terraform apply: ${timestamp()}"
dashboard_id = grafana_dashboard.overview.id
tags = ["terraform", "deployment"]
}
十二、完整项目结构示例
bash
grafana-infrastructure/
├── README.md
├── .github/
│ └── workflows/
│ └── grafana-terraform.yml
├── modules/
│ ├── monitoring-stack/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── dashboards/
│ │ └── overview.json
│ └── alerting-policy/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ └── staging/
│ ├── main.tf
│ ├── backend.tf
│ └── terraform.tfvars
├── global/
│ ├── providers.tf
│ ├── versions.tf
│ ├── variables.tf
│ ├── folders.tf
│ ├── datasources.tf
│ ├── permissions.tf
│ ├── iam.tf
│ └── alerting/
│ ├── contact-points.tf
│ ├── notification-policy.tf
│ ├── mute-timings.tf
│ ├── templates.tf
│ └── rule-groups.tf
├── dashboards/
│ ├── jsonnet/
│ │ ├── lib/
│ │ ├── cluster-overview.jsonnet
│ │ └── service-detail.jsonnet
│ └── json/ # CI 生成或手写的最终 JSON
│ ├── platform/
│ └── application/
└── scripts/
├── import-dashboards.sh
└── validate-json.sh
十三、选型总结
Terraform Grafana IaC 路线是已有 Terraform 工作流团队的最优选择,其核心价值在于:
- 全资源覆盖:Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 统一管理
- 环境一致性:通过 Workspace + Module 实现多环境复刻
- 变更可审计:Git 历史 + Terraform Plan 提供完整的变更审查链
- 灾难恢复:从 Git + State 可完全重建整个 Grafana 配置
实施路径建议
- 第 1 周:搭建 Provider + 导入现有 Datasource 和 Folder
- 第 2-3 周:批量导入 Dashboard,建立 Jsonnet/Terraform 混合工作流
- 第 4 周:迁移 Alerting(Contact Point → Policy → Rule Group)
- 第 5 周:接入 SLO、Synthetic Monitoring、IAM
- 第 6 周:完善 CI/CD、状态锁定、审批流程、文档
此方案将 Grafana 从"手工配置的 UI 工具"转变为"可版本控制、可审查、可自动化的基础设施组件",真正实现监控体系的 GitOps 闭环。