基于 GCP Eventarc 与 Terraform 实现 Spot 实例全自动容灾修复

1. 背景与挑战

在 Google Cloud Platform (GCP) 中，Spot 虚拟机（抢占式实例）凭借其显著的成本优势（通常为标准实例费用的 10%-30%）被广泛应用于各类架构中。然而，Google 随时可能基于资源调度需求强制回收此类实例（Preemption），这给系统的高可用性带来了挑战。

对于容错率较高、但在被回收后仍需自动恢复运行的后台服务而言，采用监控告警加人工介入的方式不仅效率低下，且有悖于自动化运维（AIOps）的最佳实践。本文将详细阐述如何结合 GCP Eventarc、Cloud Functions 与 Terraform，构建一套纯 Serverless 架构下的 Spot 实例自动化恢复机制。

2. 架构设计与事件驱动模型

本方案的核心在于将底层资源的生命周期状态转化为可被代码消费的标准化事件，整体链路如下：

事件生成 ：当底层发生 Spot 实例回收时，GCP 自动向 Cloud Audit Logs 写入操作类型为 v1.compute.instances.preempted 的审计日志。
事件路由：Eventarc 作为事件总线，捕获上述日志，并根据预设的属性过滤器（Event Filters）进行精准匹配，拦截无关日志。
事件消费：命中规则后，Eventarc 触发 Cloud Functions v2 实例，并注入标准的 CloudEvent 负载流。
状态修复 ：云函数解析目标资源路径，校验实例白名单，最终调用 Google Cloud Compute API 发送 Start 指令，实现实例重启。

3. 核心代码实现

3.1 业务控制层（Python）

在应用层，采用 GCP 官方提供的 functions_framework 框架处理 CloudEvents 负载。通过解析事件参数定位目标资源，并利用 google-cloud-compute 库执行调度。

python 复制代码

import functions_framework
from google.cloud import compute_v1

@functions_framework.cloud_event
def restart_vm(cloud_event):
    """
    处理由 Eventarc 路由的 Compute Engine 抢占事件。
    """
    data = cloud_event.data
    proto_payload = data.get("protoPayload", {})
    resource_name = proto_payload.get("resourceName", "")

    # 白名单校验：确保仅针对特定的核心实例执行恢复操作
    target_vm = "tf-vpc0-subnet0-openclaw"
    if target_vm not in resource_name:
        print(f"Ignored event for resource: {resource_name}")
        return

    # 解析资源路径，提取 Project, Zone 与 Instance ID
    parts = resource_name.split('/')
    if len(parts) >= 6:
        project_id = parts[1]
        zone = parts[3]
        instance_name = parts[5]
        
        print(f"Initiating start sequence for instance: {instance_name} in zone: {zone}")
        
        # 初始化 Compute 客户端并构造启动请求
        client = compute_v1.InstancesClient()
        request = compute_v1.StartInstanceRequest(
            project=project_id,
            zone=zone,
            instance=instance_name
        )
        
        # 执行底层 API 调用
        client.start(request=request)
        print(f"Start command successfully dispatched for instance: {instance_name}")
    else:
        print(f"Resource path parsing failed. Invalid format: {resource_name}")

注：依赖管理遵循解耦原则，需在同级目录提供 requirements.txt（包含 functions-framework 与 google-cloud-compute），GCP 构建环境将在运行时自动拉取依赖。

3.2 基础设施编排层（Terraform）

采用 Infrastructure as Code (IaC) 方式编排所有云端资源，确保环境的幂等性与可追溯性。配置中严格遵循 IAM 最小权限原则（PoLP）。

IAM 与存储声明

定义专用的 Service Account，并赋予其实例启动与事件接收的必要权限；同时构建 GCS Bucket 用于托管动态打包的函数源码。

python 复制代码

variable "project_id" {}
variable "region_id" {}

# 定义函数专属运行身份
resource "google_service_account" "autorestart_sa" {
  account_id   = "autorestart-sa"
  display_name = "SA for Auto-Restart Cloud Function"
}

# 授予计算引擎控制权限
resource "google_project_iam_member" "compute_admin" {
  project = var.project_id
  role    = "roles/compute.instanceAdmin.v1"
  member  = "serviceAccount:${google_service_account.autorestart_sa.email}"
}

# 授予 Eventarc 触发接收权限
resource "google_project_iam_member" "event_receiver" {
  project = var.project_id
  role    = "roles/eventarc.eventReceiver"
  member  = "serviceAccount:${google_service_account.autorestart_sa.email}"
}

# 源码托管容器及动态打包配置
resource "google_storage_bucket" "function_source_bucket" {
  name                        = "${var.project_id}-gcf-source"
  location                    = var.region_id
  uniform_bucket_level_access = true
  force_destroy               = true
}

data "archive_file" "function_zip" {
  type        = "zip"
  source_dir  = "${path.module}/src"
  output_path = "${path.module}/function-source.zip"
}

resource "google_storage_bucket_object" "function_zip_obj" {
  name   = "function-source-${data.archive_file.function_zip.output_md5}.zip"
  bucket = google_storage_bucket.function_source_bucket.name
  source = data.archive_file.function_zip.output_path
}

函数资源与 Eventarc 触发器定义

集成 Cloud Functions v2，并通过 event_trigger 块声明 Eventarc 路由规则。

python 复制代码

resource "google_cloudfunctions2_function" "autorestart_fn" {
  name        = "autorestart-spot-vm"
  location    = var.region_id
  description = "Auto restart spot VM when preempted or stopped"

  build_config {
    runtime     = "python312"
    entry_point = "restart_vm"
    source {
      storage_source {
        bucket = google_storage_bucket.function_source_bucket.name
        object = google_storage_bucket_object.function_zip_obj.name
      }
    }
  }

  service_config {
    max_instance_count    = 1
    available_memory      = "256M"
    timeout_seconds       = 60
    service_account_email = google_service_account.autorestart_sa.email
  }

  # 定义 Eventarc 事件监听与过滤策略
  event_trigger {
    trigger_region        = "global"
    event_type            = "google.cloud.audit.log.v1.written"
    service_account_email = google_service_account.autorestart_sa.email
    
    # 过滤器 1：锚定 Compute Engine 服务日志
    event_filters {
      attribute = "serviceName"
      value     = "compute.googleapis.com"
    }
    
    # 过滤器 2：精准匹配实例被抢占的具体方法名
    event_filters {
      attribute = "methodName"
      value     = "v1.compute.instances.preempted"
    }
  }
}

4. 架构优势总结

零轮询成本：相较于运行独立的常驻进程执行定时巡检，本方案依赖 GCP 底层日志的推送机制。Eventarc 在路由层面的过滤拦截不产生任何计算费用，函数实例仅在真实抢占事件发生时才被唤醒计费。
极简的可维护性：通过 Terraform 的模块化设计，底层权限、服务总线与业务逻辑被清晰解耦。源码的打包、上传与运行时配置更新均实现了真正的 IaC 闭环。