分布式事务实战:Saga 模式 + 补偿机制 + 本地消息表 + 最终一致性

前言

💡 痛点: 微服务间怎么保证数据一致性?Saga 怎么实现?补偿事务怎么写?本地消息表靠谱吗?TCC 和 Saga 怎么选?

🎯 解决方案: 本文系统讲解分布式事务核心模式:Saga 编配与编排实现、补偿事务设计与恢复策略、本地消息表可靠投递、TCC 三阶段提交、Seata 框架实战、最终一致性 vs 强一致性决策树。
#mermaid-svg-3JrHLwZRsRGubsnc{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-3JrHLwZRsRGubsnc .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-3JrHLwZRsRGubsnc .error-icon{fill:#552222;}#mermaid-svg-3JrHLwZRsRGubsnc .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-3JrHLwZRsRGubsnc .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-3JrHLwZRsRGubsnc .marker{fill:#333333;stroke:#333333;}#mermaid-svg-3JrHLwZRsRGubsnc .marker.cross{stroke:#333333;}#mermaid-svg-3JrHLwZRsRGubsnc svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-3JrHLwZRsRGubsnc p{margin:0;}#mermaid-svg-3JrHLwZRsRGubsnc .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-3JrHLwZRsRGubsnc .cluster-label text{fill:#333;}#mermaid-svg-3JrHLwZRsRGubsnc .cluster-label span{color:#333;}#mermaid-svg-3JrHLwZRsRGubsnc .cluster-label span p{background-color:transparent;}#mermaid-svg-3JrHLwZRsRGubsnc .label text,#mermaid-svg-3JrHLwZRsRGubsnc span{fill:#333;color:#333;}#mermaid-svg-3JrHLwZRsRGubsnc .node rect,#mermaid-svg-3JrHLwZRsRGubsnc .node circle,#mermaid-svg-3JrHLwZRsRGubsnc .node ellipse,#mermaid-svg-3JrHLwZRsRGubsnc .node polygon,#mermaid-svg-3JrHLwZRsRGubsnc .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-3JrHLwZRsRGubsnc .rough-node .label text,#mermaid-svg-3JrHLwZRsRGubsnc .node .label text,#mermaid-svg-3JrHLwZRsRGubsnc .image-shape .label,#mermaid-svg-3JrHLwZRsRGubsnc .icon-shape .label{text-anchor:middle;}#mermaid-svg-3JrHLwZRsRGubsnc .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-3JrHLwZRsRGubsnc .rough-node .label,#mermaid-svg-3JrHLwZRsRGubsnc .node .label,#mermaid-svg-3JrHLwZRsRGubsnc .image-shape .label,#mermaid-svg-3JrHLwZRsRGubsnc .icon-shape .label{text-align:center;}#mermaid-svg-3JrHLwZRsRGubsnc .node.clickable{cursor:pointer;}#mermaid-svg-3JrHLwZRsRGubsnc .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-3JrHLwZRsRGubsnc .arrowheadPath{fill:#333333;}#mermaid-svg-3JrHLwZRsRGubsnc .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-3JrHLwZRsRGubsnc .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-3JrHLwZRsRGubsnc .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-3JrHLwZRsRGubsnc .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-3JrHLwZRsRGubsnc .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-3JrHLwZRsRGubsnc .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-3JrHLwZRsRGubsnc .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-3JrHLwZRsRGubsnc .cluster text{fill:#333;}#mermaid-svg-3JrHLwZRsRGubsnc .cluster span{color:#333;}#mermaid-svg-3JrHLwZRsRGubsnc div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-3JrHLwZRsRGubsnc .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-3JrHLwZRsRGubsnc rect.text{fill:none;stroke-width:0;}#mermaid-svg-3JrHLwZRsRGubsnc .icon-shape,#mermaid-svg-3JrHLwZRsRGubsnc .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-3JrHLwZRsRGubsnc .icon-shape p,#mermaid-svg-3JrHLwZRsRGubsnc .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-3JrHLwZRsRGubsnc .icon-shape .label rect,#mermaid-svg-3JrHLwZRsRGubsnc .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-3JrHLwZRsRGubsnc .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-3JrHLwZRsRGubsnc .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-3JrHLwZRsRGubsnc :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 补偿机制
Saga 编排
失败
补偿
失败
消息驱动
Local Message Table

本地消息表
可靠消息队列

RocketMQ/Kafka
定时轮询

死信重试
Saga Orchestrator

编排器
Step 1

创建订单
Step 2

扣减库存
Step 3

扣款
Step 4

发送通知
Compensable Action

可补偿操作
Recovery

恢复/重试
Compensate Backward

逆向补偿


一、分布式事务问题本质

1.1 为什么 CAP 和 BASE 决定了选择

go 复制代码
// ======== 强一致性场景 vs 最终一致性场景 ========

/*
分布式系统的不可能三角(CAP):
- Consistency(一致性):所有节点同一时刻看到相同数据
- Availability(可用性):每次请求都能得到响应
- Partition Tolerance(分区容错):网络分区时仍能运行

实际选择:
- CP 系统(强一致性):ZooKeeper、etcd、HBase
- AP 系统(最终一致性):Cassandra、DynamoDB

BASE 理论(最终一致性的基础):
- Basically Available(基本可用):允许部分功能暂时不可用
- Soft state(软状态):数据状态可以暂时不一致
- Eventually consistent(最终一致):系统在一段时间后达到一致
*/

// ======== 强一致性选择:两阶段提交(2PC)=======
/*
2PC 的问题(生产环境很少用):
1. 单点协调者:协调者挂了,参与者全部阻塞
2. 同步阻塞:所有参与者在 prepare 阶段锁定资源
3. 数据不一致:commit 阶段部分失败时

结论:2PC 只适合单机数据库,或极少数分片场景
*/

// ======== 最终一致性选择:Saga / 本地消息表 / TCC ========
/*
Saga:适用于长事务(业务流程跨多个服务)
TCC:适用于短事务(需要强隔离性)
本地消息表:适用于异步解耦(消息可靠性优先)
*/

1.2 业务场景决策树

go 复制代码
// ======== 何时用哪种模式?=======
/*
决策树:

1. 能否接受最终一致性?
   ├─ 是 → 继续
   └─ 否 → 强一致性 → 2PC(不推荐)或同一数据库

2. 事务时长?
   ├─ < 1秒(短事务)→ TCC
   └─ > 1秒(长流程)→ Saga 或 本地消息表

3. 是否需要幂等性保证?
   ├─ 是 → 所有方案都需要幂等(Saga 天然支持)
   └─ 否 → TCC(Try 阶段)

4. 补偿逻辑复杂度?
   ├─ 简单(可逆操作)→ Saga
   └─ 复杂(需要业务判断)→ 本地消息表

典型场景决策:
- 电商下单(订单→库存→支付→物流):Saga
- 转账(扣款→入账):TCC 或 Saga
- 异步发消息(订单→通知→日志):本地消息表
- 秒杀扣库存:TCC(强隔离)或 Redis + MQ
*/

二、Saga 模式

2.1 Saga 编排器实现

go 复制代码
// ======== Saga Orchestrator(编排式 Saga)=======
package saga

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"sync"
	"time"

	"github.com/google/uuid"
)

// SagaStep Saga 步骤定义
type SagaStep struct {
	Name         string
	Execute      func(ctx context.Context, payload []byte) (result []byte, err error)
	Compensate   func(ctx context.Context, payload []byte) error  // 补偿函数
	RetryPolicy  RetryPolicy
}

// SagaResult Saga 执行结果
type SagaResult struct {
	SagaID       string
	Completed    bool
	CompletedSteps []string
	FailedStep   string
	Error        error
}

// Saga Orchestrator
type SagaOrchestrator struct {
	sagaID  string
	steps   []SagaStep
	results map[int][]byte  // 每个步骤的结果(用于补偿)
	mu      sync.Mutex
}

func NewSaga(steps []SagaStep) *SagaOrchestrator {
	return &SagaOrchestrator{
		sagaID:  uuid.New().String(),
		steps:   steps,
		results: make(map[int][]byte),
	}
}

// Execute 执行 Saga
func (s *SagaOrchestrator) Execute(ctx context.Context, initialPayload []byte) (*SagaResult, error) {
	result := &SagaResult{
		SagaID: s.sagaID,
	}

	executed := []string{}
	payload := initialPayload

	for i, step := range s.steps {
		log.Printf("[Saga %s] Executing step %d: %s", s.sagaID, i, step.Name)

		// 执行步骤(带重试)
		var err error
		for attempt := 0; attempt <= step.RetryPolicy.MaxRetries; attempt++ {
			payload, err = step.Execute(ctx, payload)

			if err == nil {
				break
			}

			if attempt < step.RetryPolicy.MaxRetries {
				wait := step.RetryPolicy.Backoff.Duration(attempt)
				log.Printf("[Saga %s] Step %s failed (attempt %d), retrying in %v: %v",
					s.sagaID, step.Name, attempt+1, wait, err)
				time.Sleep(wait)
			}
		}

		if err != nil {
			result.FailedStep = step.Name
			result.Error = err

			// 补偿已执行的步骤(反向执行)
			log.Printf("[Saga %s] Step %s failed, starting compensation", s.sagaID, step.Name)
			s.compensate(ctx, executed)

			return result, nil
		}

		// 保存执行结果
		s.mu.Lock()
		s.results[i] = payload
		s.mu.Unlock()
		executed = append(executed, step.Name)

		log.Printf("[Saga %s] Step %s completed successfully", s.sagaID, step.Name)
	}

	result.Completed = true
	result.CompletedSteps = executed
	return result, nil
}

// compensate 逆向补偿已执行的步骤
func (s *SagaOrchestrator) compensate(ctx context.Context, executed []string) {
	for i := len(executed) - 1; i >= 0; i-- {
		stepIdx := -1
		for j, step := range s.steps {
			if step.Name == executed[i] {
				stepIdx = j
				break
			}
		}

		if stepIdx == -1 {
			continue
		}

		step := s.steps[stepIdx]
		s.mu.Lock()
		payload := s.results[stepIdx]
		s.mu.Unlock()

		log.Printf("[Saga %s] Compensating step %d: %s", s.sagaID, stepIdx, step.Name)

		if err := step.Compensate(ctx, payload); err != nil {
			// 补偿失败:记录日志,触发人工干预
			log.Printf("[Saga %s] CRITICAL: Compensation failed for %s: %v",
				s.sagaID, step.Name, err)
			// 发送告警,进入人工处理流程
			s.sendCompensationAlert(step.Name, err)
		} else {
			log.Printf("[Saga %s] Compensation completed for %s", s.sagaID, step.Name)
		}
	}
}

func (s *SagaOrchestrator) sendCompensationAlert(stepName string, err error) {
	// 发送到告警系统,触发人工干预
	fmt.Printf("ALERT: Saga %s compensation failed at step %s: %v\n", s.sagaID, stepName, err)
}

// RetryPolicy 重试策略
type RetryPolicy struct {
	MaxRetries int
	Backoff    BackoffStrategy
}

type BackoffStrategy interface {
	Duration(attempt int) time.Duration
}

type ExponentialBackoff struct {
	Initial time.Duration
	Max     time.Duration
	Factor  float64
}

func (b ExponentialBackoff) Duration(attempt int) time.Duration {
	d := time.Duration(float64(b.Initial) * pow(b.Factor, float64(attempt)))
	if d > b.Max {
		return b.Max
	}
	return d
}

func pow(base float64, exp int) float64 {
	result := 1.0
	for i := 0; i < exp; i++ {
		result *= base
	}
	return result
}

2.2 电商下单 Saga 完整示例

go 复制代码
// ======== 电商下单 Saga 完整实现 ========
package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"time"

	"github.com/google/uuid"
	"github.com/redis/go-redis/v9"
	"github.com/jackc/pgx/v5/pgxpool"
)

// OrderService 订单服务
type OrderService struct {
	db     *pgxpool.Pool
	redis  *redis.Client
	saga   *SagaOrchestrator
}

// OrderPayload 订单 Saga payload
type OrderPayload struct {
	OrderID    string  `json:"order_id"`
	UserID     string  `json:"user_id"`
	ProductID  string  `json:"product_id"`
	Quantity   int     `json:"quantity"`
	TotalPrice float64 `json:"total_price"`
}

// Step 1: 创建订单
func (s *OrderService) CreateOrder(ctx context.Context, payload []byte) ([]byte, error) {
	var p OrderPayload
	if err := json.Unmarshal(payload, &p); err != nil {
		return nil, err
	}

	p.OrderID = uuid.New().String()

	query := `
		INSERT INTO orders (order_id, user_id, product_id, quantity, total_price, status, created_at)
		VALUES ($1, $2, $3, $4, $5, 'pending', NOW())
		RETURNING order_id
	`
	var orderID string
	err := s.db.QueryRow(ctx, query,
		p.OrderID, p.UserID, p.ProductID, p.Quantity, p.TotalPrice,
	).Scan(&orderID)

	if err != nil {
		return nil, fmt.Errorf("failed to create order: %w", err)
	}

	p.OrderID = orderID
	result, _ := json.Marshal(p)
	return result, nil
}

// Step 1 补偿:取消订单
func (s *OrderService) CompensateCreateOrder(ctx context.Context, payload []byte) error {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	query := `UPDATE orders SET status = 'cancelled' WHERE order_id = $1`
	_, err := s.db.Exec(ctx, query, p.OrderID)
	return err
}

// Step 2: 扣减库存
func (s *OrderService) ReserveStock(ctx context.Context, payload []byte) ([]byte, error) {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	// Redis 分布式锁
	lockKey := fmt.Sprintf("stock:lock:%s", p.ProductID)
	locked, err := s.redis.SetNX(ctx, lockKey, "1", 10*time.Second).Result()
	if err != nil || !locked {
		return nil, fmt.Errorf("failed to acquire stock lock")
	}
	defer s.redis.Del(ctx, lockKey)

	// 检查并扣减库存
	stockKey := fmt.Sprintf("product:stock:%s", p.ProductID)
	stock, err := s.redis.Get(ctx, stockKey).Int()
	if err != nil || stock < p.Quantity {
		return nil, fmt.Errorf("insufficient stock: available=%d, requested=%d", stock, p.Quantity)
	}

	err = s.redis.DecrBy(ctx, stockKey, int64(p.Quantity)).Err()
	if err != nil {
		return nil, fmt.Errorf("failed to reserve stock: %w", err)
	}

	// 保存库存快照用于补偿
	compensationData, _ := json.Marshal(map[string]interface{}{
		"order_id":    p.OrderID,
		"product_id":  p.ProductID,
		"quantity":    p.Quantity,
		"previous_stock": stock,
	})

	// 合并到 payload
	var resultPayload OrderPayload
	json.Unmarshal(payload, &resultPayload)
	resultPayload.Quantity = p.Quantity // 保留扣减数量

	return json.Marshal(resultPayload)
}

// Step 2 补偿:恢复库存
func (s *OrderService) CompensateReserveStock(ctx context.Context, payload []byte) error {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	stockKey := fmt.Sprintf("product:stock:%s", p.ProductID)
	return s.redis.IncrBy(ctx, stockKey, int64(p.Quantity)).Err()
}

// Step 3: 扣款(模拟)
func (s *OrderService) ChargePayment(ctx context.Context, payload []byte) ([]byte, error) {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	// 模拟支付网关调用
	paymentID := fmt.Sprintf("PAY_%s", uuid.New().String()[:8])

	// 记录支付(实际向支付网关发起请求)
	query := `
		INSERT INTO payments (payment_id, order_id, user_id, amount, status, created_at)
		VALUES ($1, $2, $3, $4, 'completed', NOW())
	`
	_, err := s.db.Exec(ctx, query, paymentID, p.OrderID, p.UserID, p.TotalPrice)
	if err != nil {
		return nil, fmt.Errorf("payment failed: %w", err)
	}

	resultPayload, _ := json.Marshal(OrderPayload{
		OrderID:    p.OrderID,
		PaymentID:  paymentID,
		UserID:     p.UserID,
		ProductID:  p.ProductID,
		TotalPrice: p.TotalPrice,
	})
	return resultPayload, nil
}

// Step 3 补偿:退款(模拟)
func (s *OrderService) CompensateChargePayment(ctx context.Context, payload []byte) error {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	// 模拟退款
	query := `UPDATE payments SET status = 'refunded' WHERE payment_id = $1`
	_, err := s.db.Exec(ctx, query, p.PaymentID)
	return err
}

// Step 4: 发送通知
func (s *OrderService) SendNotification(ctx context.Context, payload []byte) ([]byte, error) {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	log.Printf("[Notification] Order %s confirmed for user %s, amount: %.2f",
		p.OrderID, p.UserID, p.TotalPrice)

	// 这里可以发送邮件/SMS/推送通知
	// 通知失败不影响事务(仅记录日志)
	return payload, nil
}

// Step 4 补偿:发送退款通知
func (s *OrderService) CompensateSendNotification(ctx context.Context, payload []byte) error {
	var p OrderPayload
	json.Unmarshal(payload, &p)

	log.Printf("[Notification] Order %s cancelled, refund initiated for user %s",
		p.OrderID, p.UserID)
	return nil
}

// 定义下单 Saga 步骤
func NewOrderSaga() *SagaOrchestrator {
	steps := []SagaStep{
		{
			Name:       "createOrder",
			Execute:    orderService.CreateOrder,
			Compensate: orderService.CompensateCreateOrder,
			RetryPolicy: RetryPolicy{
				MaxRetries: 2,
				Backoff:    ExponentialBackoff{Initial: 100 * time.Millisecond, Max: 2 * time.Second, Factor: 2},
			},
		},
		{
			Name:       "reserveStock",
			Execute:    orderService.ReserveStock,
			Compensate: orderService.CompensateReserveStock,
			RetryPolicy: RetryPolicy{
				MaxRetries: 1,
				Backoff:    ExponentialBackoff{Initial: 200 * time.Millisecond, Max: 1 * time.Second, Factor: 2},
			},
		},
		{
			Name:       "chargePayment",
			Execute:    orderService.ChargePayment,
			Compensate: orderService.CompensateChargePayment,
			RetryPolicy: RetryPolicy{
				MaxRetries: 3,
				Backoff:    ExponentialBackoff{Initial: 500 * time.Millisecond, Max: 5 * time.Second, Factor: 2},
			},
		},
		{
			Name:       "sendNotification",
			Execute:    orderService.SendNotification,
			Compensate: orderService.CompensateSendNotification,
			RetryPolicy: RetryPolicy{
				MaxRetries: 0,  // 通知失败不重试
			},
		},
	}

	return NewSaga(steps)
}

三、本地消息表

3.1 本地消息表核心实现

sql 复制代码
-- ======== 本地消息表结构 ========
CREATE TABLE outbox (
    outbox_id      BIGSERIAL PRIMARY KEY,
    aggregate_type VARCHAR(100) NOT NULL,   -- 'order', 'payment', etc.
    aggregate_id   VARCHAR(100) NOT NULL,    -- 业务 ID
    event_type     VARCHAR(100) NOT NULL,   -- 'order.created', 'payment.completed'
    payload        JSONB NOT NULL,          -- 消息内容
    status         VARCHAR(20) NOT NULL DEFAULT 'pending',
    -- pending: 待发送, sent: 已发送, failed: 发送失败
    retry_count    INTEGER NOT NULL DEFAULT 0,
    max_retries    INTEGER NOT NULL DEFAULT 3,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    processed_at   TIMESTAMPTZ
);

-- 索引
CREATE INDEX idx_outbox_status ON outbox (status, created_at);
CREATE INDEX idx_outbox_aggregate ON outbox (aggregate_type, aggregate_id);
CREATE INDEX idx_outbox_created ON outbox (created_at);

-- 事件表(幂等性保证)
CREATE TABLE event_log (
    event_id       BIGSERIAL PRIMARY KEY,
    event_type     VARCHAR(100) NOT NULL,
    aggregate_id   VARCHAR(100) NOT NULL,
    event_data     JSONB NOT NULL,
    occurred_at    TIMESTAMPTZ NOT NULL,
    processed_at   TIMESTAMPTZ,
    UNIQUE(event_type, aggregate_id)
);
go 复制代码
// ======== 本地消息表发送器 ========
package outbox

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"time"

	"github.com/jackc/pgx/v5"
	"github.com/jackc/pgx/v5/pgxpool"
)

// OutboxMessage 消息结构
type OutboxMessage struct {
	OutboxID      int64
	AggregateType string
	AggregateID   string
	EventType     string
	Payload       []byte
}

// OutboxRepository 消息表操作
type OutboxRepository struct {
	db *pgxpool.Pool
}

func NewOutboxRepository(db *pgxpool.Pool) *OutboxRepository {
	return &OutboxRepository{db: db}
}

// Publish 发布消息(在同一事务中写入)
func (r *OutboxRepository) Publish(
	ctx context.Context,
	tx pgx.Tx,
	aggregateType, aggregateID, eventType string,
	payload interface{},
) error {
	payloadJSON, err := json.Marshal(payload)
	if err != nil {
		return fmt.Errorf("failed to marshal payload: %w", err)
	}

	query := `
		INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload, status)
		VALUES ($1, $2, $3, $4, 'pending')
	`
	_, err = tx.Exec(ctx, query, aggregateType, aggregateID, eventType, payloadJSON)
	return err
}

// GetPending 获取待发送消息
func (r *OutboxRepository) GetPending(ctx context.Context, limit int) ([]OutboxMessage, error) {
	query := `
		SELECT outbox_id, aggregate_type, aggregate_id, event_type, payload
		FROM outbox
		WHERE status = 'pending'
		  AND retry_count < max_retries
		ORDER BY created_at ASC
		LIMIT $1
		FOR UPDATE SKIP LOCKED  -- 防止并发获取同一条消息
	`

	rows, err := r.db.Query(ctx, query, limit)
	if err != nil {
		return nil, err
	}
	defer rows.Close()

	var messages []OutboxMessage
	for rows.Next() {
		var msg OutboxMessage
		if err := rows.Scan(&msg.OutboxID, &msg.AggregateType, &msg.AggregateID, &msg.EventType, &msg.Payload); err != nil {
			return nil, err
		}
		messages = append(messages, msg)
	}

	return messages, nil
}

// MarkAsSent 标记为已发送
func (r *OutboxRepository) MarkAsSent(ctx context.Context, outboxID int64) error {
	query := `
		UPDATE outbox
		SET status = 'sent', processed_at = NOW()
		WHERE outbox_id = $1
	`
	_, err := r.db.Exec(ctx, query, outboxID)
	return err
}

// MarkAsFailed 标记为失败(增加重试计数)
func (r *OutboxRepository) MarkAsFailed(ctx context.Context, outboxID int64) error {
	query := `
		UPDATE outbox
		SET status = CASE
			WHEN retry_count + 1 >= max_retries THEN 'failed'
			ELSE 'pending'
		END,
		retry_count = retry_count + 1,
		updated_at = NOW()
		WHERE outbox_id = $1
	`
	_, err := r.db.Exec(ctx, query, outboxID)
	return err
}

// Cleanup 清理旧消息(保留 7 天)
func (r *OutboxRepository) Cleanup(ctx context.Context) (int64, error) {
	query := `
		DELETE FROM outbox
		WHERE status IN ('sent', 'failed')
		  AND processed_at < NOW() - INTERVAL '7 days'
	`
	result, err := r.db.Exec(ctx, query)
	return result.RowsAffected(), err
}

// ======== Outbox Relay(轮询任务)=======
type OutboxRelay struct {
	repo       *OutboxRepository
	publisher  MessagePublisher
	pollInterval time.Duration
	batchSize    int
}

type MessagePublisher interface {
	Publish(ctx context.Context, topic, key string, payload []byte) error
}

func (r *OutboxRelay) Start(ctx context.Context) {
	ticker := time.NewTicker(r.pollInterval)
	defer ticker.Stop()

	log.Printf("[OutboxRelay] Started, polling every %v", r.pollInterval)

	for {
		select {
		case <-ctx.Done():
			log.Println("[OutboxRelay] Stopping...")
			return
		case <-ticker.C:
			r.processBatch(ctx)
		}
	}
}

func (r *OutboxRelay) processBatch(ctx context.Context) {
	messages, err := r.repo.GetPending(ctx, r.batchSize)
	if err != nil {
		log.Printf("[OutboxRelay] Failed to get pending messages: %v", err)
		return
	}

	if len(messages) == 0 {
		return
	}

	log.Printf("[OutboxRelay] Processing %d messages", len(messages))

	for _, msg := range messages {
		err := r.publisher.Publish(ctx, msg.AggregateType, msg.AggregateID, msg.Payload)
		if err != nil {
			log.Printf("[OutboxRelay] Failed to publish outbox_id=%d: %v", msg.OutboxID, err)
			if err := r.repo.MarkAsFailed(ctx, msg.OutboxID); err != nil {
				log.Printf("[OutboxRelay] Failed to mark as failed: %v", err)
			}
			continue
		}

		if err := r.repo.MarkAsSent(ctx, msg.OutboxID); err != nil {
			log.Printf("[OutboxRelay] Failed to mark as sent: %v", err)
		}
	}
}

3.2 消费者幂等处理

go 复制代码
// ======== 事件消费者(幂等处理)=======
package consumer

import (
	"context"
	"encoding/json"
	"fmt"
	"log"

	"github.com/jackc/pgx/v5/pgxpool"
)

type EventConsumer struct {
	db *pgxpool.Pool
}

func (c *EventConsumer) Process(ctx context.Context, eventType, aggregateID string, payload []byte) error {
	// 幂等检查:事件是否已处理
	exists, err := c.checkEventProcessed(ctx, eventType, aggregateID)
	if err != nil {
		return fmt.Errorf("failed to check event: %w", err)
	}

	if exists {
		log.Printf("[Consumer] Event %s:%s already processed, skipping", eventType, aggregateID)
		return nil
	}

	// 处理事件(事务中)
	tx, err := c.db.BeginTx(ctx, pgx.TxOptions{})
	if err != nil {
		return err
	}
	defer tx.Rollback(ctx)

	// 业务处理
	switch eventType {
	case "order.created":
		err = c.handleOrderCreated(ctx, tx, payload)
	case "payment.completed":
		err = c.handlePaymentCompleted(ctx, tx, payload)
	default:
		log.Printf("[Consumer] Unknown event type: %s", eventType)
		return nil
	}

	if err != nil {
		return err
	}

	// 记录事件处理(幂等标记)
	if err := c.recordEvent(ctx, tx, eventType, aggregateID, payload); err != nil {
		return err
	}

	return tx.Commit(ctx)
}

func (c *EventConsumer) checkEventProcessed(ctx context.Context, eventType, aggregateID string) (bool, error) {
	query := `SELECT 1 FROM event_log WHERE event_type = $1 AND aggregate_id = $2 LIMIT 1`
	var exists int
	err := c.db.QueryRow(ctx, query, eventType, aggregateID).Scan(&exists)
	if err == pgx.ErrNoRows {
		return false, nil
	}
	return err == nil, err
}

func (c *EventConsumer) recordEvent(ctx context.Context, tx pgx.Tx, eventType, aggregateID string, payload []byte) error {
	query := `
		INSERT INTO event_log (event_type, aggregate_id, event_data, occurred_at, processed_at)
		VALUES ($1, $2, $3, NOW(), NOW())
		ON CONFLICT (event_type, aggregate_id) DO NOTHING
	`
	_, err := tx.Exec(ctx, query, eventType, aggregateID, payload)
	return err
}

func (c *EventConsumer) handleOrderCreated(ctx context.Context, tx pgx.Tx, payload []byte) error {
	var data struct {
		OrderID    string  `json:"order_id"`
		UserID     string  `json:"user_id"`
		TotalPrice float64 `json:"total_price"`
	}
	json.Unmarshal(payload, &data)

	log.Printf("[Consumer] Processing order.created: %s", data.OrderID)

	// 业务逻辑:发送欢迎邮件、更新报表等
	return nil
}

func (c *EventConsumer) handlePaymentCompleted(ctx context.Context, tx pgx.Tx, payload []byte) error {
	var data struct {
		OrderID   string  `json:"order_id"`
		PaymentID string  `json:"payment_id"`
		Amount    float64 `json:"amount"`
	}
	json.Unmarshal(payload, &data)

	log.Printf("[Consumer] Processing payment.completed: %s", data.PaymentID)

	// 更新订单状态
	query := `UPDATE orders SET status = 'paid' WHERE order_id = $1`
	_, err := tx.Exec(ctx, query, data.OrderID)
	return err
}

四、TCC 模式

4.1 TCC 实现

go 复制代码
// ======== TCC 三阶段实现 ========
package tcc

import (
	"context"
	"fmt"
	"log"
	"sync"
	"time"

	"github.com/google/uuid"
)

// Try/Confirm/Cancel 接口
type TryFunc func(ctx context.Context) (interface{}, error)  // 预留资源
type ConfirmFunc func(ctx context.Context, tryResult interface{}) error  // 确认
type CancelFunc func(ctx context.Context, tryResult interface{}) error   // 取消

type TCCOption struct {
	TryTimeout    time.Duration
	ConfirmTimeout time.Duration
	RetryPolicy   RetryPolicy
}

type TxnContext struct {
	TxnID      string
	TryResult  interface{}
	Status     string
	CreatedAt  time.Time
}

// TCC Service
type TCCService struct {
	store map[string]*TxnContext
	mu    sync.RWMutex
}

func NewTCCService() *TCCService {
	return &TCCService{
		store: make(map[string]*TxnContext),
	}
}

// Execute 执行 TCC 事务
func (s *TCCService) Execute(
	ctx context.Context,
	opts TCCOption,
	try TryFunc,
	confirm ConfirmFunc,
	cancel CancelFunc,
) error {
	txnID := uuid.New().String()

	// ======== 1. Try 阶段:预留资源 ========
	tryCtx, cancel := context.WithTimeout(ctx, opts.TryTimeout)
	defer cancel()

	log.Printf("[TCC %s] Try phase starting", txnID)

	tryResult, err := try(tryCtx)
	if err != nil {
		log.Printf("[TCC %s] Try failed: %v, executing cancel", txnID, err)
		// Try 失败,执行 Cancel
		cancelCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
		defer cancel()
		_ = cancel(cancelCtx, nil)
		return fmt.Errorf("try failed: %w", err)
	}

	// 保存 Try 结果
	s.mu.Lock()
	s.store[txnID] = &TxnContext{
		TxnID:     txnID,
		TryResult: tryResult,
		Status:    "try_completed",
		CreatedAt: time.Now(),
	}
	s.mu.Unlock()

	log.Printf("[TCC %s] Try completed, executing confirm", txnID)

	// ======== 2. Confirm 阶段:确认资源 ========
	confirmCtx, cancel := context.WithTimeout(ctx, opts.ConfirmTimeout)
	defer cancel()

	for attempt := 0; attempt <= opts.RetryPolicy.MaxRetries; attempt++ {
		err = confirm(confirmCtx, tryResult)
		if err == nil {
			break
		}

		if attempt < opts.RetryPolicy.MaxRetries {
			wait := opts.RetryPolicy.Backoff.Duration(attempt)
			log.Printf("[TCC %s] Confirm failed (attempt %d), retrying in %v: %v",
				txnID, attempt+1, wait, err)
			time.Sleep(wait)
		}
	}

	if err != nil {
		log.Printf("[TCC %s] Confirm failed: %v, executing cancel", txnID, err)
		// Confirm 失败,执行 Cancel
		cancelCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
		defer cancel()
		if cancelErr := cancel(cancelCtx, tryResult); cancelErr != nil {
			log.Printf("[TCC %s] Cancel also failed: %v", txnID, cancelErr)
		}
		return fmt.Errorf("confirm failed: %w", err)
	}

	log.Printf("[TCC %s] Confirm completed successfully", txnID)

	// 清理
	s.mu.Lock()
	delete(s.store, txnID)
	s.mu.Unlock()

	return nil
}

// ======== TCC 转账示例 ========
type AccountService struct {
	db *pgxpool.Pool
}

func (a *AccountService) TransferTCC(ctx context.Context, from, to string, amount float64) error {
	tcc := NewTCCService()

	return tcc.Execute(ctx, TCCOption{
		TryTimeout:    10 * time.Second,
		ConfirmTimeout: 5 * time.Second,
		RetryPolicy: RetryPolicy{
			MaxRetries: 2,
			Backoff:    ExponentialBackoff{Initial: 100 * time.Millisecond, Factor: 2},
		},
	}, func(ctx context.Context) (interface{}, error) {
		// ======== Try: 冻结金额 ========
		// 检查余额是否充足
		var balance float64
		err := a.db.QueryRow(ctx,
			`SELECT balance FROM accounts WHERE user_id = $1 FOR UPDATE`,
			from).Scan(&balance)
		if err != nil {
			return nil, fmt.Errorf("failed to check balance: %w", err)
		}

		if balance < amount {
			return nil, fmt.Errorf("insufficient balance: %.2f < %.2f", balance, amount)
		}

		// 冻结金额(减少可用余额,增加冻结金额)
		_, err = a.db.Exec(ctx,
			`UPDATE accounts SET frozen = frozen + $1 WHERE user_id = $2`,
			amount, from)
		if err != nil {
			return nil, fmt.Errorf("failed to freeze amount: %w", err)
		}

		return map[string]float64{"from": from, "to": to, "amount": amount}, nil

	}, func(ctx context.Context, tryResult interface{}) error {
		// ======== Confirm: 完成转账 ========
		data := tryResult.(map[string]interface{})
		fromID := data["from"].(string)
		toID := data["to"].(string)
		amt := data["amount"].(float64)

		tx, err := a.db.BeginTx(ctx, pgx.TxOptions{})
		if err != nil {
			return err
		}
		defer tx.Rollback(ctx)

		// 从冻结金额中扣除
		_, err = tx.Exec(ctx,
			`UPDATE accounts SET frozen = frozen - $1 WHERE user_id = $2`,
			amt, fromID)
		if err != nil {
			return err
		}

		// 增加目标账户余额
		_, err = tx.Exec(ctx,
			`UPDATE accounts SET balance = balance + $1 WHERE user_id = $2`,
			amt, toID)
		if err != nil {
			return err
		}

		return tx.Commit(ctx)

	}, func(ctx context.Context, tryResult interface{}) error {
		// ======== Cancel: 解冻金额 ========
		if tryResult == nil {
			return nil
		}

		data := tryResult.(map[string]interface{})
		fromID := data["from"].(string)
		amt := data["amount"].(float64)

		// 解冻金额(恢复可用余额)
		_, err := a.db.Exec(ctx,
			`UPDATE accounts SET frozen = frozen - $1 WHERE user_id = $2`,
			amt, fromID)
		return err
	})
}

五、Seata 框架

5.1 AT 模式(自动补偿)

yaml 复制代码
# ======== Seata Server 部署(TC 事务协调者)=======
# docker-compose.yml
version: '3.8'
services:
  seata-server:
    image: seataio/seata-server:1.7.0
    container_name: seata-server
    ports:
      - "8091:8091"
      - "7091:7091"  # Metrics
    environment:
      - STORE_MODE=db
      - SEATA_CONFIG_NAME=file:/root/seata-config/registry
      - SPRING_DATASOURCE_DRIVER-CLASS-NAME=com.mysql.cj.jdbc.Driver
      - SPRING_DATASOURCE_URL=jdbc:mysql://mysql:3306/seata?useUnicode=true&rewriteBatchedStatements=true
      - SPRING_DATASOURCE_USER=seata
      - SPRING_DATASOURCE_PASSWORD=seata_password
    volumes:
      - ./seata-config/registry.conf:/root/seata-config/registry:ro
    depends_on:
      - mysql
    networks:
      - seata-network

  mysql:
    image: mysql:8.0
    ports:
      - "3306:3306"
    environment:
      MYSQL_ROOT_PASSWORD: root_password
      MYSQL_DATABASE: seata
    volumes:
      - mysql-data:/var/lib/mysql
    networks:
      - seata-network

networks:
  seata-network:
    driver: bridge

volumes:
  mysql-data:
yaml 复制代码
# ======== Seata Registry 配置 ========
# seata-config/registry.conf
registry {
  type = "nacos"
  nacos {
    application = "seata-server"
    serverAddr = "nacos:8848"
    namespace = ""
    group = "SEATA_GROUP"
    cluster = "default"
  }
}

config {
  type = "nacos"
  nacos {
    serverAddr = "nacos:8848"
    namespace = ""
    group = "SEATA_GROUP"
    dataId = "seataConfig"
  }
}

# ======== 应用配置 ========
# application.yml
seata:
  enabled: true
  application-id: ${spring.application.name}
  tx-service-group: my-tx-group
  enable-auto-data-source-proxy: true
  config:
    type: nacos
    nacos:
      server-addr: ${NACOS_HOST:localhost}:${NACOS_PORT:8848}
      group: SEATA_GROUP
  registry:
    type: nacos
    nacos:
      server-addr: ${NACOS_HOST:localhost}:${NACOS_PORT:8848}
      group: SEATA_GROUP
  service:
    vgroup-mapping:
      my-tx-group: default
    enable-degrade: false
    disable-global-transaction: false
java 复制代码
// ======== Java 应用使用 Seata AT 模式 ========
// Spring Boot 应用

// 1. 依赖
// implementation 'io.seata:seata-spring-boot-starter:1.7.0'
// implementation 'io.seata:seata-dubbo-alibaba:1.7.0'  // 如使用 Dubbo

// 2. 分布式事务注解
@Service
public class OrderService {

    @GlobalTransactional(name = "create-order", timeoutMills = 30000, rollbackFor = Exception.class)
    public Order createOrder(OrderDTO orderDTO) {
        // Seata 自动管理以下所有操作的分布式事务
        // 1. 创建订单
        Order order = orderMapper.create(orderDTO);

        // 2. 扣减库存(远程调用)
        inventoryClient.deductStock(orderDTO.getProductId(), orderDTO.getQuantity());

        // 3. 扣减余额(远程调用)
        accountClient.deductBalance(orderDTO.getUserId(), orderDTO.getTotalPrice());

        // 4. 发送消息
        messageClient.sendOrderCreated(order);

        return order;
    }
}

// 3. 全局事务回滚
@GlobalTransactional(name = "transfer", rollbackFor = Exception.class)
public void transfer(String fromAccount, String toAccount, BigDecimal amount) {
    accountClient.debit(fromAccount, amount);
    // 模拟失败
    throw new RuntimeException("Transfer failed");
    // Seata 自动回滚 fromAccount 的扣款
}

六、Checklist 总结

复制代码
□ 分布式事务模式选择
  □ 业务场景分析:强一致性 vs 最终一致性
  □ 事务时长判断:短事务 vs 长流程
  □ 补偿逻辑复杂度评估
  □ 选型决策:Saga / TCC / 本地消息表

□ Saga 模式
  □ Saga Orchestrator 实现
  □ 正向执行函数
  □ 补偿函数(幂等设计)
  □ 重试策略(指数退避)
  □ 补偿失败告警
  □ 人工干预流程
  □ 订单→库存→支付→通知完整流程实现

□ 本地消息表
  □ Outbox 表设计(status/retry_count/processed_at)
  □ 事务中同时写入业务数据和 Outbox
  □ Outbox Relay 轮询任务(FOR UPDATE SKIP LOCKED)
  □ 消息发送失败重试
  □ 消费者幂等处理(event_log UNIQUE 约束)
  □ 旧消息清理

□ TCC 模式
  □ Try: 预留资源/冻结金额
  □ Confirm: 确认执行
  □ Cancel: 释放冻结资源
  □ Try/Confirm/Cancel 幂等性保证
  □ 全局事务超时控制

□ Seata 框架
  □ Seata Server(TC)部署
  □ Registry/Config(Nacos/etcd)
  □ @GlobalTransactional 注解
  □ AT 模式与 MT 模式选择
  □ 事务分组与高可用

□ 生产级可靠性
  □ 幂等性设计(所有操作必须幂等)
  □ 消息可靠性(at-least-once + 幂等消费)
  □ 补偿链超时处理
  □ 死信队列处理
  □ 监控告警(Saga 失败率/TCC 冻结资金)
  □ 定期补偿失败分析

总结

一句话总结: 分布式事务没有银弹,Saga 适合长流程补偿、TCC 适合资源预留、本地消息表适合可靠消息,三者按场景组合使用。

分布式事务模式对比:

维度 Saga TCC 本地消息表 2PC
一致性 最终一致 最终一致 最终一致 强一致
适用场景 长流程 短事务 异步解耦 不推荐
资源锁定 Try 阶段锁定 全程锁定
补偿复杂度 高(需写补偿逻辑) 中(需写 Cancel) 低(重试即可)
吞吐量
实现复杂度
失败恢复 补偿链 自动 Cancel 重试消息 人工处理

下一步推荐:

  • Seata 在 Spring Cloud / Dubbo 中的深度集成(AT/TCC/Saga 多模式)
  • RocketMQ 事务消息实战(半消息 + 回查)
  • 分布式事务可视化监控(SkyWalking / Pinpoint 集成)