古法CDC:AWS Aurora MySQL使用AWS DMS构建数据管道到数据湖(Apache Iceberg)

问题

需要再AWS的数据湖里面对数据进行CRUD,这里就需要使用Apache Iceberg来管理数据。

这了我们使用AWS DMS服务来构建CDC,将MySQL中的数据搬到s3,然后,使用Lambda来触发CDC更新数据到Athena(Apache Iceberg)中使用。

一图胜千言

修改MySQL参数

  • binlog_format:ROW
  • binlog_row_image:Full

这是MySQL审计日志最全的方式。

创建mysql用户

sql 复制代码
# CDC权限
CREATE USER 'dms_user'@'%' IDENTIFIED BY 'dms_User349';
GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'dms_user'@'%';

# 业务库和业务表
grant select on mydb.* to dms_user;

# 使用 MySQL 特定的迁移前评估
grant select on mysql.user to dms_user;
grant select on mysql.db to dms_user;
grant select on mysql.tables_priv to dms_user;
grant select on mysql.role_edges to dms_user; #only for MySQL version 8.0.11 and higher
grant select on performance_schema.replication_connection_status to dms_user;  #Required for primary instance validation - MySQL version 5.7 and higher only

# RDS 运行 MySQL 特定的迁移前评估
grant select on mysql.rds_configuration to dms_user;  #Required for binary log retention check

# 如果参数BatchEnable是必需的true,则需要授予权限
grant create temporary tables on *.* to dms_user;

FLUSH PRIVILEGES;

Amazon Secrets Manager

我这里的mysql是使用了AWS Secrets Manager进行用户名和密码托管轮转的。这里就不重点介绍了,我们主要关注AWS DMS数据迁移任务怎么创建。

AWS DMS

创建源端点

或者,如下方式创建:

注意,这里连接mysql启用了必须使用SSL方式才能连接,需要手动将Aurora MySQL中的证书手动下载添加到DMS中。在如下页面中找到AWS Aurora MySQL中的证书,如下页面:

https://docs.amazonaws.cn/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.SSL.html#UsingWithRDS.SSL.RegionCertificates-BJS

运行如下命令,验证一下下载的SSL证书:

bash 复制代码
keytool -printcert -v -file global-bundle.pem

我们在AWS DMS证书管理页面,导入上面MySQL的证书给DMS,如下图:

证书配置成功后,下面重新创建源端点:

创建目标端点

s3创建内网端点

这里的路由表,你如果不知道怎么设置,或者后续无法写数据到s3的话,你可以一股脑全部把路由表都够选上也可以。

Athena的vpc 端点创建

Athena vpc端点安全组创建

这里的athena vpc 端点的安全组中的出入站规则都是指定了dms的安全组。下面就可以开始创建athena的vpc内网端点了,如下图:

接下来创建s3目标端点,注意这里使用到的s3桶名,最好以aws-glue-开头,方便以后glue中任务读取。如下图:

参数配置如下:

json 复制代码
{
  "DataFormat": "parquet",
  "ParquetVersion": "PARQUET_2_0",
  "CompressionType": "GZIP",
  "IncludeOpForFullLoad": true,
  "GlueCatalogGeneration": true,
  "TimestampColumnName": "last_updated_ts",
  "UseTaskStartTimeForFullLoadTimestamp": true,
  "CdcMaxBatchInterval": 3600,
  "CdcMinFileSize": 64000,
  "DatePartitionEnabled": true,
  "DatePartitionSequence": "YYYYMMDD",
  "DatePartitionDelimiter": "SLASH",
  "DatePartitionTimezone": "Asia/Shanghai"
}

源和目标端点都创建之后,如下图:

这个目标S3端点会创建一个角色,这个角色缺失"s3:DeleteObject",权限,需要我们自己手动加一下这个权限。还得添加s3解密,glue和athena相关权限配置:

json 复制代码
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllObjectActions",
			"Effect": "Allow",
			"Action": [
				"s3:PutObject",
				"s3:GetObject",
				"s3:DeleteObject",
				"s3:GetObjectVersion"
			],
			"Resource": [
				"arn:aws-cn:s3:::aws-glue-dev-bronze/*"
			],
			"Condition": {
				"StringEquals": {
					"aws:ResourceAccount": "111112234434"
				}
			}
		},
		{
			"Sid": "ListBucketActions",
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket"
			],
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"aws:ResourceAccount": "111112234434"
				}
			}
		},
		{
			"Sid": "GetBucketActions",
			"Effect": "Allow",
			"Action": [
				"s3:GetBucketLocation",
				"s3:GetBucketVersioning"
			],
			"Resource": [
				"arn:aws-cn:s3:::aws-glue-dev-bronze"
			],
			"Condition": {
				"StringEquals": {
					"aws:ResourceAccount": "111112234434"
				}
			}
		},
		{
			"Sid": "EnableBucketHTTPSOnly",
			"Action": "s3:*",
			"Effect": "Deny",
			"Resource": [
				"arn:aws-cn:s3:::aws-glue-dev-bronze/*",
				"arn:aws-cn:s3:::aws-glue-dev-bronze"
			],
			"Condition": {
				"Bool": {
					"aws:SecureTransport": false
				}
			}
		},
		{
			"Sid": "AllowUseOfTheKey",
			"Effect": "Allow",
			"Action": [
				"kms:Encrypt",
				"kms:Decrypt",
				"kms:ReEncrypt*",
				"kms:GenerateDataKey*",
				"kms:DescribeKey"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"glue:CreateDatabase",
				"glue:GetDatabase",
				"glue:GetDatabases",
				"glue:CreateTable",
				"glue:DeleteTable",
				"glue:UpdateTable",
				"glue:GetTable",
				"glue:GetTables",
				"glue:BatchCreatePartition",
				"glue:CreatePartition",
				"glue:UpdatePartition",
				"glue:GetPartition",
				"glue:GetPartitions",
				"glue:BatchGetPartition"
			],
			"Resource": [
				"arn:aws-cn:glue:*:111112234434:catalog",
				"arn:aws-cn:glue:*:111112234434:database/*",
				"arn:aws-cn:glue:*:111112234434:table/*"
			]
		},
		{
			"Effect": "Allow",
			"Action": [
				"athena:StartQueryExecution",
				"athena:GetQueryExecution",
				"athena:CreateWorkGroup"
			],
			"Resource": "arn:aws-cn:athena:*:111112234434:workgroup/*"
		}
	]
}

就可以开始CDC任务创建了。

开始CDC

创建安全组

创建一个安全组给CDC任务使用:

开始创建复制,如下图:

注意,这里的前缀,不要使用-字符,因为AWS DMS在Athena中创建数据库,使用连字符会失败,应该改成下划线_,需要注意一下。

创建完成,手动启动,如下图:

启动之前,先让AWS云先评估一下,如果没问题,再真正开始CDC任务,如下图:

看到如下AWS DMS任务,表示我们的CDC任务运行成功了:

到此CDC任务运行成功。到这里只是将mysql里面的binlog数据全部搬到数据湖里面的铜牌层,还需要进一步清理数据到银牌层。

银牌层

创建Silver S3桶

Iceberg数据库创建

sql 复制代码
CREATE DATABASE dev_silver_db;

直接在glue创建银牌层桶。

创建Iceberg表

根据需要处理的原始表结构,来写下面的DDL语句:

sql 复制代码
CREATE TABLE my_iceberg (
   last_updated_ts TIMESTAMP,
   id BIGINT,
   name STRING
)
LOCATION 's3://aws-glue-dev-silver/dbname/my_iceberg/'
TBLPROPERTIES ('table_type'='ICEBERG', 'format'='parquet')

AWS Glue合并任务

AWSGlueServiceRoleDev角色创建

这个角色必须以AWSGlueServiceRole开头,不然AWS Glue任务无法选择这个角色。创建Glue任务如下:

其中配置参数为:

bash 复制代码
--datalake-formats iceberg
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf  spark.sql.catalog.glue_catalog.warehouse=s3://aws-glue-dev-silver/dbname/my_iceberg/
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf  spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true
--conf spark.sql.iceberg.handle-timestamp-without-timezone=true

任务源代码如下:

python 复制代码
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import TimestampType

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ============================================
# 第一步:从 Bronze 表读取增量数据
# ============================================
# 注意:transformation_ctx 对 Glue 书签机制至关重要,请保持唯一且不变[reference:0]
myIcebergIncrementalInputDF = glueContext.create_dynamic_frame.from_catalog(
    database="dev_bronze",                           # 修改为你的 Glue 数据库名
    table_name="my_iceberg",                   # 修改为你的 Bronze 表名
    transformation_ctx="myIcebergIncrementalInputDF"
).toDF()

# ============================================
# 第二步:去重 - 按 id 分组,保留 last_updated_ts 最新的记录[reference:1]
# ============================================
windowSpec = Window
    .partitionBy(myIcebergIncrementalInputDF.id)
    .orderBy(col("last_updated_ts").desc())

rankedDF = myIcebergIncrementalInputDF
    .withColumn("row_num", row_number().over(windowSpec))

deduplicatedDF = rankedDF
    .filter(col("row_num") == 1)
    .drop("row_num")
    
# ============================================
# 第三步:清洗 - 数据类型转换、字段拆分等[reference:2]
# ============================================
deduplicatedDF = deduplicatedDF
    .withColumn("last_updated_ts", col("last_updated_ts").cast(TimestampType()))

# 注册为临时视图,供 MERGE 语句使用[reference:3]
deduplicatedDF.createOrReplaceTempView("deduplicated_view")

# ============================================
# 第四步:执行 MERGE 到 Silver 层 Iceberg 表[reference:4]
# ============================================
merge_sql = """
MERGE INTO glue_catalog.dev_silver_db.my_iceberg AS target
USING deduplicated_view AS source
ON target.id = source.id

WHEN MATCHED AND source.op = 'U' THEN
    UPDATE SET
        target.last_updated_ts    = source.last_updated_ts,
        target.name       = source.name

WHEN MATCHED AND source.op = 'D' THEN
    DELETE

WHEN NOT MATCHED THEN
    INSERT (
        last_updated_ts, id, name
    ) VALUES (
        source.last_updated_ts, source.id, source.name
    )
"""

spark.sql(merge_sql)

job.commit()

lambda触发函数

创建一个AWS Lambda函数,源代码如下:

python 复制代码
import json
import boto3

def lambda_handler(event, context):
    glue_client = boto3.client('glue')
    glue_job_name = 'dev-b2s-xxxx'

    try:
        # Start the Glue job
        response = glue_client.start_job_run(JobName=glue_job_name)
        print(f"Glue job started: {response['JobRunId']}")
    except Exception as e:
        print(f"Error starting Glue job: {str(e)}")
        raise e
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

注意,这里需要给aws lambda的角色添加启动glue任务的权限,类似如下:

json 复制代码
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun"
            ],
            "Resource": "arn:aws-cn:glue:cn-north-1:111112234434:job/*"
        }
    ]
}

然后,我们回到bronze层的s3桶,创建一个当文件创建的事件来激活调用这个lambda函数,如下图:

总结

没有搞过大数据的朋友,需要注意一下如下两个概念:

  • Iceberg是表格式
  • Parquet是文件格式

这就是AWS里面的将mysql里面数据搬到数据湖的方式。AWS DMS+AWS Lambda+AWS Glue。这里主要就是借助iceberg表来进行增量更新。

参考