AWS中国云中的ETL之从aurora搬数据到s3(Glue版——修复版)

问题

AWS中国云中的ETL之从aurora搬数据到s3(Glue版)

之前这个方式,在数据比较大的情况下,会出现对mysql全表扫描问题。

解决思路

使用JDBC下推方式,避免对mysql取数的全表扫描。

解决

将可视化ETL改成脚本方式:具体代码如下:

python 复制代码
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsgluedq.transforms import EvaluateDataQuality
from awsglue import DynamicFrame

def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
    for alias, frame in mapping.items():
        frame.toDF().createOrReplaceTempView(alias)
    result = spark.sql(query)
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Default ruleset used by all target nodes with data quality enabled
DEFAULT_DATA_QUALITY_RULESET = """
    Rules = [
        ColumnCount > 0
    ]
"""

sampleQuery = '''
select track_id,
    distinct_id,
    lib,
    event,
    type,
    all_json,
    host,
    user_agent,
    ua_platform,
    ua_browser,
    ua_version,
    ua_language,
    connection,
    pragma,
    cache_control,
    accept,
    accept_encoding,
    accept_language,
    ip,
    ip_city,
    ip_asn,
    url,
    referrer,
    remark,
    created_at,
    date,
    hour
from xxxx
where date = CURDATE() and
'''.rstrip('\n')

# Script generated for node prod-mysql
prodmysql_node202512354346 = glueContext.create_dynamic_frame.from_options(
    connection_type = "mysql",
    connection_options = {
        "useConnectionProperties": "true",
        "dbtable": "xxxx",
        "connectionName": "prod Aurora connection",
        "sampleQuery": sampleQuery,
        # 启用JDBC下推
        "enablePartitioningForSampleQuery": True,
        # 按小时字段分区
        "hashfield": "hour",
        "hashpartitions": "24"
    },
    transformation_ctx = "prodmysql_node202512354346"
)

# Script generated for node SQL Query xxxx
SqlQuery68 = '''
select track_id,
    distinct_id,
    lib,
    event,
    type,
    all_json,
    host,
    user_agent,
    ua_platform,
    ua_browser,
    ua_version,
    ua_language,
    connection,
    pragma,
    cache_control,
    accept,
    accept_encoding,
    accept_language,
    ip,
    ip_city,
    ip_asn,
    url,
    referrer,
    remark,
    created_at,
    date,
    hour,
    YEAR(date) AS year,
    MONTH(date) AS month,
    DAY(date) AS day 
from xxxx 
where date = CURDATE();
'''
SQLQueryxxxx_node20251236978987 = sparkSqlQuery(glueContext, query = SqlQuery68, mapping = {"xxxx":prodmysql_node202512354346}, transformation_ctx = "SQLQueryxxxx_node20251236978987")

# Script generated for node Amazon S3 xxxx
EvaluateDataQuality().process_rows(frame=SQLQueryxxxx_node20251236978987, ruleset=DEFAULT_DATA_QUALITY_RULESET, publishing_options={"dataQualityEvaluationContext": "EvaluateDataQuality_node1758699684078", "enableDataQualityResultsPublishing": True}, additional_options={"dataQualityResultsPublishing.strategy": "BEST_EFFORT", "observations.scope": "ALL"})
# AmazonS3xxxx_node1758699703229 = glueContext.write_dynamic_frame.from_options(frame=SQLQueryxxxx_node20251236978987, connection_type="s3", format="glueparquet", connection_options={"path": "s3://aws-glue-prod-xxxx", "partitionKeys": ["year", "month", "day"]}, format_options={"compression": "snappy"}, transformation_ctx="AmazonS3xxxx_node1758699703229")

additionalOptions = {
    "enableUpdateCatalog": True
}
# 自动添加给表分区,记得给表添加属性useGlueParquetWriter为true
additionalOptions["partitionKeys"] = ["year", "month", "day"]

write_sink = glueContext.write_dynamic_frame_from_catalog(
    frame=SQLQueryxxxx_node20251236978987, 
    database="prod", 
    table_name="aws_glue_prod_xxxx", 
    transformation_ctx="write_sink",
    additional_options=additionalOptions
)

job.commit()

注意在运行之前,记得给表添加属性useGlueParquetWriter为true。

参考

相关推荐
阿里云大数据AI技术1 小时前
全模态、多引擎、一体化,阿里云DLF3.0构建Data+AI驱动的智能湖仓平台
人工智能·阿里云·云计算
摇滚侠1 小时前
阿里云安装的 Redis 在什么位置,如何找到 Redis 的安装位置
redis·阿里云·云计算
m0_694845575 小时前
tinylisp 是什么?超轻量 Lisp 解释器编译与运行教程
服务器·开发语言·云计算·github·lisp
ESBK20256 小时前
第四届移动互联网、云计算与信息安全国际会议(MICCIS 2026)二轮征稿启动,诚邀全球学者共赴学术盛宴
大数据·网络·物联网·网络安全·云计算·密码学·信息与通信
fendouweiqian7 小时前
AWS WAF(配合 CloudFront)基础防护配置:免费能做什么、要不要开日志、如何限制危险方法
网络安全·aws·cloudfront
_运维那些事儿19 小时前
VM环境的CI/CD
linux·运维·网络·阿里云·ci/cd·docker·云计算
人间打气筒(Ada)1 天前
k8s:CNI网络插件flannel与calico
linux·云原生·容器·kubernetes·云计算·k8s
主机哥哥1 天前
2026年阿里云五种方案快速部署 OpenClaw(Clawdbot)详细教程
阿里云·云计算
m0_694845571 天前
music-website 是什么?前后端分离音乐网站部署实战
linux·运维·服务器·云计算·github
新新学长搞科研1 天前
【智慧城市专题IEEE会议】第六届物联网与智慧城市国际学术会议(IoTSC 2026)
人工智能·分布式·科技·物联网·云计算·智慧城市·学术会议