Transform Spark

rm -r dp-203 -f

git clone https://github.com/MicrosoftLearning/dp-203-azure-data-engineer dp-203

cd dp-203/Allfiles/labs/06

./setup.ps1

https://github.com/MicrosoftLearning/dp-203-azure-data-engineer/tree/master/Allfiles/labs/06/notebooks

python 复制代码
order_details = spark.read.csv('/data/*.csv', header=True, inferSchema=True)
display(order_details.limit(5))
python 复制代码
from pyspark.sql.functions import split, col

# Create the new FirstName and LastName fields
transformed_df = order_details.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))

# Remove the CustomerName field
transformed_df = transformed_df.drop("CustomerName")

display(transformed_df.limit(5))
python 复制代码
transformed_df.write.mode("overwrite").parquet('/transformed_data/orders.parquet')
print ("Transformed data saved!")
python 复制代码
from pyspark.sql.functions import year, month, col

dated_df = transformed_df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate")))
display(dated_df.limit(5))
dated_df.write.partitionBy("Year","Month").mode("overwrite").parquet("/partitioned_data")
print ("Transformed data saved!")
python 复制代码
orders_2020 = spark.read.parquet('/partitioned_data/Year=2020/Month=*')
display(orders_2020.limit(5))
python 复制代码
order_details.write.saveAsTable('sales_orders', format='parquet', mode='overwrite', path='/sales_orders_table')
python 复制代码
sql_transform = spark.sql("SELECT *, YEAR(OrderDate) AS Year, MONTH(OrderDate) AS Month FROM sales_orders")
display(sql_transform.limit(5))
sql_transform.write.partitionBy("Year","Month").saveAsTable('transformed_orders', format='parquet', mode='overwrite', path='/transformed_orders_table')
sql 复制代码
%%sql

SELECT * FROM transformed_orders
WHERE Year = 2021
    AND Month = 1
sql 复制代码
%%sql

DROP TABLE transformed_orders;
DROP TABLE sales_orders;
相关推荐
Bode_20021 小时前
基于大数据分析的全生命周期质量追溯质量评估体系落地方案
大数据·人工智能
serve the people1 小时前
Elasticsearch(1) could you tell me how to use es if i am a beginner
大数据·elasticsearch·jenkins
一个儒雅随和的男子2 小时前
Elasticsearch出现深度分页问题怎么解决?
大数据·elasticsearch·搜索引擎
AI智图坊3 小时前
多件装组合SKU图的批量生产效率分析:从PS手工到AI自动化的工作流改造
大数据·运维·人工智能·gpt·ai作画·自动化·aigc
jerryinwuhan4 小时前
面向产业带与中小企业数字化转型的电商运营人才培养模式
大数据·人工智能
giaz14n9X5 小时前
Redis 分布式锁进阶第六十三篇
分布式
Fnetlink16 小时前
企业SDWAN供应商
大数据
galaxylove6 小时前
Gartner发布创新洞察:AI SOC智能体加速通信运营商安全运营转型
大数据·人工智能·安全
甩手网软件6 小时前
Shopee2026新规:费率重构与履约收紧下,卖家如何破局?
大数据·人工智能
lizhihai_996 小时前
股市学习心得-AI 产业链核心标的梳理清单
大数据·服务器·人工智能·科技·学习