Spark Delta Lake

rm -r dp-203 -f

git clone https://github.com/MicrosoftLearning/dp-203-azure-data-engineer dp-203

cd dp-203/Allfiles/labs/07

./setup.ps1

python 复制代码
%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/products/products.csv', format='csv'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))
python 复制代码
%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/products/products.csv', format='csv'
## If header exists uncomment line below
, header=True
)
display(df.limit(10))
python 复制代码
 delta_table_path = "/delta/products-delta"
 df.write.format("delta").save(delta_table_path)
  1. On the files tab, use the icon in the toolbar to return to the root of the files container, and note that a new folder named delta has been created. Open this folder and the products-delta table it contains, where you should see the parquet format file(s) containing the data.
python 复制代码
from delta.tables import *
from pyspark.sql.functions import *

 # Create a deltaTable object
deltaTable = DeltaTable.forPath(spark, delta_table_path)

 # Update the table (reduce price of product 771 by 10%)
deltaTable.update(
     condition = "ProductID == 771",
     set = { "ListPrice": "ListPrice * 0.9" })

 # View the updated data as a dataframe
deltaTable.toDF().show(10)
python 复制代码
 new_df = spark.read.format("delta").load(delta_table_path)
 new_df.show(10)
python 复制代码
 new_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
 new_df.show(10)
python 复制代码
deltaTable.history(10).show(20, False, True)
python 复制代码
 spark.sql("CREATE DATABASE AdventureWorks")
 spark.sql("CREATE TABLE AdventureWorks.ProductsExternal USING DELTA LOCATION '{0}'".format(delta_table_path))
 spark.sql("DESCRIBE EXTENDED AdventureWorks.ProductsExternal").show(truncate=False)

This code creates a new database named AdventureWorks and then creates an external tabled named ProductsExternal in that database based on the path to the parquet files you defined previously. It then displays a description of the table's properties. Note that the Location property is the path you specified.

sql 复制代码
%%sql

 USE AdventureWorks;

 SELECT * FROM ProductsExternal;
python 复制代码
 df.write.format("delta").saveAsTable("AdventureWorks.ProductsManaged")
 spark.sql("DESCRIBE EXTENDED AdventureWorks.ProductsManaged").show(truncate=False)

This code creates a managed tabled named ProductsManaged based on the DataFrame you originally loaded from the products.csv file (before you updated the price of product 771). You do not specify a path for the parquet files used by the table - this is managed for you in the Hive metastore, and shown in the Location property in the table description (in the files/synapse/workspaces/synapsexxxxxxx/warehouse path).

sql 复制代码
%%sql

 USE AdventureWorks;

 SELECT * FROM ProductsManaged;
sql 复制代码
%%sql

 USE AdventureWorks;

 SHOW TABLES;
sql 复制代码
%%sql

 USE AdventureWorks;

 DROP TABLE IF EXISTS ProductsExternal;
 DROP TABLE IF EXISTS ProductsManaged;
  1. Return to the files tab and view the files/delta/products-delta folder. Note that the data files still exist in this location. Dropping the external table has removed the table from the metastore, but left the data files intact.
  2. View the files/synapse/workspaces/synapsexxxxxxx/warehouse folder, and note that there is no folder for the ProductsManaged table data. Dropping a managed table removes the table from the metastore and also deletes the table's data files.
sql 复制代码
%%sql

 USE AdventureWorks;

 CREATE TABLE Products
 USING DELTA
 LOCATION '/delta/products-delta';
sql 复制代码
%%sql

 USE AdventureWorks;

 SELECT * FROM Products;
python 复制代码
 from notebookutils import mssparkutils
 from pyspark.sql.types import *
 from pyspark.sql.functions import *

 # Create a folder
 inputPath = '/data/'
 mssparkutils.fs.mkdirs(inputPath)

 # Create a stream that reads data from the folder, using a JSON schema
 jsonSchema = StructType([
 StructField("device", StringType(), False),
 StructField("status", StringType(), False)
 ])
 iotstream = spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath)

 # Write some event data to the folder
 device_data = '''{"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev2","status":"error"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"error"}
 {"device":"Dev2","status":"ok"}
 {"device":"Dev2","status":"error"}
 {"device":"Dev1","status":"ok"}'''
 mssparkutils.fs.put(inputPath + "data.txt", device_data, True)
 print("Source stream created...")

Ensure the message Source stream created... is printed. The code you just ran has created a streaming data source based on a folder to which some data has been saved, representing readings from hypothetical IoT devices.

python 复制代码
 # Write the stream to a delta table
 delta_stream_table_path = '/delta/iotdevicedata'
 checkpointpath = '/delta/checkpoint'
 deltastream = iotstream.writeStream.format("delta").option("checkpointLocation", checkpointpath).start(delta_stream_table_path)
 print("Streaming to delta sink...")
python 复制代码
 # Read the data in delta format into a dataframe
 df = spark.read.format("delta").load(delta_stream_table_path)
 display(df)
python 复制代码
 # create a catalog table based on the streaming sink
 spark.sql("CREATE TABLE IotDeviceData USING DELTA LOCATION '{0}'".format(delta_stream_table_path))
sql 复制代码
 %%sql

 SELECT * FROM IotDeviceData;
python 复制代码
 # Add more data to the source stream
 more_data = '''{"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"ok"}
 {"device":"Dev1","status":"error"}
 {"device":"Dev2","status":"error"}
 {"device":"Dev1","status":"ok"}'''

 mssparkutils.fs.put(inputPath + "more-data.txt", more_data, True)
sql 复制代码
%%sql

 SELECT * FROM IotDeviceData;
python 复制代码
 deltastream.stop()
sql 复制代码
 -- This is auto-generated code
 SELECT
     TOP 100 *
 FROM
     OPENROWSET(
         BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/delta/products-delta/',
         FORMAT = 'DELTA'
     ) AS [result]
sql 复制代码
 USE AdventureWorks;

 SELECT * FROM Products;

Run the code and observe that you can also use the serverless SQL pool to query Delta Lake data in catalog tables that are defined the Spark metastore.

相关推荐
科研前沿12 小时前
镜像视界 CameraGraph™+多智能体:构建自感知自决策的全域空间认知网络技术方案
大数据·运维·人工智能·数码相机·计算机视觉
发哥来了12 小时前
AI视频生成模型选型指南:五大核心维度对比评测
大数据·人工智能·机器学习·ai·aigc
发哥来了12 小时前
AI驱动生产线的实际落地:一个东莞厂商的技术选型实录
大数据·人工智能·机器学习·ai·aigc
历程里程碑13 小时前
4 Git远程协作:从零开始,玩转仓库关联与代码同步(带实操代码讲解)
大数据·c++·git·elasticsearch·搜索引擎·gitee·github
AI周红伟14 小时前
周红伟:运营商一季度净利集体下滑 Token运营提速
大数据·网络·人工智能
无忧智库14 小时前
研发管理的下一个十年:当多Agent协同遇上知识图谱,传统项目管理体系正在被颠覆(WORD)
大数据·人工智能·知识图谱
汽车仪器仪表相关领域16 小时前
Kvaser Memorator Professional 5xHS CB:五通道CAN FD裸板记录仪,赋能多总线系统集成测试的旗舰级核心装备
大数据·网络·人工智能·单元测试·汽车·集成测试
gQ85v10Db17 小时前
Redis分布式锁进阶第十七篇:微服务分布式锁全局治理 + 跨团队统一规范落地 + 全链路稳定性提升方案
redis·分布式·微服务
头条快讯17 小时前
中国非遗美食文化的跨国传承:鲁味居在北美市场的标准化实践与布局
大数据·人工智能
我是发哥哈18 小时前
深度评测:五款主流AI培训平台的课程交付能力对比
大数据·人工智能·学习·机器学习·ai·chatgpt