rm -r dp-203 -f
git clone https://github.com/MicrosoftLearning/dp-203-azure-data-engineer dp-203
cd dp-203/Allfiles/labs/07
./setup.ps1
python
%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/products/products.csv', format='csv'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))
data:image/s3,"s3://crabby-images/ddd63/ddd63c80ed17d606b80d6cbf6885d35f6fd67104" alt=""
python
%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/products/products.csv', format='csv'
## If header exists uncomment line below
, header=True
)
display(df.limit(10))
data:image/s3,"s3://crabby-images/f1ef8/f1ef87453ce3c1598ee2eb56c0586877abf05174" alt=""
python
delta_table_path = "/delta/products-delta"
df.write.format("delta").save(delta_table_path)
data:image/s3,"s3://crabby-images/ac3bb/ac3bbe1c85f151450e66fe65d242f1b94f85d335" alt=""
- On the files tab, use the ↑ icon in the toolbar to return to the root of the files container, and note that a new folder named delta has been created. Open this folder and the products-delta table it contains, where you should see the parquet format file(s) containing the data.
data:image/s3,"s3://crabby-images/44603/446038af86f07b412057a6ce2011b437317b0f6e" alt=""
python
from delta.tables import *
from pyspark.sql.functions import *
# Create a deltaTable object
deltaTable = DeltaTable.forPath(spark, delta_table_path)
# Update the table (reduce price of product 771 by 10%)
deltaTable.update(
condition = "ProductID == 771",
set = { "ListPrice": "ListPrice * 0.9" })
# View the updated data as a dataframe
deltaTable.toDF().show(10)
data:image/s3,"s3://crabby-images/37a92/37a92fd62a92f9874267b103c93dae2dbc5b3797" alt=""
python
new_df = spark.read.format("delta").load(delta_table_path)
new_df.show(10)
data:image/s3,"s3://crabby-images/677f1/677f11047f8b4c2aa365a26a9aef91c975f17f47" alt=""
python
new_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
new_df.show(10)
data:image/s3,"s3://crabby-images/6675d/6675d1de44c10119dbe30762953cfe80d323602d" alt=""
python
deltaTable.history(10).show(20, False, True)
data:image/s3,"s3://crabby-images/27fc3/27fc369eefd881f02f7ce7bd71e54eacf498e90e" alt=""
python
spark.sql("CREATE DATABASE AdventureWorks")
spark.sql("CREATE TABLE AdventureWorks.ProductsExternal USING DELTA LOCATION '{0}'".format(delta_table_path))
spark.sql("DESCRIBE EXTENDED AdventureWorks.ProductsExternal").show(truncate=False)
data:image/s3,"s3://crabby-images/a94e0/a94e02a8d5d314ce8fe65a45bf83718662dfe31e" alt=""
This code creates a new database named AdventureWorks and then creates an external tabled named ProductsExternal in that database based on the path to the parquet files you defined previously. It then displays a description of the table's properties. Note that the Location property is the path you specified.
data:image/s3,"s3://crabby-images/83b7d/83b7d8831e8832f7b4053f0e571fa62ae49cab7b" alt=""
sql
%%sql
USE AdventureWorks;
SELECT * FROM ProductsExternal;
data:image/s3,"s3://crabby-images/ec922/ec922e259033b64d78960c054b0154fccd53ac45" alt=""
python
df.write.format("delta").saveAsTable("AdventureWorks.ProductsManaged")
spark.sql("DESCRIBE EXTENDED AdventureWorks.ProductsManaged").show(truncate=False)
data:image/s3,"s3://crabby-images/8a0f3/8a0f3578a109e6eb1f86eff0b1823ff3e48755e6" alt=""
This code creates a managed tabled named ProductsManaged based on the DataFrame you originally loaded from the products.csv file (before you updated the price of product 771). You do not specify a path for the parquet files used by the table - this is managed for you in the Hive metastore, and shown in the Location property in the table description (in the files/synapse/workspaces/synapsexxxxxxx/warehouse path).
sql
%%sql
USE AdventureWorks;
SELECT * FROM ProductsManaged;
data:image/s3,"s3://crabby-images/9e6aa/9e6aafac9447107880056b79c2df23e52300bb80" alt=""
sql
%%sql
USE AdventureWorks;
SHOW TABLES;
data:image/s3,"s3://crabby-images/71887/7188772666561e2c86e81b6728df62c61acbedfb" alt=""
sql
%%sql
USE AdventureWorks;
DROP TABLE IF EXISTS ProductsExternal;
DROP TABLE IF EXISTS ProductsManaged;
data:image/s3,"s3://crabby-images/030c2/030c2b352aa5d1274f72feb8a255a2447ef8682c" alt=""
- Return to the files tab and view the files/delta/products-delta folder. Note that the data files still exist in this location. Dropping the external table has removed the table from the metastore, but left the data files intact.
- View the files/synapse/workspaces/synapsexxxxxxx/warehouse folder, and note that there is no folder for the ProductsManaged table data. Dropping a managed table removes the table from the metastore and also deletes the table's data files.
data:image/s3,"s3://crabby-images/2f3bd/2f3bd9f6cf9df0c0b00e27c279d5998c67a614fb" alt=""
sql
%%sql
USE AdventureWorks;
CREATE TABLE Products
USING DELTA
LOCATION '/delta/products-delta';
data:image/s3,"s3://crabby-images/69d1e/69d1e0416047150598a4bb18d32a93ab8d993055" alt=""
sql
%%sql
USE AdventureWorks;
SELECT * FROM Products;
data:image/s3,"s3://crabby-images/3f237/3f2370ed049be1b04f131fbca35806426c1a7096" alt=""
python
from notebookutils import mssparkutils
from pyspark.sql.types import *
from pyspark.sql.functions import *
# Create a folder
inputPath = '/data/'
mssparkutils.fs.mkdirs(inputPath)
# Create a stream that reads data from the folder, using a JSON schema
jsonSchema = StructType([
StructField("device", StringType(), False),
StructField("status", StringType(), False)
])
iotstream = spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath)
# Write some event data to the folder
device_data = '''{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev2","status":"error"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"error"}
{"device":"Dev2","status":"ok"}
{"device":"Dev2","status":"error"}
{"device":"Dev1","status":"ok"}'''
mssparkutils.fs.put(inputPath + "data.txt", device_data, True)
print("Source stream created...")
data:image/s3,"s3://crabby-images/0ca49/0ca496a24e4a0c35880fc0e2d7a31f99860410f8" alt=""
Ensure the message Source stream created... is printed. The code you just ran has created a streaming data source based on a folder to which some data has been saved, representing readings from hypothetical IoT devices.
python
# Write the stream to a delta table
delta_stream_table_path = '/delta/iotdevicedata'
checkpointpath = '/delta/checkpoint'
deltastream = iotstream.writeStream.format("delta").option("checkpointLocation", checkpointpath).start(delta_stream_table_path)
print("Streaming to delta sink...")
data:image/s3,"s3://crabby-images/31a3f/31a3ff74cc702b78c57e0c27a7a86de5c7ade1f2" alt=""
python
# Read the data in delta format into a dataframe
df = spark.read.format("delta").load(delta_stream_table_path)
display(df)
data:image/s3,"s3://crabby-images/ffbcc/ffbcc1b214a893de32bab41a25fc608cc4d05c61" alt=""
python
# create a catalog table based on the streaming sink
spark.sql("CREATE TABLE IotDeviceData USING DELTA LOCATION '{0}'".format(delta_stream_table_path))
data:image/s3,"s3://crabby-images/2f468/2f468bb6e80ed64dde1e68a8ac547f4b7c1b8ae2" alt=""
sql
%%sql
SELECT * FROM IotDeviceData;
data:image/s3,"s3://crabby-images/3f041/3f0410c5c4c66542dc6005d7745b830d73e49674" alt=""
python
# Add more data to the source stream
more_data = '''{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"ok"}
{"device":"Dev1","status":"error"}
{"device":"Dev2","status":"error"}
{"device":"Dev1","status":"ok"}'''
mssparkutils.fs.put(inputPath + "more-data.txt", more_data, True)
data:image/s3,"s3://crabby-images/01fe3/01fe33b8987d1288f5a16e4f6f90932fb48b79ec" alt=""
sql
%%sql
SELECT * FROM IotDeviceData;
data:image/s3,"s3://crabby-images/29ecb/29ecba0190bd4af98d2d23fec011b3b794fd35ba" alt=""
python
deltastream.stop()
data:image/s3,"s3://crabby-images/0148d/0148d2fa6565f29bb6200476158322e27599fcc3" alt=""
data:image/s3,"s3://crabby-images/944ab/944abbb711838afa37fd318f42c3e9d2b7883520" alt=""
sql
-- This is auto-generated code
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/delta/products-delta/',
FORMAT = 'DELTA'
) AS [result]
data:image/s3,"s3://crabby-images/e29a8/e29a86b5b324fca81be682f72ba93bac10f73abc" alt=""
data:image/s3,"s3://crabby-images/3c12c/3c12c2b5a1abea9776878da79d059828e5c29efe" alt=""
sql
USE AdventureWorks;
SELECT * FROM Products;
data:image/s3,"s3://crabby-images/9b403/9b4036593c9cd1772d37ba4c43396c0f93bf8976" alt=""
Run the code and observe that you can also use the serverless SQL pool to query Delta Lake data in catalog tables that are defined the Spark metastore.