简介
发现市面上基本没有对 PySpark 进行配置的工具,同时 Spark 3.4.0 引入了 server-client 模式,也没有比较好的解决方案,我这里开源了一个简单的模块,支持以下功能:
- 通过环境变量配置 Spark ,参见 config spark
- 在 IPython/Jupyter 中执行 Spark SQL 的
%SQL
和%%SQL
magic -
- SQL 语句可分多行编写,支持使用
;
分隔语句 - 支持配置连接客户端
- SQL 语句可分多行编写,支持使用
sparglim-server
用于创建daemon Spark Connect Server ,并支持 on K8S 部署
快速开始
Run Jupyterlab with sparglim
docker image:
bash
docker run \
-it \
-p 8888:8888 \
wh1isper/jupyterlab-sparglim
访问 http://localhost:8888
来使用jupyterlab,然后可以试试SQL功能SQL Magic.
Run and Daemon a Spark Connect Server:
bash
docker run \
-it \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/sparglim-server
访问 http://localhost:4040
查看Spark-UI并通过sc://localhost:15002
连接Spark Connect Server. Use sparglim to setup SparkSession to connect to Spark Connect Server.
用户案例
直接使用:Basic
直接通过代码快速配置SparkSession
sql
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
# Create a local[*] spark session with s3&kerberos config
spark = ConfigBuilder().get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
构建一个PySpark应用: Building a PySpark App
配置PySpark on K8S以支持使用JupyterLab的数据探索任务:examples/jupyter-sparglim-on-k8s
配置PySpark以开发ELT服务:pyspark-sampling
部署Spark Connect Server:Deploy Spark Connect Server on K8S (And Connect to it)
部署Spark on K8S模式下的Spark Connect Server:examples/sparglim-server
部署Spark on K8S模式下的Spark Connect Server,并通过JupyterLab连接它:examples/jupyter-sparglim-sc
连接已有的Spark Connect ServerConnect to Spark Connect Server
只需要配置环境变量 SPARGLIM_REMOTE
, 格式为sc://host:port
Example Code:
python
import os
os.environ["SPARGLIM_REMOTE"] = "sc://localhost:15002" # or export SPARGLIM_REMOTE=sc://localhost:15002 before run python
from sparglim.config.builder import ConfigBuilder
from datetime import datetime, date
from pyspark.sql import Row
c = ConfigBuilder().config_connect_client()
spark = c.get_or_create()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()
SQL Magic
Install Sparglim with
bash
pip install sparglim["magic"]
Load magic in IPython/Jupyter
ipython
%load_ext sparglim.sql
spark # show SparkSession brief info
Create a view:
python
from datetime import datetime, date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.createOrReplaceTempView("tb")
Query the view by %SQL
:
sql
%sql SELECT * FROM tb
%SQL
result dataframe can be assigned to a variable:
python
df = %sql SELECT * FROM tb
df
or %%SQL
can be used to execute multiple statements:
sql
%%sql SELECT
*
FROM
tb;
You can also using Spark SQL to load data from external data source, such as:
sql
%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;