基于Docker部署测试PySpark

基于Docker运行PySpark

pyspark的部署比较麻烦,利用docker可以快速实现pyspark环境准备

运行pyspark

创建相关目录

shell 复制代码
cd resources/
[john@localhost resources]$ pwd
/home/john/Projects/2025_2026/spark/spark-test/resources
[john@localhost resources]$ tree
.
├── dir1
│?? ├── dir2
│?? │?? └── file2.parquet
│?? ├── file1.parquet
│?? └── file3.json
├── employees.json
├── full_user.avsc
├── kv1.txt
├── META-INF
│?? └── services
│??     ├── org.apache.spark.sql.jdbc.JdbcConnectionProvider
│??     └── org.apache.spark.sql.SparkSessionExtensionsProvider
├── people.csv
├── people.json
├── people.txt
├── README.md
├── user.avsc
├── users.avro
├── users.orc
└── users.parquet

4 directories, 16 files
[john@localhost resources]$

启动运行pyspark

shell 复制代码
docker run -it --rm \
-v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
-v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
-e HOME=/home/spark \
--privileged=true \
--user root \
spark:python3 /opt/spark/bin/pyspark

涉及到的宿主机目录需要提前自己创建

运行效果如下:

shell 复制代码
[john@localhost spark-test]$ docker run -it --rm \
> -v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
> -v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
> -e HOME=/home/spark \
> --privileged=true \
> --user root \
> spark:python3 /opt/spark/bin/pyspark
Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/15 16:31:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://bccc4588f3c1:4040
Spark context available as 'sc' (master = local[*], app id = local-1765816297828).
SparkSession available as 'spark'.
>>> textFile = spark.read.text("README.md")
>>> textFile.count()
174
>>> textFile.first()
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
>>> linesWithSpark.count()
20
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql import functions as sf
>>> textFile.select(sf.size(sf.split(textFile.value, "\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
[Row(max(numWords)=16)]
>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
>>> wordCounts
DataFrame[word: string, count: bigint]
>>> wordCounts.collect()
[Row(word='[![PySpark', count=1), Row(word='online', count=1), Row(word='graphs', count=1), Row(word='Build](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml)', count=1), Row(word='spark.range(1000', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
>>> exit()
相关推荐
SkyWalking中文站1 天前
认识 Horizon UI · 6/17:Trace 探索器
运维·监控·自动化运维
程序员老赵1 天前
Docker 部署 Redmine:老牌开源项目管理部署实测记录
docker·开源·团队管理
程序员老赵1 天前
服务器文件不想 SFTP 上传?Docker 跑个 File Browser,浏览器就能管理
服务器·docker·开源
火车叼位1 天前
写给初级开发者:SSL、SSH、HTTPS 与证书体系全解析
运维
小猿姐2 天前
唯品会大规模数据库云原生实践:基于 KubeBlocks 管理数千实例的统一运维之路
运维·elasticsearch·云原生
SkyWalking中文站2 天前
认识 Horizon UI · 5/17:3D 基础设施地图
运维·监控·自动化运维
SkyWalking中文站3 天前
认识 Horizon UI · 1/17:SkyWalking 新一代可观测性控制台
运维·前端·监控
雪梨酱QAQ3 天前
Kubeneters HA Cluster部署
运维
lichenyang4533 天前
Docker 学习笔记(五):Docker Compose,用一个 YAML 启动前端、后端和 MongoDB
docker
lichenyang4533 天前
Docker 学习笔记(四):Dockerfile,把项目打成自己的镜像
docker·容器