基于Docker部署测试PySpark

基于Docker运行PySpark

pyspark的部署比较麻烦,利用docker可以快速实现pyspark环境准备

运行pyspark

创建相关目录

shell 复制代码
cd resources/
[john@localhost resources]$ pwd
/home/john/Projects/2025_2026/spark/spark-test/resources
[john@localhost resources]$ tree
.
├── dir1
│?? ├── dir2
│?? │?? └── file2.parquet
│?? ├── file1.parquet
│?? └── file3.json
├── employees.json
├── full_user.avsc
├── kv1.txt
├── META-INF
│?? └── services
│??     ├── org.apache.spark.sql.jdbc.JdbcConnectionProvider
│??     └── org.apache.spark.sql.SparkSessionExtensionsProvider
├── people.csv
├── people.json
├── people.txt
├── README.md
├── user.avsc
├── users.avro
├── users.orc
└── users.parquet

4 directories, 16 files
[john@localhost resources]$

启动运行pyspark

shell 复制代码
docker run -it --rm \
-v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
-v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
-e HOME=/home/spark \
--privileged=true \
--user root \
spark:python3 /opt/spark/bin/pyspark

涉及到的宿主机目录需要提前自己创建

运行效果如下:

shell 复制代码
[john@localhost spark-test]$ docker run -it --rm \
> -v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
> -v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
> -e HOME=/home/spark \
> --privileged=true \
> --user root \
> spark:python3 /opt/spark/bin/pyspark
Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/15 16:31:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://bccc4588f3c1:4040
Spark context available as 'sc' (master = local[*], app id = local-1765816297828).
SparkSession available as 'spark'.
>>> textFile = spark.read.text("README.md")
>>> textFile.count()
174
>>> textFile.first()
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
>>> linesWithSpark.count()
20
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql import functions as sf
>>> textFile.select(sf.size(sf.split(textFile.value, "\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
[Row(max(numWords)=16)]
>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
>>> wordCounts
DataFrame[word: string, count: bigint]
>>> wordCounts.collect()
[Row(word='[![PySpark', count=1), Row(word='online', count=1), Row(word='graphs', count=1), Row(word='Build](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml)', count=1), Row(word='spark.range(1000', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
>>> exit()
相关推荐
小p10 小时前
docker学习: 2. 构建镜像Dockerfile
docker
小p1 天前
docker学习: 1. docker基本使用
docker
蝎子莱莱爱打怪1 天前
Centos7中一键安装K8s集群以及Rancher安装记录
运维·后端·kubernetes
崔小汤呀1 天前
Docker部署Nacos
docker·容器
缓解AI焦虑1 天前
Docker + K8s 部署大模型推理服务:资源划分与多实例调度
docker·容器
1candobetter2 天前
Docker Compose Build 与 Up 的区别:什么时候必须重建镜像
docker·容器·eureka
DianSan_ERP2 天前
电商API接口全链路监控:构建坚不可摧的线上运维防线
大数据·运维·网络·人工智能·git·servlet
シ風箏2 天前
MySQL【部署 04】Docker部署 MySQL8.0.32 版本(网盘镜像及启动命令分享)
数据库·mysql·docker
呉師傅2 天前
火狐浏览器报错配置文件缺失如何解决#操作技巧#
运维·网络·windows·电脑
不是二师兄的八戒2 天前
Linux服务器挂载OSS存储的完整实践指南
linux·运维·服务器