基于Docker部署测试PySpark

基于Docker运行PySpark

pyspark的部署比较麻烦,利用docker可以快速实现pyspark环境准备

运行pyspark

创建相关目录

shell 复制代码
cd resources/
[john@localhost resources]$ pwd
/home/john/Projects/2025_2026/spark/spark-test/resources
[john@localhost resources]$ tree
.
├── dir1
│?? ├── dir2
│?? │?? └── file2.parquet
│?? ├── file1.parquet
│?? └── file3.json
├── employees.json
├── full_user.avsc
├── kv1.txt
├── META-INF
│?? └── services
│??     ├── org.apache.spark.sql.jdbc.JdbcConnectionProvider
│??     └── org.apache.spark.sql.SparkSessionExtensionsProvider
├── people.csv
├── people.json
├── people.txt
├── README.md
├── user.avsc
├── users.avro
├── users.orc
└── users.parquet

4 directories, 16 files
[john@localhost resources]$

启动运行pyspark

shell 复制代码
docker run -it --rm \
-v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
-v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
-e HOME=/home/spark \
--privileged=true \
--user root \
spark:python3 /opt/spark/bin/pyspark

涉及到的宿主机目录需要提前自己创建

运行效果如下:

shell 复制代码
[john@localhost spark-test]$ docker run -it --rm \
> -v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
> -v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
> -e HOME=/home/spark \
> --privileged=true \
> --user root \
> spark:python3 /opt/spark/bin/pyspark
Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/15 16:31:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://bccc4588f3c1:4040
Spark context available as 'sc' (master = local[*], app id = local-1765816297828).
SparkSession available as 'spark'.
>>> textFile = spark.read.text("README.md")
>>> textFile.count()
174
>>> textFile.first()
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
>>> linesWithSpark.count()
20
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql import functions as sf
>>> textFile.select(sf.size(sf.split(textFile.value, "\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
[Row(max(numWords)=16)]
>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
>>> wordCounts
DataFrame[word: string, count: bigint]
>>> wordCounts.collect()
[Row(word='[![PySpark', count=1), Row(word='online', count=1), Row(word='graphs', count=1), Row(word='Build](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml)', count=1), Row(word='spark.range(1000', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
>>> exit()
相关推荐
志栋智能22 分钟前
从云端到边缘:无处不在的超自动化巡检需求
运维·自动化
容器魔方25 分钟前
Karmada v1.18 版本发布!新增混合云溢出式调度能力
人工智能·云原生·容器·华为云·云计算
是一个Bug43 分钟前
AI Agent 的沙箱是什么?它和 Docker / 虚拟机有什么区别?
人工智能·docker·容器
BJ_Bonree1 小时前
聊点技术 | 从“统一接入“到“统一调度“:重塑可观测平台的数据底座
运维·人工智能·可观测性
AOwhisky1 小时前
学习自测与解析:Redis系列第一期与第二期核心知识点详解
运维·数据库·redis·学习·云计算
流浪0011 小时前
Linux系统篇(五):Linux 进程控制全解:fork、exec、wait 核心原理与实战
linux·运维·服务器
从入门到放弃-咖啡豆1 小时前
记录一次docker部署过程和一些常用的docker指令
运维·docker·容器
DianSan_ERP1 小时前
架构师视角:电商大促高并发下的订单API限流与防漏单架构演进
java·运维·网络·安全·微服务·架构·自动化
不会就选b1 小时前
Linux之make,makefile
linux·运维·服务器
腾讯蓝鲸智云1 小时前
【运维自动化-监控平台】初识蓝鲸监控
运维·自动化·云计算·sass·paas