基于Docker部署测试PySpark

基于Docker运行PySpark

pyspark的部署比较麻烦,利用docker可以快速实现pyspark环境准备

运行pyspark

创建相关目录

shell 复制代码
cd resources/
[john@localhost resources]$ pwd
/home/john/Projects/2025_2026/spark/spark-test/resources
[john@localhost resources]$ tree
.
├── dir1
│?? ├── dir2
│?? │?? └── file2.parquet
│?? ├── file1.parquet
│?? └── file3.json
├── employees.json
├── full_user.avsc
├── kv1.txt
├── META-INF
│?? └── services
│??     ├── org.apache.spark.sql.jdbc.JdbcConnectionProvider
│??     └── org.apache.spark.sql.SparkSessionExtensionsProvider
├── people.csv
├── people.json
├── people.txt
├── README.md
├── user.avsc
├── users.avro
├── users.orc
└── users.parquet

4 directories, 16 files
[john@localhost resources]$

启动运行pyspark

shell 复制代码
docker run -it --rm \
-v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
-v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
-e HOME=/home/spark \
--privileged=true \
--user root \
spark:python3 /opt/spark/bin/pyspark

涉及到的宿主机目录需要提前自己创建

运行效果如下:

shell 复制代码
[john@localhost spark-test]$ docker run -it --rm \
> -v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
> -v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
> -e HOME=/home/spark \
> --privileged=true \
> --user root \
> spark:python3 /opt/spark/bin/pyspark
Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/15 16:31:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://bccc4588f3c1:4040
Spark context available as 'sc' (master = local[*], app id = local-1765816297828).
SparkSession available as 'spark'.
>>> textFile = spark.read.text("README.md")
>>> textFile.count()
174
>>> textFile.first()
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
>>> linesWithSpark.count()
20
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql import functions as sf
>>> textFile.select(sf.size(sf.split(textFile.value, "\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
[Row(max(numWords)=16)]
>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
>>> wordCounts
DataFrame[word: string, count: bigint]
>>> wordCounts.collect()
[Row(word='[![PySpark', count=1), Row(word='online', count=1), Row(word='graphs', count=1), Row(word='Build](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml)', count=1), Row(word='spark.range(1000', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
>>> exit()
相关推荐
翼龙云_cloud5 分钟前
云代理商:Hermes Agent在量化交易中的实战应用
运维·服务器·人工智能·ai智能体·hermes agent
木雷坞13 分钟前
Home Assistant Docker Compose 升级失败排查:镜像、备份和设备映射
服务器·docker·home assisant
无限进步_19 分钟前
【Linux】Makefile:让编译自动化
linux·运维·自动化
Jinkxs22 分钟前
LoadBalancer- 简单限流策略:Nginx 基于连接 / 请求的限流实现
java·运维·nginx
流浪00130 分钟前
告别静态打印:Linux C 实现实时刷新进度条
linux·运维·c语言
qq_1969761732 分钟前
硬核教程:用Gemini境像站构建端到端自动化办公工作流,告别重复操作(国内免费镜像实测)
运维·自动化
小此方32 分钟前
Re:Linux系统篇(二十)进程篇·五:深入理解 Linux 进程优先级:从底层逻辑到实战修改
linux·运维·服务器
流浪00137 分钟前
Linux篇(八) Make 与 Makefile 超详细入门教程|从零基础到手写自动化编译
linux·运维·自动化
j_xxx404_42 分钟前
Linux线程:从内存分页机制(Page Table/TLB/Page Fault)彻底读懂 Linux 线程本质
linux·运维·服务器·开发语言·c++·人工智能·ai
老码观察1 小时前
K8s 容器化部署的宿主机资源规划的踩坑实录
docker·容器·kubernetes