基于Docker部署测试PySpark

基于Docker运行PySpark

pyspark的部署比较麻烦,利用docker可以快速实现pyspark环境准备

运行pyspark

创建相关目录

shell 复制代码
cd resources/
[john@localhost resources]$ pwd
/home/john/Projects/2025_2026/spark/spark-test/resources
[john@localhost resources]$ tree
.
├── dir1
│?? ├── dir2
│?? │?? └── file2.parquet
│?? ├── file1.parquet
│?? └── file3.json
├── employees.json
├── full_user.avsc
├── kv1.txt
├── META-INF
│?? └── services
│??     ├── org.apache.spark.sql.jdbc.JdbcConnectionProvider
│??     └── org.apache.spark.sql.SparkSessionExtensionsProvider
├── people.csv
├── people.json
├── people.txt
├── README.md
├── user.avsc
├── users.avro
├── users.orc
└── users.parquet

4 directories, 16 files
[john@localhost resources]$

启动运行pyspark

shell 复制代码
docker run -it --rm \
-v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
-v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
-e HOME=/home/spark \
--privileged=true \
--user root \
spark:python3 /opt/spark/bin/pyspark

涉及到的宿主机目录需要提前自己创建

运行效果如下:

shell 复制代码
[john@localhost spark-test]$ docker run -it --rm \
> -v /home/john/Projects/2025_2026/spark/spark-test/resources:/opt/spark/work-dir \
> -v /home/john/Projects/2025_2026/spark/spark-test/spark_history:/home/spark \
> -e HOME=/home/spark \
> --privileged=true \
> --user root \
> spark:python3 /opt/spark/bin/pyspark
Python 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/15 16:31:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.10.12 (main, Aug 15 2025 14:32:43)
Spark context Web UI available at http://bccc4588f3c1:4040
Spark context available as 'sc' (master = local[*], app id = local-1765816297828).
SparkSession available as 'spark'.
>>> textFile = spark.read.text("README.md")
>>> textFile.count()
174
>>> textFile.first()
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
>>> linesWithSpark.count()
20
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql import functions as sf
>>> textFile.select(sf.size(sf.split(textFile.value, "\s+")).name("numWords")).agg(sf.max(sf.col("numWords"))).collect()
[Row(max(numWords)=16)]
>>> wordCounts = textFile.select(sf.explode(sf.split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
>>> wordCounts
DataFrame[word: string, count: bigint]
>>> wordCounts.collect()
[Row(word='[![PySpark', count=1), Row(word='online', count=1), Row(word='graphs', count=1), Row(word='Build](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml/badge.svg)](https://github.com/apache/spark/actions/workflows/build_branch40_maven.yml)', count=1), Row(word='spark.range(1000', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
>>> exit()
相关推荐
小小管写大大码1 小时前
如何让vscode变得更智能?vscode接入claude实现自动编程
运维·ide·vscode·自动化·编辑器·ai编程·腾讯云ai代码助手
zhang133830890752 小时前
CG-09H 超声波风速风向传感器 加热型 ABS材质 重量轻 没有机械部件
大数据·运维·网络·人工智能·自动化
Fᴏʀ ʏ꯭ᴏ꯭ᴜ꯭.3 小时前
Keepalived VIP迁移邮件告警配置指南
运维·服务器·笔记
物联网软硬件开发-轨物科技3 小时前
【轨物洞见】告别“被动维修”!预测性运维如何重塑老旧电站的资产价值?
运维·人工智能
程序员允诺3 小时前
[DevOps实战] 彻底解决依赖地狱:如何编译全静态、可移植的 Xorriso 工具
运维·devops
酣大智3 小时前
接口模式参数
运维·网络·网络协议·tcp/ip
一只自律的鸡3 小时前
【Linux驱动】bug处理 ens33找不到IP
linux·运维·bug
!chen4 小时前
linux服务器静默安装Oracle26ai
linux·运维·服务器
莫大3304 小时前
2核2G云服务器PHP8.5+MySQL9.0+Nginx(LNMP)安装WordPress网站详细教程
运维·服务器·nginx
刚刚入门的菜鸟4 小时前
php-curl
运维·web安全·php