1,下载
下载 spark-3.5.4-bin-without-hadoop.tgz
地址: https://downloads.apache.org/spark/spark-3.5.4/
2,安装
通过虚拟机设置共享文件夹将需要的安装包复制到linux虚拟机中 localhost1。虚拟机的共享盘在 /mnt/hgfs/。 将共享盘安装包复制到 存在目标路径/opt/software/
解压缩
bash
cd /opt/software/
tar -zxvf spark-3.5.4-bin-without-hadoop.tgz -C /usr/local/applications/
3,配置环境变量
配置三个Linux节点
bash
vi /etc/profile
SPARK_HOME=/usr/local/applications/spark-3.5.4-bin-without-hadoop
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export SPARK_HOME PATH
bash
source /etc/profile
4,修改Spark配置
bash
cd $SPARK_HOME/conf
workers
bash
cp workers.template workers
vi workers
localhost1
localhost2
localhost3
spark-defaults.conf
bash
cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
spark.master spark://localhost1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost1:9000/spark-eventlog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 512m
启动HDFS
bash
start-dfs.sh
创建HDFS日志目录
bash
hdfs dfs -mkdir /spark-eventlog
spark-env.sh
bash
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_431
export HADOOP_HOME=/usr/local/applications/hadoop-3.3.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/usr/local/applications/hadoop-3.3.6/bin/hadoop classpath)
export SPARK_MASTER_HOST=localhost1
export SPARK_MASTER_PORT=7077
5, 将Spark软件分发到集群
先关闭防火墙
bash
systemctl stop firewalld
systemctl disable firewalld
将spark分发到localhost2 和 localhost3
bash
cd /usr/local/applications
scp -r spark-3.5.4-bin-without-hadoop root@localhost2:/usr/local/applications/spark-3.5.4-bin-without-hadoop
scp -r spark-3.5.4-bin-without-hadoop root@localhost3:/usr/local/applications/spark-3.5.4-bin-without-hadoop
6, 启动集群
bash
cd $SPARK_HOME/sbin
./start-all.sh
启动后查看三个节点的进程
bash
[root@localhost1 sbin]# jps
3397 Jps
3190 Master
3336 Worker
[root@localhost2 ~]# jps
2966 Worker
3030 Jps
[root@localhost3 ~]# jps
2972 Worker
3037 Jps
可以看见如下 Spark 的 Web 界面:
7,集群测试
需要使用hdfs,所以需要先启动HDFS
bash
start-dfs.sh
1, 计算圆周率
bash
run-example SparkPi 10
输出结果
bash
[root@localhost1 conf]# run-example SparkPi 10
Pi is roughly 3.141343141343141
2, 启动spark-shell
bash
[root@localhost1 conf]# spark-shell
Spark context Web UI available at http://localhost1:4040
Spark context available as 'sc' (master = spark://localhost1:7077, app id = app-20250128143941-0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.4
/_/
Using Scala version 2.12.18 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_431)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
在spark shell 中执行
bash
scala> val lines = sc.textFile("/wcinput/wc.txt")
lines: org.apache.spark.rdd.RDD[String] = /wcinput/wc.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
(mapreduce,3)
(yarn,2)
(neil,3)
(hadoop,2)
(jack,3)
(hdfs,1)