1. 环境准备
- 集群中的每台节点都要安装好 Java 环境(建议 Java 8 及以上版本)。
- 确保所有节点间能通过 SSH 无密码登录。
- 安装并配置好 Hadoop 集群,因为 YARN 是 Hadoop 的资源管理系统。
2.配置Hadoop
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/data/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/hadoop/datanode</value> </property> </configuration>
yarn-site.xml
:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager</value> </property> </configuration>
mapred-site.xml
:
java<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
在 NameNode 节点上格式化 HDFS:
hdfs namenode -format
启动 HDFS 和 YARN:
3.下载spark
将下载好的spark拖入opt中创建的文件夹module将他解压至创建的文件夹software中
wget https://downloads.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz tar -zxvf spark-3.3.2-bin-hadoop3.tgz -C /opt
3.1配置环境变量
编辑 /etc/profile
文件,添加如下内容:
export SPARK_HOME=/opt/spark-3.3.2-bin-hadoop3 export PATH=PATH:SPARK_HOME/bin:$SPARK_HOME/sbin
使其生效
source /etc/profile
3.3 配置 Spark 核心文件
在 node1
节点上编辑以下文件:
spark-env.sh
:
cp /opt/spark-3.3.2-bin-hadoop3/conf/spark-env.sh.template /opt/spark-3.3.2-bin-hadoop3/conf/spark-env.sh vim /opt/spark-3.3.2-bin-hadoop3/conf/spark-env.sh
添加内容:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_CONF_DIR=/opt/hadoop-3.3.4/etc/hadoop export SPARK_MASTER_HOST=node1 export SPARK_LOCAL_IP=node1 export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:9000/spark-logs -Dspark.history.ui.port=18080"
spark-defaults.conf
:
cp /opt/spark-3.3.2-bin-hadoop3/conf/spark-defaults.conf.template /opt/spark-3.3.2-bin-hadoop3/conf/spark-defaults.conf vim /opt/spark-3.3.2-bin-hadoop3/conf/spark-defaults.conf
添加如下内容
spark.master yarn spark.submit.deployMode cluster spark.yarn.am.memory 1024m spark.executor.memory 1024m spark.eventLog.enabled true spark.eventLog.dir hdfs://node1:9000/spark-logs
3.4同步到其他节点
scp -r /opt/spark-3.3.2-bin-hadoop3/conf node2:/opt/spark-3.3.2-bin-hadoop3/ scp -r /opt/spark-3.3.2-bin-hadoop3/conf node3:/opt/spark-3.3.2-bin-hadoop3/
3.5创建spark日志
hdfs dfs -mkdir -p /spark-logs
3.6 启动 Spark History Server
在 node1
节点上执行:
4.测试集群
spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ /opt/spark-3.3.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.3.2.jar \ 10