spark standalone模式HA部署,任务失败重提测试

1. standalone ha

Standalone集群中Worker汇报资源、提交应用程序、Driver申请资源时都需要与Master进行通信,如果Standalone集群中只有一个Master,当Master挂掉后就会影响以上通信,所以在Standalone集群中配置Master HA就有必要。

Master HA中有两个Master,一个Master为Active Master,另外一个是StandBy Master,当Active Master挂掉后,StandBy Master 切换为Active Master自动接管Spark集群,我们可以使用Zookeeper进行主备Master切换,zookeeper具备选举和存储功能,可以存储Spark集群Worker、Application信息。

1.1 节点规划

Standalone集群Master和备用的Master在各个节点上划分如下:

节点名称 ActiveMaster StandByMaster Worker
hadoop101
hadoop102
hadoop103
hadoop106

按照如下步骤进行Master HA搭建。

1.2 部署

1.2.1 配置spark-env.sh

在hadoop106节点上配置$SPARK_HOME/conf/spark-env.sh文件,写入如下内容:

复制代码
export SPARK_MASTER_HOST=hadoop106
export SPARK_MASTER_PORT=30277
export SPARK_MASTER_WEBUI_PORT=30280
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=3g
export SPARK_WORKER_WEBUI_PORT=30281
export SPARK_WORKER_DIR=/data/spark
export SPARK_PID_DIR=/opt/module/spark-3.5.5
export SPARK_DAEMON_JAVA_OPTS=" -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop101:2181,hadoop102:2181,hadoop103:2181 -Dspark.deploy.zookeeper.dir=/sparkmasterha"

以上配置项解释如下:

  • spark.deploy.recoveryMode:Spark集群恢复模式为 ZOOKEEPER。
  • spark.deploy.zookeeper.url:指定zookeeper集群地址。
  • spark.deploy.zookeeper.dir:Spark 在 ZooKeeper 中存储集群状态信息的路径。

1.2.2 将spark-env.sh发送到其他spark standalone节点

复制代码
[root@hadoop106 ~]# scp -r /opt/module/spark-3.5.5 hadoop101:`pwd`
[root@hadoop106 ~]# scp -r /opt/module/spark-3.5.5 hadoop102:`pwd`
[root@hadoop106 ~]# scp -r /opt/module/spark-3.5.5 hadoop103:`pwd`

1.2.3 配置备用Master

这里选择hadoop101节点为StandBy Master,在该节点上配置$SPARK_HOME/conf/spark-env.sh,修改"SPARK_MASTER_HOST"为hadoop101。

1.2.4 启动Standalone集群

复制代码
#在hadoop106启动Standalone集群
[root@hadoop106 ~]# cd /opt/module/spark-3.5.5/sbin
[root@hadoop106 sbin]# ./start-all.sh

#在hadoop101启动StandBy Master
[root@hadoop101 ~]# cd /opt/module/spark-3.5.5/sbin
[root@hadoop101 sbin]# ./start-master.sh

1.3 验证

1.3.1 WebUI访问Active Master和StandBy Master WebUI

Active Master WebUI:

StandBy Master WebUI:

zookeeper WebUI:

1.3.2 提交任务测试

提交任务命令

复制代码
spark-submit --master spark://hadoop106:30277,hadoop101:30277 --deploy-mode cluster --conf spark.task.maxFailures=3  --class org.apache.spark.examples.SparkPi /opt/module/spark-3.5.5/examples/jars/spark-examples_2.13-3.5.5.jar 20000

hadoop106节点上,能够获取到提交的任务

kill 掉 hadoop106节点上的master。zookeeper的node删除

任务由hadoop101接管并执行完毕

2. spark任务重提测试

2.1 standalone模式

2.2.1 CoarseGrainedExecutorBackend进程挂掉

master会把这个任务重启

复制代码
[root@hadoop103 ~]# jps
19024 Worker
23380 QuorumPeerMain
19770 CoarseGrainedExecutorBackend

[root@hadoop103 ~]# kill -9 19770

[root@hadoop103 ~]# jps
19024 Worker
23380 QuorumPeerMain
19882 CoarseGrainedExecutorBackend

2.2.2 DriverWrapper进程挂掉

任务失败

复制代码
[root@hadoop102 conf]# jps
9938 CoarseGrainedExecutorBackend
30837 QuorumPeerMain
9862 DriverWrapper
8968 Worker

[root@hadoop102 conf]# kill -9 9862

[root@hadoop102 conf]# jps
30837 QuorumPeerMain
8968 Worker
31049 DFSZKFailoverController

改变提交参数

复制代码
spark-submit --master spark://hadoop106:30277,hadoop101:30277 --deploy-mode cluster --supervise  --class org.apache.spark.examples.SparkPi /opt/module/spark-3.5.5/examples/jars/spark-examples_2.13-3.5.5.jar 20000

任务失败后,master会一直重新提交任务,直到运行成功

复制代码
[root@hadoop102 conf]# jps
30837 QuorumPeerMain
10599 CoarseGrainedExecutorBackend
8968 Worker
10523 DriverWrapper

[root@hadoop102 conf]# kill -9 10523

[root@hadoop102 conf]# jps
30837 QuorumPeerMain
8968 Worker
10697 DriverWrapper

[root@hadoop102 conf]# kill -9 10697

[root@hadoop102 conf]# jps
30837 QuorumPeerMain
8968 Worker
10856 DriverWrapper

2.2 spark on yarn模式

2.2.1 YarnCoarseGrainedExecutorBackend进程挂掉

复制代码
[root@hadoop103 ~]# jps
21488 YarnCoarseGrainedExecutorBackend
23918 NodeManager

[root@hadoop103 ~]# kill -9 21488

[root@hadoop103 ~]# jps
23918 NodeManager

[root@hadoop105 ~]# jps
19408 NodeManager
6594 ApplicationMaster
6441 SparkSubmit
6698 YarnCoarseGrainedExecutorBackend

[root@hadoop105 ~]# jps
19408 NodeManager
6594 ApplicationMaster
6441 SparkSubmit
6698 YarnCoarseGrainedExecutorBackend
6813 YarnCoarseGrainedExecutorBackend

[root@hadoop105 ~]# kill -9 6813
[root@hadoop105 ~]# jps
19408 NodeManager
6594 ApplicationMaster
6441 SparkSubmit
6698 YarnCoarseGrainedExecutorBackend

[root@hadoop104 ~]# jps
14939 NodeManager

[root@hadoop104 ~]# jps
14939 NodeManager
25085 YarnCoarseGrainedExecutorBackend

AM进程会找一台合适的nodemanager重新启动进程

2.2.2 ApplicationMaster进程挂掉

复制代码
[root@hadoop103 ~]# jps
22265 YarnCoarseGrainedExecutorBackend
23918 NodeManager
22174 ApplicationMaster

[root@hadoop103 ~]# kill -9 22174
[root@hadoop103 ~]# jps
23918 NodeManager
22383 ApplicationMaster

yarn会重启一次AM进程

复制代码
root; groups with modify permissions: EMPTY
25/12/01 16:18:34 INFO Client: Submitting application application_1764227554448_0009 to ResourceManager
25/12/01 16:18:34 INFO YarnClientImpl: Submitted application application_1764227554448_0009
25/12/01 16:18:35 INFO Client: Application report for application_1764227554448_0009 (state: ACCEPTED)
25/12/01 16:18:35 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1764577114256
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0009/
         user: root
25/12/01 16:18:45 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:18:45 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: hadoop103
         ApplicationMaster RPC port: 44489
         queue: default
         start time: 1764577114256
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0009/
         user: root
25/12/01 16:19:10 INFO Client: Application report for application_1764227554448_0009 (state: ACCEPTED)
25/12/01 16:19:10 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1764577114256
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0009/
         user: root
25/12/01 16:19:16 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:19:16 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: hadoop103
         ApplicationMaster RPC port: 41209
         queue: default
         start time: 1764577114256
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0009/
         user: root
25/12/01 16:19:46 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:20:16 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:20:46 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:21:16 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:21:46 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:22:17 INFO Client: Application report for application_1764227554448_0009 (state: RUNNING)
25/12/01 16:22:24 INFO Client: Application report for application_1764227554448_0009 (state: FINISHED)
25/12/01 16:22:24 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: hadoop103
         ApplicationMaster RPC port: 41209
         queue: default
         start time: 1764577114256
         final status: SUCCEEDED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0009/
         user: root
25/12/01 16:22:24 INFO ShutdownHookManager: Shutdown hook called
25/12/01 16:22:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-64fa15fc-483a-4708-ac5e-4c9d6b17757e
25/12/01 16:22:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-0e326f3d-314b-4eb2-ba7f-110daaf9ba9e

kill 两次AM 进程,任务执行失败。

复制代码
root; groups with modify permissions: EMPTY
25/12/01 16:26:37 INFO Client: Submitting application application_1764227554448_0010 to ResourceManager
25/12/01 16:26:37 INFO YarnClientImpl: Submitted application application_1764227554448_0010
25/12/01 16:26:38 INFO Client: Application report for application_1764227554448_0010 (state: ACCEPTED)
25/12/01 16:26:38 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1764577597406
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0010/
         user: root
25/12/01 16:26:50 INFO Client: Application report for application_1764227554448_0010 (state: RUNNING)
25/12/01 16:26:50 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: hadoop104
         ApplicationMaster RPC port: 43071
         queue: default
         start time: 1764577597406
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0010/
         user: root
25/12/01 16:27:03 INFO Client: Application report for application_1764227554448_0010 (state: ACCEPTED)
25/12/01 16:27:03 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1764577597406
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0010/
         user: root
25/12/01 16:27:09 INFO Client: Application report for application_1764227554448_0010 (state: RUNNING)
25/12/01 16:27:09 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: hadoop103
         ApplicationMaster RPC port: 40025
         queue: default
         start time: 1764577597406
         final status: UNDEFINED
         tracking URL: http://hadoop101:8088/proxy/application_1764227554448_0010/
         user: root
25/12/01 16:27:17 INFO Client: Application report for application_1764227554448_0010 (state: FAILED)
25/12/01 16:27:17 INFO Client:
         client token: N/A
         diagnostics: Application application_1764227554448_0010 failed 2 times due to AM Container for appattempt_1764227554448_0010_000002 exited with  exitCode: 137
Failing this attempt.Diagnostics: [2025-12-01 16:27:16.923]Container killed on request. Exit code is 137
[2025-12-01 16:27:16.925]Container exited with a non-zero exit code 137.
[2025-12-01 16:27:16.925]Killed by external signal
For more detailed output, check the application tracking page: http://hadoop101:8088/cluster/app/application_1764227554448_0010 Then click on links to logs of each attempt.
. Failing the application.
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1764577597406
         final status: FAILED
         tracking URL: http://hadoop101:8088/cluster/app/application_1764227554448_0010
         user: root
25/12/01 16:27:17 INFO Client: Deleted staging directory hdfs://mycluster/user/root/.sparkStaging/application_1764227554448_0010
25/12/01 16:27:17 ERROR Client: Application diagnostics message: Application application_1764227554448_0010 failed 2 times due to AM Container for appattempt_1764227554448_0010_000002 exited with  exitCode: 137
Failing this attempt.Diagnostics: [2025-12-01 16:27:16.923]Container killed on request. Exit code is 137
[2025-12-01 16:27:16.925]Container exited with a non-zero exit code 137.
[2025-12-01 16:27:16.925]Killed by external signal
For more detailed output, check the application tracking page: http://hadoop101:8088/cluster/app/application_1764227554448_0010 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1764227554448_0010 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1312)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1745)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1034)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:199)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:222)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1125)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1134)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
25/12/01 16:27:17 INFO ShutdownHookManager: Shutdown hook called
25/12/01 16:27:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-39839360-c274-42ed-bc1c-9a1ad6112c92
25/12/01 16:27:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-e2efae43-bbe4-409d-92a7-fb7c856281e0

因为yarn.resourcemanager.am.max-attempts默认为2,也可由spark.yarn.maxAppAttempts

设置默认值为2,不能超过yarn.resourcemanager.am.max-attempts的值。

相关推荐
老蒋新思维1 小时前
创客匠人 2025 峰会启示:AI 重构知识变现,从内容输出到资产运营
大数据·人工智能·tcp/ip·重构·创始人ip·创客匠人·知识变现
Dxy12393102161 小时前
Elasticsearch数据更新简介
大数据·elasticsearch·搜索引擎
TDengine (老段)1 小时前
TDengine COLS 函数用户手册
大数据·数据库·物联网·时序数据库·iot·tdengine
吃喝不愁霸王餐APP开发者1 小时前
外卖霸王餐用户画像标签系统:Spark SQL批处理+Kafka流处理混合计算
sql·spark·kafka
B站计算机毕业设计之家1 小时前
电商数据实战:python京东商品爬取与可视化系统 大数据 Hadoop spark 优秀项目(源码)✅
大数据·hadoop·python·机器学习·spark·echarts·推荐算法
啊吧怪不啊吧1 小时前
贪心算法(局部最优实现全局最优)第二篇
大数据·算法·leetcode·贪心算法
e***58231 小时前
【分布式】Hadoop完全分布式的搭建(零基础)
大数据·hadoop·分布式
LDG_AGI2 小时前
【推荐系统】深度学习训练框架(八):PyTorch分布式采样器DistributedSampler原理详解
人工智能·pytorch·分布式·深度学习·算法·机器学习·推荐算法
智能化咨询2 小时前
(66页PPT)某著名企业XX集团数据分析平台建设项目方案设计(附下载方式)
大数据·人工智能·数据分析