Yarn 2.9.1滚动升级到3.4.1调研与实践验证
- ResourceManager 简称 RM
- MapReduce History Server 简称 MRHS
- NodeManager 简称 NM
安装包
安装本软件,涉及的软件包为:
-
jdk1.8.0_161.tgz
-
zookeeper-3.4.6.tar.gz
-
hadoop-2.9.1.61.tar.gz
-
hadoop-3.4.1.0.SNAPSHOT.tar.gz
-
apache-zookeeper-3.8.4-bin.tar.gz
源环境使用的是ZK3.4.6+hadoop-2.9.1,目标升级成ZK3.8.4+hadoop-3.4.1.0.SNAPSHOT
部署架构
Yarn集群 ServerA ServerB ServerC ServerD ServerE ServerS ZK集群 ServerF ServerG ServerH ServerJ ZK3 ZK2 ZK1 MRHS NM3 NM2 NM1 RM2 RM1 YarnClient2.9.1 Yarn集群 HDFS集群
1.RM和NM需要分开部署。若RM上存在NM,需要先迁移RM上的NM。
2.task需要分布到所有的NMs上。文件 yarn-site.xml
中的yarn.scheduler.fair.max.assign
配置为1。
滚动升级
升级组件
- RM
- NM
- MRHS
高版本的yarn不兼容ZK3.4.6,在做yarn升级之前,先要将ZK升级成3.8.4
为了兼容低版本客户端提交的任务,新版本的配置文件中需要开启NM的env白名单,并增加。
xml
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
升级步骤
先升级standby 状态的NN 通过yarn rmadmin -getServiceState rm1
和yarn rmadmin -getServiceState rm2
确认RMs的状态,提交执行任务A、B、C、D、E都属于长时间运行任务,可以贯穿整个升级过程。期间会不提提交
-
提交执行任务A
-
滚动升级RMs (RM2为standby)
- 停止低版本RM2:
yarn-daemon.sh stop resourcemanager
- 调整软链接指向高版本:
ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
- 启动高版本RM2:
yarn --daemon start resourcemanager
- 提交执行任务B
- 停止低版本RM1:
yarn-daemon.sh stop resourcemanager
- 提交执行任务C
- 调整软链接指向高版本:
ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
- 启动高版本RM1:
yarn --daemon start resourcemanager
- 停止低版本RM2:
-
滚动升级NMs
- 停止NM:
yarn-daemon.sh stop nodemanager
- 调整软链接指向高版本:
ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
- 启动NM:
yarn --daemon start nodemanager
- 提交执行任务D
- 将NM分批,重复上述步骤,直至NM全部升级完成
- 提交执行任务E
- 停止NM:
-
滚动升级JHS
- 停止JHS:
mr-jobhistory-daemon.sh stop historyserver
- 调整软链接指向高版本:
ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
- 启动JHS
mapred --daemon start historyserver
- 停止JHS:
验证内容
滚动升级RM期间的相关操作
-
停止RM2,采集RM1,NMs,JHS上是否存在异常信息,NMs很多的情况下,直接抽样查看吗?
并未发现明显异常信息
-
升级高版本RM2
并未发现明显异常信息
-
停止RM1,采集RM2,NMs,JHS上是否存在异常信息
active RM正常切换到RM2,并未发现其他错误信息
-
升级高版本RM1
RM1,RM2,NMs,JHS中没有明显错误日志
application日志出现RPC协议版本不一致,导致任务失败,从此刻开始任务使用到与RM相同机器的NM,该任务失败。
-
升级 NMs
提交application,高版本的NMs无法启动Container,报错信息:
shell2025-07-14 16:34:05,998 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1752479714585_0004_m_000013_2: [2025-07-14 16:34:05.704]Exception from container-launch. Container id: container_e01_1752479714585_0004_01_000060 Exit code: 1 [2025-07-14 16:34:05.706]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild [2025-07-14 16:34:05.706]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild
调整
mapred-site.xml
文件,下面增加配置(若全部是高版本NMs,低版本提交任务时,没有这些配置,会报缺失这些配置的错误日志 YARN-6999)xml<property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value> </property>
增加上述配置后,若任务只运行在高版本NMs可以成功。任务跨版本NMs,会导致低版本无法低版本NMs上无法启动Container,报错信息:
shell2025-07-14 16:12:40,030 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e01_1752479714585_0001_01_000051 is : 1 2025-07-14 16:12:40,030 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e01_1752479714585_0001_01_000051 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:998) at org.apache.hadoop.util.Shell.run(Shell.java:884) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:294) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:437) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:288) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:92) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
application出现错误信息
shell2025-07-14 16:12:29,440 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1752479714585_0001_m_000009_0: [2025-07-14 16:12:28.631]Exception from container-launch. Container id: container_e01_1752479714585_0001_01_000012 Exit code: 1 [2025-07-14 16:12:28.634]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: ADD_OPENS: No such file or directory [2025-07-14 16:12:28.635]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: ADD_OPENS: No such file or directory
修改
mapred-site.xml
配置文件里的配置xml<property> <name>mapreduce.jvm.add-opens-as-default</name> <value>false</value> </property>
application出现错误信息,主要是高低版本的TaskUmbilicalProtocol协议不兼容导致的。
shell2025-07-14 15:40:58,183 INFO [IPC Server handler 14 on 35531] org.apache.hadoop.ipc.Server: IPC Server handler 14 on 35531, call Call#0 Retry#0 getTask(org.apache.hadoop.mapred.JvmContext@65c73391), rpc version=2, client version=21, methodsFingerPrint=-410993661 from 10.37.74.28:57168 org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.mapred.TaskUmbilicalProtocol version mismatch. (client = 21, server = 19) at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:502) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) 2025-07-14 15:40:58,195 INFO [IPC Server handler 12 on 35531] org.apache.hadoop.ipc.Server: IPC Server handler 12 on 35531, call Call#0 Retry#0 getTask(org.apache.hadoop.mapred.JvmContext@2858bcf1), rpc version=2, client version=21, methodsFingerPrint=-410993661 from 10.37.74.28:57170 org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.mapred.TaskUmbilicalProtocol version mismatch. (client = 21, server = 19) at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:502) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
-
停止MRHS
未发现异常日志信息
-
升级MRHS
未发现异常日志信息
但是日志查询有些不一样,日志的地址有过内部调整。