【Yarn实战】Yarn 2.9.1滚动升级到3.4.1调研与实践验证

Yarn 2.9.1滚动升级到3.4.1调研与实践验证

  • ResourceManager 简称 RM
  • MapReduce History Server 简称 MRHS
  • NodeManager 简称 NM

安装包

安装本软件,涉及的软件包为:

  • jdk1.8.0_161.tgz

  • zookeeper-3.4.6.tar.gz

  • hadoop-2.9.1.61.tar.gz

  • hadoop-3.4.1.0.SNAPSHOT.tar.gz

  • apache-zookeeper-3.8.4-bin.tar.gz

    源环境使用的是ZK3.4.6+hadoop-2.9.1,目标升级成ZK3.8.4+hadoop-3.4.1.0.SNAPSHOT

部署架构

Yarn集群 ServerA ServerB ServerC ServerD ServerE ServerS ZK集群 ServerF ServerG ServerH ServerJ ZK3 ZK2 ZK1 MRHS NM3 NM2 NM1 RM2 RM1 YarnClient2.9.1 Yarn集群 HDFS集群

1.RM和NM需要分开部署。若RM上存在NM,需要先迁移RM上的NM。

2.task需要分布到所有的NMs上。文件 yarn-site.xml中的yarn.scheduler.fair.max.assign 配置为1。

滚动升级

升级组件
  • RM
  • NM
  • MRHS

高版本的yarn不兼容ZK3.4.6,在做yarn升级之前,先要将ZK升级成3.8.4

为了兼容低版本客户端提交的任务,新版本的配置文件中需要开启NM的env白名单,并增加。

xml 复制代码
<property>
    <name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
升级步骤

先升级standby 状态的NN 通过yarn rmadmin -getServiceState rm1yarn rmadmin -getServiceState rm2确认RMs的状态,提交执行任务A、B、C、D、E都属于长时间运行任务,可以贯穿整个升级过程。期间会不提提交

  1. 提交执行任务A

  2. 滚动升级RMs (RM2为standby)

    1. 停止低版本RM2:yarn-daemon.sh stop resourcemanager
    2. 调整软链接指向高版本:ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
    3. 启动高版本RM2:yarn --daemon start resourcemanager
    4. 提交执行任务B
    5. 停止低版本RM1:yarn-daemon.sh stop resourcemanager
    6. 提交执行任务C
    7. 调整软链接指向高版本:ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
    8. 启动高版本RM1:yarn --daemon start resourcemanager
  3. 滚动升级NMs

    1. 停止NM:yarn-daemon.sh stop nodemanager
    2. 调整软链接指向高版本:ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
    3. 启动NM:yarn --daemon start nodemanager
    4. 提交执行任务D
    5. 将NM分批,重复上述步骤,直至NM全部升级完成
    6. 提交执行任务E
  4. 滚动升级JHS

    1. 停止JHS:mr-jobhistory-daemon.sh stop historyserver
    2. 调整软链接指向高版本:ln -snf hadoop-3.4.1.0-SNAPSHOT hadoop;
    3. 启动JHSmapred --daemon start historyserver
验证内容

滚动升级RM期间的相关操作

  1. 停止RM2,采集RM1,NMs,JHS上是否存在异常信息,NMs很多的情况下,直接抽样查看吗?

    并未发现明显异常信息

  2. 升级高版本RM2

    并未发现明显异常信息

  3. 停止RM1,采集RM2,NMs,JHS上是否存在异常信息

    active RM正常切换到RM2,并未发现其他错误信息

  4. 升级高版本RM1

    RM1,RM2,NMs,JHS中没有明显错误日志

    application日志出现RPC协议版本不一致,导致任务失败,从此刻开始任务使用到与RM相同机器的NM,该任务失败。

  5. 升级 NMs

    提交application,高版本的NMs无法启动Container,报错信息:

    shell 复制代码
    2025-07-14 16:34:05,998 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1752479714585_0004_m_000013_2: [2025-07-14 16:34:05.704]Exception from container-launch.
    Container id: container_e01_1752479714585_0004_01_000060
    Exit code: 1
    
    [2025-07-14 16:34:05.706]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    Last 4096 bytes of stderr :
    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
    Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild
    
    
    [2025-07-14 16:34:05.706]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    Last 4096 bytes of stderr :
    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
    Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild

    调整mapred-site.xml文件,下面增加配置(若全部是高版本NMs,低版本提交任务时,没有这些配置,会报缺失这些配置的错误日志 YARN-6999

    xml 复制代码
    <property>
            <name>yarn.app.mapreduce.am.env</name>
            <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value>
        </property>
    
        <property>
            <name>mapreduce.map.env</name>
            <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value>
        </property>
    
        <property>
            <name>mapreduce.reduce.env</name>
            <value>HADOOP_MAPRED_HOME=/home/yarn/software/hadoop</value>
        </property>

    增加上述配置后,若任务只运行在高版本NMs可以成功。任务跨版本NMs,会导致低版本无法低版本NMs上无法启动Container,报错信息:

    shell 复制代码
    2025-07-14 16:12:40,030 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e01_1752479714585_0001_01_000051 is : 1
    2025-07-14 16:12:40,030 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e01_1752479714585_0001_01_000051 and exit code: 1
    ExitCodeException exitCode=1: 
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:998)
            at org.apache.hadoop.util.Shell.run(Shell.java:884)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216)
            at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:294)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:437)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:288)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:92)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)

    application出现错误信息

    shell 复制代码
    2025-07-14 16:12:29,440 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1752479714585_0001_m_000009_0: [2025-07-14 16:12:28.631]Exception from container-launch.
    Container id: container_e01_1752479714585_0001_01_000012
    Exit code: 1
    
    [2025-07-14 16:12:28.634]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    /bin/bash: ADD_OPENS: No such file or directory
    
    [2025-07-14 16:12:28.635]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    /bin/bash: ADD_OPENS: No such file or directory

    修改mapred-site.xml配置文件里的配置

    xml 复制代码
        <property>
            <name>mapreduce.jvm.add-opens-as-default</name>
            <value>false</value>
        </property>

    application出现错误信息,主要是高低版本的TaskUmbilicalProtocol协议不兼容导致的。

    shell 复制代码
    2025-07-14 15:40:58,183 INFO [IPC Server handler 14 on 35531] org.apache.hadoop.ipc.Server: IPC Server handler 14 on 35531, call Call#0 Retry#0 getTask(org.apache.hadoop.mapred.JvmContext@65c73391), rpc version=2, client version=21, methodsFingerPrint=-410993661 from 10.37.74.28:57168
    org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.mapred.TaskUmbilicalProtocol version mismatch. (client = 21, server = 19)
    	at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:502)
    	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
    	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
    	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:422)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
    	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
    2025-07-14 15:40:58,195 INFO [IPC Server handler 12 on 35531] org.apache.hadoop.ipc.Server: IPC Server handler 12 on 35531, call Call#0 Retry#0 getTask(org.apache.hadoop.mapred.JvmContext@2858bcf1), rpc version=2, client version=21, methodsFingerPrint=-410993661 from 10.37.74.28:57170
    org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.mapred.TaskUmbilicalProtocol version mismatch. (client = 21, server = 19)
    	at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:502)
    	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
    	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
    	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:422)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
    	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
  6. 停止MRHS

    未发现异常日志信息

  7. 升级MRHS

    未发现异常日志信息

    但是日志查询有些不一样,日志的地址有过内部调整。

相关推荐
D明明就是我6 小时前
Hive 拉链表
数据仓库·hive·hadoop
嘉禾望岗5039 小时前
hive join优化和数据倾斜处理
数据仓库·hive·hadoop
yumgpkpm10 小时前
华为鲲鹏 Aarch64 环境下多 Oracle 数据库汇聚操作指南 CMP(类 Cloudera CDP 7.3)
大数据·hive·hadoop·elasticsearch·zookeeper·big data·cloudera
忧郁火龙果11 小时前
六、Hive的基本使用
数据仓库·hive·hadoop
忧郁火龙果11 小时前
五、安装配置hive
数据仓库·hive·hadoop
chad__chang1 天前
dolphinscheduler安装过程
hive·hadoop
ajax_beijing1 天前
hadoop的三副本数据冗余策略
大数据·hadoop·分布式
yumgpkpm2 天前
CMP (类ClouderaCDP7.3(404次编译) )华为鲲鹏Aarch64(ARM)信创环境多个mysql数据库汇聚的操作指南
大数据·hive·hadoop·zookeeper·big data·cloudera
华阙之梦2 天前
【在 Windows 上运行 Apache Hadoop 或 Spark/GeoTrellis 涉及 HDFS 】
hadoop·windows·apache