flink 中配置hadoop 遇到问题解决

下面是一些常用的 yarn-session 参数说明及其使用场景:

基本参数

  • -n, --container <arg>: 设置要启动的容器数量(即 TaskManager 的数量)。例如:-n 4 表示启动 4 个 TaskManager。
  • -jm, --jobManagerMemory <arg>: JobManager 的内存大小,单位为 MB。例如:-jm 1024
  • -tm, --taskManagerMemory <arg>: 每个 TaskManager 的内存大小,单位为 MB。例如:-tm 4096
  • -s, --slots <arg>: 每个 TaskManager 的 Slot 数量。Slots 决定了 TaskManager 可以并行执行的任务数。例如:-s 8
  • -d, --detached: 以后台模式运行 Flink YARN session。这对于长时间运行的 Flink 集群很有用。
  • -nm, --name <arg>: 给你的应用程序命名,在 YARN ResourceManager UI 中显示。例如:-nm myFlinkApp
  • -jm, --jobManagerMemory <arg>: 设置 JobManager 的堆内存大小,默认单位是 MB。例如:-jm 1024m
  • -yD <property=value>: 动态属性设置,允许你覆盖默认配置或指定额外的配置选项。例如:-yD yarn.applicationMaster.vcores=2

资源相关参数

  • -yjm, --yarnjobManagerMemory <arg>: 设置 YARN 环境下 JobManager 的内存大小。例如:-yjm 1024m
  • -ytm, --yarntaskManagerMemory <arg>: 设置 YARN 环境下 TaskManager 的内存大小。例如:-ytm 4096m
  • -yqu, --yarnqueue <arg>: 指定 YARN 队列名称来提交应用。例如:-yqu default
  • -yst, --yarnstreaming: 启动 Flink YARN session 在流模式下。
  • -yD <property=value>: 设置动态属性,比如可以用来调整 vCores (yarn.container.vcores) 或者其他特定于 YARN 的配置。

其他参数

  • -z, --zookeeperNamespace <arg>: 使用 Zookeeper 命名空间进行高可用性设置。
  • -yd, --yarndetached: 在 YARN 上以后台模式启动 Flink 集群。
  • -q, --query: 查询已有的 Flink YARN session。

这里是一个启动 Flink YARN session 的例子,其中包含了一些基本和资源相关的参数:

Error: Could not find or load main classorg.apache.zookeeper.ZooKeeperMain

zookeeper 的lib 目录为空

Caused by: org.apache.flink.configuration.IllegalConfigurationException: The number of requested virtual cores for application master 1 exceeds the maximum number of virtual cores 0 available in the Yarn Cluster.

一、常见配置文件与是否需要重启

修改的配置文件 配置项示例 是否需要重启 说明
core-site.xml fs.defaultFS, hadoop.tmp.dir ✅ 是 影响全局行为,如默认文件系统和临时目录
hdfs-site.xml dfs.replication, dfs.block.size ❌ 否(部分生效) 新文件会使用新配置,旧文件不变
hdfs-site.xml dfs.namenode.name.dir, dfs.datanode.data.dir ✅ 是 数据存储路径变更必须重启
hdfs-site.xml dfs.ha.automatic-failover.enabled ✅ 是 HA 相关核心配置变更
yarn-site.xml yarn.nodemanager.resource.memory-mb ✅ 是 NodeManager 资源配置变更需重启
yarn-site.xml yarn.resourcemanager.scheduler.class ✅ 是 调度器类型变更需重启 RM
mapred-site.xml mapreduce.task.timeout ❌ 否 任务级别参数,可在作业中动态设置
hadoop-env.sh / yarn-env.sh 内存、JVM 参数等 ✅ 是 JVM 参数在进程启动时加载,需重启
workers / slaves 文件 增删 DataNode/NodeManager 节点 ❌ 否(新增节点需手动启动) 不影响现有节点
  1. 设置 vcore

    <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property>

原因一:ResourceManager 没有正常运行或处于 standby 状态

查看 ResourceManager 状态 yarn rmadmin -getServiceState rm1

强制切换为 active yarn rmadmin -transitionToActive --forcemanual rm1

echo YARN_CONF_DIR ls YARN_CONF_DIR

复制代码
core-site.xml
hdfs-site.xml
yarn-site.xml
capacity-scheduler.xml
  1. nodeManager 有问题,比如nodeManager 的节点目录不存在

重启 NodeManager

复制代码
yarn-daemon.sh stop nodemanager
yarn-daemon.sh start nodemanager

2. 确认 ResourceManager 正常运行, 并且不是standby

复制代码
hdfs haadmin -transitionToActive --forcemanual nn2
# 查看当前状态
hdfs haadmin -getServiceState nn1   # 输出 standby
hdfs haadmin -getServiceState nn2   # 输出 active

# 切换 nn1 为 active
hdfs haadmin -transitionToActive --forcemanual nn1
hdfs namenode -bootstrapStandby

# 再次确认状态
hdfs haadmin -getServiceState nn1   # 应该是 active
hdfs haadmin -getServiceState nn2   # 应该是 standby
hdfs dfsadmin -saveNamespace
hadoop-daemon.sh start zkfc
hadoop-daemon.sh start namenode
hadoop-daemon.sh stop namenode
hadoop-daemon.sh start datanode
hadoop-daemon.sh stop datanode
hadoop-daemon.sh restart datanode
start-dfs.sh
stop-dfs.sh
stop-yarn.sh
start-yarn.sh
hadoop-daemon.sh start journalnode
yarn-daemon.sh restart nodemanager
yarn node -list -showDetails
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh start resourcemanager
yarn-daemon.sh stop nodemanager
yarn-daemon.sh start nodemanager
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
yarn rmadmin -transitionToActive --forcemanual rm1
yarn rmadmin -refreshQueues

state.checkpoints.dir: hdfs:///flink/checkpoints
state.savepoints.dir: hdfs:///flink/savepoints
high-availability.storageDir: hdfs:///flink/ha/
  1. 确认使用容量调度器

    <configuration> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yarn-cluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop-001</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop-002</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop-001:2181,hadoop-002:2181,hadoop-003:2181</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/var/log/hadoop-yarn/nodemanager</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/var/hadoop/yarn/local</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8192</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.maximum-capacity</name> <value>60</value> </property> <property> <name>yarn.scheduler.capacity.default.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.default.flink.maximum-capacity</name> <value>60</value> </property> </configuration> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,flink</value> <description>根队列下的子队列</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.maximum-capacity</name> <value>60</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>60</value> </property>

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user":hadoop:supergroup:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtosClientNamenodeProtocol2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556) at org.apache.hadoop.ipc.RPCServer.call(RPC.java:1093) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:1043) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:971) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.ipc.ServerHandler.run(Server.java:2976) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612) at org.apache.hadoop.ipc.Client.call(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1455) at org.apache.hadoop.ipc.ProtobufRpcEngine2Invoker.invoke(ProtobufRpcEngine2.java:242) at org.apache.hadoop.ipc.ProtobufRpcEngine2Invoker.invoke(ProtobufRpcEngine2.java:129) at com.sun.proxy.Proxy29.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:965) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy30.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1739)

复制代码
hadoop fs -chmod -R 777 /user