下面是一些常用的 yarn-session
参数说明及其使用场景:
基本参数
-n
,--container <arg>
: 设置要启动的容器数量(即 TaskManager 的数量)。例如:-n 4
表示启动 4 个 TaskManager。-jm
,--jobManagerMemory <arg>
: JobManager 的内存大小,单位为 MB。例如:-jm 1024
。-tm
,--taskManagerMemory <arg>
: 每个 TaskManager 的内存大小,单位为 MB。例如:-tm 4096
。-s
,--slots <arg>
: 每个 TaskManager 的 Slot 数量。Slots 决定了 TaskManager 可以并行执行的任务数。例如:-s 8
。-d
,--detached
: 以后台模式运行 Flink YARN session。这对于长时间运行的 Flink 集群很有用。-nm
,--name <arg>
: 给你的应用程序命名,在 YARN ResourceManager UI 中显示。例如:-nm myFlinkApp
。-jm
,--jobManagerMemory <arg>
: 设置 JobManager 的堆内存大小,默认单位是 MB。例如:-jm 1024m
。-yD <property=value>
: 动态属性设置,允许你覆盖默认配置或指定额外的配置选项。例如:-yD yarn.applicationMaster.vcores=2
。
资源相关参数
-yjm
,--yarnjobManagerMemory <arg>
: 设置 YARN 环境下 JobManager 的内存大小。例如:-yjm 1024m
。-ytm
,--yarntaskManagerMemory <arg>
: 设置 YARN 环境下 TaskManager 的内存大小。例如:-ytm 4096m
。-yqu
,--yarnqueue <arg>
: 指定 YARN 队列名称来提交应用。例如:-yqu default
。-yst
,--yarnstreaming
: 启动 Flink YARN session 在流模式下。-yD <property=value>
: 设置动态属性,比如可以用来调整 vCores (yarn.container.vcores
) 或者其他特定于 YARN 的配置。
其他参数
-z
,--zookeeperNamespace <arg>
: 使用 Zookeeper 命名空间进行高可用性设置。-yd
,--yarndetached
: 在 YARN 上以后台模式启动 Flink 集群。-q
,--query
: 查询已有的 Flink YARN session。
这里是一个启动 Flink YARN session 的例子,其中包含了一些基本和资源相关的参数:
Error: Could not find or load main classorg.apache.zookeeper.ZooKeeperMain
zookeeper 的lib 目录为空
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The number of requested virtual cores for application master 1 exceeds the maximum number of virtual cores 0 available in the Yarn Cluster.
一、常见配置文件与是否需要重启
修改的配置文件 | 配置项示例 | 是否需要重启 | 说明 |
---|---|---|---|
core-site.xml |
fs.defaultFS , hadoop.tmp.dir |
✅ 是 | 影响全局行为,如默认文件系统和临时目录 |
hdfs-site.xml |
dfs.replication , dfs.block.size |
❌ 否(部分生效) | 新文件会使用新配置,旧文件不变 |
hdfs-site.xml |
dfs.namenode.name.dir , dfs.datanode.data.dir |
✅ 是 | 数据存储路径变更必须重启 |
hdfs-site.xml |
dfs.ha.automatic-failover.enabled |
✅ 是 | HA 相关核心配置变更 |
yarn-site.xml |
yarn.nodemanager.resource.memory-mb |
✅ 是 | NodeManager 资源配置变更需重启 |
yarn-site.xml |
yarn.resourcemanager.scheduler.class |
✅ 是 | 调度器类型变更需重启 RM |
mapred-site.xml |
mapreduce.task.timeout |
❌ 否 | 任务级别参数,可在作业中动态设置 |
hadoop-env.sh / yarn-env.sh |
内存、JVM 参数等 | ✅ 是 | JVM 参数在进程启动时加载,需重启 |
workers / slaves 文件 |
增删 DataNode/NodeManager 节点 | ❌ 否(新增节点需手动启动) | 不影响现有节点 |
-
设置 vcore
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property>
原因一:ResourceManager 没有正常运行或处于 standby 状态
查看 ResourceManager 状态 yarn rmadmin -getServiceState rm1
强制切换为 active yarn rmadmin -transitionToActive --forcemanual rm1
echo YARN_CONF_DIR ls YARN_CONF_DIR
core-site.xml
hdfs-site.xml
yarn-site.xml
capacity-scheduler.xml
- nodeManager 有问题,比如nodeManager 的节点目录不存在
重启 NodeManager
yarn-daemon.sh stop nodemanager
yarn-daemon.sh start nodemanager
2. 确认 ResourceManager 正常运行, 并且不是standby
hdfs haadmin -transitionToActive --forcemanual nn2
# 查看当前状态
hdfs haadmin -getServiceState nn1 # 输出 standby
hdfs haadmin -getServiceState nn2 # 输出 active
# 切换 nn1 为 active
hdfs haadmin -transitionToActive --forcemanual nn1
hdfs namenode -bootstrapStandby
# 再次确认状态
hdfs haadmin -getServiceState nn1 # 应该是 active
hdfs haadmin -getServiceState nn2 # 应该是 standby
hdfs dfsadmin -saveNamespace
hadoop-daemon.sh start zkfc
hadoop-daemon.sh start namenode
hadoop-daemon.sh stop namenode
hadoop-daemon.sh start datanode
hadoop-daemon.sh stop datanode
hadoop-daemon.sh restart datanode
start-dfs.sh
stop-dfs.sh
stop-yarn.sh
start-yarn.sh
hadoop-daemon.sh start journalnode
yarn-daemon.sh restart nodemanager
yarn node -list -showDetails
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh start resourcemanager
yarn-daemon.sh stop nodemanager
yarn-daemon.sh start nodemanager
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
yarn rmadmin -transitionToActive --forcemanual rm1
yarn rmadmin -refreshQueues
state.checkpoints.dir: hdfs:///flink/checkpoints
state.savepoints.dir: hdfs:///flink/savepoints
high-availability.storageDir: hdfs:///flink/ha/
-
确认使用容量调度器
<configuration> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yarn-cluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop-001</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop-002</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop-001:2181,hadoop-002:2181,hadoop-003:2181</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/var/log/hadoop-yarn/nodemanager</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/var/hadoop/yarn/local</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8192</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.maximum-capacity</name> <value>60</value> </property> <property> <name>yarn.scheduler.capacity.default.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.default.flink.maximum-capacity</name> <value>60</value> </property> </configuration> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,flink</value> <description>根队列下的子队列</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.flink.maximum-capacity</name> <value>60</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>60</value> </property>
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user":hadoop:supergroup:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtosClientNamenodeProtocol2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572) at org.apache.hadoop.ipc.ProtobufRpcEngine2ServerProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556) at org.apache.hadoop.ipc.RPCServer.call(RPC.java:1093) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:1043) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:971) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.ipc.ServerHandler.run(Server.java:2976) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612) at org.apache.hadoop.ipc.Client.call(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1455) at org.apache.hadoop.ipc.ProtobufRpcEngine2Invoker.invoke(ProtobufRpcEngine2.java:242) at org.apache.hadoop.ipc.ProtobufRpcEngine2Invoker.invoke(ProtobufRpcEngine2.java:129) at com.sun.proxy.Proxy29.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:965) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandlerCall.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy30.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1739)
hadoop fs -chmod -R 777 /user