文章目录
- [一. 配置说明](#一. 配置说明)
-
- [1. hadoop各进程环境配置](#1. hadoop各进程环境配置)
- [2. hadoop各进程配置](#2. hadoop各进程配置)
-
- [2.1. etc/hadoop/core-site.xml](#2.1. etc/hadoop/core-site.xml)
- [2.2. etc/hadoop/hdfs-site.xml](#2.2. etc/hadoop/hdfs-site.xml)
-
- [2.2.1. NameNode](#2.2.1. NameNode)
- [2.2.2. datanode](#2.2.2. datanode)
- [2.3. etc/hadoop/yarn-site.xml](#2.3. etc/hadoop/yarn-site.xml)
-
- [2.3.1. ResourceManager and NodeManager](#2.3.1. ResourceManager and NodeManager)
- [2.3.2. ResourceManager](#2.3.2. ResourceManager)
- [2.3.3. NodeManager](#2.3.3. NodeManager)
- [2.3.4. History Server](#2.3.4. History Server)
- [2.4. etc/hadoop/mapred-site.xml](#2.4. etc/hadoop/mapred-site.xml)
-
- [2.4.1. MapReduce Applications](#2.4.1. MapReduce Applications)
- [2.4.2. MapReduce JobHistory Server](#2.4.2. MapReduce JobHistory Server)
- [3. Monitoring Health of NodeManagers](#3. Monitoring Health of NodeManagers)
- [二. 配置示例](#二. 配置示例)
-
- [1. core-site.xml](#1. core-site.xml)
- [2. hdfs-site.xml](#2. hdfs-site.xml)
- [3. yarn-site.xml](#3. yarn-site.xml)
- [4. mapred-site.xml](#4. mapred-site.xml)
上篇简单说明了多节点的hadoop节点怎么安装,但是没有细致的分析hadoop相关配置,具体怎么根据环境进行配置,接下来我们对这些配置进行讲解
一. 配置说明
1. hadoop各进程环境配置
Daemon | Environment Variable |
---|---|
NameNode | HDFS_NAMENODE_OPTS |
DataNode | HDFS_DATANODE_OPTS |
Secondary NameNode | HDFS_SECONDARYNAMENODE_OPTS |
ResourceManager | YARN_RESOURCEMANAGER_OPTS |
NodeManager | YARN_NODEMANAGER_OPTS |
WebAppProxy | YARN_PROXYSERVER_OPTS |
Job History Server | MAPRED_HISTORYSERVER_OPTS |
举个例子:
配置Namenode使用parallelGC和4GB的Java堆,应该在hadoop-env.sh中添加以下语句:
shell
export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"
其他有用的配置
- HADOOP_PID_DIR : 各进程pid存储目录
- HADOOP_LOG_DIR :各进程日志目录。日志文件会自动创建
- HADOOP_HEAPSIZE_MAX:hadoop使用的最大java堆内存。默认的,hadoop让JVM决定使用多少。这个值可以被每个进程所设置的值所覆盖。例如,设置HADOOP_HEAPSIZE_MAX=1g和hadoop_namende_opts ="-Xmx5g"将为NameNode配置5GB的堆。
注意:
In most cases, you should specify the HADOOP_PID_DIR and HADOOP_LOG_DIR directories such that they can only be written to by the users that are going to run the hadoop daemons. Otherwise there is the potential for a symlink attack.
对于大多数情况,你需要设置HADOOP_PID_DIR和HADOOP_LOG_DIR,因为他们只允许启动用户去操作,这就避免了潜在的symlink attack。
ing
在系统级的shell环境中配置HADOOP_HOME也是传统的做法。例如,/etc/profile.d中的一个简单脚本:
c
HADOOP_HOME=/path/to/hadoop
export HADOOP_HOME
2. hadoop各进程配置
2.1. etc/hadoop/core-site.xml
参数 | 值 | 备注 |
---|---|---|
fs.defaultFS | NameNode URI:hdfs://host:port/ | 用于指定Hadoop客户端应用程序连接和操作HDFS时使用的默认文件系统的URI或URL |
io.file.buffer.size | 131072 | 读写 SequenceFiles 缓存大小 |
fs.defaultFS
例如,如果将fs.defaultFS配置为hdfs://namenode:9000,则Hadoop客户端应用程序将默认连接到名为namenode的HDFS名称节点,并使用9000端口进行通信。
它简化了应用程序代码、支持多集群环境和方便切换文件系统等功能。
2.2. etc/hadoop/hdfs-site.xml
2.2.1. NameNode
参数 | 值 | 备注 |
---|---|---|
dfs.namenode.name.dir | xxx/namenode,xxx/namenode | 用于保存Namenode的namespace和事务日志的路径 |
dfs.blocksize | 134217728 | 对于大型文件系统,使用256MB作为一个块的大小. 这里是128MB。 |
dfs.namenode.handler.count | 4096 | 更多的NameNode服务器线程来处理来自大量 DataNode 的 RPC请求。 |
dfs.hosts.exclude | xxx/decommission | 不允许加入的datanode,动态删除datanode节点时使用 |
dfs.webhdfs.enabled | true | 用于启用或禁用WebHDFS服务。 |
2.2.2. datanode
参数 | 值 | 备注 |
---|---|---|
dfs.datanode.data.dir | xxx/data,xxx/data | 用于保存用户上传的数据目录 |
dfs.datanode.data.dir.perm | 750 | 定义了dfs.datanode.data.dir 数据目录的权限 |
dfs.block.local-path-access.user | user1、user2 | 哪些用户可以直接访问数据节点的本地块文件路径。Hadoop可以允许附近数据节点的本地块文件通过本地文件系统直接进行读取,而无需通过网络传输。这可以提高数据访问的性能。则默认情况下,所有用户都可以直接访问数据节点的本地块文件路径。 |
2.3. etc/hadoop/yarn-site.xml
2.3.1. ResourceManager and NodeManager
参数 | 值 | 备注 |
---|---|---|
yarn.acl.enable | true / false | true,则启用访问控制列表(ACL)功能,意味着访问控制将应用于YARN操作。false,则禁用ACL功能,允许所有用户执行YARN操作。 |
yarn.admin.acl | Admin ACL | 未设置yarn.admin.acl,则默认情况下没有用户或用户组被授予管理员权限。配置为一个或多个用户名或用户组,那么这些用户或用户组将被授予管理员权限。 |
yarn.log-aggregation-enable | true/false | 启用或禁用日志聚合功能。true:将应用程序的日志从各个节点收集到中央位置。 启用日志聚合功能可以方便地管理和检索应用程序的日志,特别是在集群规模较大、应用程序数量众多或需要定位故障时。聚合的日志可以存储在HDFS中,供后续的日志分析、监控或审计使用。 |
2.3.2. ResourceManager
参数 | 值 | 备注 |
---|---|---|
yarn.resourcemanager.address | 默认值:0.0.0.0:8032 | ResourceManager 对客户端暴露的地址。客户端通过该地址向RM提交应用程序,杀死应用程序等。 |
yarn.resourcemanager.scheduler.address | 默认值:${yarn.resourcemanager.hostname}:8030 | ResourceManager 对ApplicationMaster暴露的访问地址。ApplicationMaster通过该地址向RM申请资源、释放资源等。 |
yarn.resourcemanager.resource-tracker.address | 默认值:${yarn.resourcemanager.hostname}:8031 | ResourceManager 对NodeManager暴露的地址.。NodeManager通过该地址向RM汇报心跳,领取任务等。 |
yarn.resourcemanager.admin.address | 默认值:${yarn.resourcemanager.hostname}:8033 | ResourceManager 对管理员暴露的访问地址。管理员通过该地址向RM发送管理命令等。 |
yarn.resourcemanager.webapp.address | 默认值:${yarn.resourcemanager.hostname}:8088 | ResourceManager对外web ui地址。用户可通过该地址在浏览器中查看集群各类信息。 |
yarn.resourcemanager.scheduler.class | org.apache.hadoop.yarn.server. resourcemanager.scheduler.capacity.CapacityScheduler | 启用的资源调度器主类。目前可用的有FIFO、Capacity Scheduler和Fair Scheduler。 |
yarn.scheduler.minimum-allocation-mb yarn.scheduler.maximum-allocation-mb | 默认值:1024/8192 | 单个可申请的最小/最大内存资源量。比如设置为1024和3072,则运行MapRedce作业时,每个Task最少可申请1024MB内存,最多可申请3072MB内存。 |
yarn.resourcemanager.nodes.include-path yarn.resourcemanager.nodes.exclude-path | 默认值:"" | NodeManager黑白名单。如果发现若干个NodeManager存在问题,比如故障率很高,任务运行失败率高,则可以将之加入黑名单中。 |
yarn.resourcemanager.hostname | hostname | 可以设置所有resourcemanager*address相关配置,这些配置使用默认端口 |
yarn.scheduler.maximum-allocation-vcores yarn.scheduler.minimum-allocation-vcores | 8/1 | 启动container最大和最小的核心数 |
2.3.3. NodeManager
参数 | 值 | 备注 |
---|---|---|
yarn.nodemanager.resource.memory-mb | nodemanager进程内存,通过free -h 查看机器具体内存设定 | NodeManager总的可用物理内存。注意,该参数是不可修改的,一旦设置,整个运行过程中不可动态修改。 另外,该参数的默认值是8192MB,即使你的机器内存不够8192MB,YARN也会按照这些内存来使用 |
yarn.nodemanager.vmem-pmem-ratio | 2.1 | 任务每使用1MB物理内存,最多可使用虚拟内存量,默认是2.1。 |
yarn.nodemanager.local-dirs | 默认值:${hadoop.tmp.dir}/nm-local-dir | 中间结果存放位置。注意,这个参数通常会配置多个目录,用来分摊磁盘IO负载。 |
yarn.nodemanager.log-dirs | /logs | 日志目录。多个目录也可以分摊IO负载 |
yarn.nodemanager.log.retain-seconds | 10800(3小时) | NodeManager上日志最多存放时间(不启用日志聚集功能时有效) |
yarn.nodemanager.remote-app-log-dir | /log | 当任务完成时会将日志存在hdfs上。需要设置权限,且只有当日志聚合开启时可用 |
yarn.nodemanager.remote-app-log-dir-suffix | 日志目录的后缀 | |
yarn.nodemanager.aux-services | mapreduce_shuffle,spark_shuffle | NodeManager上运行的附属服务。比如配置成mapreduce_shuffle,才可运行MapReduce程序 |
yarn.nodemanager.aux-services.spark_shuffle.class | org.apache.spark. network.yarn.YarnShuffleService | spark shuffle类,需指定才能运行spark任务 |
yarn.nodemanager.env-whitelist | 默认值:JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, HADOOP_YARN_HOME | 用于指定允许传递给应用程序容器的环境变量白名单。 白名单中列出的环境变量将会被允许传递给应用程序容器,而不在白名单中的环境变量将被过滤掉。 配置后,需要在YARN集群中重新启动NodeManager,以使配置更改生效。 |
2.3.4. History Server
参数 | 值 | 备注 |
---|---|---|
yarn.log-aggregation.retain-seconds | 2592000(30天) | 日志聚合后保留的时间长度,以秒为单位 |
yarn.log-aggregation.retain-check-interval-seconds | 3600(1小时) | 检查聚合日志保留时间间隔的时间间隔 |
2.4. etc/hadoop/mapred-site.xml
2.4.1. MapReduce Applications
参数 | 值 | 备注 |
---|---|---|
mapreduce.framework.name | 默认值:yarn | 使用的MapReduce框架的名称或标识。 - 默认情况下,值设置为yarn,表示使用YARN作为MapReduce的框架。 当将其设置为yarn时,MapReduce任务将由YARN ResourceManager进行管理,并在YARN集群中的NodeManager上执行。 - 当将其设置为local时,MapReduce任务将在单个本地节点上以本地模式运行,不涉及集群资源管理。 - 当将其设置为classic时,MapReduce任务将在Hadoop 1.x中使用旧的经典模式。 |
mapreduce.map.memory.mb | 默认:1GB | 指定每个Map任务可用的内存量。 内存设置旨在控制每个Map任务可以使用的物理内存量,包括执行过程中的数据缓存、堆空间等。 |
mapreduce.map.java.opts | -Xmx2048m | 指定每个Map任务的Java虚拟机(JVM)选项 |
mapreduce.task.io.sort.mb | 512 | Map任务的中间排序阶段使用的内存量。 在Map任务的中间排序阶段,数据将被排序并写入磁盘以供Reduce任务使用。默认情况下,每个Map任务被分配100MB的内存用于中间数据的排序。内存设置旨在控制排序操作期间数据可以保留的内存量。较大的值可以提高排序性能,但可能占用更多的内存资源。 |
mapreduce.task.io.sort.factor | 100 | 指定在Map任务的排序阶段执行归并操作时同时合并的文件数 在Map任务的排序阶段,中间数据会被分为多个数据块,每个数据块会被写入磁盘的临时文件。而在归并操作中,这些临时文件会被合并以生成最终的排序结果。 较高的值可以加快排序操作的速度,但会使用更多的内存资源。 |
mapreduce.reduce.shuffle.parallelcopies | 50 | 指定Reduce任务的shuffle阶段并行拷贝(并行复制)数据的数量。 在Map阶段结束后,Map任务会将中间结果数据拷贝到Reduce任务进行进一步处理。在这个过程中,Reduce任务从各个Map任务并行复制数据以提高整体性能。 增加并行拷贝的数量可以加快数据传输速度,提高整体性能,但会占用更多的网络带宽和系统资源。 |
2.4.2. MapReduce JobHistory Server
参数 | 值 | 备注 |
---|---|---|
mapreduce.jobhistory.address | {historynode}:10020 | historynode server 地址 |
mapreduce.jobhistory.webapp.address | {historynode}:19888 | Historynode server web UI |
mapreduce.jobhistory.intermediate-done-dir | /mr-history/tmp | 这个配置属性主要用于管理作业历史事件记录的中间完成文件的存储位置。此值不会直接影响作业执行或性能。 |
mapreduce.jobhistory.done-dir | /mr-history/done | 值不会直接影响作业执行或性能。这个配置属性主要用于管理作业历史事件记录文件的存储位置。 |
3. Monitoring Health of NodeManagers
Hadoop provides a mechanism by which administrators can configure the NodeManager to run an administrator supplied script periodically to determine if a node is healthy or not.
管理员可以提供一个脚本,周期性的检查nodemanager是否健康。
基本逻辑:
Administrators can determine if the node is in a healthy state by performing any checks of their choice in the script. If the script detects the node to be in an unhealthy state, it must print a line to standard output beginning with the string ERROR. The NodeManager spawns the script periodically and checks its output. If the script's output contains the string ERROR, as described above, the node's status is reported as unhealthy and the node is black-listed by the ResourceManager. No further tasks will be assigned to this node. However, the NodeManager continues to run the script, so that if the node becomes healthy again, it will be removed from the blacklisted nodes on the ResourceManager automatically. The node's health along with the output of the script, if it is unhealthy, is available to the administrator in the ResourceManager web interface. The time since the node was healthy is also displayed on the web interface.
管理员通过脚本可以决定节点是否处于健康状态。如果脚本检测到node处于不健康状态时,它需要打印出ERROR为首的日志。
nodemanager周期性的产生脚本,并检查其输出。如果脚本输出ERROR,则node状态为不健康状态,resourcemanager将其加入黑名单,之后将不会有任务分配到这个节点。但nodemanager会继续运行此脚本,以便当其再次健康时,resourcemanager会将其移出黑名单。
管理员可以在yarn的web观察到node的健康状态。
参数 | 值 | 备注 |
---|---|---|
yarn.nodemanager.health-checker.script.path | /path/to/health-check-script.sh | 脚本路径 |
yarn.nodemanager.health-checker.script.opts | -option1 value1 -option2 value2 | 脚本接收的参数 |
yarn.nodemanager.health-checker.interval-ms | ms | 多长时间执行一次脚本 |
yarn.nodemanager.health-checker.script.timeout-ms | ms | 脚本执行超时时间 |
The health checker script is not supposed to give ERROR if only some of the local disks become bad. NodeManager has the ability to periodically check the health of the local disks (specifically checks nodemanager-local-dirs and nodemanager-log-dirs) and after reaching the threshold of number of bad directories based on the value set for the config property yarn.nodemanager.disk-health-checker.min-healthy-disks, the whole node is marked unhealthy and this info is sent to resource manager also. The boot disk is either raided or a failure in the boot disk is identified by the health checker script.
当本地磁盘损坏时,不支持health checker输出ERROR,即不能判定为node坏掉,nodemanager有能力周期性的检查本地磁盘的好坏,具体的,是nodemanager-local-dirs、nodemanager-log-dirs这两个目录。当到达yarn.nodemanager.disk-health-checker.min-healthy-disks 的上限时,整个node将被标记为不健康状态,并将信息发送给resourcemanager。
当磁盘被突袭或者启动失败时,health checker script将会识别。
二. 配置示例
1. core-site.xml
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenodeip:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
2. hdfs-site.xml
xml
<configuration>
<!-- ===========namenode=========== -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/data/hdfs/namenode,/opt/data02/hdfs/namenode</value>
<description>If this is a comma-delimited list of directories then the name table is replicated in all of the
directories, for redundancy.
Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.
用于保存Namenode的namespace和事务日志的路径
</description>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>HDFS blocksize of 256MB for large file-systems.
对于大型文件系统,使用256MB作为一个块的大小. 这里是128MB。
</description>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>4096</value>
<description>
More NameNode server threads to handle RPCs from large number of DataNodes.
更多 NameNode 服务器线程来处理来自大量 DataNode 的 RPC请求。
</description>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/home/${user.name}/software/hadoop/etc/hadoop/decommission</value>
<description>If necessary, use these files to control the list of allowable datanodes.
不允许加入的datanode
</description>
</property>
<!-- ===========namenode=========== -->
<!-- ===========datanode=========== -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/data/hdfs/data,/opt/data02/hdfs/data</value>
<description>If necessary, use these files to control the list of allowable datanodes.
不允许加入的datanode
</description>
</property>
<!-- ===========datanode=========== -->
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>taiyi,hbase</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
3. yarn-site.xml
xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ResourceManager and NodeManager -->
<property>
<name>yarn.acl.enable</name>
<value>false</value>
<description>
Enable ACLs? Defaults to false.
</description>
</property>
<property>
<name>yarn.admin.acl</name>
<value>yarn</value>
<description>
ACL to set admins on the cluster.
ACLs are of for comma-separated-usersspacecomma-separated-groups.
Defaults to special value of * which means anyone.
Special value of just space means no one has access.
设置集群的acl权限 ing
</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>
Configuration to enable or disable log aggregation
日志聚合
</description>
</property>
<!-- ResourceManager and NodeManager -->
<!-- Configurations for ResourceManager: -->
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanagerIp:8832</value>
<description>
ResourceManager host:port for clients to submit jobs.
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
用于客户端提交任务
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resourcemanagerIp:8830</value>
<description>
ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.
</description>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resourcemanagerIp:8831</value>
<description>
ResourceManager host:port for NodeManagers.
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
</description>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resourcemanagerIp:8833</value>
<description>
ResourceManager host:port for administrative commands.
</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>resourcemanagerIp:8888</value>
<description>
ResourceManager web-ui host:port.
</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanagerHostname</value>
<description>
ResourceManager host.
host Single hostname that can be set in place of setting all yarn.
resourcemanager*address resources.
Results in default ports for ResourceManager components.
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<description>
ResourceManager Scheduler class.
CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler.
Use a fully qualified class name, e.g.,
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
<description>
Minimum limit of memory to allocate to each container request at the Resource Manager.
In MBs
每个container最小启动资源
</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
<description>
Maximum limit of memory to allocate to each container request at the Resource Manager.
</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- Configurations for ResourceManager: -->
<!-- Configurations for NodeManager: -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>61440</value>
<description>
Resource i.e. available physical memory, in MB, for given NodeManager
Defines total available resources on the NodeManager to be made available to running containers
</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>
Maximum ratio by which virtual memory usage of tasks may exceed physical memory
The virtual memory usage of each task may exceed its physical memory limit by this ratio.
The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by
this ratio.
任务的虚拟内存使用可能超过物理内存的最大比例
</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/yarn/nm-local-dir,/data02/yarn/nm-local-dir</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/taiyi/hadoop/yarn/userlogs</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
<description>
Default time (in seconds) to retain log files on the NodeManager
Only applicable if log-aggregation is disabled.
</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/home/taiyi/hadoop/yarn/containerlogs</value>
<description>
HDFS directory where the application logs are moved on application completion.
Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
<description>
Suffix appended to the remote log dir.
Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam}
Only applicable if log-aggregation is enabled.
</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
<description>
Shuffle service that needs to be set for Map Reduce applications.
</description>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<!-- Configurations for NodeManager: -->
<!--Configurations for History Server (Needs to be moved elsewhere): -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>2592000</value>
<description>
How long to keep aggregation logs before deleting them. -1 disables.
Be careful, set this too small and you will spam the name node.
</description>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>3600</value>
<description>
Time between checks for aggregated log retention.
If set to 0 or a negative value then the value is computed
as one-tenth of the aggregated log retention time.
Be careful, set this too small and you will spam the name node.
</description>
</property>
</configuration>
4. mapred-site.xml
xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Configurations for MapReduce Applications: -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>
Execution framework set to Hadoop YARN.
</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<description>
Larger resource limit for maps.
</description>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx2048m</value>
<description>
Larger heap-size for child jvms of maps.
</description>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
<description>
Higher memory-limit while sorting data for efficiency.
</description>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
<description>More streams merged at once while sorting files.
</description>
</property>
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>50</value>
<description>Higher number of parallel copies run
by reduces to fetch outputs from very large number of maps.
</description>
</property>
<!-- Configurations for MapReduce Applications: -->
<!--Configurations for MapReduce JobHistory Server:-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>joghistorynodeIp:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>joghistorynodeIp:19888</value>
</property>
<!-- <property>-->
<!-- <name>mapreduce.jobhistory.intermediate-done-dir</name>-->
<!-- <value>/home/taiyi/yarn/mrhistory/tmp</value>-->
<!-- <description>-->
<!-- Directory where history files are written by MapReduce jobs-->
<!-- </description>-->
<!-- </property>-->
<!-- <property>-->
<!-- <name>mapreduce.jobhistory.done-dir</name>-->
<!-- <value>/home/taiyi/yarn/mrhistory/done</value>-->
<!-- <description>-->
<!-- Directory where history files are managed by the MR JobHistory Server.-->
<!-- </description>-->
<!-- </property>-->
<!--Configurations for MapReduce JobHistory Server:-->
<!-- 用于执行任务时寻找hadoop环境变量 -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
参考: