Hadoop 3.3.5 + Flink 1.15.3 集群完整部署手册(3节点标准版)

📋 文档说明

本手册基于实际操作中遇到的所有问题 整理而成,包含详细的避坑指南。环境为3台虚拟机(2C/4G),IP规划如下:

  • 192.168.171.129 worker01(主节点:NameNode/ResourceManager/JobManager)
  • 192.168.171.130 worker02(从节点:NameNode/DataNode/NodeManager)
  • 192.168.171.131 worker03(新增从节点:DataNode/NodeManager)

第一章:基础环境配置

1.1 主机名与网络配置

在所有节点执行:

bash 复制代码
# 设置主机名(可选)
hostnamectl set-hostname worker01  # worker01上执行
hostnamectl set-hostname worker02  # worker02上执行
hostnamectl set-hostname worker03  # worker03上执行

# 配置hosts文件
cat >> /etc/hosts << EOF
192.168.171.129 worker01
192.168.171.130 worker02
192.168.171.131 worker03
EOF

# 关闭防火墙(测试环境,生产环境开通需要的端口)
systemctl stop firewalld
systemctl status firewalld

1.2 SSH免密登录配置

在worker01执行:

bash 复制代码
# 生成密钥对
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

# 分发到所有节点(包括自己)
ssh-copy-id worker01
ssh-copy-id worker02
ssh-copy-id worker03

# 验证
ssh worker02 "hostname"  # 应该无需密码直接返回worker02
ssh worker03 "hostname"  # 应该无需密码直接返回worker03

1.3 JDK安装与配置

⚠️ 重点避坑:JAVA_HOME必须指向JDK(不是JRE),且必须包含bin目录

bash 复制代码
# 安装JDK 1.8(所有节点)
yum install -y java-1.8.0-openjdk-devel

# 查找JDK实际路径
readlink -f /usr/bin/java
# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin/java
# 截取jre之前的内容:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64

# 配置环境变量(所有节点)
cat >> ~/.bashrc << EOF
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export PATH=\$PATH:\$JAVA_HOME/bin
EOF
source ~/.bashrc

# 验证
java -version
javac -version  # 必须有javac,证明是JDK

1.4 目录规划(所有节点)

bash 复制代码
# 创建部署目录(根据实际情况修改)
mkdir -p /root/engine

# 创建统一数据目录
mkdir -p /data/zookeeper
mkdir -p /data/hadoop/{logs,pids,tmp,namenode,datanode,journal,yarn}
mkdir -p /data/flink/logs

# 查看目录结构
tree /data -L 2

第二章:ZooKeeper集群部署

2.1 下载与安装

在所有节点分别执行:

bash 复制代码
cd ~/engine
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.1/apache-zookeeper-3.8.1-bin.tar.gz
tar -zxvf apache-zookeeper-3.8.1-bin.tar.gz
mv apache-zookeeper-3.8.1-bin zookeeper

2.2 ZooKeeper配置

创建配置文件(所有节点):

bash 复制代码
cd /root/engine/zookeeper/conf
cp zoo_sample.cfg zoo.cfg

# 编辑zoo.cfg
cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper
clientPort=2181
server.1=worker01:2888:3888
server.2=worker02:2888:3888
server.3=worker03:2888:3888
EOF

2.3 创建myid文件

bash 复制代码
# worker01上执行
echo "1" > /data/zookeeper/myid

# worker02上执行
echo "2" > /data/zookeeper/myid

# worker03上执行
echo "3" > /data/zookeeper/myid

2.4 配置环境变量

bash 复制代码
cat >> ~/.bashrc << EOF
export ZOOKEEPER_HOME=/root/engine/zookeeper
export PATH=\$PATH:\$ZOOKEEPER_HOME/bin
EOF
source ~/.bashrc

2.5 分发到worker02和worker03

bash 复制代码
# 分发zookeeper
scp -r /root/engine/zookeeper worker02:/root/engine/
scp -r /root/engine/zookeeper worker03:/root/engine/

# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/

# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"

2.6 启动ZooKeeper

⚠️ 特别注意:必须分别在每台机器上手动启动

bash 复制代码
# 所有节点分别执行
zkServer.sh start

# 检查状态(所有节点分别执行)
zkServer.sh status
# 预期结果:一台显示Mode: leader,两台显示Mode: follower

第三章:Hadoop 3.3.5 HA集群部署

3.1 下载与安装

在worker01执行:

bash 复制代码
cd /root/engine
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
tar -zxvf hadoop-3.3.5.tar.gz

3.2 配置环境变量

所有节点执行:

bash 复制代码
cat >> ~/.bashrc << EOF
export HADOOP_HOME=/root/engine/hadoop-3.3.5
export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=/data/hadoop/logs
export HADOOP_PID_DIR=/data/hadoop/pids
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
export HADOOP_CLASSPATH=\`hadoop classpath\`
EOF
source ~/.bashrc

3.3 修改hadoop-env.sh

⚠️ 关键配置:必须设置用户变量(否则root无法启动)

bash 复制代码
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# 添加以下内容(根据实际情况修改)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export HADOOP_LOG_DIR=/data/hadoop/logs
export HADOOP_PID_DIR=/data/hadoop/pids

# 关键:设置启动用户(解决ERROR: Attempting to operate as root)
# 非root用户启动,无需设置
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
# 本地虚拟机资源不足,需要将资源调小一些
export HDFS_NAMENODE_OPTS="-Xms512m -Xmx512m"
export HDFS_DATANODE_OPTS="-Xms256m -Xmx256m"
export YARN_RESOURCEMANAGER_OPTS="-Xms512m -Xmx512m"
export YARN_NODEMANAGER_OPTS="-Xms256m -Xmx256m"
export HADOOP_JOB_HISTORYSERVER_OPTS="-Xms128m -Xmx128m"

3.4 core-site.xml配置

xml 复制代码
vim $HADOOP_HOME/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop/tmp</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>worker01:2181,worker02:2181,worker03:2181</value>
    </property>
    <property>
        <name>ha.zookeeper.session-timeout.ms</name>
        <value>5000</value>
    </property>
</configuration>

3.5 hdfs-site.xml配置(HA核心)

xml 复制代码
vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <!-- NameNode HA配置 -->
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>worker01:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>worker02:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>worker01:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>worker02:9870</value>
    </property>
    
    <!-- JournalNode配置(3节点标准配置) -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://worker01:8485;worker02:8485;worker03:8485/mycluster</value>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/data/hadoop/journal</value>
    </property>
    
    <!-- 故障转移配置 -->
    <property>
        <name>dfs.client.failover.proxy.provider.mycluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
    <!-- 根据实际路径修改 -->
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
    </property>
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    
    <!-- 数据目录 -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/hadoop/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/hadoop/datanode</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>  <!-- 3节点设2,兼顾安全与空间 -->
    </property>
</configuration>

3.6 yarn-site.xml配置

xml 复制代码
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <!-- ResourceManager HA配置 -->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yrc</value>
    </property>
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>worker01</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>worker02</value>
    </property>
    
    <!-- ZooKeeper配置 -->
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>worker01:2181,worker02:2181,worker03:2181</value>
    </property>
    
    <!-- NodeManager配置 -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <!-- 内存配置(4G虚拟机优化) -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3072</value>  <!-- 留1G给系统 -->
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>     <!-- 2核 -->
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
    </property>
    
    <!-- 禁用内存检查(避免容器误杀) -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    
    <!-- 日志聚合 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://worker01:19888/jobhistory/logs</value>
    </property>
</configuration>

3.7 mapred-site.xml配置

xml 复制代码
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>worker01:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>worker01:19888</value>
    </property>
</configuration>

3.8 capacity-scheduler.xml配置

xml 复制代码
vim $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

<!-- 将DefaultResourceCalculator改为DominantResourceCalculator -->
<property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
    <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
    <description>
        The ResourceCalculator implementation to be used to compare
        Resources in the scheduler.
        The default i.e. DefaultResourceCalculator only uses Memory while
        DominantResourceCalculator uses dominant-resource to compare
        multi-dimensional resources such as Memory, CPU etc.
    </description>
</property>

3.9 workers文件配置

bash 复制代码
vim $HADOOP_HOME/etc/hadoop/workers
# 添加以下内容
worker01
worker02
worker03

3.10 分发到worker02和worker03

bash 复制代码
# 分发Hadoop
scp -r /root/engine/hadoop-3.3.5 worker02:/root/engine/
scp -r /root/engine/hadoop-3.3.5 worker03:/root/engine/

# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/

# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"

3.11 格式化与启动(⚠️严格按顺序)

步骤1:启动JournalNode(所有节点)

bash 复制代码
# worker01、worker02、worker03分别执行
$HADOOP_HOME/bin/hdfs --daemon start journalnode
jps | grep JournalNode  # 确认进程存在

步骤2:格式化NameNode(仅在worker01)

bash 复制代码
hdfs namenode -format

步骤3:启动worker01的NameNode

bash 复制代码
$HADOOP_HOME/bin/hdfs --daemon start namenode

步骤4:同步Standby NameNode(worker02)

bash 复制代码
hdfs namenode -bootstrapStandby

步骤5:启动worker02的NameNode

bash 复制代码
$HADOOP_HOME/bin/hdfs --daemon start namenode

步骤6:格式化ZKFC(任一NameNode节点)

bash 复制代码
hdfs zkfc -formatZK  # ⚠️ 这一步很容易遗漏,会导致双Standby

步骤7:启动所有DataNode

bash 复制代码
$HADOOP_HOME/sbin/start-dfs.sh  # 会自动启动所有DataNode
# 或在每台机器手动启动
# $HADOOP_HOME/bin/hdfs --daemon start datanode

步骤8:启动ZKFC(两个NameNode节点)

bash 复制代码
# worker01和worker02分别执行
$HADOOP_HOME/bin/hdfs --daemon start zkfc

步骤9:启动YARN

bash 复制代码
$HADOOP_HOME/sbin/start-yarn.sh

步骤10:启动JobHistoryServer

bash 复制代码
# 在 worker01 上执行
$HADOOP_HOME/bin/mapred --daemon start historyserver

3.12 验证集群

bash 复制代码
# 检查所有节点进程
echo "=== worker01 processes ==="
ssh worker01 'jps | grep -E "NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer"'
echo "=== worker02 processes ==="
ssh worker02 'jps | grep -E "NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain"'
echo "=== worker03 processes ==="
ssh worker03 'jps | grep -E "DataNode|JournalNode|NodeManager|QuorumPeerMain"'

# 检查NameNode状态
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2
# 预期:一个active,一个standby

# 查看RM状态
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
# 预期:一个active,一个standby

# 检查YARN节点
yarn node -list
# 预期:三个节点都为RUNNING

# 检查HDFS健康状态
hdfs dfsadmin -report

4.1 下载与安装

在worker01执行:

bash 复制代码
cd /root/engine
wget https://archive.apache.org/dist/flink/flink-1.15.3/flink-1.15.3-bin-scala_2.12.tgz
tar -zxvf flink-1.15.3-bin-scala_2.12.tgz

4.2 配置环境变量

所有节点执行:

bash 复制代码
cat >> ~/.bashrc << EOF
export FLINK_HOME=/root/engine/flink-1.15.3
export PATH=\$PATH:\$FLINK_HOME/bin
EOF
source ~/.bashrc

⚠️ 重点避坑:内存配置必须简化,否则会冲突

yaml 复制代码
# 基础通信
jobmanager.rpc.address: worker01
jobmanager.bind-host: 0.0.0.0
taskmanager.bind-host: 0.0.0.0

# 简化内存配置(只设process.size,其他自动计算)
jobmanager.memory.process.size: 1024m
taskmanager.memory.process.size: 1024m

# Slot配置(4G虚拟机优化)
taskmanager.numberOfTaskSlots: 1
parallelism.default: 2

# Web UI
rest.port: 8081
rest.address: worker01
rest.bind-address: 0.0.0.0
web.submit.enable: true

# History Server
historyserver.web.enabled: true
historyserver.web.address: worker01
historyserver.web.port: 8082
jobmanager.archive.fs.dir: hdfs://mycluster/flink/completed-jobs/
historyserver.archive.fs.dir: hdfs://mycluster/flink/completed-jobs/

# 其他
classloader.check-leaked-classloader: false
env.java.opts: -Dfile.encoding=UTF-8

#==============================================================================
# High Availability (HA模式)
#==============================================================================

# 启用ZooKeeper HA模式
high-availability: zookeeper

# ZooKeeper集群地址(使用3个节点)
high-availability.zookeeper.quorum: worker01:2181,worker02:2181,worker03:2181

# 元数据存储路径(使用正确的HDFS HA集群名称)
high-availability.storageDir: hdfs://mycluster/flink/ha/

# ZooKeeper中存储元数据的根路径
high-availability.zookeeper.path.root: /flink

# 集群ID(可自定义,用于区分不同Flink集群)
high-availability.cluster-id: /flink_cluster_one

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# 状态后端(4G虚拟机建议用hashmap,更轻量)
state.backend: hashmap

# Checkpoint目录
state.checkpoints.dir: hdfs://mycluster/flink/checkpoints

# Savepoint目录
state.savepoints.dir: hdfs://mycluster/flink/savepoints

# Checkpoint间隔(不开启则注释掉)
# execution.checkpointing.interval: 60000

# 失败恢复策略
jobmanager.execution.failover-strategy: region

4.4 配置masters和workers

bash 复制代码
# masters文件(worker01上)
echo "worker01:8081" > $FLINK_HOME/conf/masters

# workers文件
cat > $FLINK_HOME/conf/workers << EOF
worker01
worker02
worker03
EOF

4.5 分发到worker02和worker03

bash 复制代码
scp -r /root/engine/flink-1.15.3 worker02:/root/engine/
scp -r /root/engine/flink-1.15.3 worker03:/root/engine/

# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/

# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"

4.6 启动History Server

bash 复制代码
$FLINK_HOME/bin/historyserver.sh start

第五章:验证测试

5.1 提交WordCount任务

bash 复制代码
cd $FLINK_HOME/bin

# 最简单的测试(不设内存参数,使用配置文件的默认值)
./flink run -m yarn-cluster ../examples/batch/WordCount.jar

# 如果成功,会看到类似输出:
# YARN application has been deployed successfully.
# Job has been submitted with JobID xxxxx

5.2 带参数的提交

bash 复制代码
# 指定内存和并行度
./flink run -m yarn-cluster \
  -yjm 1024 \
  -ytm 1024 \
  -ys 1 \
  -p 2 \
  ../examples/batch/WordCount.jar

5.3 验证集群状态

bash 复制代码
# 1. 查看YARN应用
yarn application -list | grep FLINK

# 2. 访问Web UI
# YARN: http://worker01:8088
# Flink: http://worker01:8081
# HistoryServer: http://worker01:8082
# HDFS: http://worker01:9870

# 3. 查看任务输出
yarn logs -applicationId application_xxx | grep -A 20 "WordCount"

# 4. 创建HDFS目录测试
hdfs dfs -mkdir -p /flink/test
hdfs dfs -ls /

第六章:常见问题与解决方案(避坑大全)

6.1 SLF4J多绑定警告

现象:

tex 复制代码
SLF4J: Class path contains multiple SLF4J bindings.

原因: Hadoop和Flink各自带了SLF4J实现。

解决方案: 不影响使用,可忽略。或配置HADOOP_CLASSPATH过滤:

bash 复制代码
export HADOOP_CLASSPATH=$(hadoop classpath | tr ':' '\n' | grep -v 'slf4j-reload4j' | tr '\n' ':')

6.2 JAVA_HOME配置错误

现象:

te 复制代码
ERROR: JAVA_HOME /usr/bin/java does not exist.

原因: JAVA_HOME指向了java可执行文件,而不是JDK目录。

解决方案:

bash 复制代码
# 正确配置
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64

6.3 root用户启动需要配置用户变量

现象:

tex 复制代码
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined.

解决方案: 在hadoop-env.sh中添加:

bash 复制代码
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

6.4 JournalNode连接拒绝

现象:

tex 复制代码
Retrying connect to server: worker01:8485 failed

原因: JournalNode没启动,或启动顺序错误。

解决方案: 严格按照顺序:先启动JournalNode,再格式化/启动NameNode。

6.5 双Standby问题

现象:

tex 复制代码
hdfs haadmin -getServiceState nn1 = standby
hdfs haadmin -getServiceState nn2 = standby

原因: 忘记执行hdfs zkfc -formatZK

解决方案:

bash 复制代码
hdfs zkfc -formatZK
# 然后重启ZKFC
$HADOOP_HOME/bin/hdfs --daemon stop zkfc
$HADOOP_HOME/bin/hdfs --daemon start zkfc

6.6 Flink内存配置冲突

现象:

tex 复制代码
The configured Total Flink Memory (73.000mb) is less than the configured Off-heap Memory (128.000mb)

原因: 同时设置了多个内存参数,导致计算冲突。

解决方案: 简化配置,只保留jobmanager.memory.process.size

6.7 NodeManager状态不同步

现象:

tex 复制代码
Node is out of sync with ResourceManager, hence resyncing.
Active Nodes: 2

解决方案: 重启所有NodeManager:

bash 复制代码
# 所有节点执行
$HADOOP_HOME/bin/yarn --daemon stop nodemanager
$HADOOP_HOME/bin/yarn --daemon start nodemanager

6.8 容器内存不足

现象:

任务提交后一直等待,YARN显示容器一直处于ACCEPTED状态。

解决方案: 调整yarn-site.xml中的内存配置:

xml 复制代码
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
</property>

6.9 worker03节点DataNode未启动

现象:

tex 复制代码
hdfs dfsadmin -report 显示只有2个DataNode

解决方案: 手动启动worker03的DataNode:

bash 复制代码
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon start datanode"

第七章:日常运维命令

7.1 启动命令(严格顺序)

bash 复制代码
# 1. 启动ZooKeeper(所有节点)
zkServer.sh start

# 2. 启动JournalNode(所有节点)
$HADOOP_HOME/bin/hdfs --daemon start journalnode

# 3. 启动HDFS
$HADOOP_HOME/sbin/start-dfs.sh

# 4. 启动ZKFC(NameNode节点)
$HADOOP_HOME/bin/hdfs --daemon start zkfc

# 5. 启动YARN
$HADOOP_HOME/sbin/start-yarn.sh

# 6. 启动JobHistoryServer
$HADOOP_HOME/bin/mapred --daemon start historyserver

# 7. 启动Flink HistoryServer
$FLINK_HOME/bin/historyserver.sh start

7.2 停止命令(相反顺序)

bash 复制代码
$FLINK_HOME/bin/historyserver.sh stop
$HADOOP_HOME/bin/mapred --daemon stop historyserver
$HADOOP_HOME/sbin/stop-yarn.sh
$HADOOP_HOME/sbin/stop-dfs.sh
# JournalNode和ZooKeeper需要单独停止
$HADOOP_HOME/bin/hdfs --daemon stop journalnode
zkServer.sh stop

7.3 一键启动脚本

创建start-all-custom.sh

bash 复制代码
#!/bin/bash

echo "=================================================="
echo "== 开始启动Hadoop + Flink集群服务(3节点HA版)=="
echo "=================================================="
echo "启动时间: $(date +'%Y-%m-%d %H:%M:%S')"
echo ""

# 步骤1:启动ZooKeeper
echo "【步骤1/8】启动ZooKeeper集群..."
ssh worker01 "zkServer.sh start"
ssh worker02 "zkServer.sh start"
ssh worker03 "zkServer.sh start"
sleep 5

# 步骤2:启动JournalNode
echo "【步骤2/8】启动JournalNode服务..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
sleep 3

# 步骤3:启动HDFS
echo "【步骤3/8】启动HDFS集群..."
$HADOOP_HOME/sbin/start-dfs.sh
sleep 5

# 步骤4:启动ZKFC
echo "【步骤4/8】启动ZKFC服务..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
sleep 3

# 步骤5:启动YARN
echo "【步骤5/8】启动YARN集群..."
$HADOOP_HOME/sbin/start-yarn.sh
sleep 3

# 步骤6:启动JobHistoryServer
echo "【步骤6/8】启动MapReduce JobHistoryServer..."
$HADOOP_HOME/bin/mapred --daemon start historyserver
sleep 2

# 步骤7:启动Flink HistoryServer
echo "【步骤7/8】启动Flink HistoryServer..."
$FLINK_HOME/bin/historyserver.sh start
sleep 2

# 步骤8:检查进程
echo ""
echo "【步骤8/8】检查服务进程..."

echo "--- worker01 进程 ---"
ssh worker01 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer|JobHistoryServer'"

echo "--- worker02 进程 ---"
ssh worker02 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain'"

echo "--- worker03 进程 ---"
ssh worker03 "jps | grep -E 'DataNode|JournalNode|NodeManager|QuorumPeerMain'"

echo ""
echo "=================================================="
echo "== 集群启动完成!=="
echo "== YARN WebUI: http://worker01:8088 =="
echo "== Flink WebUI: http://worker01:8081 =="
echo "== HDFS WebUI: http://worker01:9870 =="
echo "=================================================="

7.4 一键停止脚本

创建stop-all-custom.sh

bash 复制代码
#!/bin/bash

echo "=================================================="
echo "== 开始停止Hadoop + Flink集群服务(3节点HA版)=="
echo "=================================================="

# 获取当前时间
echo "停止时间: $(date +'%Y-%m-%d %H:%M:%S')"
echo ""

# 步骤1:停止Flink HistoryServer
echo "【步骤1/8】停止Flink HistoryServer..."
$FLINK_HOME/bin/historyserver.sh stop
sleep 2

# 步骤2:停止JobHistoryServer
echo "【步骤2/8】停止MapReduce JobHistoryServer..."
$HADOOP_HOME/bin/mapred --daemon stop historyserver
sleep 2

# 步骤3:停止YARN
echo "【步骤3/8】停止YARN集群..."
$HADOOP_HOME/sbin/stop-yarn.sh
sleep 3

# 步骤4:停止ZKFC(两个NameNode节点)
echo "【步骤4/8】停止ZKFC服务(worker01, worker02)..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc"
sleep 2

# 步骤5:停止HDFS
echo "【步骤5/8】停止HDFS集群..."
$HADOOP_HOME/sbin/stop-dfs.sh
sleep 5

# 步骤6:停止JournalNode(所有节点)
echo "【步骤6/8】停止JournalNode服务(所有节点)..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
sleep 3

# 步骤7:停止ZooKeeper(所有节点)
echo "【步骤7/8】停止ZooKeeper集群(所有节点)..."
ssh worker01 "zkServer.sh stop"
ssh worker02 "zkServer.sh stop"
ssh worker03 "zkServer.sh stop"
sleep 3

# 步骤8:检查剩余进程
echo ""
echo "【步骤8/8】检查剩余Java进程..."

echo "--- worker01 剩余进程 ---"
ssh worker01 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer|JobHistoryServer' || echo '无相关进程'"

echo "--- worker02 剩余进程 ---"
ssh worker02 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain' || echo '无相关进程'"

echo "--- worker03 剩余进程 ---"
ssh worker03 "jps | grep -E 'DataNode|JournalNode|NodeManager|QuorumPeerMain' || echo '无相关进程'"

echo ""
echo "=================================================="
echo "== 集群停止完成!=="
echo "=================================================="

7.5 常用检查命令

bash 复制代码
# 进程检查(所有节点)
jps | grep -E "NameNode|DataNode|JournalNode|DFSZKFC|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer"

# HDFS检查
hdfs dfsadmin -report
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

# YARN检查
yarn node -list
yarn application -list

# 文件系统操作
hdfs dfs -ls /
hdfs dfs -put /etc/hosts /tmp/

🎯 最终验证

执行以下命令验证整个集群:

bash 复制代码
# 1. 创建测试文件
echo "Hello Hadoop and Flink" > /tmp/test.txt
hdfs dfs -mkdir -p /flink/input
hdfs dfs -put /tmp/test.txt /flink/input/

# 2. 提交WordCount任务
cd $FLINK_HOME/bin
./flink run -m yarn-cluster \
  -yjm 1024 \
  -ytm 1024 \
  -ys 1 \
  -p 3 \
  ../examples/batch/WordCount.jar \
  --input hdfs://mycluster/flink/input/test.txt \
  --output hdfs://mycluster/flink/output/result

# 3. 查看结果
hdfs dfs -cat /flink/output/result/*

# 4. 检查所有Web UI
# YARN: http://worker01:8088 (3个NodeManager都应为Active)
# Flink: http://worker01:8081 (TaskManager应为3个)
# HistoryServer: http://worker01:8082
# HDFS: http://worker01:9870 (3个DataNode都应为Live)

恭喜!你的3节点Hadoop 3.3.5 + Flink 1.15.3 HA集群已成功部署!🎉

相关推荐
SunnyDays10112 小时前
如何使用 Java 实现自动删除 Word 文档中的空白页或指定页
java·删除 word 文档空白页·删除 word 文档页面
༄天M宇ༀ2 小时前
E10: e-builder 低代码构建平台接口管理(E9建模版)
java·前端·spring·servlet·reactjs
蜜獾云2 小时前
java 异步编程
java·开发语言
xin^_^2 小时前
java基础学习
java·开发语言·python
yttandb2 小时前
数据库的设计
java·数据库
zhouping@2 小时前
JAVA的学习笔记day05
java·笔记·学习
luckyzlb2 小时前
02-kafka(01润色版)
java·中间件·kafka
清水白石0082 小时前
《解锁 Python 潜能:从内存模型看可变与不可变对象,及其实战最佳实践》
大数据·开发语言·python
ByNotD0g2 小时前
Tomcat中的回显问题
java·tomcat