📋 文档说明
本手册基于实际操作中遇到的所有问题 整理而成,包含详细的避坑指南。环境为3台虚拟机(2C/4G),IP规划如下:
192.168.171.129worker01(主节点:NameNode/ResourceManager/JobManager)192.168.171.130worker02(从节点:NameNode/DataNode/NodeManager)192.168.171.131worker03(新增从节点:DataNode/NodeManager)
第一章:基础环境配置
1.1 主机名与网络配置
在所有节点执行:
bash
# 设置主机名(可选)
hostnamectl set-hostname worker01 # worker01上执行
hostnamectl set-hostname worker02 # worker02上执行
hostnamectl set-hostname worker03 # worker03上执行
# 配置hosts文件
cat >> /etc/hosts << EOF
192.168.171.129 worker01
192.168.171.130 worker02
192.168.171.131 worker03
EOF
# 关闭防火墙(测试环境,生产环境开通需要的端口)
systemctl stop firewalld
systemctl status firewalld
1.2 SSH免密登录配置
在worker01执行:
bash
# 生成密钥对
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
# 分发到所有节点(包括自己)
ssh-copy-id worker01
ssh-copy-id worker02
ssh-copy-id worker03
# 验证
ssh worker02 "hostname" # 应该无需密码直接返回worker02
ssh worker03 "hostname" # 应该无需密码直接返回worker03
1.3 JDK安装与配置
⚠️ 重点避坑:JAVA_HOME必须指向JDK(不是JRE),且必须包含bin目录
bash
# 安装JDK 1.8(所有节点)
yum install -y java-1.8.0-openjdk-devel
# 查找JDK实际路径
readlink -f /usr/bin/java
# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin/java
# 截取jre之前的内容:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
# 配置环境变量(所有节点)
cat >> ~/.bashrc << EOF
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export PATH=\$PATH:\$JAVA_HOME/bin
EOF
source ~/.bashrc
# 验证
java -version
javac -version # 必须有javac,证明是JDK
1.4 目录规划(所有节点)
bash
# 创建部署目录(根据实际情况修改)
mkdir -p /root/engine
# 创建统一数据目录
mkdir -p /data/zookeeper
mkdir -p /data/hadoop/{logs,pids,tmp,namenode,datanode,journal,yarn}
mkdir -p /data/flink/logs
# 查看目录结构
tree /data -L 2
第二章:ZooKeeper集群部署
2.1 下载与安装
在所有节点分别执行:
bash
cd ~/engine
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.1/apache-zookeeper-3.8.1-bin.tar.gz
tar -zxvf apache-zookeeper-3.8.1-bin.tar.gz
mv apache-zookeeper-3.8.1-bin zookeeper
2.2 ZooKeeper配置
创建配置文件(所有节点):
bash
cd /root/engine/zookeeper/conf
cp zoo_sample.cfg zoo.cfg
# 编辑zoo.cfg
cat > zoo.cfg << EOF
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper
clientPort=2181
server.1=worker01:2888:3888
server.2=worker02:2888:3888
server.3=worker03:2888:3888
EOF
2.3 创建myid文件
bash
# worker01上执行
echo "1" > /data/zookeeper/myid
# worker02上执行
echo "2" > /data/zookeeper/myid
# worker03上执行
echo "3" > /data/zookeeper/myid
2.4 配置环境变量
bash
cat >> ~/.bashrc << EOF
export ZOOKEEPER_HOME=/root/engine/zookeeper
export PATH=\$PATH:\$ZOOKEEPER_HOME/bin
EOF
source ~/.bashrc
2.5 分发到worker02和worker03
bash
# 分发zookeeper
scp -r /root/engine/zookeeper worker02:/root/engine/
scp -r /root/engine/zookeeper worker03:/root/engine/
# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/
# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"
2.6 启动ZooKeeper
⚠️ 特别注意:必须分别在每台机器上手动启动
bash
# 所有节点分别执行
zkServer.sh start
# 检查状态(所有节点分别执行)
zkServer.sh status
# 预期结果:一台显示Mode: leader,两台显示Mode: follower
第三章:Hadoop 3.3.5 HA集群部署
3.1 下载与安装
在worker01执行:
bash
cd /root/engine
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
tar -zxvf hadoop-3.3.5.tar.gz
3.2 配置环境变量
所有节点执行:
bash
cat >> ~/.bashrc << EOF
export HADOOP_HOME=/root/engine/hadoop-3.3.5
export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=/data/hadoop/logs
export HADOOP_PID_DIR=/data/hadoop/pids
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
export HADOOP_CLASSPATH=\`hadoop classpath\`
EOF
source ~/.bashrc
3.3 修改hadoop-env.sh
⚠️ 关键配置:必须设置用户变量(否则root无法启动)
bash
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# 添加以下内容(根据实际情况修改)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
export HADOOP_LOG_DIR=/data/hadoop/logs
export HADOOP_PID_DIR=/data/hadoop/pids
# 关键:设置启动用户(解决ERROR: Attempting to operate as root)
# 非root用户启动,无需设置
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
# 本地虚拟机资源不足,需要将资源调小一些
export HDFS_NAMENODE_OPTS="-Xms512m -Xmx512m"
export HDFS_DATANODE_OPTS="-Xms256m -Xmx256m"
export YARN_RESOURCEMANAGER_OPTS="-Xms512m -Xmx512m"
export YARN_NODEMANAGER_OPTS="-Xms256m -Xmx256m"
export HADOOP_JOB_HISTORYSERVER_OPTS="-Xms128m -Xmx128m"
3.4 core-site.xml配置
xml
vim $HADOOP_HOME/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>worker01:2181,worker02:2181,worker03:2181</value>
</property>
<property>
<name>ha.zookeeper.session-timeout.ms</name>
<value>5000</value>
</property>
</configuration>
3.5 hdfs-site.xml配置(HA核心)
xml
vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<!-- NameNode HA配置 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>worker01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>worker02:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>worker01:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>worker02:9870</value>
</property>
<!-- JournalNode配置(3节点标准配置) -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://worker01:8485;worker02:8485;worker03:8485/mycluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/hadoop/journal</value>
</property>
<!-- 故障转移配置 -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 根据实际路径修改 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 数据目录 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/data/hadoop/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value> <!-- 3节点设2,兼顾安全与空间 -->
</property>
</configuration>
3.6 yarn-site.xml配置
xml
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<!-- ResourceManager HA配置 -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yrc</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>worker01</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>worker02</value>
</property>
<!-- ZooKeeper配置 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>worker01:2181,worker02:2181,worker03:2181</value>
</property>
<!-- NodeManager配置 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 内存配置(4G虚拟机优化) -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value> <!-- 留1G给系统 -->
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value> <!-- 2核 -->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 禁用内存检查(避免容器误杀) -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 日志聚合 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://worker01:19888/jobhistory/logs</value>
</property>
</configuration>
3.7 mapred-site.xml配置
xml
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>worker01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>worker01:19888</value>
</property>
</configuration>
3.8 capacity-scheduler.xml配置
xml
vim $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
<!-- 将DefaultResourceCalculator改为DominantResourceCalculator -->
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
<description>
The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.
</description>
</property>
3.9 workers文件配置
bash
vim $HADOOP_HOME/etc/hadoop/workers
# 添加以下内容
worker01
worker02
worker03
3.10 分发到worker02和worker03
bash
# 分发Hadoop
scp -r /root/engine/hadoop-3.3.5 worker02:/root/engine/
scp -r /root/engine/hadoop-3.3.5 worker03:/root/engine/
# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/
# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"
3.11 格式化与启动(⚠️严格按顺序)
步骤1:启动JournalNode(所有节点)
bash
# worker01、worker02、worker03分别执行
$HADOOP_HOME/bin/hdfs --daemon start journalnode
jps | grep JournalNode # 确认进程存在
步骤2:格式化NameNode(仅在worker01)
bash
hdfs namenode -format
步骤3:启动worker01的NameNode
bash
$HADOOP_HOME/bin/hdfs --daemon start namenode
步骤4:同步Standby NameNode(worker02)
bash
hdfs namenode -bootstrapStandby
步骤5:启动worker02的NameNode
bash
$HADOOP_HOME/bin/hdfs --daemon start namenode
步骤6:格式化ZKFC(任一NameNode节点)
bash
hdfs zkfc -formatZK # ⚠️ 这一步很容易遗漏,会导致双Standby
步骤7:启动所有DataNode
bash
$HADOOP_HOME/sbin/start-dfs.sh # 会自动启动所有DataNode
# 或在每台机器手动启动
# $HADOOP_HOME/bin/hdfs --daemon start datanode
步骤8:启动ZKFC(两个NameNode节点)
bash
# worker01和worker02分别执行
$HADOOP_HOME/bin/hdfs --daemon start zkfc
步骤9:启动YARN
bash
$HADOOP_HOME/sbin/start-yarn.sh
步骤10:启动JobHistoryServer
bash
# 在 worker01 上执行
$HADOOP_HOME/bin/mapred --daemon start historyserver
3.12 验证集群
bash
# 检查所有节点进程
echo "=== worker01 processes ==="
ssh worker01 'jps | grep -E "NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer"'
echo "=== worker02 processes ==="
ssh worker02 'jps | grep -E "NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain"'
echo "=== worker03 processes ==="
ssh worker03 'jps | grep -E "DataNode|JournalNode|NodeManager|QuorumPeerMain"'
# 检查NameNode状态
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2
# 预期:一个active,一个standby
# 查看RM状态
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
# 预期:一个active,一个standby
# 检查YARN节点
yarn node -list
# 预期:三个节点都为RUNNING
# 检查HDFS健康状态
hdfs dfsadmin -report
第四章:Flink 1.15.3 on YARN部署
4.1 下载与安装
在worker01执行:
bash
cd /root/engine
wget https://archive.apache.org/dist/flink/flink-1.15.3/flink-1.15.3-bin-scala_2.12.tgz
tar -zxvf flink-1.15.3-bin-scala_2.12.tgz
4.2 配置环境变量
所有节点执行:
bash
cat >> ~/.bashrc << EOF
export FLINK_HOME=/root/engine/flink-1.15.3
export PATH=\$PATH:\$FLINK_HOME/bin
EOF
source ~/.bashrc
4.3 配置flink-conf.yaml
⚠️ 重点避坑:内存配置必须简化,否则会冲突
yaml
# 基础通信
jobmanager.rpc.address: worker01
jobmanager.bind-host: 0.0.0.0
taskmanager.bind-host: 0.0.0.0
# 简化内存配置(只设process.size,其他自动计算)
jobmanager.memory.process.size: 1024m
taskmanager.memory.process.size: 1024m
# Slot配置(4G虚拟机优化)
taskmanager.numberOfTaskSlots: 1
parallelism.default: 2
# Web UI
rest.port: 8081
rest.address: worker01
rest.bind-address: 0.0.0.0
web.submit.enable: true
# History Server
historyserver.web.enabled: true
historyserver.web.address: worker01
historyserver.web.port: 8082
jobmanager.archive.fs.dir: hdfs://mycluster/flink/completed-jobs/
historyserver.archive.fs.dir: hdfs://mycluster/flink/completed-jobs/
# 其他
classloader.check-leaked-classloader: false
env.java.opts: -Dfile.encoding=UTF-8
#==============================================================================
# High Availability (HA模式)
#==============================================================================
# 启用ZooKeeper HA模式
high-availability: zookeeper
# ZooKeeper集群地址(使用3个节点)
high-availability.zookeeper.quorum: worker01:2181,worker02:2181,worker03:2181
# 元数据存储路径(使用正确的HDFS HA集群名称)
high-availability.storageDir: hdfs://mycluster/flink/ha/
# ZooKeeper中存储元数据的根路径
high-availability.zookeeper.path.root: /flink
# 集群ID(可自定义,用于区分不同Flink集群)
high-availability.cluster-id: /flink_cluster_one
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# 状态后端(4G虚拟机建议用hashmap,更轻量)
state.backend: hashmap
# Checkpoint目录
state.checkpoints.dir: hdfs://mycluster/flink/checkpoints
# Savepoint目录
state.savepoints.dir: hdfs://mycluster/flink/savepoints
# Checkpoint间隔(不开启则注释掉)
# execution.checkpointing.interval: 60000
# 失败恢复策略
jobmanager.execution.failover-strategy: region
4.4 配置masters和workers
bash
# masters文件(worker01上)
echo "worker01:8081" > $FLINK_HOME/conf/masters
# workers文件
cat > $FLINK_HOME/conf/workers << EOF
worker01
worker02
worker03
EOF
4.5 分发到worker02和worker03
bash
scp -r /root/engine/flink-1.15.3 worker02:/root/engine/
scp -r /root/engine/flink-1.15.3 worker03:/root/engine/
# 分发环境变量
scp ~/.bashrc worker02:~/
scp ~/.bashrc worker03:~/
# 远程使配置生效
ssh worker02 "source ~/.bashrc"
ssh worker03 "source ~/.bashrc"
4.6 启动History Server
bash
$FLINK_HOME/bin/historyserver.sh start
第五章:验证测试
5.1 提交WordCount任务
bash
cd $FLINK_HOME/bin
# 最简单的测试(不设内存参数,使用配置文件的默认值)
./flink run -m yarn-cluster ../examples/batch/WordCount.jar
# 如果成功,会看到类似输出:
# YARN application has been deployed successfully.
# Job has been submitted with JobID xxxxx
5.2 带参数的提交
bash
# 指定内存和并行度
./flink run -m yarn-cluster \
-yjm 1024 \
-ytm 1024 \
-ys 1 \
-p 2 \
../examples/batch/WordCount.jar
5.3 验证集群状态
bash
# 1. 查看YARN应用
yarn application -list | grep FLINK
# 2. 访问Web UI
# YARN: http://worker01:8088
# Flink: http://worker01:8081
# HistoryServer: http://worker01:8082
# HDFS: http://worker01:9870
# 3. 查看任务输出
yarn logs -applicationId application_xxx | grep -A 20 "WordCount"
# 4. 创建HDFS目录测试
hdfs dfs -mkdir -p /flink/test
hdfs dfs -ls /
第六章:常见问题与解决方案(避坑大全)
6.1 SLF4J多绑定警告
现象:
tex
SLF4J: Class path contains multiple SLF4J bindings.
原因: Hadoop和Flink各自带了SLF4J实现。
解决方案: 不影响使用,可忽略。或配置HADOOP_CLASSPATH过滤:
bash
export HADOOP_CLASSPATH=$(hadoop classpath | tr ':' '\n' | grep -v 'slf4j-reload4j' | tr '\n' ':')
6.2 JAVA_HOME配置错误
现象:
te
ERROR: JAVA_HOME /usr/bin/java does not exist.
原因: JAVA_HOME指向了java可执行文件,而不是JDK目录。
解决方案:
bash
# 正确配置
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
6.3 root用户启动需要配置用户变量
现象:
tex
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined.
解决方案: 在hadoop-env.sh中添加:
bash
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
6.4 JournalNode连接拒绝
现象:
tex
Retrying connect to server: worker01:8485 failed
原因: JournalNode没启动,或启动顺序错误。
解决方案: 严格按照顺序:先启动JournalNode,再格式化/启动NameNode。
6.5 双Standby问题
现象:
tex
hdfs haadmin -getServiceState nn1 = standby
hdfs haadmin -getServiceState nn2 = standby
原因: 忘记执行hdfs zkfc -formatZK。
解决方案:
bash
hdfs zkfc -formatZK
# 然后重启ZKFC
$HADOOP_HOME/bin/hdfs --daemon stop zkfc
$HADOOP_HOME/bin/hdfs --daemon start zkfc
6.6 Flink内存配置冲突
现象:
tex
The configured Total Flink Memory (73.000mb) is less than the configured Off-heap Memory (128.000mb)
原因: 同时设置了多个内存参数,导致计算冲突。
解决方案: 简化配置,只保留jobmanager.memory.process.size。
6.7 NodeManager状态不同步
现象:
tex
Node is out of sync with ResourceManager, hence resyncing.
Active Nodes: 2
解决方案: 重启所有NodeManager:
bash
# 所有节点执行
$HADOOP_HOME/bin/yarn --daemon stop nodemanager
$HADOOP_HOME/bin/yarn --daemon start nodemanager
6.8 容器内存不足
现象:
任务提交后一直等待,YARN显示容器一直处于ACCEPTED状态。
解决方案: 调整yarn-site.xml中的内存配置:
xml
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
6.9 worker03节点DataNode未启动
现象:
tex
hdfs dfsadmin -report 显示只有2个DataNode
解决方案: 手动启动worker03的DataNode:
bash
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon start datanode"
第七章:日常运维命令
7.1 启动命令(严格顺序)
bash
# 1. 启动ZooKeeper(所有节点)
zkServer.sh start
# 2. 启动JournalNode(所有节点)
$HADOOP_HOME/bin/hdfs --daemon start journalnode
# 3. 启动HDFS
$HADOOP_HOME/sbin/start-dfs.sh
# 4. 启动ZKFC(NameNode节点)
$HADOOP_HOME/bin/hdfs --daemon start zkfc
# 5. 启动YARN
$HADOOP_HOME/sbin/start-yarn.sh
# 6. 启动JobHistoryServer
$HADOOP_HOME/bin/mapred --daemon start historyserver
# 7. 启动Flink HistoryServer
$FLINK_HOME/bin/historyserver.sh start
7.2 停止命令(相反顺序)
bash
$FLINK_HOME/bin/historyserver.sh stop
$HADOOP_HOME/bin/mapred --daemon stop historyserver
$HADOOP_HOME/sbin/stop-yarn.sh
$HADOOP_HOME/sbin/stop-dfs.sh
# JournalNode和ZooKeeper需要单独停止
$HADOOP_HOME/bin/hdfs --daemon stop journalnode
zkServer.sh stop
7.3 一键启动脚本
创建start-all-custom.sh:
bash
#!/bin/bash
echo "=================================================="
echo "== 开始启动Hadoop + Flink集群服务(3节点HA版)=="
echo "=================================================="
echo "启动时间: $(date +'%Y-%m-%d %H:%M:%S')"
echo ""
# 步骤1:启动ZooKeeper
echo "【步骤1/8】启动ZooKeeper集群..."
ssh worker01 "zkServer.sh start"
ssh worker02 "zkServer.sh start"
ssh worker03 "zkServer.sh start"
sleep 5
# 步骤2:启动JournalNode
echo "【步骤2/8】启动JournalNode服务..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
sleep 3
# 步骤3:启动HDFS
echo "【步骤3/8】启动HDFS集群..."
$HADOOP_HOME/sbin/start-dfs.sh
sleep 5
# 步骤4:启动ZKFC
echo "【步骤4/8】启动ZKFC服务..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
sleep 3
# 步骤5:启动YARN
echo "【步骤5/8】启动YARN集群..."
$HADOOP_HOME/sbin/start-yarn.sh
sleep 3
# 步骤6:启动JobHistoryServer
echo "【步骤6/8】启动MapReduce JobHistoryServer..."
$HADOOP_HOME/bin/mapred --daemon start historyserver
sleep 2
# 步骤7:启动Flink HistoryServer
echo "【步骤7/8】启动Flink HistoryServer..."
$FLINK_HOME/bin/historyserver.sh start
sleep 2
# 步骤8:检查进程
echo ""
echo "【步骤8/8】检查服务进程..."
echo "--- worker01 进程 ---"
ssh worker01 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer|JobHistoryServer'"
echo "--- worker02 进程 ---"
ssh worker02 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain'"
echo "--- worker03 进程 ---"
ssh worker03 "jps | grep -E 'DataNode|JournalNode|NodeManager|QuorumPeerMain'"
echo ""
echo "=================================================="
echo "== 集群启动完成!=="
echo "== YARN WebUI: http://worker01:8088 =="
echo "== Flink WebUI: http://worker01:8081 =="
echo "== HDFS WebUI: http://worker01:9870 =="
echo "=================================================="
7.4 一键停止脚本
创建stop-all-custom.sh:
bash
#!/bin/bash
echo "=================================================="
echo "== 开始停止Hadoop + Flink集群服务(3节点HA版)=="
echo "=================================================="
# 获取当前时间
echo "停止时间: $(date +'%Y-%m-%d %H:%M:%S')"
echo ""
# 步骤1:停止Flink HistoryServer
echo "【步骤1/8】停止Flink HistoryServer..."
$FLINK_HOME/bin/historyserver.sh stop
sleep 2
# 步骤2:停止JobHistoryServer
echo "【步骤2/8】停止MapReduce JobHistoryServer..."
$HADOOP_HOME/bin/mapred --daemon stop historyserver
sleep 2
# 步骤3:停止YARN
echo "【步骤3/8】停止YARN集群..."
$HADOOP_HOME/sbin/stop-yarn.sh
sleep 3
# 步骤4:停止ZKFC(两个NameNode节点)
echo "【步骤4/8】停止ZKFC服务(worker01, worker02)..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc"
sleep 2
# 步骤5:停止HDFS
echo "【步骤5/8】停止HDFS集群..."
$HADOOP_HOME/sbin/stop-dfs.sh
sleep 5
# 步骤6:停止JournalNode(所有节点)
echo "【步骤6/8】停止JournalNode服务(所有节点)..."
ssh worker01 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
ssh worker02 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
ssh worker03 "$HADOOP_HOME/bin/hdfs --daemon stop journalnode"
sleep 3
# 步骤7:停止ZooKeeper(所有节点)
echo "【步骤7/8】停止ZooKeeper集群(所有节点)..."
ssh worker01 "zkServer.sh stop"
ssh worker02 "zkServer.sh stop"
ssh worker03 "zkServer.sh stop"
sleep 3
# 步骤8:检查剩余进程
echo ""
echo "【步骤8/8】检查剩余Java进程..."
echo "--- worker01 剩余进程 ---"
ssh worker01 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer|JobHistoryServer' || echo '无相关进程'"
echo "--- worker02 剩余进程 ---"
ssh worker02 "jps | grep -E 'NameNode|DataNode|JournalNode|DFSZKF|ResourceManager|NodeManager|QuorumPeerMain' || echo '无相关进程'"
echo "--- worker03 剩余进程 ---"
ssh worker03 "jps | grep -E 'DataNode|JournalNode|NodeManager|QuorumPeerMain' || echo '无相关进程'"
echo ""
echo "=================================================="
echo "== 集群停止完成!=="
echo "=================================================="
7.5 常用检查命令
bash
# 进程检查(所有节点)
jps | grep -E "NameNode|DataNode|JournalNode|DFSZKFC|ResourceManager|NodeManager|QuorumPeerMain|HistoryServer"
# HDFS检查
hdfs dfsadmin -report
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2
# YARN检查
yarn node -list
yarn application -list
# 文件系统操作
hdfs dfs -ls /
hdfs dfs -put /etc/hosts /tmp/
🎯 最终验证
执行以下命令验证整个集群:
bash
# 1. 创建测试文件
echo "Hello Hadoop and Flink" > /tmp/test.txt
hdfs dfs -mkdir -p /flink/input
hdfs dfs -put /tmp/test.txt /flink/input/
# 2. 提交WordCount任务
cd $FLINK_HOME/bin
./flink run -m yarn-cluster \
-yjm 1024 \
-ytm 1024 \
-ys 1 \
-p 3 \
../examples/batch/WordCount.jar \
--input hdfs://mycluster/flink/input/test.txt \
--output hdfs://mycluster/flink/output/result
# 3. 查看结果
hdfs dfs -cat /flink/output/result/*
# 4. 检查所有Web UI
# YARN: http://worker01:8088 (3个NodeManager都应为Active)
# Flink: http://worker01:8081 (TaskManager应为3个)
# HistoryServer: http://worker01:8082
# HDFS: http://worker01:9870 (3个DataNode都应为Live)
恭喜!你的3节点Hadoop 3.3.5 + Flink 1.15.3 HA集群已成功部署!🎉