从Podman开始一步步构建Hadoop开发集群

目录

摘要

工作关系经常开发大数据相关程序,Hive、Spark、Flink等等都会涉及,为方便随时调试程序便有了部署一个Hadoop集群的想法。在单个主机上部署集群肯定要用到容器相关技术,本人使用过Docker和Podman,在资源和性能比较有限的笔记本上面Podman的总体表现比较不错,而且还可以拉取Docker上的镜像。Podman的安装过程比较简单这里先不赘述,下面一步步展示基于Podman的Hadoop集群安装过程。

软件系统版本

系统/软件 版本 说明
Podman Desktop 1.20.2
Debian 9 (stretch) 10(buster)默认支持的JDK 11以上版本,也通过其他方式安装JDK 8
Hadoop 3.3.6

Hadoop角色结构

resourcemanager好像必须跟namenode在同一台机器
flowchart BT A[Hadoop Network] B["node01 - 10.88.0.101 角色: namenode (HDFS) datanode (HDFS) nodemanager (YARN) "] --> A C["node02 - 10.88.0.102 角色: secondarynamenode (HDFS) datanode (HDFS) nodemanager (YARN)"] --> A D["node03 - 10.88.0.103 角色: resourcemanager (YARN) datanode (HDFS) nodemanager (YARN)"] --> A

Podman相关

避免挂载宿主机磁盘后遇到权限问题,登录Podman machine default

修改配置/etc/wsl.conf

automount

enabled=true

mountFsTab=false

options="metadata"

自动生成hosts记录问题

修改配置/etc/wsl.conf

network

generateHosts = false

容器与宿主机网络互通

这里没有设置容器的端口映射,通过添加路由的方式,宿主机直接访问容器IP。在宿主机上添加路由,网关设置为Podman machine default系统的IP地址。

route ADD 10.88.0.0 MASK 255.255.0.0 172.24.137.159 IF 37

零、创建容器

创建debian 9容器并指定ip和主机名称

powershell 复制代码
podman run -dt --name node01 --hostname node01.data.org --ip 10.88.0.101 --add-host "node02;node02.data.org:10.88.0.102" --add-host "node03;node03.data.org:10.88.0.103" --cap-add IPC_LOCK --cap-add NET_RAW docker.io/library/debian:9.9 bash
podman run -dt --name node02 --hostname node02.data.org --ip 10.88.0.102 --add-host "node01;node01.data.org:10.88.0.101" --add-host "node03;node03.data.org:10.88.0.103" --cap-add IPC_LOCK --cap-add NET_RAW docker.io/library/debian:9.9 bash
podman run -dt --name node03 --hostname node03.data.org --ip 10.88.0.103 --add-host "node01;node01.data.org:10.88.0.101" --add-host "node02;node02.data.org:10.88.0.102" --cap-add IPC_LOCK --cap-add NET_RAW docker.io/library/debian:9.9 bash

2023年以后Debian官方源地址已经修改,系统自带的源会出现404报错问题,更新官方源或者网上搜索国内公共源。

shell 复制代码
# 设置系统时区和语言
rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
# 更新镜像源
mv /etc/apt/sources.list /etc/apt/sources.list.old
cat > /etc/apt/sources.list << EOF
deb http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch-backports main contrib non-free
deb http://mirrors.cloud.tencent.com/debian-archive//debian-security/ stretch/updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian-archive//debian-security/ stretch/updates main contrib non-free
EOF
apt update

执行apt update更新时如果提示PUBLIC KEY的问题执行下面语句

shell 复制代码
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 467B942D3A79BD29(更换为提示的KEY)
shell 复制代码
# 安装必须软件
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        openjdk-8-jdk \
        net-tools \
        curl \
        netcat \
        gnupg \
        libsnappy-dev \
        openssh-server \
        openssh-client \
        sudo \
    && rm -rf /var/lib/apt/lists/*

# 创建hadoop用户并添加sudo权限
chmod 640 /etc/sudoers
groupadd --system --gid=10000 hadoop && useradd --system --home-dir /home/hadoop --uid=10000 --gid=hadoop hadoop && echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

# 生成秘钥,把node01公钥导入node02和node03中,配置免密登录
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# 允许root登录,启动ssh并允许开机启动
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
/etc/init.d/ssh start
update-rc.d ssh enable

一、Hadoop基础环境安装

hadoop软件下载安装

设置环境变量

shell 复制代码
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export HADOOP_VERSION=3.3.6
export HADOOP_HOME=/opt/hadoop-$HADOOP_VERSION
echo "export LANG=zh_CN.UTF-8" >> /etc/profile
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> /etc/profile.d/hadoopbase.sh
echo "export JRE_HOME=${JAVA_HOME}/jre" >> /etc/profile.d/hadoopbase.sh
echo "export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib" >> /etc/profile.d/hadoopbase.sh
echo "export PATH=${JAVA_HOME}/bin:\$PATH" >> /etc/profile.d/hadoopbase.sh
echo "export HADOOP_VERSION=3.3.6" >> /etc/profile.d/hadoopbase.sh
echo "export HADOOP_HOME=/opt/hadoop-$HADOOP_VERSION" >> /etc/profile.d/hadoopbase.sh
echo "export HADOOP_CONF_DIR=/etc/hadoop" >> /etc/profile.d/hadoopbase.sh
echo "export PATH=$HADOOP_HOME/bin/:\$PATH" >> /etc/profile.d/hadoopbase.sh
source /etc/profile

创建数据目录

shell 复制代码
mkdir -p /data/hadoop-data
mkdir -p /data/hadoop/tmp
mkdir -p /data/hadoop/nameNode
mkdir -p /data/hadoop/dataNode

安装Hadoop

shell 复制代码
# 导入hadoop安装包校验KEY
curl -O https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
gpg --import KEYS
# hadoop 版本
set -x \
    && curl -fSL "https://www.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" -o /tmp/hadoop.tar.gz \
    && curl -fSL "https://www.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz.asc" -o /tmp/hadoop.tar.gz.asc \
    && gpg --verify /tmp/hadoop.tar.gz.asc \
    && tar -xf /tmp/hadoop.tar.gz -C /opt/ \
    && rm /tmp/hadoop.tar.gz*
ln -s $HADOOP_HOME/etc/hadoop /etc/hadoop
mkdir $HADOOP_HOME/logs

node01节点修改配置

  1. /etc/hadoop/hadoop-env.sh
sh 复制代码
#文件最后添加
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
  1. /etc/hadoop/core-site.xml
xml 复制代码
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node01:8020</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop/tmp</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>
</configuration>
  1. /etc/hadoop/hdfs-site.xml
xml 复制代码
<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node02:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>0.0.0.0:50070</value>
    </property>
    <property>    
        <name>dfs.namenode.name.dir</name>    
        <value>/data/hadoop/nameNode</value>    
    </property>    
    <property>    
        <name>dfs.datanode.data.dir</name>    
        <value>/data/hadoop/dataNode</value>  
    </property>
    <property>
      <name>dfs.permissions</name>
      <value>false</value>
    </property>
</configuration>
  1. /etc/hadoop/mapred-site.xml
xml 复制代码
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node01:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node01:19888</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
</configuration>
  1. /etc/hadoop/yarn-site.xml resourcemanager好像必须跟namenode在同一台机器
xml 复制代码
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node01</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://node01:19888/jobhistory/logs</value>
    </property>
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
</configuration>
  1. /etc/hadoop/workers
xml 复制代码
node01
node02
node03
  1. 拷贝到其他节点
shell 复制代码
scp -r /opt/hadoop-3.3.6 root@node02:/opt/
scp -r /opt/hadoop-3.3.6 root@node03:/opt/
# 分别在node02和node03节点创建软链接
ln -s $HADOOP_HOME/etc/hadoop /etc/hadoop
  1. 初始启动前操作
shell 复制代码
# 在node01上格式化namenode
hdfs namenode -format
  1. 启动服务
shell 复制代码
# 在node01上执行
bash $HADOOP_HOME/sbin/start-all.sh
bash $HADOOP_HOME/sbin/stop-all.sh

检查服务启动动是否正常

shell 复制代码
jps

打开浏览器验证

HDFS上传文件

http://node01:50070/ 验证HDFS上传文件

mapreduce跑wordcount任务验证

运行wordcount测试样例统计一个文本用词数量,文本要先上传到HDFS

shell 复制代码
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /tmp/harrypotter.txt /output/harr

output目录下载和查看输出结果

至此Hadoop平台的搭建已经完成。基础的Yarn、HDFS、MapReduce功能已经齐全,后面在此基础上完成Hive、Spark和Flink环境部署。

二、Hive安装

待补充

三、Spark

待补充

四、Flink

待补充

五、FAQ

1.容器中安装ping工具报错

需要在创建容器时添加cap-add参数IPC_LOCK和NET_RAW,desktop在Security -> Capabilities位置,命令行创建容器时添加下面参数

--cap-add=IPC_LOCK --cap-add=NET_RAW (或者--cap-add=ALL)

2.在HDFS web ui 上传文件时提示"Couldn't upload the file"

  1. 检查hdfs-site.xml是否配置dfs.permissions=false
  2. 检查操作上传的主机的hosts文件是否配置hadoop集群节点IP映射,上传文件时会调用域名
  3. HDFS目录是否有写入权限

配置参考

ini 复制代码
CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_mapreduce_map_output_compress=true
YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle

MAPRED_CONF_mapreduce_framework_name=yarn
MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
MAPRED_CONF_mapreduce_map_memory_mb=4096
MAPRED_CONF_mapreduce_reduce_memory_mb=8192
MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/

Debian 9(stretch)镜像源

官方

复制代码
deb http://archive.debian.org/debian/ stretch main contrib non-free
deb-src http://archive.debian.org/debian/ stretch main contrib non-free
deb http://archive.debian.org/debian/ stretch-backports main contrib non-free
deb http://archive.debian.org/debian-security/ stretch/updates main contrib non-free
deb-src http://archive.debian.org/debian-security/ stretch/updates main contrib non-free

腾讯

复制代码
deb http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian-archive//debian/ stretch-backports main contrib non-free
deb http://mirrors.cloud.tencent.com/debian-archive//debian-security/ stretch/updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian-archive//debian-security/ stretch/updates main contrib non-free

阿里云

复制代码
deb http://mirrors.aliyun.com/debian-archive/debian/ stretch main contrib non-free
deb-src http://mirrors.aliyun.com/debian-archive/debian/ stretch main contrib non-free
deb http://mirrors.aliyun.com/debian-archive/debian/ stretch-backports main contrib non-free
deb http://mirrors.aliyun.com/debian-archive/debian-security/ stretch/updates main contrib non-free
deb-src http://mirrors.aliyun.com/debian-archive/debian-security/ stretch/updates main contrib non-free
相关推荐
K_i13410 小时前
Hadoop 集群自动化运维实战
运维·hadoop·自动化
Q264336502312 小时前
【有源码】基于Python与Spark的火锅店数据可视化分析系统-基于机器学习的火锅店综合竞争力评估与可视化分析-基于用户画像聚类的火锅店市场细分与可视化研究
大数据·hadoop·python·机器学习·数据分析·spark·毕业设计
顧棟1 天前
【Yarn实战】Yarn 2.9.1滚动升级到3.4.1调研与实践验证
hadoop·yarn
D明明就是我1 天前
Hive 拉链表
数据仓库·hive·hadoop
嘉禾望岗5031 天前
hive join优化和数据倾斜处理
数据仓库·hive·hadoop
yumgpkpm1 天前
华为鲲鹏 Aarch64 环境下多 Oracle 数据库汇聚操作指南 CMP(类 Cloudera CDP 7.3)
大数据·hive·hadoop·elasticsearch·zookeeper·big data·cloudera
忧郁火龙果1 天前
六、Hive的基本使用
数据仓库·hive·hadoop
忧郁火龙果1 天前
五、安装配置hive
数据仓库·hive·hadoop
chad__chang2 天前
dolphinscheduler安装过程
hive·hadoop
ajax_beijing2 天前
hadoop的三副本数据冗余策略
大数据·hadoop·分布式