hadoop学习笔记

hadoop集群搭建

hadoop摘要

Hadoop 是一个开源的分布式存储和计算框架,旨在处理大规模数据集并提供高可靠性、高性能的数据处理能力。它主要包括以下几个核心组件:

  • Hadoop 分布式文件系统(HDFS):HDFS 是 Hadoop 的分布式文件存储系统,用于存储大规模数据,并通过数据的副本和自动故障恢复机制来提供高可靠性和容错性。

  • YARN(Yet Another Resource Negotiator):YARN 是 Hadoop 的资源管理平台,用于调度和管理集群中的计算资源,支持多种数据处理框架的并行计算,如 MapReduce、Spark 等。

  • MapReduce:MapReduce 是 Hadoop 最初提出的一种分布式计算编程模型,用于并行处理大规模数据集。它将计算任务分解为 Map 和 Reduce 两个阶段,充分利用集群的计算资源进行并行计算。

  • Hadoop 生态系统:除了上述核心组件外,Hadoop 还包括了一系列相关项目和工具,如 HBase(分布式列存数据库)、Hive(数据仓库基础设施)、Spark(快速通用的集群计算系统)等,构成了完整的大数据处理生态系统。

总的来说,Hadoop 提供了一套强大的工具和框架,使得用户能够在分布式环境下高效地存储、处理和分析大规模数据,是大数据领域的重要基础设施之一。

基础环境

规划

hadoop001 hadoop02 hadoop003
HDFS NameNode DataNode SecondaryNameNode DateNode
YARN NodeManager ResourceManager NodeManager NodeManager

关闭防火墙 安全模块

shell 复制代码
systemctl disable --now firewalld 
#关闭并且禁止防火墙自启动
setenforce 0
#关闭增强模块
sed -i 's/SELINUX=enforcing/SELINUX=permissive/g'  /etc/selinux/config
#禁止自启动

添加主机地址映射

shell 复制代码
cat >> /etc/hosts << lxf
192.168.200.41 hadoop001
192.168.200.42 hadoop002
192.168.200.43 hadoop003
lxf

传递软件包

shell 复制代码
#最好创建一个独立目录存放软件包
#
[root@hadoop001 ~]# mkdir -p /export/{data,servers,software}
[root@hadoop001 ~]# tree /export/
/export/
├── data     #存放数据
├── servers  #安装服务
└── software #存放服务的软件包

3 directories, 0 files
[root@hadoop001 ~]# 

配置ssh免密

master 和两台slave都要配置免密

shell 复制代码
#在hadoop001节点
[root@hadoop001 ~]#
[root@hadoop001 ~]# ssh-keygen -t rsa 
[root@hadoop001 ~]#
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:2AY6p3kFWziXFmXH7e/PHgrAVZ9H7hkCWAtCc19UwZw root@hadoop001
The key's randomart image is:
+---[RSA 2048]----+
|      .+.+=+o=+o+|
|       .+=o.* oE.|
|      = =  + o +o|
|     . @. .   o.+|
|    o + So     o.|
|     = o  .     .|
|    o .    .   o |
|     .      . ..o|
|             . .=|
+----[SHA256]-----+
[root@hadoop001 ~]#
[root@hadoop001 ~]# hosts=("hadoop001" "hadoop002" "hadoop003"); for host in "${hosts[@]}"; do echo "将公钥分发到 $host"; ssh-copy-id  $host; done
将公钥分发到 hadoop001
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@hadoop001's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop001'"
and check to make sure that only the key(s) you wanted were added.

将公钥分发到 hadoop002
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'hadoop002 (192.168.200.42)' can't be established.
ECDSA key fingerprint is SHA256:ADFjDGD2MxgCqL5fQuWhn+0T5drPiTXERvlMiu/QXjA.
ECDSA key fingerprint is MD5:d2:2b:06:cb:13:48:e0:87:d7:f3:87:8b:2c:56:e4:da.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@hadoop002's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop002'"
and check to make sure that only the key(s) you wanted were added.

将公钥分发到 hadoop003
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@hadoop003's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop003'"
and check to make sure that only the key(s) you wanted were added.
[root@hadoop001 ~]#
[root@hadoop001 ~]# 
#在hadoop002和hadoop003重复操作 使得三台节点可以互相通信

安装服务

安装java hadoop

shell 复制代码
[root@hadoop001]# tree -L 2 /export/
/export/
├── data      #因为还没有启动服务所以没有数据文件
├── servers
│   ├── hadoop-3.3.6
│   └── jdk1.8.0_221
└── software
    ├── hadoop-3.3.6-aarch64.tar.gz
    └── jdk-8u221-linux-x64.tar.gz

5 directories, 2 files
[root@hadoop001 export]# 
#同样在另外两个节点分别解压到对应目录

配置环境变量

shell 复制代码
#/etc/profile.d/my_env.sh  我们可以专门为hadoop设置一个环境变量文件夹用作修改,而不去直接修改/etc/profile文件,这样在系统启动时或者用户登录时会自动加载这些环境变量。

[root@hadoop001 ~]# cat >> /etc/profile.d/my_env.sh << lxf 
> #JAVA_HOME
> export JAVA_HOME=/export/servers/jdk1.8.0_221
> export PATH=$PATH:$JAVA_HOME/bin
> #HADOOP_HOME
> export HADOOP_HOME=/export/servers/hadoop-3.3.6
> export PATH=$PATH:$HADOOP_HOME/bin
> export PATH=$PATH:$HADOOP_HOME/sbin
> lxf
[root@hadoop001 ~]# source /etc/profile
[root@hadoop001 ~]# echo $HADOOP_HOME
/export/servers/hadoop-3.3.6
[root@hadoop001 ~]# echo $JAVA_HOME
/export/servers/jdk1.8.0_221
[root@hadoop001 ~]# 

#测试环境变量
[root@hadoop001 ~]# hadoop version 
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c
Compiled by ubuntu on 2023-06-18T23:15Z
Compiled on platform linux-aarch_64
Compiled with protoc 3.7.1
From source with checksum 5652179ad55f76cb287d9c633bb53bbd
This command was run using /export/servers/hadoop-3.3.6/share/hadoop/common/hadoop-common-3.3.6.jar
[root@hadoop001 ~]# java -version 
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
[root@hadoop001 ~]# 
#分发环境配置文件
scp /etc/profile.d/my_env.sh hadoop002:/etc/profile.d/my_env.sh
scp /etc/profile.d/my_env.sh hadoop003:/etc/profile.d/my_env.sh
#分发完成后注意测试环境变量是否成功配置

修改Hadoop配置文件

shell 复制代码
[root@hadoop001 hadoop]# pwd
/export/servers/hadoop-3.3.6/etc/hadoop
[root@hadoop001 hadoop]# grep  '^export' hadoop-env.sh
export  JAVA_HOME=/export/servers/jdk1.8.0_221
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
[root@hadoop001 hadoop]# 

修改core-site.xml文件

xml 复制代码
<configuration>
 
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:9000</value>
    </property>
    
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/export/data/hadoop/tmp</value>
    </property>
</configuration>

修改hdfs-site.xml文件

xml 复制代码
<configuration>
    
    <property>
        <name>dfs.relication</name>
        <value>2</value>
    </property>
    
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/export/data/hadoop/name</value>
    </property>
     
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop002:50090</value>
    </property>
	 
    <property>
        <name>dfs.namenode.data.dir</name>
        <value>/export/data/hadoop/data</value>
    </property>
</configuration>

修改mapred-site.xml文件

xml 复制代码
<configuration>
	
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

修改yarn-site.xml文件

xml 复制代码
<configuration>
<!-- Site specific YARN configuration properties -->
     
	<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop001</value>
    </property>
</configuration>

修改workers文件

shell 复制代码
[root@hadoop001 hadoop]# cat >>  workers  << lxf 
> hadoop001
> hadoop002
> hadoop003
> lxf
[root@hadoop001 hadoop]# 

向集群分发配置文件

shell 复制代码
#开始分发
#向hadoop002
scp /export/servers/hadoop-3.3.6/etc/hadoop hadoop002:/export/servers/hadoop-3.3.6/etc/hadoop
#向hadoop003
scp /export/servers/hadoop-3.3.6/etc/hadoop hadoop002:/export/servers/hadoop-3.3.6/etc/hadoop

hadoop集群

hadoop集群文件系统初始化

Hadoop集群的初始化是非常重要的,它确保了集群的各个组件在启动时处于正确的状态,并且能够正确地协调彼此的工作。在第一次启动Hadoop集群时,初始化是必需的,具体原因如下:

  • 文件系统初始化:Hadoop的分布式文件系统(HDFS)需要在启动时进行初始化,这包括创建初始的目录结构、设置权限和准备必要的元数据等操作。
  • 元数据准备:Hadoop的各个组件(比如NameNode、ResourceManager等)需要进行元数据的准备工作,包括创建必要的数据库表、清理日志文件等。
  • 配置检查:初始化过程还会对各个节点的配置进行检查,确保配置的正确性和一致性,以避免在后续运行中出现问题。
  • 启动必要的服务:在初始化过程中,Hadoop会启动各个必要的服务,比如NameNode、DataNode、ResourceManager、NodeManager等,以确保集群的核心组件都能够正常运行。
shell 复制代码
#在启动hadoop集群之前,需要对主节点hadoop001进行格式化
[root@hadoop001 ~]# hadoop namenode -format
.......
......

启动

分布启动
启动hdfs
shell 复制代码
[root@hadoop001 ~]# start-dfs.sh 
启动yarn
shell 复制代码
[root@hadoop001 ~]# start-yarn.sh 

集群一键启动

shell 复制代码
[root@hadoop001 ~]# start-all.sh 

关闭

分布关闭
关闭hdfs
shell 复制代码
[root@hadoop001 ~]# stop-dfs.sh 
启动yarn
shell 复制代码
[root@hadoop001 ~]# stop-yarn.sh 

集群一键启动

shell 复制代码
[root@hadoop001 ~]# stop-all.sh 

测试

查看hadoop的WebUI

成功访问

可以在C:\Windows\System32\drivers\etc\hosts下面修改hosts主机地址映射然后在宿主机的浏览器中就可以使用主机名:端口号去访问

如果没有设置就只能使用ip:端口号去访问

关闭hdfs
shell 复制代码
[root@hadoop001 ~]# stop-dfs.sh 
启动yarn
shell 复制代码
[root@hadoop001 ~]# stop-yarn.sh 

集群一键启动

shell 复制代码
[root@hadoop001 ~]# stop-all.sh 

测试

查看hadoop的WebUI

成功访问

可以在C:\Windows\System32\drivers\etc\hosts下面修改hosts主机地址映射然后在宿主机的浏览器中就可以使用主机名:端口号去访问

如果没有设置就只能使用ip:端口号去访问

相关推荐
努力的布布1 小时前
Elasticsearch-索引的批量操作
大数据·elasticsearch·搜索引擎·全文检索
胡楚昊1 小时前
攻防世界PWN刷题笔记(引导模式)1-3
笔记
RodrickOMG1 小时前
【大数据】Hadoop三节点集群搭建
大数据·hadoop·分布式
智慧化智能化数字化方案1 小时前
工业金融政务数据分类分级体系建设解读
大数据·金融·数据分类分级·政务·政务数据分类·工业数据分类·金融数据分类分级
上等猿1 小时前
Ajax笔记
前端·笔记·ajax
DashVector1 小时前
如何通过HTTP API插入或更新Doc
大数据·数据库·数据仓库·人工智能·http·数据库架构·向量检索
斑驳竹影1 小时前
ElasticSearch存储引擎
大数据·elasticsearch·搜索引擎
lmxnsI2 小时前
docker使用笔记
笔记·docker·容器
Aloudata2 小时前
NoETL 自动化指标平台如何保障数据质量和口径一致性?
大数据·数据分析·数据质量·noetl
lijiachang0307182 小时前
设计模式(一):单例模式
c++·笔记·学习·程序人生·单例模式·设计模式·大学生