大数据开发-Hadoop之HDFS的使用
文章目录
HDFS介绍
HDFS(Hadoop Distributed File System)是Hadoop分布式文件系统的简称,它是[Apache Hadoop Core](https://www.baidu.com/s?wd=Apache Hadoop Core&usm=2&ie=utf-8&rsv_pq=d1c295df0092998b&oq=hdfs介绍&rsv_t=4944fXszkM3dWIy7WDzfV3XalnaNV5TY2Rmh7TvzAJsDL6MCgimUMmfRqUY&sa=re_dqa_zy&icon=1)项目的一部分,用于存储和管理海量数据,适合大文件的存储。HDFS的设计目标是能够运行于大规模数据集的分布式环境中,提供高吞吐量数据访问,并支持在低成本硬件上部署。它具有高度容错性和可用性,适合于批处理和大数据处理场景。
HDFS的主要组件包括:
- NameNode:负责管理文件系统的命名空间和客户端的访问请求,维护文件系统的目录树和文件到数据块的映射关系
- DataNode:从节点,负责存储和检索文件数据,接受客户端的读写请求,并从本地磁盘或网络中的其他DataNode上读取或写入数据块。
一般的文件系统设计:

HDFS文件系统设计:

HDFS的shell介绍
操作格式
shell
bin/hdfs dfs -xxx scheme://authority/path
其中scheme在hdfs中就代表hdfs,authority代表主节点的ip:port,path代表文件的路径。这一段内容其实就是我们在配置文件时的core-site.xml中的fs.defaultFS。

常见命令
shell
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
shell
# -ls 查看指定路径信息 path为空代表根目录
[root@hadoop01 bin]# hdfs dfs -ls hdfs://hadoop01:9000/
# -put 从本地上传文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -put README.txt hdfs://hadoop01:9000/
# 查看
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 1 items
-rw-r--r-- 2 root supergroup 1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
# -cat 查看HDFS文件内容
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://hadoop01:9000/README.txt
For the latest information about Hadoop, please visit our website at:
http://hadoop.apache.org/
and our wiki, at:
http://wiki.apache.org/hadoop/
This distribution includes cryptographic software. The country in
which you currently reside may have restrictions on the import,
possession, use, and/or re-export to another country, of
encryption software. BEFORE using any encryption software, please
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to
see if this is permitted. See <http://www.wassenaar.org/> for more
information.
The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms. The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS
Export Administration Regulations, Section 740.13) for both object
code and source code.
The following provides more details on the included cryptographic
software:
Hadoop Core uses the SSL libraries from the Jetty project written
by mortbay.org.
# -get 下载文件到本地
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -get hdfs://hadoop01:9000/README.txt readme.txt
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# ls
bin etc include lib libexec LICENSE.txt NOTICE.txt readme.txt README.txt sbin share
# 创建文件夹 -mkdir <path>
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -mkdir hdfs://hadoop01:9000/test
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 2 items
-rw-r--r-- 2 root supergroup 1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x - root supergroup 0 2024-03-05 12:49 hdfs://hadoop01:9000/test
# -mkdir -p 递归创建
bin/hdfs dfs -mkdir -p hdfs://hadoop01:9000/abc/bcd
# ls -R 递归查看所有文件
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls -R hdfs://hadoop01:9000/
-rw-r--r-- 2 root supergroup 1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x - root supergroup 0 2024-03-05 12:52 hdfs://hadoop01:9000/abc
drwxr-xr-x - root supergroup 0 2024-03-05 12:52 hdfs://hadoop01:9000/abc/dfg
drwxr-xr-x - root supergroup 0 2024-03-05 12:49 hdfs://hadoop01:9000/test
# -rm 删除文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm hdfs://hadoop01:9000/README.txt
Deleted hdfs://hadoop01:9000/README.txt
# -rm -r 删除目录
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm -r hdfs://hadoop01:9000/test
Deleted hdfs://hadoop01:9000/test
# 统计文件个数
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | wc -l
3
# 统计文件大小
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | awk '{print $8,$5}'
/LICENSE.txt 150569
/NOTICE.txt 22125
/README.txt 1361
Java操作HDFS
去除hdfs的权限校验,因为windows用户没有授权
shell
3 添加如下配置
[root@hadoop01 hadoop-3.2.0]# vim etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
配置与代码
shell
# pom文件中添加hadoop客户端
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
java
public static void main(String[] args) throws IOException {
// 创建配置对象
Configuration conf = new Configuration();
// 指定HDFS地址
conf.set("fs.defaultFS", "hdfs://192.168.52.100:9000");
FileSystem fileSystem = FileSystem.get(conf);
// 上传文件
//put(fileSystem);
// 下载文件
//get(fileSystem);
// 删除文件
delete(fileSystem);
}
/**
* 删除文件 第二个参数可以忽略
* 删除文件夹 第二个参数设置为true可以递归删除目录
* @param fileSystem
* @throws IOException
*/
private static void delete(FileSystem fileSystem) throws IOException {
fileSystem.delete(new Path("/LICENSE.txt"), true);
}
/**
* 下载文件
* @param fileSystem
* @throws IOException
*/
private static void get(FileSystem fileSystem) throws IOException {
// 获取HDFS输入流
FSDataInputStream open = fileSystem.open(new Path("/NOTICE.txt"));
// 获取本地输出流
FileOutputStream fileOutputStream = new FileOutputStream("D:\\dailyProject\\NOTICE.txt");
IOUtils.copyBytes(open, fileOutputStream, 1024, true);
}
/**
* 上传文件
* @param fileSystem
* @throws IOException
*/
private static void put(FileSystem fileSystem) throws IOException {
// 获取上传文件输入流
FileInputStream inputStream = new FileInputStream("D:\\工作文件\\大数据学习笔记\\大数据开发-Hadoop分布式集群搭建.md");
// 获取HDFS输出流
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/test.md"));
IOUtils.copyBytes(inputStream, fsDataOutputStream, 1024, true);
}
测试结果
shell
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 4 items
-rw-r--r-- 2 root supergroup 150569 2024-03-05 12:59 /LICENSE.txt
-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt
-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md

shell
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt
-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md
ot@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt
-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md