大数据开发-Hadoop之HDFS的使用

大数据开发-Hadoop之HDFS的使用

文章目录

HDFS介绍

HDFS(Hadoop Distributed File System)是Hadoop分布式文件系统的简称,它是[Apache Hadoop Core](https://www.baidu.com/s?wd=Apache Hadoop Core&usm=2&ie=utf-8&rsv_pq=d1c295df0092998b&oq=hdfs介绍&rsv_t=4944fXszkM3dWIy7WDzfV3XalnaNV5TY2Rmh7TvzAJsDL6MCgimUMmfRqUY&sa=re_dqa_zy&icon=1)项目的一部分,用于存储和管理海量数据,适合大文件的存储。HDFS的设计目标是能够运行于大规模数据集的分布式环境中,提供高吞吐量数据访问,并支持在低成本硬件上部署。它具有高度容错性和可用性,适合于批处理和大数据处理场景。

HDFS的主要组件包括:

  • NameNode:负责管理文件系统的命名空间和客户端的访问请求,维护文件系统的目录树和文件到数据块的映射关系
  • DataNode:从节点,负责存储和检索文件数据,接受客户端的读写请求,并从本地磁盘或网络中的其他DataNode上读取或写入数据块。

一般的文件系统设计:

HDFS文件系统设计:

HDFS的shell介绍

操作格式

shell 复制代码
bin/hdfs dfs -xxx scheme://authority/path

其中scheme在hdfs中就代表hdfs,authority代表主节点的ip:port,path代表文件的路径。这一段内容其实就是我们在配置文件时的core-site.xml中的fs.defaultFS。

常见命令

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]
shell 复制代码
# -ls 查看指定路径信息 path为空代表根目录
[root@hadoop01 bin]# hdfs dfs -ls hdfs://hadoop01:9000/

# -put 从本地上传文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -put README.txt hdfs://hadoop01:9000/
# 查看
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 1 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt

# -cat 查看HDFS文件内容
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://hadoop01:9000/README.txt
For the latest information about Hadoop, please visit our website at:

   http://hadoop.apache.org/

and our wiki, at:

   http://wiki.apache.org/hadoop/

This distribution includes cryptographic software.  The country in 
which you currently reside may have restrictions on the import, 
possession, use, and/or re-export to another country, of 
encryption software.  BEFORE using any encryption software, please 
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to 
see if this is permitted.  See <http://www.wassenaar.org/> for more
information.

The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity 
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms.  The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS 
Export Administration Regulations, Section 740.13) for both object 
code and source code.

The following provides more details on the included cryptographic
software:
  Hadoop Core uses the SSL libraries from the Jetty project written 
by mortbay.org.


# -get 下载文件到本地
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -get hdfs://hadoop01:9000/README.txt readme.txt
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  readme.txt  README.txt  sbin  share


# 创建文件夹 -mkdir <path>
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -mkdir hdfs://hadoop01:9000/test
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 2 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -mkdir -p 递归创建
bin/hdfs dfs -mkdir -p hdfs://hadoop01:9000/abc/bcd

# ls -R 递归查看所有文件
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls -R hdfs://hadoop01:9000/
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc/dfg
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -rm 删除文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm hdfs://hadoop01:9000/README.txt
Deleted hdfs://hadoop01:9000/README.txt


# -rm -r 删除目录
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm -r hdfs://hadoop01:9000/test
Deleted hdfs://hadoop01:9000/test


# 统计文件个数
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | wc -l
3

# 统计文件大小
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | awk '{print $8,$5}'
/LICENSE.txt 150569
/NOTICE.txt 22125
/README.txt 1361

Java操作HDFS

去除hdfs的权限校验,因为windows用户没有授权

shell 复制代码
3 添加如下配置
[root@hadoop01 hadoop-3.2.0]# vim etc/hadoop/hdfs-site.xml 
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>

配置与代码

shell 复制代码
   # pom文件中添加hadoop客户端
   <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>
java 复制代码
public static void main(String[] args) throws IOException {

        // 创建配置对象
        Configuration conf = new Configuration();
        // 指定HDFS地址
        conf.set("fs.defaultFS", "hdfs://192.168.52.100:9000");
        FileSystem fileSystem = FileSystem.get(conf);
        // 上传文件
        //put(fileSystem);
        // 下载文件
        //get(fileSystem);
        // 删除文件
        delete(fileSystem);
    }

    /**
     * 删除文件  第二个参数可以忽略
     * 删除文件夹 第二个参数设置为true可以递归删除目录
     * @param fileSystem
     * @throws IOException
     */
    private static void delete(FileSystem fileSystem) throws IOException {
        fileSystem.delete(new Path("/LICENSE.txt"), true);
    }

    /**
     * 下载文件
     * @param fileSystem
     * @throws IOException
     */
    private static void get(FileSystem fileSystem) throws IOException {
        // 获取HDFS输入流
        FSDataInputStream open = fileSystem.open(new Path("/NOTICE.txt"));
        // 获取本地输出流
        FileOutputStream fileOutputStream = new FileOutputStream("D:\\dailyProject\\NOTICE.txt");
        IOUtils.copyBytes(open, fileOutputStream, 1024, true);
    }

    /**
     * 上传文件
     * @param fileSystem
     * @throws IOException
     */
    private static void put(FileSystem fileSystem) throws IOException {
        // 获取上传文件输入流
        FileInputStream inputStream = new FileInputStream("D:\\工作文件\\大数据学习笔记\\大数据开发-Hadoop分布式集群搭建.md");
        // 获取HDFS输出流
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/test.md"));
        IOUtils.copyBytes(inputStream, fsDataOutputStream, 1024, true);
    }

测试结果

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 4 items
-rw-r--r--   2 root    supergroup     150569 2024-03-05 12:59 /LICENSE.txt
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md
shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md

ot@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /

Found 3 items

-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt

-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt

-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md

复制代码
相关推荐
2501_9436953330 分钟前
高职大数据技术专业,怎么参与开源数据分析项目积累经验?
大数据·数据分析·开源
Dxy12393102161 小时前
别再让 ES 把你拖垮!5 个实战技巧让搜索性能提升 10 倍
大数据·elasticsearch·搜索引擎
2501_943695332 小时前
大专市场调查与统计分析专业,怎么辨别企业招聘的“画饼”岗位?
大数据
七夜zippoe2 小时前
CANN Runtime跨进程通信 共享设备上下文的IPC实现
大数据·cann
威胁猎人2 小时前
【黑产大数据】2025年全球电商业务欺诈风险研究报告
大数据
十月南城2 小时前
Hadoop基础认知——HDFS、YARN、MapReduce在现代体系中的位置与价值
hadoop·hdfs·mapreduce
L543414462 小时前
告别代码堆砌匠厂架构让你的系统吞吐量翻倍提升
大数据·人工智能·架构·自动化·rpa
证榜样呀2 小时前
2026 大专计算机专业必考证书推荐什么
大数据·前端
LLWZAI3 小时前
让朱雀AI检测无法判断的AI公众号文章,当创作者开始与算法「躲猫猫」
大数据·人工智能·深度学习
SickeyLee3 小时前
产品经理案例分析(五):电商产品后台设计:撑起前台体验的 “隐形支柱”
大数据