大数据开发-Hadoop之HDFS的使用

大数据开发-Hadoop之HDFS的使用

文章目录

HDFS介绍

HDFS(Hadoop Distributed File System)是Hadoop分布式文件系统的简称,它是[Apache Hadoop Core](https://www.baidu.com/s?wd=Apache Hadoop Core&usm=2&ie=utf-8&rsv_pq=d1c295df0092998b&oq=hdfs介绍&rsv_t=4944fXszkM3dWIy7WDzfV3XalnaNV5TY2Rmh7TvzAJsDL6MCgimUMmfRqUY&sa=re_dqa_zy&icon=1)项目的一部分,用于存储和管理海量数据,适合大文件的存储。HDFS的设计目标是能够运行于大规模数据集的分布式环境中,提供高吞吐量数据访问,并支持在低成本硬件上部署。它具有高度容错性和可用性,适合于批处理和大数据处理场景。

HDFS的主要组件包括:

  • NameNode:负责管理文件系统的命名空间和客户端的访问请求,维护文件系统的目录树和文件到数据块的映射关系
  • DataNode:从节点,负责存储和检索文件数据,接受客户端的读写请求,并从本地磁盘或网络中的其他DataNode上读取或写入数据块。

一般的文件系统设计:

HDFS文件系统设计:

HDFS的shell介绍

操作格式

shell 复制代码
bin/hdfs dfs -xxx scheme://authority/path

其中scheme在hdfs中就代表hdfs,authority代表主节点的ip:port,path代表文件的路径。这一段内容其实就是我们在配置文件时的core-site.xml中的fs.defaultFS。

常见命令

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]
shell 复制代码
# -ls 查看指定路径信息 path为空代表根目录
[root@hadoop01 bin]# hdfs dfs -ls hdfs://hadoop01:9000/

# -put 从本地上传文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -put README.txt hdfs://hadoop01:9000/
# 查看
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 1 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt

# -cat 查看HDFS文件内容
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://hadoop01:9000/README.txt
For the latest information about Hadoop, please visit our website at:

   http://hadoop.apache.org/

and our wiki, at:

   http://wiki.apache.org/hadoop/

This distribution includes cryptographic software.  The country in 
which you currently reside may have restrictions on the import, 
possession, use, and/or re-export to another country, of 
encryption software.  BEFORE using any encryption software, please 
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to 
see if this is permitted.  See <http://www.wassenaar.org/> for more
information.

The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity 
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms.  The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS 
Export Administration Regulations, Section 740.13) for both object 
code and source code.

The following provides more details on the included cryptographic
software:
  Hadoop Core uses the SSL libraries from the Jetty project written 
by mortbay.org.


# -get 下载文件到本地
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -get hdfs://hadoop01:9000/README.txt readme.txt
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  readme.txt  README.txt  sbin  share


# 创建文件夹 -mkdir <path>
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -mkdir hdfs://hadoop01:9000/test
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 2 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -mkdir -p 递归创建
bin/hdfs dfs -mkdir -p hdfs://hadoop01:9000/abc/bcd

# ls -R 递归查看所有文件
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls -R hdfs://hadoop01:9000/
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc/dfg
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -rm 删除文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm hdfs://hadoop01:9000/README.txt
Deleted hdfs://hadoop01:9000/README.txt


# -rm -r 删除目录
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm -r hdfs://hadoop01:9000/test
Deleted hdfs://hadoop01:9000/test


# 统计文件个数
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | wc -l
3

# 统计文件大小
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | awk '{print $8,$5}'
/LICENSE.txt 150569
/NOTICE.txt 22125
/README.txt 1361

Java操作HDFS

去除hdfs的权限校验,因为windows用户没有授权

shell 复制代码
3 添加如下配置
[root@hadoop01 hadoop-3.2.0]# vim etc/hadoop/hdfs-site.xml 
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>

配置与代码

shell 复制代码
   # pom文件中添加hadoop客户端
   <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>
java 复制代码
public static void main(String[] args) throws IOException {

        // 创建配置对象
        Configuration conf = new Configuration();
        // 指定HDFS地址
        conf.set("fs.defaultFS", "hdfs://192.168.52.100:9000");
        FileSystem fileSystem = FileSystem.get(conf);
        // 上传文件
        //put(fileSystem);
        // 下载文件
        //get(fileSystem);
        // 删除文件
        delete(fileSystem);
    }

    /**
     * 删除文件  第二个参数可以忽略
     * 删除文件夹 第二个参数设置为true可以递归删除目录
     * @param fileSystem
     * @throws IOException
     */
    private static void delete(FileSystem fileSystem) throws IOException {
        fileSystem.delete(new Path("/LICENSE.txt"), true);
    }

    /**
     * 下载文件
     * @param fileSystem
     * @throws IOException
     */
    private static void get(FileSystem fileSystem) throws IOException {
        // 获取HDFS输入流
        FSDataInputStream open = fileSystem.open(new Path("/NOTICE.txt"));
        // 获取本地输出流
        FileOutputStream fileOutputStream = new FileOutputStream("D:\\dailyProject\\NOTICE.txt");
        IOUtils.copyBytes(open, fileOutputStream, 1024, true);
    }

    /**
     * 上传文件
     * @param fileSystem
     * @throws IOException
     */
    private static void put(FileSystem fileSystem) throws IOException {
        // 获取上传文件输入流
        FileInputStream inputStream = new FileInputStream("D:\\工作文件\\大数据学习笔记\\大数据开发-Hadoop分布式集群搭建.md");
        // 获取HDFS输出流
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/test.md"));
        IOUtils.copyBytes(inputStream, fsDataOutputStream, 1024, true);
    }

测试结果

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 4 items
-rw-r--r--   2 root    supergroup     150569 2024-03-05 12:59 /LICENSE.txt
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md
shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md

ot@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /

Found 3 items

-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt

-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt

-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md

复制代码
相关推荐
艾莉丝努力练剑1 小时前
【优选算法必刷100题】第031~32题(前缀和算法):连续数组、矩阵区域和
大数据·人工智能·线性代数·算法·矩阵·二维前缀和
能鈺CMS2 小时前
能鈺CMS · 虚拟发货源码
java·大数据·数据库
非著名架构师3 小时前
极端天气下的供应链韧性:制造企业如何构建气象风险防御体系
大数据·人工智能·算法·制造·疾风气象大模型·风光功率预测
做萤石二次开发的哈哈4 小时前
11月27日直播预告 | 萤石智慧台球厅创新场景化方案分享
大数据·人工智能
Hello.Reader5 小时前
使用 Flink CDC 搭建跨库 Streaming ETLMySQL + Postgres → Elasticsearch 实战
大数据·elasticsearch·flink
用户199701080185 小时前
1688图片搜索API | 上传图片秒找同款 | 相似商品精准推荐
大数据·数据挖掘·图片资源
武子康6 小时前
大数据-164 Apache Kylin Cuboid 剪枝实战:Derived 维度与膨胀率控制
大数据·后端·apache kylin
梦里不知身是客116 小时前
shuffle过程
大数据
星释7 小时前
Rust 练习册 80:Grains与位运算
大数据·算法·rust
练习时长一年7 小时前
git常用命令总结
大数据·git·elasticsearch