大数据开发-Hadoop之HDFS的使用

大数据开发-Hadoop之HDFS的使用

文章目录

HDFS介绍

HDFS(Hadoop Distributed File System)是Hadoop分布式文件系统的简称,它是[Apache Hadoop Core](https://www.baidu.com/s?wd=Apache Hadoop Core&usm=2&ie=utf-8&rsv_pq=d1c295df0092998b&oq=hdfs介绍&rsv_t=4944fXszkM3dWIy7WDzfV3XalnaNV5TY2Rmh7TvzAJsDL6MCgimUMmfRqUY&sa=re_dqa_zy&icon=1)项目的一部分,用于存储和管理海量数据,适合大文件的存储。HDFS的设计目标是能够运行于大规模数据集的分布式环境中,提供高吞吐量数据访问,并支持在低成本硬件上部署。它具有高度容错性和可用性,适合于批处理和大数据处理场景。

HDFS的主要组件包括:

  • NameNode:负责管理文件系统的命名空间和客户端的访问请求,维护文件系统的目录树和文件到数据块的映射关系
  • DataNode:从节点,负责存储和检索文件数据,接受客户端的读写请求,并从本地磁盘或网络中的其他DataNode上读取或写入数据块。

一般的文件系统设计:

HDFS文件系统设计:

HDFS的shell介绍

操作格式

shell 复制代码
bin/hdfs dfs -xxx scheme://authority/path

其中scheme在hdfs中就代表hdfs,authority代表主节点的ip:port,path代表文件的路径。这一段内容其实就是我们在配置文件时的core-site.xml中的fs.defaultFS。

常见命令

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]
shell 复制代码
# -ls 查看指定路径信息 path为空代表根目录
[root@hadoop01 bin]# hdfs dfs -ls hdfs://hadoop01:9000/

# -put 从本地上传文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -put README.txt hdfs://hadoop01:9000/
# 查看
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 1 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt

# -cat 查看HDFS文件内容
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -cat hdfs://hadoop01:9000/README.txt
For the latest information about Hadoop, please visit our website at:

   http://hadoop.apache.org/

and our wiki, at:

   http://wiki.apache.org/hadoop/

This distribution includes cryptographic software.  The country in 
which you currently reside may have restrictions on the import, 
possession, use, and/or re-export to another country, of 
encryption software.  BEFORE using any encryption software, please 
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to 
see if this is permitted.  See <http://www.wassenaar.org/> for more
information.

The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity 
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms.  The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS 
Export Administration Regulations, Section 740.13) for both object 
code and source code.

The following provides more details on the included cryptographic
software:
  Hadoop Core uses the SSL libraries from the Jetty project written 
by mortbay.org.


# -get 下载文件到本地
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -get hdfs://hadoop01:9000/README.txt readme.txt
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  readme.txt  README.txt  sbin  share


# 创建文件夹 -mkdir <path>
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -mkdir hdfs://hadoop01:9000/test
You have new mail in /var/spool/mail/root
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls hdfs://hadoop01:9000/
Found 2 items
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -mkdir -p 递归创建
bin/hdfs dfs -mkdir -p hdfs://hadoop01:9000/abc/bcd

# ls -R 递归查看所有文件
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls -R hdfs://hadoop01:9000/
-rw-r--r--   2 root supergroup       1361 2024-03-05 12:42 hdfs://hadoop01:9000/README.txt
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc
drwxr-xr-x   - root supergroup          0 2024-03-05 12:52 hdfs://hadoop01:9000/abc/dfg
drwxr-xr-x   - root supergroup          0 2024-03-05 12:49 hdfs://hadoop01:9000/test

# -rm 删除文件
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm hdfs://hadoop01:9000/README.txt
Deleted hdfs://hadoop01:9000/README.txt


# -rm -r 删除目录
[root@hadoop01 hadoop-3.2.0]# bin/hdfs dfs -rm -r hdfs://hadoop01:9000/test
Deleted hdfs://hadoop01:9000/test


# 统计文件个数
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | wc -l
3

# 统计文件大小
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls / | grep / | awk '{print $8,$5}'
/LICENSE.txt 150569
/NOTICE.txt 22125
/README.txt 1361

Java操作HDFS

去除hdfs的权限校验,因为windows用户没有授权

shell 复制代码
3 添加如下配置
[root@hadoop01 hadoop-3.2.0]# vim etc/hadoop/hdfs-site.xml 
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>

配置与代码

shell 复制代码
   # pom文件中添加hadoop客户端
   <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.2.0</version>
        </dependency>
java 复制代码
public static void main(String[] args) throws IOException {

        // 创建配置对象
        Configuration conf = new Configuration();
        // 指定HDFS地址
        conf.set("fs.defaultFS", "hdfs://192.168.52.100:9000");
        FileSystem fileSystem = FileSystem.get(conf);
        // 上传文件
        //put(fileSystem);
        // 下载文件
        //get(fileSystem);
        // 删除文件
        delete(fileSystem);
    }

    /**
     * 删除文件  第二个参数可以忽略
     * 删除文件夹 第二个参数设置为true可以递归删除目录
     * @param fileSystem
     * @throws IOException
     */
    private static void delete(FileSystem fileSystem) throws IOException {
        fileSystem.delete(new Path("/LICENSE.txt"), true);
    }

    /**
     * 下载文件
     * @param fileSystem
     * @throws IOException
     */
    private static void get(FileSystem fileSystem) throws IOException {
        // 获取HDFS输入流
        FSDataInputStream open = fileSystem.open(new Path("/NOTICE.txt"));
        // 获取本地输出流
        FileOutputStream fileOutputStream = new FileOutputStream("D:\\dailyProject\\NOTICE.txt");
        IOUtils.copyBytes(open, fileOutputStream, 1024, true);
    }

    /**
     * 上传文件
     * @param fileSystem
     * @throws IOException
     */
    private static void put(FileSystem fileSystem) throws IOException {
        // 获取上传文件输入流
        FileInputStream inputStream = new FileInputStream("D:\\工作文件\\大数据学习笔记\\大数据开发-Hadoop分布式集群搭建.md");
        // 获取HDFS输出流
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/test.md"));
        IOUtils.copyBytes(inputStream, fsDataOutputStream, 1024, true);
    }

测试结果

shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 4 items
-rw-r--r--   2 root    supergroup     150569 2024-03-05 12:59 /LICENSE.txt
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md
shell 复制代码
[root@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r--   2 root    supergroup      22125 2024-03-05 12:59 /NOTICE.txt
-rw-r--r--   2 root    supergroup       1361 2024-03-05 13:00 /README.txt
-rw-r--r--   3 1111612 supergroup       7877 2024-03-05 14:01 /test.md

ot@hadoop01 hadoop-3.2.0]# hdfs dfs -ls /

Found 3 items

-rw-r--r-- 2 root supergroup 22125 2024-03-05 12:59 /NOTICE.txt

-rw-r--r-- 2 root supergroup 1361 2024-03-05 13:00 /README.txt

-rw-r--r-- 3 1111612 supergroup 7877 2024-03-05 14:01 /test.md

复制代码
相关推荐
AI先驱体验官1 小时前
智能体变现:从技术实现到产品化的实践路径
大数据·人工智能·深度学习·重构·aigc
TDengine (老段)2 小时前
TDengine IDMP 工业数据建模 —— 属性
大数据·数据库·人工智能·时序数据库·tdengine·涛思数据
得物技术3 小时前
Redis 自动化运维最佳实践|得物技术
大数据·redis
Elastic 中国社区官方博客4 小时前
Elasticsearch:如何在 Elastic AI Builder 里使用 DSL 来查询 Elasticsearch
大数据·人工智能·elasticsearch·搜索引擎·ai·全文检索
tian_jiangnan4 小时前
flink大数据15天速成教程
大数据·flink
一休哥※4 小时前
ClawTeam 完整使用教程:用 AI 多智能体团队自动完成复杂任务
大数据·人工智能·elasticsearch
yitian_hm5 小时前
HBase 原理深度剖析:从数据模型到存储机制
大数据·数据库·hbase
鹧鸪云光伏6 小时前
微电网设计系统及经济收益计算
大数据·人工智能·光伏·储能设计方案
国冶机电安装6 小时前
其他弱电系统安装:从方案设计到落地施工的完整指南
大数据·运维·网络
蓝天守卫者联盟16 小时前
玩具喷涂废气治理厂家:行业现状、技术路径与选型指南
大数据·运维·人工智能·python