hadoop之HDFS文件系统命令操作

Apache Hadoop 3.3.4 -- Overview

01.appendToFile

sh 复制代码
hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hdfs dfs -appendToFile /root/tmp/202302/02/1.txt hdfs://192.168.88.161:8020/tmp/test20230202/1.txt

02.cat

-ignoreCrc 忽略检查验证
sh 复制代码
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4

03.checksum

-v 显示文件块的大小
sh 复制代码
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///etc/hosts

04.chgrp

更改文件的组关联。用户必须是文件的所有者,或者是超级用户。其他信息在权限指南中。

-R 将文件的组关联进行递归更改

05.chmod

-R 将文件使用权限进行递归更改
sh 复制代码
hdfs dfs -chmod -R 777 /tmp/tmp

06.chown

-R 递归更改

07.copyFromLocal

将文件上传到HDFS, 同 -put

08.copyToLocal

将文件下载到本地,同 -get

09.count

计算指定文件模式匹配的路径下的目录、文件和字节数。获取配额和使用情况。带有 -count 的输出列包括:DIR_COUNT、FILE_COUNT、CONTENT_SIZE、路径名

-q -u 和 -q 选项控制输出包含哪些列。-q 表示显示配额,-u 将输出限制为仅显示配额和使用情况。
-u -u 和 -q 选项控制输出包含哪些列。-q 表示显示配额,-u 将输出限制为仅显示配额和使用情况。
-v 显示标题行
-x -x 选项从结果计算中排除快照。如果没有 -x 选项(缺省值),则始终根据所有 INodes 计算结果,包括给定路径下的所有快照。如果给定了 -u 或 -q 选项,则忽略 -x 选项。
-h 可以更人性化的展示字节大小B,K,M,G
-e 显示纠删码策略
-s -s 选项显示每个目录的快照计数。
sh 复制代码
hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
hadoop fs -count -u hdfs://nn1.example.com/file1
hadoop fs -count -u -h hdfs://nn1.example.com/file1
hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
hadoop fs -count -e hdfs://nn1.example.com/file1
hadoop fs -count -s hdfs://nn1.example.com/file1

10.test

判断hdfs是否存在文件或者文件夹

命令参数 描述
-d 如果指定路径是一个目录返回0否则返回1
-e 如果指定路径存在返回0否则返回1
-f 如果指定路径是一个文件返回0否则返回1
-s 如果指定路径文件大小大于0返回0否则返回1
-z 如果指定指定文件大小等于0返回0否则返回1

11.getmerge

sh 复制代码
# 将hdfs目录中的文件合并下载到本地
hdfs dfs -getmerge hdfs://ip:port/tmp/tmp ./value.txt

12.expunge

清空回收站

13.skipTrash

直接删除,不放入回收站

14.report

查看hdfs总容量和使用情况

hdfs dfsadmin -report

15.distcp

参数 说明
-append 重用目标文件中的现有数据,并在可能的情况下添加新数据,新增进去而不是覆盖它
-async 是否应该阻塞distcp执行
-atomic 提交所有更改或不提交更改
-bandwidth 以MB/second为单位指定每个map的带宽
-delete 删除目标文件中存在的文件,但在源文件中不存在,走HDFS垃圾回收站
-diff 使用snapshot diff报告来标识源和目标之间的差异
-f 需要复制的文件列表
-filelimit (已弃用!)限制复制到<= n的文件数
-filters 从复制的文件列表中排除
-i 忽略复制过程中的失败
-log HDFS上的distcp执行日志文件夹保存
-m 限制同步启动的map数,默认每个文件对应一个map,每台机器最多启动20个map
-mapredSslConf 配置ssl配置文件,用于hftps://
-numListstatusThreads 用于构建文件清单的线程数(最多40个),当文件目录结构复杂时应该适当增大该值
-overwrite 选择无条件覆盖目标文件,即使它们存在。
-p 保留源文件状态(rbugpcaxt)(复制,块大小,用户,组,权限,校验和类型,ACL,XATTR,时间戳)
-sizelimit (已弃用!)限制复制到<= n的文件数字节
-skipcrccheck 是否跳过源和目标路径之间的CRC检查。
-strategy 选择复制策略,默认值uniformsize,每个map复制的文件总大小均衡;可以设置为dynamic,使更快的map复制更多的文件,以提高性能
-tmp 要用于原子的中间工作路径承诺
-update 如果目标文件的名称和大小与源文件不同,则覆盖;如果目标文件大小和名称与源文件相同则跳过
sh 复制代码
hadoop distcp -i  -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/

hadoop distcp -i -update -delete -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp

16.find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

  • -name pattern

    -iname pattern

    Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.

  • -print

    -print0

    Always evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

  • expression -a expression

    expression -and expression

    expression expression

    Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

sh 复制代码
hadoop fs -find / -name test -print

16.ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

  • -C: Display the paths of files and directories only.(仅显示文件和目录的路径)
  • -d: Directories are listed as plain files.(目录列为普通文件)
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).(格式化,显示m和g)
  • -q: Print ? instead of non-printable characters.
  • -R: Recursively list subdirectories encountered.(递归列出遇到的子目录。)
  • -t: Sort output by modification time (most recent first).(按修改时间对输出进行排序(最近的第一个)。)
  • -S: Sort output by file size.(按文件大小对输出进行排序。)
  • -r: Reverse the sort order.(颠倒排序顺序。)
  • -u: Use access time rather than modification time for display and sorting.(使用访问时间而不是修改时间进行显示和排序。)
  • -e: Display the erasure coding policy of files and directories only.(仅显示文件和目录的擦除编码策略。)

For a file ls returns stat on the file with the following format:

sh 复制代码
permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

sh 复制代码
permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

sh 复制代码
hadoop fs -ls /user/hadoop/file1
hadoop fs -ls -e /ecdir

17.mkdir

18.mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri's as argument and creates directories.

Options:

  • The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

sh 复制代码
hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

19.mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

sh 复制代码
hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

20.put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [-t <thread count>] [-q <thread pool queue size>] [ - | <localsrc> ...] <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to "-"

Copying fails if the file already exists, unless the -f flag is given.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)(保留访问和修改时间、所有权和权限。(假设权限可以跨文件系统传播))
  • -f : Overwrites the destination if it already exists.(覆盖已存在的目标。)
  • -l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.(允许DataNode将文件延迟持久化到磁盘,强制复制因子为1。此标志将导致耐久性降低。小心使用。)
  • -d : Skip creation of temporary file with the suffix ._COPYING_.(跳过创建后缀为的临时文件_复制。)
  • -t <thread count> : Number of threads to be used, default is 1. Useful when uploading directories containing more than 1 file.(要使用的线程数,默认值为1。上传包含多个文件的目录时很有用。)
  • -q <thread pool queue size> : Thread pool queue size to be used, default is 1024. It takes effect only when thread count greater than 1.(要使用的线程池队列大小,默认值为1024。只有当线程数大于1时,它才会生效。)

Examples:

sh 复制代码
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hadoop fs -put -t 5 localdir hdfs://nn.example.com/hadoop/hadoopdir
hadoop fs -put -t 10 -q 2048 localdir1 localdir2 hdfs://nn.example.com/hadoop/hadoopdir

21.rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

  • The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.(选项将不会显示诊断消息,也不会修改退出状态以反映文件不存在时的错误。)
  • The -R option deletes the directory and any content under it recursively.(选项以递归方式删除目录及其下的任何内容。)
  • The -r option is equivalent to -R.
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.(-选项将绕过垃圾桶(如果启用),并立即删除指定的文件。当需要从超过配额的目录中删除文件时,这可能很有用。)
  • The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories. Delay is expected when walking over large directory recursively to count the number of files to be deleted before the confirmation.(选项在删除文件总数大于`hadoop.shell.delete.limit.num.files'(在core-site.xml中,默认值:100)的目录之前需要进行安全确认。它可以与-skipTrash一起使用,以防止意外删除大目录。当递归遍历大目录以计算确认前要删除的文件数时,预计会出现延迟。)

Example:

sh 复制代码
hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

22.rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

  • --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.(使用通配符时,如果目录仍包含文件,请不要失败。)

Example:

sh 复制代码
hadoop fs -rmdir /user/hadoop/emptydir

23.tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

  • The -f option will output appended data as the file grows, as in Unix.

Example:

sh 复制代码
hadoop fs -tail pathname

24.touch

Usage: hadoop fs -touch [-a] [-m] [-t TIMESTAMP] [-c] URI [URI ...]

Updates the access and modification times of the file specified by the URI to the current time. If the file does not exist, then a zero length file is created at URI with current time as the timestamp of that URI.

  • Use -a option to change only the access time(仅更改访问时间的选项)
  • Use -m option to change only the modification time(仅更改修改时间的选项)
  • Use -t option to specify timestamp (in format yyyyMMdd:HHmmss) instead of current time(用于指定时间戳(格式为yyyyMMdd:HHmmss)而不是当前时间的选项)
  • Use -c option to not create file if it does not exist(如果文件不存在,则不创建该文件的选项)

The timestamp format is as follows * yyyy Four digit year (e.g. 2018) * MM Two digit month of the year (e.g. 08 for month of August) * dd Two digit day of the month (e.g. 01 for first day of the month) * HH Two digit hour of the day using 24 hour notation (e.g. 23 stands for 11 pm, 11 stands for 11 am) * mm Two digit minutes of the hour * ss Two digit seconds of the minute e.g. 20180809:230000 represents August 9th 2018, 11pm

Example:

sh 复制代码
hadoop fs -touch pathname
hadoop fs -touch -m -t 20180809:230000 pathname
hadoop fs -touch -t 20180809:230000 pathname
hadoop fs -touch -a pathname

25.touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.(创建一个长度为零的文件。如果存在长度为非零的文件,则返回错误。)

Example:

sh 复制代码
hadoop fs -touchz pathname

26.help

sh 复制代码
# 查看ls的命令帮助文档
hadoop fs -help ls

27.将fsimage文件转换为xml文件

sh 复制代码
hdfs oiv -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

28.将edits文件转换为xml文件

sh 复制代码
hdfs oev -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

29.查看支持的压缩算法

sh 复制代码
hadoop checknative

命令集合

The Hadoop FileSystem shell works with Object Stores such as Amazon S3, Azure ABFS and Google GCS.

sh 复制代码
# Create a directory
hadoop fs -mkdir s3a://bucket/datasets/

# Upload a file from the cluster filesystem
hadoop fs -put /datasets/example.orc s3a://bucket/datasets/

# touch a file
hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched

Unlike a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

In particular, the put and copyFromLocal commands should both have the -d options set for a direct upload.

sh 复制代码
# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/

# Upload a file from under the user's home directory in the local filesystem.
# Note it is the shell expanding the "~", not the hadoop fs command
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/

# create a file from stdin
# the special "-" source means "use stdin"
echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

Objects can be downloaded and viewed:

sh 复制代码
# copy a directory to the local filesystem
hadoop fs -copyToLocal s3a://bucket/datasets/

# copy a file from the object store to the cluster filesystem.
hadoop fs -get wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt /examples

# print the object
hadoop fs -cat wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

# print the object, unzipping it if necessary
hadoop fs -text wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

## download log files into a local file
hadoop fs -getmerge wasb://yourcontainer@youraccount.blob.core.windows.net/logs\* log.txt

Commands which list many files tend to be significantly slower than when working with HDFS or other filesystems

sh 复制代码
hadoop fs -count s3a://bucket/
hadoop fs -du s3a://bucket/

Other slow commands include find, mv, cp and rm.

Find

This can be very slow on a large store with many directories under the path supplied.

sh 复制代码
# enumerate all files in the object store's container.
hadoop fs -find s3a://bucket/ -print

# remember to escape the wildcards to stop the shell trying to expand them first
hadoop fs -find s3a://bucket/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size.

The time to rename a directory depends on the number and size of all files beneath that directory.

sh 复制代码
hadoop fs -mv s3a://bucket/datasets s3a://bucket/historical

If the operation is interrupted, the object store will be in an undefined state.

Copy

sh 复制代码
hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and the bandwidth in both directions between the local computer and the object store.

The further the computer is from the object store, the longer the copy takes

Deleting objects

The rm command will delete objects and directories full of objects. If the object store is eventually consistent , fs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, this will be in the bucket; the rm operation will then take time proportional to the size of the data. Furthermore, the deleted files will continue to incur storage costs.

To avoid this, use the the -skipTrash option.

sh 复制代码
hadoop fs -rm -skipTrash s3a://bucket/dataset

Data moved to the .Trash directory can be purged using the expunge command. As this command only works with the default filesystem, it must be configured to make the default filesystem the target object store.

sh 复制代码
hadoop fs -expunge -D fs.defaultFS=s3a://bucket/

注意:命令操作节选自hadoo3.3.0,在低版本的hadoop中可能不存在

相关推荐
Elastic 中国社区官方博客26 分钟前
释放专利力量:Patently 如何利用向量搜索和 NLP 简化协作
大数据·数据库·人工智能·elasticsearch·搜索引擎·自然语言处理
力姆泰克29 分钟前
看电动缸是如何提高农机的自动化水平
大数据·运维·服务器·数据库·人工智能·自动化·1024程序员节
力姆泰克30 分钟前
力姆泰克电动缸助力农业机械装备,提高农机的自动化水平
大数据·服务器·数据库·人工智能·1024程序员节
QYR市场调研34 分钟前
自动化研磨领域的革新者:半自动与自动自磨机的技术突破
大数据·人工智能
工业互联网专业2 小时前
Python毕业设计选题:基于Hadoop的租房数据分析系统的设计与实现
vue.js·hadoop·python·flask·毕业设计·源码·课程设计
半部论语2 小时前
第三章:TDengine 常用操作和高级功能
大数据·时序数据库·tdengine
EasyGBS3 小时前
国标GB28181公网直播EasyGBS国标GB28181软件管理解决方案
大数据·网络·音视频·媒体·视频监控·gb28181
2403_875736873 小时前
道品科技的水肥一体化智能灌溉:开启现代农业的创新征程
大数据·人工智能·1024程序员节
河南查新信息技术研究院3 小时前
科技查新在医药健康领域的应用
大数据·科技·全文检索
青云交3 小时前
大数据新视界 -- 大数据大厂之 Impala 性能优化:应对海量复杂数据的挑战(上)(7/30)
大数据·性能优化·impala·数据分区·查询优化·海量复杂数据·经典案例