Apache Hadoop文件上传、下载、分布式计算案例初体验

上篇:Apache Hadoop完全分布式集群搭建无坑指南-CSDN博客

通过上篇,我们搭建了完整的Hadoop集群,此篇我们简单通过集群上传和下载文件,同时测试分布式worldCount案例。后续的篇章再对分布式计算、分布式存储作更深的理解。

上传下载测试

从linux本地文件系统上传下载文件验证HDFS集群工作是否正常

复制代码
#创建目录
hdfs dfs -mkdir -p /test/input

#本地hoome目录创建一个文件,随便写点内容进去
cd /root
vim test.txt
​
#上传linxu文件到Hdfs
hdfs dfs -put /root/test.txt /test/input
​
#从Hdfs下载文件到linux本地(可以换别的节点进行测试)
hdfs dfs -get /test/input/test.txt

分布式计算测试

在HDFS文件系统根目录下面创建一个wcinput文件夹

复制代码
[root@hadoop01 hadoop-2.9.2]# hdfs dfs -mkdir /wcinput

创建wc.txt文件,输入如下内容

复制代码
hadoop mapreduce yarn
hdfs hadoop mapreduce
mapreduce yarn kmning
kmning
kmning

上传wc.txt到Hdfs目录/wcinput下

复制代码
hdfs dfs -put wc.txt /wcinput

执行mapreduce任务

复制代码
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /wcinput/ /wcoutput

打印如下

复制代码
24/07/03 20:44:26 INFO client.RMProxy: Connecting to ResourceManager at hadoop03/192.168.43.103:8032
24/07/03 20:44:28 INFO input.FileInputFormat: Total input files to process : 1
24/07/03 20:44:28 INFO mapreduce.JobSubmitter: number of splits:1
24/07/03 20:44:28 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
24/07/03 20:44:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1720006717389_0001
24/07/03 20:44:29 INFO impl.YarnClientImpl: Submitted application application_1720006717389_0001
24/07/03 20:44:29 INFO mapreduce.Job: The url to track the job: http://hadoop03:8088/proxy/application_1720006717389_0001/
24/07/03 20:44:29 INFO mapreduce.Job: Running job: job_1720006717389_0001
24/07/03 20:44:45 INFO mapreduce.Job: Job job_1720006717389_0001 running in uber mode : false
24/07/03 20:44:45 INFO mapreduce.Job:  map 0% reduce 0%
24/07/03 20:44:57 INFO mapreduce.Job:  map 100% reduce 0%
24/07/03 20:45:13 INFO mapreduce.Job:  map 100% reduce 100%
24/07/03 20:45:14 INFO mapreduce.Job: Job job_1720006717389_0001 completed successfully
24/07/03 20:45:14 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=70
                FILE: Number of bytes written=396911
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=180
                HDFS: Number of bytes written=44
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=9440
                Total time spent by all reduces in occupied slots (ms)=11870
                Total time spent by all map tasks (ms)=9440
                Total time spent by all reduce tasks (ms)=11870
                Total vcore-milliseconds taken by all map tasks=9440
                Total vcore-milliseconds taken by all reduce tasks=11870
                Total megabyte-milliseconds taken by all map tasks=9666560
                Total megabyte-milliseconds taken by all reduce tasks=12154880
        Map-Reduce Framework
                Map input records=5
                Map output records=11
                Map output bytes=124
                Map output materialized bytes=70
                Input split bytes=100
                Combine input records=11
                Combine output records=5
                Reduce input groups=5
                Reduce shuffle bytes=70
                Reduce input records=5
                Reduce output records=5
                Spilled Records=10
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=498
                CPU time spent (ms)=3050
                Physical memory (bytes) snapshot=374968320
                Virtual memory (bytes) snapshot=4262629376
                Total committed heap usage (bytes)=219676672
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=80
        File Output Format Counters
                Bytes Written=44

查看结果

复制代码
[root@hadoop01 hadoop-2.9.2]# hdfs dfs -cat /wcoutput/part-r-00000
hadoop  2
hdfs    1
kmning  3
mapreduce       3
yarn    2

可见,程序将单词出现的次数通过MapReduce分布式计算统计了出来。

相关推荐
武子康1 天前
大数据-242 离线数仓 - DataX 实战:MySQL 全量/增量导入 HDFS + Hive 分区(离线数仓 ODS
大数据·后端·apache hive
SelectDB2 天前
易车 × Apache Doris:构建湖仓一体新架构,加速 AI 业务融合实践
大数据·agent·mcp
武子康2 天前
大数据-241 离线数仓 - 实战:电商核心交易数据模型与 MySQL 源表设计(订单/商品/品类/店铺/支付)
大数据·后端·mysql
IvanCodes2 天前
一、消息队列理论基础与Kafka架构价值解析
大数据·后端·kafka
武子康3 天前
大数据-240 离线数仓 - 广告业务 Hive ADS 实战:DataX 将 HDFS 分区表导出到 MySQL
大数据·后端·apache hive
字节跳动数据平台4 天前
5000 字技术向拆解 | 火山引擎多模态数据湖如何释放模思智能的算法生产力
大数据
武子康4 天前
大数据-239 离线数仓 - 广告业务实战:Flume 导入日志到 HDFS,并完成 Hive ODS/DWD 分层加载
大数据·后端·apache hive
字节跳动数据平台5 天前
代码量减少 70%、GPU 利用率达 95%:火山引擎多模态数据湖如何释放模思智能的算法生产力
大数据
得物技术5 天前
深入剖析Spark UI界面:参数与界面详解|得物技术
大数据·后端·spark
武子康5 天前
大数据-238 离线数仓 - 广告业务 Hive分析实战:ADS 点击率、购买率与 Top100 排名避坑
大数据·后端·apache hive