MapReduce简单应用(一)——WordCount

[1. 执行过程](#1. 执行过程)
- [1.1 分割](#1.1 分割)
- [1.2 Map](#1.2 Map)
- [1.3 Combine](#1.3 Combine)
- [1.4 Reduce](#1.4 Reduce)
[2. 代码和结果](#2. 代码和结果)
- [2.1 pom.xml中依赖配置](#2.1 pom.xml中依赖配置)
- [2.2 工具类util](#2.2 工具类util)
- [2.3 WordCount](#2.3 WordCount)
- [2.4 结果](#2.4 结果)
参考

1. 执行过程

假设WordCount的两个输入文本text1.txt和text2.txt如下。

txt 复制代码

Hello World
Bye World

txt 复制代码

Hello Hadoop
Bye Hadoop

1.1 分割

将每个文件拆分成split分片，由于测试文件比较小，所以每个文件为一个split，并将文件按行分割形成<key，value>对，如下图所示。这一步由MapReduce自动完成，其中key值为偏移量，由MapReduce自动计算出来，包括回车所占的字符数。

1.2 Map

将分割好的<key，value>对交给用户定义的Map方法处理，生成新的<key，value>对。处理流程为先对每一行文字按空格拆分为多个单词，每个单词出现次数设初值为1，key为某个单词，value为1，如下图所示。

1.3 Combine

得到Map方法输出的<key，value>对后，Mapper将它们按照key值进行升序排列，并执行Combine合并过程，将key值相同的value值累加，得到Mapper的最终输出结果，并写入磁盘，如下图所示。

1.4 Reduce

Reducer先对从Mapper接受的数据进行排序，并将key值相同的value值合并到一个list列表中，再交由用户自定义的Reduce方法进行汇总处理，得到新的<key，value>对，并作为WordCount的输出结果，存入HDFS，如下图所示。

2. 代码和结果

2.1 pom.xml中依赖配置

xml 复制代码

	<dependencies>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.11</version>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>3.3.6</version>
			<exclusions>
				<exclusion>
					<groupId>org.slf4j</groupId>
					<artifactId>slf4j-log4j12</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-core</artifactId>
			<version>3.3.6</version>
			<type>pom</type>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
			<version>3.3.6</version>
		</dependency>
	</dependencies>

2.2 工具类util

util.removeALL的功能是删除hdfs上的指定输出路径(如果存在的话)，而util.showResult的功能是打印wordcount的结果。

java 复制代码

import java.net.URI;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;


public class util {
    public static FileSystem getFileSystem(String uri, Configuration conf) throws Exception {
        URI add = new URI(uri);
        return FileSystem.get(add, conf);
    }

    public static void removeALL(String uri, Configuration conf, String path) throws Exception {
        FileSystem fs = getFileSystem(uri, conf);
        if (fs.exists(new Path(path))) {
            boolean isDeleted = fs.delete(new Path(path), true);
            System.out.println("Delete Output Folder? " + isDeleted);
        }
    }

    public static void  showResult(String uri, Configuration conf, String path) throws Exception {
        FileSystem fs = getFileSystem(uri, conf);
        String regex = "part-r-";
        Pattern pattern = Pattern.compile(regex);

        if (fs.exists(new Path(path))) {
            FileStatus[] files = fs.listStatus(new Path(path));
            for (FileStatus file : files) {
                Matcher matcher = pattern.matcher(file.getPath().toString());
                if (matcher.find()) {
                    FSDataInputStream openStream = fs.open(file.getPath());
                    IOUtils.copyBytes(openStream, System.out, 1024);
                    openStream.close();
                }
            }
        }
    }
}

2.3 WordCount

正常来说，MapReduce编程都是要把代码打包成jar文件，然后用hadoop jar jar文件名主类名称输入路径输出路径。下面代码中直接给出了输入和输出路径，可以直接运行。