MapReduce简单应用(二)——去重、排序和平均

目录

  • [1. 数据去重](#1. 数据去重)
    • [1.1 原理](#1.1 原理)
    • [1.2 pom.xml中依赖配置](#1.2 pom.xml中依赖配置)
    • [1.3 工具类util](#1.3 工具类util)
    • [1.4 去重代码](#1.4 去重代码)
    • [1.5 结果](#1.5 结果)
  • [2. 数据排序](#2. 数据排序)
    • [2.1 原理](#2.1 原理)
    • [2.2 排序代码](#2.2 排序代码)
    • [2.3 结果](#2.3 结果)
  • [3. 计算均值](#3. 计算均值)
    • [3.1 原理](#3.1 原理)
    • [3.2 自定义序列化数据类型DecimalWritable](#3.2 自定义序列化数据类型DecimalWritable)
    • [3.3 计算平均值](#3.3 计算平均值)
    • [3.4 结果](#3.4 结果)
  • 参考

1. 数据去重

待去重的两个文本内容如下。

txt 复制代码
2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c
txt 复制代码
2012-3-1 b
2012-3-2 a
2012-3-3 b
2012-3-4 d
2012-3-5 a
2012-3-6 c
2012-3-7 d
2012-3-3 c

1.1 原理

利用MapReduce处理过程中键值唯一的特性,即可完成数据去重任务,只需把Map读入的<LongWritable, Text>键值对中的值作为Map输出键值对中的键,而输出键值对中的值设置为NullWritable类型,Reduce只需把Map输出的键值对直接原封不动输出即可。

1.2 pom.xml中依赖配置

xml 复制代码
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>3.3.6</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>3.3.6</version>
      <type>pom</type>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
      <version>3.3.6</version>
    </dependency>
  </dependencies>

1.3 工具类util

java 复制代码
import java.net.URI;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;


public class util {
    public static FileSystem getFileSystem(String uri, Configuration conf) throws Exception {
        URI add = new URI(uri);
        return FileSystem.get(add, conf);
    }

    public static void removeALL(String uri, Configuration conf, String path) throws Exception {
        FileSystem fs = getFileSystem(uri, conf);
        if (fs.exists(new Path(path))) {
            boolean isDeleted = fs.delete(new Path(path), true);
            System.out.println("Delete Output Folder? " + isDeleted);
        }
    }

    public static void  showResult(String uri, Configuration conf, String path) throws Exception {
        FileSystem fs = getFileSystem(uri, conf);
        String regex = "part-r-";
        Pattern pattern = Pattern.compile(regex);

        if (fs.exists(new Path(path))) {
            FileStatus[] files = fs.listStatus(new Path(path));
            for (FileStatus file : files) {
                Matcher matcher = pattern.matcher(file.getPath().toString());
                if (matcher.find()) {
                    System.out.println(file.getPath() + ":");
                    FSDataInputStream openStream = fs.open(file.getPath());
                    IOUtils.copyBytes(openStream, System.out, 1024);
                    openStream.close();
                }
            }
        }
    }
}

1.4 去重代码

java 复制代码
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class App {
    public static class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            context.write(value, NullWritable.get());
        }
    }

    public static class MyReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
        @Override
        protected void reduce(Text key, Iterable<NullWritable> values, Context context)
                throws IOException, InterruptedException {
            context.write(key, NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] myArgs = {
            "file:///home/developer/CodeArtsProjects/data-deduplication/dedup1.txt",
            "file:///home/developer/CodeArtsProjects/data-deduplication/dedup2.txt",
            "hdfs://localhost:9000/user/developer/data-deduplication/output"
        };
        util.removeALL("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        Job job = Job.getInstance(conf, "DataDeduplication");
        job.setJarByClass(App.class);
        job.setMapperClass(MyMapper.class);
        job.setCombinerClass(MyReducer.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        for (int i = 0; i < myArgs.length - 1; i++) {
            FileInputFormat.addInputPath(job, new Path(myArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(myArgs[myArgs.length - 1]));
        int res = job.waitForCompletion(true) ? 0 : 1;
        if (res == 0) {
            System.out.println("数据去重的结果为:");
            util.showResult("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        }
        System.exit(res);
    }
}

1.5 结果

2. 数据排序

pom.xml中依赖配置、工具类util代码同1中。三个排序文本的内容如下。

txt 复制代码
2
32
654
32
15
756
65223
6
txt 复制代码
5956
22
650
92
54
txt 复制代码
26
54
6
32

2.1 原理

利用MapReduce过程中Shuffle会对键值对排序的功能,只需要设置一个Reduce,把Map读入的<LongWritable, Text>键值对中的值处理后变为IntWritable类型,并作Map输出的键,同时Map输出的值设置为new IntWritable(1),这样便可以处理多个重复值的排序。然后设置一个静态计数器,作为Reduce输出的键,Reduce输出的值为Map输出的键。

2.2 排序代码

java 复制代码
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class App {
    static int count = 0;

    public static class MyMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            int val = Integer.parseInt(value.toString());
            context.write(new IntWritable(val), new IntWritable(1));
        };
    }

    public static class MyReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            for (IntWritable val : values) {
                context.write(new IntWritable(++count) , key);
            }
        };
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] myArgs = {
            "file:///home/developer/CodeArtsProjects/data-sort/sort1.txt",
            "file:///home/developer/CodeArtsProjects/data-sort/sort2.txt",
            "file:///home/developer/CodeArtsProjects/data-sort/sort3.txt",
            "hdfs://localhost:9000/user/developer/DataSort/output"
        };
        util.removeALL("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        Job job = Job.getInstance(conf, "DataSort");
        job.setJarByClass(App.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        for (int i = 0; i < myArgs.length - 1; i++) {
            FileInputFormat.addInputPath(job, new Path(myArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(myArgs[myArgs.length - 1]));
        int res = job.waitForCompletion(true) ? 0 : 1;
        if (res == 0) {
            System.out.println("数据排序的结果为:");
            util.showResult("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        }
        System.exit(res);
    }
}

2.3 结果

3. 计算均值

pom.xml中依赖配置、工具类util代码同1中。三门课的成绩文本如下。

txt 复制代码
张三 78
李四 89
王五 96
赵六 67
txt 复制代码
张三 88
李四 99
王五 66
赵六 77
txt 复制代码
张三 80
李四 82
王五 84
赵六 86

3.1 原理

为了实现计算过程中的精度,利用java.math.BigDecimal实现了一个自定义序列化数据类型DecimalWritable,方便在结果中保留指定小数位数。计算平均值首先把Map读入的<LongWritable, Text>键值对拆分成<Text, DecimalWritable>键值对,人名作为键,成绩作为值。Reduce将输入键值对中值列表进行累加再求平均,人名作为输出的键,平均值作为输出的值。

3.2 自定义序列化数据类型DecimalWritable

java 复制代码
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;

import org.apache.hadoop.io.WritableComparable;

public class DecimalWritable implements WritableComparable<DecimalWritable> {
    private BigDecimal value;
    private int bit;

    public void setValue(double value) {
        this.value = new BigDecimal(value);
    }

    public BigDecimal getValue() {
        return value;
    }

    public void setBit(int bit) {
        this.bit = bit;
    }

    public int getBit() {
        return bit;
    }

    DecimalWritable() {
        super();
    }

    DecimalWritable(double value, int bit) {
        super();
        setValue(value);
        setBit(bit);
    }

    DecimalWritable(BigDecimal value, int bit) {
        super();
        this.value = value;
        setBit(bit);
    }

    @Override
    public String toString() {
        return value.setScale(bit, RoundingMode.HALF_UP).toString();
    }
    
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(value.setScale(bit, RoundingMode.HALF_UP).toString());
        out.writeInt(bit);
    }

    @Override
    public int hashCode() {
        return value.hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (!(obj instanceof DecimalWritable))
            return false;
        DecimalWritable o = (DecimalWritable) obj;
        return getValue().compareTo(o.getValue()) == 0;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        value = new BigDecimal(in.readUTF());
        bit = in.readInt();
    }

    @Override
    public int compareTo(DecimalWritable o) {
        int res = getValue().compareTo(o.getValue());
        if (res == 0)
            return 0;
        else if (res > 0)
            return 1;
        else
            return -1;
    }
}

3.3 计算平均值

java 复制代码
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class App {
    public static class MyMapper extends Mapper<LongWritable, Text, Text, DecimalWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String[] splitStr = value.toString().split(" ");
            DecimalWritable grade = new DecimalWritable(Double.parseDouble(splitStr[1]), 2);
            Text name = new Text(splitStr[0]);
            context.write(name, grade);
        }
    }

    public static class MyReducer extends Reducer<Text, DecimalWritable, Text, DecimalWritable> {
        @Override
        protected void reduce(Text key, Iterable<DecimalWritable> values, Context context)
                throws IOException, InterruptedException {
            BigDecimal sum = BigDecimal.ZERO;
            for (DecimalWritable val : values) {
                sum = sum.add(val.getValue());
            }
            sum = sum.divide(new BigDecimal(3), RoundingMode.HALF_UP);
            context.write(key, new DecimalWritable(sum, 2));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] myArgs = {
            "file:///home/developer/CodeArtsProjects/cal-average-grade/chinese.txt",
            "file:///home/developer/CodeArtsProjects/cal-average-grade/english.txt",
            "file:///home/developer/CodeArtsProjects/cal-average-grade/math.txt",
            "hdfs://localhost:9000/user/developer/CalAverageGrade/output"
        };
        util.removeALL("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        Job job = Job.getInstance(conf, "CalAverageGrade");
        job.setJarByClass(App.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DecimalWritable.class);
        for (int i = 0; i < myArgs.length - 1; i++) {
            FileInputFormat.addInputPath(job, new Path(myArgs[i]));
        }
        FileOutputFormat.setOutputPath(job, new Path(myArgs[myArgs.length - 1]));
        int res = job.waitForCompletion(true) ? 0 : 1;
        if (res == 0) {
            System.out.println("平均成绩结果为:");
            util.showResult("hdfs://localhost:9000", conf, myArgs[myArgs.length - 1]);
        }
        System.exit(res);
    }
}

3.4 结果

参考

吴章勇 杨强著 大数据Hadoop3.X分布式处理实战

相关推荐
GIS数据转换器10 分钟前
构建智能汽车地图标准体系:自动驾驶技术的基石
大数据·人工智能·科技·安全·机器学习·自动驾驶·汽车
下雨天u19 分钟前
windows安装Elasticsearch
大数据·elasticsearch·搜索引擎
树莓集团12 小时前
树莓科技集团董事长:第五代产业园运营模式的深度剖析与展望
大数据·人工智能·科技·物联网·百度
viperrrrrrrrrr712 小时前
大数据学习(62)- Hadoop-yarn
大数据·yarn
大湾区经济门户网13 小时前
科技工作者之家建设扬帆起航,为科技人才提供更多优质服务
大数据·人工智能·科技·媒体
cherry523014 小时前
【第4章】项目实战-亿级电商系统需求分析
大数据·数据库·架构·需求分析
闯闯桑15 小时前
Spark 中explode用法
大数据·spark·scala
TDengine (老段)17 小时前
TDengine 特色查询
android·大数据·数据库·物联网·时序数据库·tdengine·iotdb
AntBlack17 小时前
DataWorks 体验笔记 :一切的基础都是数据的读和写
大数据·后端·创业
砸吧砸吧19 小时前
#mapreduce打包#maven:could not resolve dependencies for project
java·大数据·maven·mapreduce