Partition分区
1.默认Partitioner分区
java
(key.hashcode() & Interger.MAX_VALUE) % numReduceTasks
numReduceTasks默认为:1
//输出文件一个
默认分区根据key的hashCode对ReduceTasks个数取模。
用户控制那个key存储到那个分区
2.手动设置分区
java
//设置分区
job.setNumReduceTasks(2);
3.自定义分区步骤
(1)三步
a.自定义类继承Partitioner,重写getPartition()方法
java
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int i) {
return 0;
}
}
b.在Job驱动中,设置自定义Partitioner
java
job.setPartitionerClass(ProvincePartitioner.class);
c.自定义Partition后,根据自定义Partitioner的逻辑设置相应的ReduceTask
java
job.setNumReduceTasks(5);
Partition分区案例实操
1.需求
将统计结果按照手机归属地不同省份输出到不同文件中(分区)
(1)输入数据
markdown
1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.atguigu.com 1527 2106 200
6 84188413 192.168.100.3 www.atguigu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.atguigu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
15 13682846555 192.168.100.12 www.qq.com 1938 2910 200
16 13992314666 192.168.100.13 www.gaga.com 3008 3720 200
17 13509468723 192.168.100.14 www.qinghua.com 7335 110349 404
18 18390173782 192.168.100.15 www.sogou.com 9531 2412 200
19 13975057813 192.168.100.16 www.baidu.com 11058 48243 200
20 13768778790 192.168.100.17 120 120 200
21 13568436656 192.168.100.18 www.alibaba.com 2481 24681 200
22 13568436656 192.168.100.19 1116 954 200
(2)期望输出数据
手机号136、137、138、139开头都分别放到一个独立的4个文件中,其他开头的放到一个文件中。
2.代码实现
在Flow基础上实现
Partitioner类
java
package com.saddam.bigdata.ShangGuiGu.Shuffle.Partition;
import com.saddam.bigdata.ShangGuiGu.Writable.FlowBean;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text key, FlowBean value, int numPartitions) {
//key是手机号,value是流量信息bean对象
//第一步:获取手机号前三位
String prePhoneNum=key.toString().substring(0,3);
int partition=4;
//判断
if ("136".equals(numPartitions)) {
partition = 0;
}else if ("137".equals(numPartitions)){
partition=1;
}else if ("138".equals(numPartitions)) {
partition = 2;
}else if ("139".equals(numPartitions)) {
partition = 3;
}
return partition;
}
}
Driver类
java
package com.saddam.bigdata.ShangGuiGu.Shuffle.Partition;
import com.saddam.bigdata.ShangGuiGu.Writable.FlowBean;
import com.saddam.bigdata.ShangGuiGu.Writable.FlowMapper;
import com.saddam.bigdata.ShangGuiGu.Writable.FlowReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.log4j.BasicConfigurator;
public class ProvinceDriver {
public static void main(String[] args) throws Exception{
BasicConfigurator.configure();
//1.获取job
Configuration configuration=new Configuration();
Job job=Job.getInstance(configuration);
//2.设置jar包
job.setJarByClass(ProvinceDriver.class);
//3.关联Mapper和Reducer
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
//4.设置map输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
//5.设最终输出的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(5);
//6.设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\MR\\MapReduce\\InputDatas\\phone.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\MR\\MapReduce\\OutputDatas\\output_partition\\output_Flow"));
//7.提交job
boolean result=job.waitForCompletion(true);
System.exit(result?0:1);
}
}
总结
markdown
若Partition类中int partition=4;设置5个分区
但是
job.setNumReduceTasks(5);--》 job.setNumReduceTasks(1);
成功运行,但是输出结果就一个文件,相当于未分区
job.setNumReduceTasks(5);--》 job.setNumReduceTasks(2);
报错IO异常
job.setNumReduceTasks(5);--》 job.setNumReduceTasks(6);
大于程序可以运行,输出多一个空文件