|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| 转换算子 | 介绍 | |
| groupByKey([numPartitions]) groupByKey(Partitioner) | 当你在一个包含(K,V)对的数据集上调用此方法时,它会返回一个包含(K,Iterable)对的数据集。 注意:默认情况下,输出中的并行级别取决于父RDD的分区数量。也可以传递一个可选择numPartitions参数来设置不同的任务数量。 还可以传一个分区类,实现了Partitioner类 | |
| reduceByKey(func,numPartitions) | 当在一个 (K, V) 键值对的数据集上调用时,返回一个 (K, V) 键值对的数据集,其中每个键对应的值通过给定的归约函数 func 进行聚合,该函数必须是 (V,V) => V 类型。与 groupByKey 类似,归约任务的数量可以通过可选的第二个参数进行配置 | |
| aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions]) | 当在一个 (K, V) 键值对的数据集上调用时,返回一个 (K, U) 键值对的数据集,其中每个键对应的值通过给定的组合函数和一个中性的"零"值进行聚合。允许聚合后的值类型与输入值类型不同,同时避免不必要的内存分配。与 groupByKey 类似,归约任务的数量可以通过可选的第二个参数进行配置。支持输入类型 V 到输出类型 U 的转换 | |
| sortByKey([ascending], [numPartitions]) | sortByKey() 方法用于对键值对 RDD 按照键进行排序,适用于键实现了 Ordered 接口的数据。 true:升序 false:降序 | |
| cogroup(otherDataset, [numPartitions]) | 当在类型为 (K, V) 和 (K, W) 的数据集上调用时,返回一个 (K, (Iterable, Iterable)) 类型的元组数据集。这个操作也被称为 groupWith。 | |
| pipe | 通过 shell 命令(例如 Perl 或 bash 脚本)处理 RDD 的每个分区。RDD 元素被写入进程的标准输入(stdin),输出到其标准输出(stdout)的行作为字符串类型的 RDD 返回。 | |
| coalesce(numPartitions) | 减少 RDD 的分区数到指定的 numPartitions。在从大型数据集中过滤掉大部分数据后,这个操作对于更高效地运行后续操作很有用。 该操作不涉及shuffle,可能会造成数据倾斜 | |
| repartition(numPartitions) | 随机重新洗牌 RDD 中的数据,以创建更多或更少的分区并在它们之间平衡数据。这总是通过网络洗牌所有数据。 | |
| repartitionAndSortWithinPartitions(partitioner) | 根据给定的分区器重新分区 RDD,并在每个结果分区中按键对记录进行排序。这比先调用 repartition 然后在每个分区内排序更高效,因为它可以将排序操作下推到 shuffle 机制中。 | |
java
// groupByKey
// 案例:数据用户user_id products [["user001","basketball"],["user001","football"]]现在要返回["user001",["basketball","football"]]
JavaPairRDD<String,String> rddPair= sc.parallelizePairs(Arrays.asList(
new Tuple2<String,String>("user001","basketball"),
new Tuple2<String,String>("user001","football"),
new Tuple2<String,String>("user002","football")
));
JavaPairRDD<String,Iterable<String>> result = rddPair.groupByKey();
result.foreach(x-> System.out.println(x));
(user001,[basketball, football])
(user002,[football])
// groupByKey传递参数为Partitioner
@Data
@AllArgsConstructor
class MyPartitioner extends Partitioner{
private int nums;//分区数
@overwrite
public int numPartitions(){
return nums;
}
@overwrite
public int getPartition(Object key){
int hash = key.hashCode();
return Math.abs(hash) % nums;
}
}
JavaPairRDD<String,Iterable<String>> result = rddPair.groupByKey(new MyPartitioner(3));
result.foreach(x-> System.out.println(x));
java
// reduceByKey(func,[numParitions])
// 案例:wordcount:不区分大小写
JavaRDD<String> lines = sc.parallelize(Arrays.asList(
"Hello","World","World","Hello","hello"));
JavaPairRDD<String,Integer> word = lines.mapToPair(s->new Tuple2<String,Integer>(s.toUpperCase(),1));
JavaPairRDD<String,Integer> wordcount = word.reduceByKey((a,b)->a+b);
wordcount.foreach(x->System.out.print(x));
(HELLO,3)(WORLD,2)
java
// sortByKey()
// 按字符串排序
JavaPairRDD<String,Tuple2<String,Integer>> stu = sc.parallelizePairs(Arrays.asList(
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("english",90)),
new Tuple2<String,Tuple2<String,Integer>>("black",new Tuple2<>("english",95)),
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("math",80)),
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("chinese",90)),
new Tuple2<String,Tuple2<String,Integer>>("black",new Tuple2<>("python",90))
));
stu.sortByKey(true).collect().forEach(x-> System.out.println(x));
java
// aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
JavaPairRDD<String,Integer> stu = sc.parallelizePairs(Arrays.asList(
new Tuple2<String,Integer>("tom",90),
new Tuple2<String,Integer>("tom",80),
new Tuple2<String,Integer>("tom",70),
new Tuple2<String,Integer>("jack",90),
new Tuple2<String,Integer>("black",90)
));
JavaPairRDD<String,Tuple2<Double,Integer>> result = stu.aggregateByKey(new Tuple2<Double, Integer>(0.0, 0),(acc1,v1)->new Tuple2<Double,Integer>(acc1._1()+v1,acc1._2()+1),
(acc1,acc2)->new Tuple2<>(acc1._1()+acc2._1(),acc1._2()+acc2._2()));
result.foreach(x-> System.out.println(x));
// 新增计算最大最小值
JavaPairRDD<String,Tuple2<Integer,Integer>> result = stu.aggregateByKey(new Tuple2<>(Integer.MIN_VALUE,Integer.MAX_VALUE),(acc1,v1)->new Tuple2<>(Math.max(acc1._1(),v1),Math.min(acc1._2(),v1)),
(acc1,acc2)->new Tuple2<>(Math.max(acc1._1(),acc2._1()),
Math.min(acc1._2(), acc2._2()))
java
// cogroup:key ,Iter1,Iter2
JavaPairRDD<String,Tuple2<String,Integer>> stu = sc.parallelizePairs(Arrays.asList(
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("english",90)),
new Tuple2<String,Tuple2<String,Integer>>("black",new Tuple2<>("english",95)),
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("math",80)),
new Tuple2<String,Tuple2<String,Integer>>("tom",new Tuple2<>("chinese",90)),
new Tuple2<String,Tuple2<String,Integer>>("black",new Tuple2<>("python",90))
));
JavaPairRDD<String,Integer> stu1 = sc.parallelizePairs(Arrays.asList(
new Tuple2<String,Integer>("tom",12),
new Tuple2<String,Integer>("tom",13)
));
JavaPairRDD<String, Tuple2<Iterable<Tuple2<String, Integer>>, Iterable<Integer>>> data = stu.cogroup(stu1);
java
// coalesce
JavaRDD<String> coalesceRdd = rdd.coalesce(3);
// repartition
JavaRDD<String> coalesceRdd = rdd.repartition(3);
python
# groupByKey
rdd = sc.parallelize([("user001","basketball"),("user002","basketball"),
("user001","football")])
result = rdd.groupByKey()
result.foreach(print)
# groupByKey(numPartitions,partitionFunc)
numPartitions = 3
def partitionFunc(key):
return hash(key) % numPartitions
result = rdd.groupByKey(numPartitions,partitionFunc)
# reduceByKey
rdd = sc.parallelize(["Hello World","Hello Spark","Work Spark"])
rdd.flatMap(lambda x:x.split(" ")).map(lambda x:(x.upper(),1)).reduceByKey(lambda a,b:a+b).foreach(print)
('WORK', 1)
('WORLD', 1)
('SPARK', 2)
('HELLO', 2)
# aggregateByKey
# U:分区内返回值:返回的是[成绩,数量] V:输入值,如此方法,输入的是[科目,成绩]"math",90
def seqFunc(U,V):
return (U[0]+V[1],U[1]+1)
# U:分区内返回值:返回的是[成绩,数量]
def combFunc(U1,U2):
return (U1[0]+U2[0],U1[1]+U2[1])
rdd = sc.parallelize([("tom",("math",90)),
("tom", ("english", 90)),
("tom", ("python", 90)),
("black", ("math", 90)),
("black", ("java", 90))])
rdd.aggregateByKey((0,0.0),seqFunc=seqFunc,combFunc=combFunc).foreach(print)
# sortByKey
rdd.sortByKey(true).collect()
# cogroup
rdd1 = sc.parallelize([("tom",("math",90)),
("tom", ("english", 90)),
("tom", ("python", 90)),
("black", ("math", 90)),
("black", ("java", 90))])
rdd2 = sc.parallelize([("tom", 12),
("black", 13),
])
result = rdd1.cogroup(rdd2).collect()
for item in result:
print(item[0])
rdd1_item = item[1][0]
for t1 in rdd1_item:
print(t1)
rdd2_item = item[1][1]
for t2 in rdd2_item:
print(t2)
# coalesce
rdd1.coalesce(2).collect()
# repartition
rdd1.repartition(2).collect()