
文章目录
Transformation转换算子groupByKey和filter
Transformation转换算子groupByKey和filter
一、groupByKey
作用在K,V格式的RDD上,根据Key进行分组,返回(K,Iterable <V>)。对于需要对相同key进行聚合的场景使用reduceByKey更高效,因为reduceByKey会在各个分区中预先进行本地聚合,减少数据传输数量。
Java代码:
java
SparkConf conf = new SparkConf().setMaster("local").setAppName("GroupByKeyTest");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String, Integer> pairRDD = sc.parallelizePairs(Arrays.asList(
new Tuple2<>("a", 1),
new Tuple2<>("b", 2),
new Tuple2<>("c", 3),
new Tuple2<>("a", 4),
new Tuple2<>("b", 5),
new Tuple2<>("c", 6),
new Tuple2<>("a", 7),
new Tuple2<>("b", 8),
new Tuple2<>("c", 9)
));
//groupByKey:将数据源中的数据,按照相同的key对value进行分组,形成一个新的可迭代的value
JavaPairRDD<String, Iterable<Integer>> result = pairRDD.groupByKey();
result.foreach(new VoidFunction<Tuple2<String, Iterable<Integer>>>() {
@Override
public void call(Tuple2<String, Iterable<Integer>> tp) throws Exception {
String key = tp._1;
Iterable<Integer> values = tp._2;
int sum = 0;
for (Integer value : values) {
sum += value;
}
System.out.println(key+":"+sum);
}
});
sc.stop();
Scala代码:
Scala
val conf: SparkConf = new SparkConf().setMaster("local").setAppName("GroupByKeyTest")
val sc = new SparkContext(conf)
//groupByKey: 将RDD中的元素按照key进行分组
val result: RDD[(String, Iterable[Int])] = sc.parallelize(List(("a", 1), ("b", 2), ("c", 3), ("d", 4), ("a", 5), ("b", 6), ("c", 7), ("d", 8)))
.groupByKey()
result.foreach(tp=>{
val key: String = tp._1
val values: Iterable[Int] = tp._2.toList
var sum = 0
for (value <- values) {
sum += value
}
println(s"key:${key},sum:${sum}")
})
sc.stop()
二、filter
过滤符合条件的记录,根据传入的逻辑返回true的数据保留,返回false的数据过滤掉。
案例:过滤数据中长度大于5的字符串。
Java代码:
java
SparkConf conf = new SparkConf();
conf.setMaster("local");
conf.setAppName("filter");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("zhangsan", "lisi", "wangwu", "maliu"));
// filter:过滤长度大于5的字符串
JavaRDD<String> rdd2 = rdd1.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) throws Exception {
return s.length() > 5;
}
});
rdd2.foreach(s -> System.out.println(s));
sc.stop();
Scala代码:
Scala
val conf = new SparkConf().setMaster("local").setAppName("filter")
val sc = new SparkContext(conf)
//filter:过滤长度大于5的字符串
val rdd: RDD[String] = sc.parallelize(Array("zhangsan", "lisi", "wangwu", "maliu"))
rdd.filter(str=>{str.length > 5})
.foreach(println)
sc.stop()
- 📢博客主页:https://lansonli.blog.csdn.net
- 📢欢迎点赞 👍 收藏 ⭐留言 📝 如有错误敬请指正!
- 📢本文由 Lansonli 原创,首发于 CSDN博客🙉
- 📢停下休息的时候不要忘了别人还在奔跑,希望大家抓紧时间学习,全力奔赴更美好的生活✨