mapGroups
是 Spark 中一个强大的分组操作函数,它允许你对每个分组应用自定义逻辑并返回一个结果。以下是多个使用简单样例数据的具体用法示例。
基础示例数据
假设我们有一个简单的学生成绩数据集:
Scala
// 创建示例DataFrame
val studentScores = Seq(
("Math", "Alice", 85),
("Math", "Bob", 92),
("Math", "Charlie", 78),
("Science", "Alice", 88),
("Science", "Bob", 95),
("Science", "Charlie", 82),
("English", "Alice", 90),
("English", "Bob", 87),
("English", "Charlie", 91)
).toDF("subject", "name", "score")
// 按科目分组
val grouped = studentScores.groupByKey(row => row.getAs[String]("subject"))
示例 1: 计算每科平均分
Scala
val subjectAverages = grouped.mapGroups { (subject, iterator) =>
var total = 0
var count = 0
while (iterator.hasNext) {
val row = iterator.next()
total += row.getAs[Int]("score")
count += 1
}
(subject, if (count > 0) total.toDouble / count else 0.0)
}.toDF("subject", "average_score")
subjectAverages.show()
输出结果:
Scala
+--------+------------------+
| subject| average_score|
+--------+------------------+
| Math| 85.0|
|Science| 88.33|
|English| 89.33|
+--------+------------------+
示例 2: 找出每科最高分和学生
Scala
val topScores = grouped.mapGroups { (subject, iterator) =>
var maxScore = Int.MinValue
var topStudent = ""
while (iterator.hasNext) {
val row = iterator.next()
val score = row.getAs[Int]("score")
val name = row.getAs[String]("name")
if (score > maxScore) {
maxScore = score
topStudent = name
}
}
(subject, topStudent, maxScore)
}.toDF("subject", "top_student", "top_score")
topScores.show()
输出结果:
Scala
+--------+-----------+---------+
| subject|top_student|top_score|
+--------+-----------+---------+
| Math| Bob| 92|
|Science| Bob| 95|
|English| Charlie| 91|
+--------+-----------+---------+
示例 3: 计算每科成绩分布(统计各分数段人数)
Scala
val scoreDistribution = grouped.mapGroups { (subject, iterator) =>
var excellent = 0 // 90-100
var good = 0 // 80-89
var average = 0 // 70-79
var below = 0 // <70
while (iterator.hasNext) {
val row = iterator.next()
val score = row.getAs[Int]("score")
if (score >= 90) excellent += 1
else if (score >= 80) good += 1
else if (score >= 70) average += 1
else below += 1
}
(subject, excellent, good, average, below)
}.toDF("subject", "excellent", "good", "average", "below_70")
scoreDistribution.show()
输出结果:
Scala
+--------+---------+----+-------+-------+
| subject|excellent|good|average|below_70|
+--------+---------+----+-------+-------+
| Math| 1| 1| 1| 0|
|Science| 1| 2| 0| 0|
|English| 1| 2| 0| 0|
+--------+---------+----+-------+-------+
示例 4: 为每科生成成绩报告
Scala
val subjectReports = grouped.mapGroups { (subject, iterator) =>
var students = List[String]()
var scores = List[Int]()
var total = 0
var count = 0
while (iterator.hasNext) {
val row = iterator.next()
val name = row.getAs[String]("name")
val score = row.getAs[Int]("score")
students = name :: students
scores = score :: scores
total += score
count += 1
}
val average = if (count > 0) total.toDouble / count else 0.0
val maxScore = if (scores.nonEmpty) scores.max else 0
val minScore = if (scores.nonEmpty) scores.min else 0
s"Subject: $subject | Students: ${students.mkString(", ")} | " +
s"Average: $average | Max: $maxScore | Min: $minScore"
}.toDF("report")
subjectReports.show(false)
输出结果:
Scala
+---------------------------------------------------------------------------+
|report |
+---------------------------------------------------------------------------+
|Subject: Math | Students: Charlie, Bob, Alice | Average: 85.0 | Max: 92 | Min: 78|
|Subject: Science | Students: Charlie, Bob, Alice | Average: 88.33 | Max: 95 | Min: 82|
|Subject: English | Students: Charlie, Bob, Alice | Average: 89.33 | Max: 91 | Min: 87|
+---------------------------------------------------------------------------+
示例 5: 计算每科成绩的标准差
Scala
val subjectStdDev = grouped.mapGroups { (subject, iterator) =>
var scores = List[Double]()
var sum = 0.0
var count = 0
// 第一次遍历:计算平均值
while (iterator.hasNext) {
val row = iterator.next()
val score = row.getAs[Int]("score").toDouble
scores = score :: scores
sum += score
count += 1
}
if (count == 0) {
(subject, 0.0)
} else {
val mean = sum / count
// 第二次遍历:计算方差
var variance = 0.0
scores.foreach(score => {
variance += Math.pow(score - mean, 2)
})
variance /= count
// 计算标准差
val stdDev = Math.sqrt(variance)
(subject, stdDev)
}
}.toDF("subject", "std_dev")
subjectStdDev.show()
输出结果:
Scala
+--------+------------------+
| subject| std_dev|
+--------+------------------+
| Math| 5.88784057761515|
|Science|5.507570547286102|
|English|1.699673171197595|
+--------+------------------+
示例 6: 为每个科目创建自定义摘要
Scala
val customSummaries = grouped.mapGroups { (subject, iterator) =>
// 收集所有数据
val data = iterator.toList.map(row =>
(row.getAs[String]("name"), row.getAs[Int]("score"))
)
// 排序
val sorted = data.sortBy(-_._2)
// 计算统计量
val scores = sorted.map(_._2)
val average = scores.sum.toDouble / scores.length
val median = if (scores.length % 2 == 1) {
scores(scores.length / 2)
} else {
(scores(scores.length / 2 - 1) + scores(scores.length / 2)) / 2.0
}
// 创建自定义摘要
val summary = Map(
"subject" -> subject,
"top_student" -> sorted.head._1,
"top_score" -> sorted.head._2,
"average" -> average,
"median" -> median,
"student_count" -> scores.length
)
summary
}.toDF("summary")
customSummaries.show(false)
输出结果:
Scala
+-----------------------------------------------------------------------------------------+
|summary |
+-----------------------------------------------------------------------------------------+
|Map(subject -> Math, top_student -> Bob, top_score -> 92, average -> 85.0, median -> 85.0, student_count -> 3)|
|Map(subject -> Science, top_student -> Bob, top_score -> 95, average -> 88.33, median -> 88.0, student_count -> 3)|
|Map(subject -> English, top_student -> Charlie, top_score -> 91, average -> 89.33, median -> 90.0, student_count -> 3)|
+-----------------------------------------------------------------------------------------+
注意事项
-
内存使用 :
mapGroups
会将整个分组的数据加载到内存中,因此对于非常大的分组,可能会导致内存不足的问题。 -
性能考虑 :对于简单的聚合操作(如求和、计数),使用 Spark 内置的聚合函数通常比
mapGroups
更高效。 -
数据倾斜:如果某些分组特别大,可能会导致任务执行时间过长。
-
迭代器使用 :
mapGroups
提供的迭代器只能遍历一次,如果需要多次访问数据,需要先将其转换为列表或数组。 -
类型安全 :使用
mapGroups
时,返回的数据类型需要与预期的输出类型匹配,否则可能会在运行时出现错误。
mapGroups
是一个非常灵活的函数,适用于需要自定义复杂分组逻辑的场景。通过上述示例,你可以看到它可以用于各种统计计算、数据转换和报告生成任务。