Spark mapGroups 函数详解与多种用法示例

mapGroups 是 Spark 中一个强大的分组操作函数,它允许你对每个分组应用自定义逻辑并返回一个结果。以下是多个使用简单样例数据的具体用法示例。

基础示例数据

假设我们有一个简单的学生成绩数据集:

Scala 复制代码
// 创建示例DataFrame
val studentScores = Seq(
  ("Math", "Alice", 85),
  ("Math", "Bob", 92),
  ("Math", "Charlie", 78),
  ("Science", "Alice", 88),
  ("Science", "Bob", 95),
  ("Science", "Charlie", 82),
  ("English", "Alice", 90),
  ("English", "Bob", 87),
  ("English", "Charlie", 91)
).toDF("subject", "name", "score")

// 按科目分组
val grouped = studentScores.groupByKey(row => row.getAs[String]("subject"))

示例 1: 计算每科平均分

Scala 复制代码
val subjectAverages = grouped.mapGroups { (subject, iterator) =>
  var total = 0
  var count = 0
  
  while (iterator.hasNext) {
    val row = iterator.next()
    total += row.getAs[Int]("score")
    count += 1
  }
  
  (subject, if (count > 0) total.toDouble / count else 0.0)
}.toDF("subject", "average_score")

subjectAverages.show()

输出结果:

Scala 复制代码
+--------+------------------+
| subject|     average_score|
+--------+------------------+
|   Math|              85.0|
|Science|              88.33|
|English|              89.33|
+--------+------------------+

示例 2: 找出每科最高分和学生

Scala 复制代码
val topScores = grouped.mapGroups { (subject, iterator) =>
  var maxScore = Int.MinValue
  var topStudent = ""
  
  while (iterator.hasNext) {
    val row = iterator.next()
    val score = row.getAs[Int]("score")
    val name = row.getAs[String]("name")
    
    if (score > maxScore) {
      maxScore = score
      topStudent = name
    }
  }
  
  (subject, topStudent, maxScore)
}.toDF("subject", "top_student", "top_score")

topScores.show()

输出结果:

Scala 复制代码
+--------+-----------+---------+
| subject|top_student|top_score|
+--------+-----------+---------+
|   Math|        Bob|       92|
|Science|        Bob|       95|
|English|    Charlie|       91|
+--------+-----------+---------+

示例 3: 计算每科成绩分布(统计各分数段人数)

Scala 复制代码
val scoreDistribution = grouped.mapGroups { (subject, iterator) =>
  var excellent = 0  // 90-100
  var good = 0       // 80-89
  var average = 0    // 70-79
  var below = 0      // <70
  
  while (iterator.hasNext) {
    val row = iterator.next()
    val score = row.getAs[Int]("score")
    
    if (score >= 90) excellent += 1
    else if (score >= 80) good += 1
    else if (score >= 70) average += 1
    else below += 1
  }
  
  (subject, excellent, good, average, below)
}.toDF("subject", "excellent", "good", "average", "below_70")

scoreDistribution.show()

输出结果:

Scala 复制代码
+--------+---------+----+-------+-------+
| subject|excellent|good|average|below_70|
+--------+---------+----+-------+-------+
|   Math|        1|   1|      1|      0|
|Science|        1|   2|      0|      0|
|English|        1|   2|      0|      0|
+--------+---------+----+-------+-------+

示例 4: 为每科生成成绩报告

Scala 复制代码
val subjectReports = grouped.mapGroups { (subject, iterator) =>
  var students = List[String]()
  var scores = List[Int]()
  var total = 0
  var count = 0
  
  while (iterator.hasNext) {
    val row = iterator.next()
    val name = row.getAs[String]("name")
    val score = row.getAs[Int]("score")
    
    students = name :: students
    scores = score :: scores
    total += score
    count += 1
  }
  
  val average = if (count > 0) total.toDouble / count else 0.0
  val maxScore = if (scores.nonEmpty) scores.max else 0
  val minScore = if (scores.nonEmpty) scores.min else 0
  
  s"Subject: $subject | Students: ${students.mkString(", ")} | " +
  s"Average: $average | Max: $maxScore | Min: $minScore"
}.toDF("report")

subjectReports.show(false)

输出结果:

Scala 复制代码
+---------------------------------------------------------------------------+
|report                                                                     |
+---------------------------------------------------------------------------+
|Subject: Math | Students: Charlie, Bob, Alice | Average: 85.0 | Max: 92 | Min: 78|
|Subject: Science | Students: Charlie, Bob, Alice | Average: 88.33 | Max: 95 | Min: 82|
|Subject: English | Students: Charlie, Bob, Alice | Average: 89.33 | Max: 91 | Min: 87|
+---------------------------------------------------------------------------+

示例 5: 计算每科成绩的标准差

Scala 复制代码
val subjectStdDev = grouped.mapGroups { (subject, iterator) =>
  var scores = List[Double]()
  var sum = 0.0
  var count = 0
  
  // 第一次遍历:计算平均值
  while (iterator.hasNext) {
    val row = iterator.next()
    val score = row.getAs[Int]("score").toDouble
    scores = score :: scores
    sum += score
    count += 1
  }
  
  if (count == 0) {
    (subject, 0.0)
  } else {
    val mean = sum / count
    
    // 第二次遍历:计算方差
    var variance = 0.0
    scores.foreach(score => {
      variance += Math.pow(score - mean, 2)
    })
    variance /= count
    
    // 计算标准差
    val stdDev = Math.sqrt(variance)
    (subject, stdDev)
  }
}.toDF("subject", "std_dev")

subjectStdDev.show()

输出结果:

Scala 复制代码
+--------+------------------+
| subject|           std_dev|
+--------+------------------+
|   Math| 5.88784057761515|
|Science|5.507570547286102|
|English|1.699673171197595|
+--------+------------------+

示例 6: 为每个科目创建自定义摘要

Scala 复制代码
val customSummaries = grouped.mapGroups { (subject, iterator) =>
  // 收集所有数据
  val data = iterator.toList.map(row => 
    (row.getAs[String]("name"), row.getAs[Int]("score"))
  )
  
  // 排序
  val sorted = data.sortBy(-_._2)
  
  // 计算统计量
  val scores = sorted.map(_._2)
  val average = scores.sum.toDouble / scores.length
  val median = if (scores.length % 2 == 1) {
    scores(scores.length / 2)
  } else {
    (scores(scores.length / 2 - 1) + scores(scores.length / 2)) / 2.0
  }
  
  // 创建自定义摘要
  val summary = Map(
    "subject" -> subject,
    "top_student" -> sorted.head._1,
    "top_score" -> sorted.head._2,
    "average" -> average,
    "median" -> median,
    "student_count" -> scores.length
  )
  
  summary
}.toDF("summary")

customSummaries.show(false)

输出结果:

Scala 复制代码
+-----------------------------------------------------------------------------------------+
|summary                                                                                  |
+-----------------------------------------------------------------------------------------+
|Map(subject -> Math, top_student -> Bob, top_score -> 92, average -> 85.0, median -> 85.0, student_count -> 3)|
|Map(subject -> Science, top_student -> Bob, top_score -> 95, average -> 88.33, median -> 88.0, student_count -> 3)|
|Map(subject -> English, top_student -> Charlie, top_score -> 91, average -> 89.33, median -> 90.0, student_count -> 3)|
+-----------------------------------------------------------------------------------------+

注意事项

  1. 内存使用mapGroups 会将整个分组的数据加载到内存中,因此对于非常大的分组,可能会导致内存不足的问题。

  2. 性能考虑 :对于简单的聚合操作(如求和、计数),使用 Spark 内置的聚合函数通常比 mapGroups 更高效。

  3. 数据倾斜:如果某些分组特别大,可能会导致任务执行时间过长。

  4. 迭代器使用mapGroups 提供的迭代器只能遍历一次,如果需要多次访问数据,需要先将其转换为列表或数组。

  5. 类型安全 :使用 mapGroups 时,返回的数据类型需要与预期的输出类型匹配,否则可能会在运行时出现错误。

mapGroups 是一个非常灵活的函数,适用于需要自定义复杂分组逻辑的场景。通过上述示例,你可以看到它可以用于各种统计计算、数据转换和报告生成任务。

相关推荐
Light601 天前
点燃变革:领码SPARK融合平台如何重塑OA,开启企业智慧协同新纪元?
大数据·分布式·spark
Guheyunyi1 天前
智慧消防管理系统如何重塑安全未来
大数据·运维·服务器·人工智能·安全
写代码的【黑咖啡】1 天前
如何在大数据数仓中搭建数据集市
大数据·分布式·spark
华清远见成都中心1 天前
成都理工大学&华清远见成都中心实训,助力电商人才培养
大数据·人工智能·嵌入式
梦里不知身是客111 天前
flume防止数据丢失的方法
大数据·flume
SoleMotive.1 天前
kafka选型
分布式·kafka
鹏说大数据1 天前
数据治理项目实战系列6-数据治理架构设计实战,流程 + 工具双架构拆解
大数据·数据库·架构
小二·1 天前
MyBatis基础入门《十五》分布式事务实战:Seata + MyBatis 实现跨服务数据一致性
分布式·wpf·mybatis
AI逐月1 天前
Git 彻底清除历史记录
大数据·git·elasticsearch
天远API1 天前
Java后端进阶:处理多数据源聚合API —— 以天远小微企业报告为例
大数据·api