Pyspark中GroupedData类型内置函数

文章目录

    • [pyspark.sql.group.GroupedData 类型内置方法](#pyspark.sql.group.GroupedData 类型内置方法)
    • [agg 聚合](#agg 聚合)
    • apply
    • applylnPandas
    • [avg alias mean](#avg alias mean)
    • count
    • max
    • min
    • sum
    • pivot

pyspark.sql.group.GroupedData 类型内置方法

agg 聚合

复制代码
df = spark.createDataFrame(
     [(2, "Alice"), (3, "Alice"), (5, "Bob"), (10, "Bob")], ["age", "name"])
df.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  3|Alice|
|  5|  Bob|
| 10|  Bob|
+---+-----+
df.groupBy(df.name)
<pyspark.sql.group.GroupedData object at 0x7f74be2e4240>
df.groupBy(df.name).agg({'*':'count'}).show()
+-----+--------+
| name|count(1)|
+-----+--------+
|Alice|       2|
|  Bob|       2|
+-----+--------+

df.groupBy(df.name).agg({'age':'min'}).sort("name").show()
+-----+--------+
| name|min(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+

from pyspark.sql.functions import lit,col,min
df.groupBy(df.name).agg(min(df.age)).sort("name").show()
+-----+--------+
| name|min(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+


from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
@pandas_udf('int',PandasUDFType.GROUPED_AGG)
def min_udf(v):
    print(type(v))
    print(v)
    return v.min()
df.groupBy(df.name).agg(min_udf(df.age)).sort("name").show()  

<class 'pandas.core.series.Series'> 
0    2
1    3
                                                                                  +-----+------------+
| name|min_udf(age)|
+-----+------------+
|Alice|           2|
|  Bob|           5|
+-----+------------+

apply

参数:pandas_udf装饰的函数

pyspark.sql.functions.pandas_udf()

复制代码
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))
    
df.show()
+---+----+
| id|   v|
+---+----+
|  1| 1.0|
|  1| 2.0|
|  2| 3.0|
|  2| 5.0|
|  2|10.0|
+---+----+

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    print(pdf)
    v = pdf.v
    d1 = pdf.assign(v=(v - v.mean()) / v.std())
    print(d1,type(d1))
    return d1
df.groupby("id").apply(normalize).show()
   id    v
0   1  1.0
1   1  2.0
   id         v
0   1 -0.707107
1   1  0.707107 <class 'pandas.core.frame.DataFrame'>
   id     v
0   2   3.0
1   2   5.0
2   2  10.0
   id        v
0   2 -0.83205
1   2 -0.27735
2   2  1.10940 <class 'pandas.core.frame.DataFrame'>
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+

applylnPandas

参数:普通函数,dataframe的scheam结构

复制代码
import pandas as pd  
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
    
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())


df.groupby("id").applyInPandas(
    normalize, schema="id long, v double").show()  
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+

def normalize(pdf):
    v = pdf.v
    return pd.DataFrame()
df.groupby("id").applyInPandas(
    normalize, schema="id long, v double").show()  
                                                                                +---+---+
| id|  v|
+---+---+
+---+---+

avg alias mean

复制代码
df = spark.createDataFrame([
...     (2, "Alice", 80), (3, "Alice", 100),
...     (5, "Bob", 120), (10, "Bob", 140)], ["age", "name", "height"])
>>> df.show()
+---+-----+------+
|age| name|height|
+---+-----+------+
|  2|Alice|    80|
|  3|Alice|   100|
|  5|  Bob|   120|
| 10|  Bob|   140|
+---+-----+------+

# 直接使用avg方法
df.groupBy("name").avg('age').sort("name").show()
+-----+--------+
| name|avg(age)|
+-----+--------+
|Alice|     2.5|
|  Bob|     7.5|
+-----+--------+

# 使用agg 配合使用
df.groupBy("name").agg({"age":'avg'}).sort("name").show()
+-----+--------+
| name|avg(age)|
+-----+--------+
|Alice|     2.5|
|  Bob|     7.5|
+-----+--------+

count

复制代码
df = spark.createDataFrame(
     [(2, "Alice"), (3, "Alice"), (5, "Bob"), (10, "Bob")], ["age", "name"])
df.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  3|Alice|
|  5|  Bob|
| 10|  Bob|
+---+-----+

df.groupBy(df.name).count().sort("name").show()
+-----+-----+
| name|count|
+-----+-----+
|Alice|    2|
|  Bob|    2|
+-----+-----+

df.groupBy("name").agg({"name":'count'}).sort("name").show()
+-----+-----------+
| name|count(name)|
+-----+-----------+
|Alice|          2|
|  Bob|          2|
+-----+-----------+

max

复制代码
df = spark.createDataFrame([
    (2, "Alice", 80), (3, "Alice", 100),
    (5, "Bob", 120), (10, "Bob", 140)], ["age", "name", "height"])
df.show()
+---+-----+------+
|age| name|height|
+---+-----+------+
|  2|Alice|    80|
|  3|Alice|   100|
|  5|  Bob|   120|
| 10|  Bob|   140|
+---+-----+------+

df.groupBy("name").max("age").sort("name").show()
+-----+--------+
| name|max(age)|
+-----+--------+
|Alice|       3|
|  Bob|      10|
+-----+--------+


df.groupBy("name").agg({"age":'max','height':"min"}).sort("name").show()
+-----+--------+-----------+
| name|max(age)|min(height)|
+-----+--------+-----------+
|Alice|       3|         80|
|  Bob|      10|        120|
+-----+--------+-----------+

min

复制代码
df = spark.createDataFrame([
    (2, "Alice", 80), (3, "Alice", 100),
    (5, "Bob", 120), (10, "Bob", 140)], ["age", "name", "height"])
df.show()
+---+-----+------+
|age| name|height|
+---+-----+------+
|  2|Alice|    80|
|  3|Alice|   100|
|  5|  Bob|   120|
| 10|  Bob|   140|
+---+-----+------+

df.groupBy("name").min("age").sort("name").show()
+-----+--------+
| name|min(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+


df.groupBy("name").agg({"age":'max','height':"min"}).sort("name").show()
+-----+--------+-----------+
| name|max(age)|min(height)|
+-----+--------+-----------+
|Alice|       3|         80|
|  Bob|      10|        120|
+-----+--------+-----------+

sum

复制代码
df = spark.createDataFrame([
    (2, "Alice", 80), (3, "Alice", 100),
    (5, "Bob", 120), (10, "Bob", 140)], ["age", "name", "height"])
df.show()
+---+-----+------+
|age| name|height|
+---+-----+------+
|  2|Alice|    80|
|  3|Alice|   100|
|  5|  Bob|   120|
| 10|  Bob|   140|
+---+-----+------+

df.groupBy("name").sum("age").sort("name").show()
+-----+--------+
| name|sum(age)|
+-----+--------+
|Alice|       5|
|  Bob|      15|
+-----+--------+

pivot

分组后,选中某列,把列中的指定值进行统计

复制代码
from pyspark.sql import Row
... df1 = spark.createDataFrame([
...     Row(course="dotNET", year=2012, earnings=10000),
...     Row(course="Java", year=2012, earnings=20000),
...     Row(course="dotNET", year=2012, earnings=5000),
...     Row(course="dotNET", year=2013, earnings=48000),
...     Row(course="Java", year=2013, earnings=30000),
... ])
... df1.show()
+------+----+--------+
|course|year|earnings|
+------+----+--------+
|dotNET|2012|   10000|
|  Java|2012|   20000|
|dotNET|2012|    5000|
|dotNET|2013|   48000|
|  Java|2013|   30000|
+------+----+--------+

df1.groupBy("year").pivot("course", ["dotNET", "Java"]).max("earnings").show()
+----+------+-----+
|year|dotNET| Java|
+----+------+-----+
|2012| 10000|20000|
|2013| 48000|30000|
+----+------+-----+


df1.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").show()
+----+------+-----+
|year|dotNET| Java|
+----+------+-----+
|2012| 15000|20000|
|2013| 48000|30000|
+----+------+-----+
相关推荐
心中有国也有家1 小时前
GE图引擎深度解析——CANN的计算图优化与执行引擎
人工智能·pytorch·python·学习·numpy
卷毛的技术笔记3 小时前
告别硬编码!Spring AI Alibaba 实现 AI Agent 智能工具调用(Tool Calling)
java·人工智能·后端·python·spring·ai编程
编程大师哥3 小时前
匿名函数 lambda + 高阶函数
java·python·算法
isyangli_blog3 小时前
OpenDayLight (Carbon 版本) 启动与组件安装
开发语言·php
vb2008113 小时前
FastAPI APIRouter
开发语言·python
Benszen3 小时前
KVM虚拟化解决方案
开发语言·perl
会编程的土豆3 小时前
Go 语言反射(Reflection)详解
开发语言·后端·golang
東雪木3 小时前
多线程与并发编程 专属复习笔记
java·开发语言·笔记·java面试
adrninistrat0r3 小时前
Java调用链MCP分析工具
java·python·ai编程
杨充3 小时前
1.3 浮点型数据设计灵魂
开发语言·python·算法