Pandas groupby分组操作详解
在数据分析中,经常会遇到这样的情况:根据某一列(或多列)标签把数据划分为不同的组别,然后再对其进行数据分析。比如,某网站对注册用户的性别或者年龄等进行分组,从而研究出网站用户的画像(特点)。在 Pandas 中,要完成数据的分组操作,需要使用 groupby() 函数,它和 SQL 的GROUP BY
操作非常相似。
在划分出来的组(group)上应用一些统计函数,从而达到数据分析的目的,比如对分组数据进行聚合、转换,或者过滤。这个过程主要包含以下三步:
- 拆分(Spliting):表示对数据进行分组;
- 应用(Applying):对分组数据应用聚合函数,进行相应计算;
- 合并(Combining):最后汇总计算结果。
下面对 groupby() 函数的应用过程进行具体的讲解。
创建DataFrame对象
提供数据:empdata.csv:
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10
首先我们创建一个 DataFrame 对象,下面数据描述了某公司员工信息:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print(df)
输出结果:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
创建groupby分组对象
使用 groupby() 可以沿着任意轴分组。您可以把分组时指定的键(key)作为每组的组名,方法如下所示:
- df.groupby("key")
- df.groupby("key",axis=1)
- df.groupby(["key1","key2"])
通过上述方法对 DataFrame 对象进行分组操作:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("按照部门编号分组:\n",df.groupby("DEPTNO"))
输出结果:
按照部门编号分组:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000020F2B536070>
查看分组结果
1) groups查看分组结果
通过调用groups
属性查看分组结果:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("按照部门编号分组:\n",df.groupby("DEPTNO").groups)
输出结果:
按照部门编号分组:
{10: [6, 8, 13], 20: [0, 3, 7, 10, 12], 30: [1, 2, 4, 5, 9, 11]}
2) 多个列标签分组
当然也可以指定多个列标签进行分组,示例如下:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("按照部门编号分组:\n",df.groupby(["DEPTNO","MGR"]).groups)
输出结果:
按照部门编号分组:
{(10, 7839.0): [6], (10, nan): [8], (10, 7782.0): [13], (20, 7566.0): [7, 12], (20, 7788.0): [10], (20, 7839.0): [3], (20, 7902.0): [0], (30, 7698.0): [1, 2, 4, 9, 11], (30, 7839.0): [5]}
通过 get_group() 方法可以选择组内的具体数据项:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("按照部门编号分组:\n",df.groupby(["DEPTNO"]).groups)
print("获取部门10的具体项:\n",df.groupby("DEPTNO").get_group(10))
输出结果:
按照部门编号分组:
{10: [6, 8, 13], 20: [0, 3, 7, 10, 12], 30: [1, 2, 4, 5, 9, 11]}
获取部门10的具体项:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
遍历分组数据
通过以下方法来遍历分组数据,示例如下:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("遍历分组后的数据:")
for lable,value in df.groupby('DEPTNO'):
print(f"分组后,部门{lable}的数据:\n{value}")
输出结果:
遍历分组后的数据:
分组后,部门10的数据:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
分组后,部门20的数据:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
分组后,部门30的数据:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
2 7521 WARD SALESMAN 7698.0 1981-02-22 1250 500.0 30
4 7654 MARTIN SALESMAN 7698.0 1981-09-28 1250 1400.0 30
5 7698 BLAKE MANAGER 7839.0 1981-05-01 2850 NaN 30
9 7844 TURNER SALESMAN 7698.0 1981-09-08 1500 0.0 30
11 7900 JAMES CLERK 7698.0 1981-12-03 950 NaN 30
如上所示, groupby 对象的组名称与 DEPTNO 中的的元素值一一对应。
应用聚合函数
当您在创建 groupby 对象时,通过 agg() 函数可以对分组对象应用多个聚合函数:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("分组后使用聚合函数:\n",df.groupby("DEPTNO")[["SAL","COMM"]].agg(np.sum))
#以下方式也可以
#print("分组后使用聚合函数:\n",df.groupby("DEPTNO")[["SAL","COMM"]].sum())
输出结果:
分组后使用聚合函数:
SAL COMM
DEPTNO
10 8750 0.0
20 10875 0.0
30 9400 2200.0
当然,您也可以一次性应有多个聚合函数,示例如下:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("分组后使用聚合函数:\n",df.groupby("DEPTNO")[["SAL","COMM"]].agg([np.sum,np.size,np.mean,np.std]))
输出结果:
分组后使用聚合函数:
SAL COMM
sum size mean std sum size mean std
DEPTNO
10 8750 3 2916.666667 1893.629672 0.0 3 NaN NaN
20 10875 5 2175.000000 1123.332097 0.0 5 NaN NaN
30 9400 6 1566.666667 668.331255 2200.0 6 550.0 602.771377
组的转换操作
在组的行或列上可以执行转换操作,最终会返回一个与组大小相同的索引对象。示例如下:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("获取SAL和COMM的均值:\n",df.groupby("DEPTNO")[["SAL","COMM"]].transform(np.mean))
demean = lambda arr:arr - arr.mean()
print("获取SAL和COMM的均值的差:\n",df.groupby("DEPTNO")[["SAL","COMM"]].transform(demean))
def get_rows(df, n):
# 从1到n行的所有列
return df.iloc[:n, :]
# 分组后的组名作为行索引
print("获取分组后,每组的前N行:\n",df.groupby('DEPTNO').apply(get_rows, n=1))
输出结果:
获取SAL和COMM的均值:
SAL COMM
0 2175.000000 NaN
1 1566.666667 550.0
2 1566.666667 550.0
3 2175.000000 NaN
4 1566.666667 550.0
5 1566.666667 550.0
6 2916.666667 NaN
7 2175.000000 NaN
8 2916.666667 NaN
9 1566.666667 550.0
10 2175.000000 NaN
11 1566.666667 550.0
12 2175.000000 NaN
13 2916.666667 NaN
获取SAL和COMM的均值的差:
SAL COMM
0 -1375.000000 NaN
1 33.333333 -250.0
2 -316.666667 -50.0
3 800.000000 NaN
4 -316.666667 850.0
5 1283.333333 NaN
6 -466.666667 NaN
7 825.000000 NaN
8 2083.333333 NaN
9 -66.666667 -550.0
10 -1075.000000 NaN
11 -616.666667 NaN
12 825.000000 NaN
13 -1616.666667 NaN
获取分组后,每组的前N行:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
DEPTNO
10 6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
20 0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
30 1 7499 ALLEN SALESMAN 7698.0 1981-02-20 1600 300.0 30
组的数据过滤操作
通过 filter() 函数可以实现数据的筛选,该函数根据定义的条件过滤数据并返回一个新的数据集。
下面,获取部门平均工资大于2000的员工信息:
python
import pandas as pd
import numpy as np
df = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
print("每个部门的平均成绩:\n",df.groupby("DEPTNO")[['DEPTNO','SAL']].aggregate(np.mean))
print("获取部门平均工资大于2000的员工信息:\n",df.groupby("DEPTNO").filter(lambda x:x['SAL'].mean()>2000))
输出结果:
每个部门的平均成绩:
DEPTNO SAL
DEPTNO
10 10.0 2916.666667
20 20.0 2175.000000
30 30.0 1566.666667
获取部门平均工资大于2000的员工信息:
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
0 7369 SMITH CLERK 7902.0 1980-12-17 800 NaN 20
3 7566 JONES MANAGER 7839.0 1981-04-02 2975 NaN 20
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
7 7788 SCOTT ANALYST 7566.0 1987-04-19 3000 NaN 20
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
10 7876 ADAMS CLERK 7788.0 1987-05-23 1100 NaN 20
12 7902 FORD ANALYST 7566.0 1981-12-03 3000 NaN 20
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10