【Python】pandas：排序、重复值、缺省值处理、合并、分组

pandas是Python的扩展库（第三方库），为Python编程语言提供高性能、易于使用的数据结构和数据分析工具。

pandas官方文档：User Guide --- pandas 2.2.2 documentation (pydata.org)

帮助：可使用help(...)查看函数说明文档（若是第三方库的函数，需先导入库）。例如：help(pd**.** DataFrame)，help(pd**.**concat)

Python代码中，导入pandas：

复制代码

import pandas as pd

1、排序

（1-1）按索引排序：sort_index

sort_index(self, axis : 'Axis' = 0, level : 'Level | None' = None, ascending : 'bool | int | Sequence $bool \| int$ ' = True, inplace : 'bool' = False, kind : 'str' = 'quicksort', na_position : 'str' = 'last', sort_remaining : 'bool' = True, ignore_index : 'bool' = False, key: 'IndexKeyFunc' = None)

注：默认axis=0 按行轴（按索引）排序，ascending=True 升序，inplace=False 不替换原DataFrame，na_position='last' NaN值在最后。

DataFrame**.**sort_index( )：按索引升序排列，默认NaN值在最后。
DataFrame**.**sort_index(ascending=False, na_position='first')：按索引降序排列，NaN值在最前面。
DataFrame**.**sort_index(key=函数)：索引先传入函数操作，再按操作过的索引升序排列，默认NaN值在最后。

DataFrame**.**sort_index( )：多索引，按第一列索引升序排列。
DataFrame**.**sort_index(ascending=False)：多索引，按第一列索引降序排列。
DataFrame**.**sort_index(level=指定索引列号)：多索引，按指定索引列升序排列，默认其他索引也接着排序。
DataFrame**.**sort_index(level=指定索引列号, sort_remaining=False)：多索引，按指定索引列升序排列，其他索引不排序。
DataFrame**.**sort_index(key=函数)：索引先传入函数操作，再按操作过的索引升序排列。
注：参数na_position对多索引不适用。

参数axis默认为0，按行索引排序。若axis=1，则按列索引排序。

（1-2）按数值排序：sort_values

sort_values(self, by , axis : 'Axis' = 0, ascending =True, inplace : 'bool' = False,kind : 'str' = 'quicksort', na_position : 'str' = 'last',ignore_index : 'bool' = False, key: 'ValueKeyFunc' = None)

注：默认axis=0 按行轴排序，ascending=True 升序，inplace=False 不替换原DataFrame，na_position='last' NaN值在最后。

DataFrame**.**sort_values(by=指定列 )：按指定列（多列，用列表表示）升序排列，默认NaN值在最后。
DataFrame**.**sort_values(by=指定列, ascending=False, na_position='first')：按指定列（多列，用列表表示）降序排列，NaN值在最前面。
DataFrame**.**sort_values(by=指定列, key=函数)：指定列的数据先传入函数操作，再按操作过的数据升序排列，默认NaN值在最后。

参数axis默认为0，按指定列的数据排序。若axis=1，则按指定行的数据排序。

（1-3）指定列升序排序，显示前n行：nsmallest

nsmallest(self, n : 'int', columns : 'IndexLabel', keep: 'str' = 'first') -> 'DataFrame'

DataFrame**.**nsmallest(n, 指定列名)：按指定列升序排列，并显示前n行。
DataFrame**.**nsmallest(n, 指定列名, keep=重复数据指定显示方式)：按指定列升序排列，并显示前n行，若指定列数据相同，默认keep='first' 按数据顺序显示。若keep='last' 按数据倒序显示，若keep='all' 按顺序显示所有相同数据。
注：DataFrame**.** nsmallest(n, 指定列名) 等同于DataFrame**.** sort_values(指定列名, ascending=True)**.**head(n)，但nsmallest性能更高。

若多列排序，排序列含有NaN，则NaN也会显示出来。

只能对数值列排序。

（1-4）指定列降序排序，显示前n行：nlargest

nlargest(self, n : 'int', columns : 'IndexLabel', keep: 'str' = 'first') -> 'DataFrame'

DataFrame**.** nlargest(...) 同DataFrame**.**nsmallest(...) 用法相同，区别是nsmallest升序排列，nlargest降序排列。
注：DataFrame**.** nlargest(n, 指定列名) 等同于 DataFrame**.** sort_values(指定列名, ascending=False)**.**head(n)，但nlargest性能更高。

2、重复值

（2-1）判断重复值：duplicated

duplicated(self, subset : 'Hashable | Sequence $Hashable$ | None' = None, keep: "Literal $'first'$ | Literal $'last'$ | Literal $False$ " = 'first') -> 'Series'

注：默认subset=None 整行比对，keep='first' 第一次出现的数据为False 其他重复出现的为True。

DataFrame**.**duplicated( )：默认比对整行，整行相同则为重复值，默认keep='first' 除了第一次出现的为False，其他为True。
DataFrame**.**duplicated(keep='last')：默认比对整行，整行相同则为重复值，除了最后一次出现的为False，其他为True。
DataFrame**.**duplicated(keep=False)：默认比对整行，整行相同则为重复值，只要是重复值就为True。
DataFrame**.**duplicated(subset=列名)：比对指定列（多列，用列表形式），数据相同则为重复值，除了第一次出现的为False，其他为True。

（2-2）删除重复值：drop_duplicates， $\~**...**$

drop_duplicates(self, subset : 'Hashable | Sequence $Hashable$ | None' = None, keep : "Literal $'first'$ | Literal $'last'$ | Literal $False$ " = 'first', inplace : 'bool' = False, ignore_index: 'bool' = False) -> 'DataFrame | None'

注：默认subset=None 整行比对，keep='first' 保留第一次出现的数据，inplace=False 不替换原DataFrame，ignore_index=False 使用原索引标签。

DataFrame**.**drop_duplicates( )：默认比对整行，整行相同则为重复值，默认keep='first' 保留第一次出现的，其他删除。
DataFrame**.**drop_duplicates(keep='last')：默认比对整行，整行相同则为重复值，保留最后一次出现的，其他删除。
DataFrame**.**drop_duplicates(keep=False)：默认比对整行，整行相同则为重复值，只要是重复值就删除。
DataFrame**.**drop_duplicates(subset=列名)：比对指定列（多列，用列表形式），数据相同则为重复值，保留第一次出现的，其他删除。

DataFrame**.**duplicated(...)，默认第一次出现的为False，其他重复出现的为True。

布尔数组作为索引将保留True的行，但实际应删除True（重复出现的）保留False（第一次出现的）。因此需取反，使用Tab键上方的**~**键，即DataFrame $**\~** DataFrame**.**duplicated(...)$ 。

DataFrame $**\~** DataFrame**.**duplicated(...)$ ：按duplicated的布尔结果取反删除重复值。

（2-3）统计不同的行出现次数：value_counts

value_counts(self, subset : 'Sequence $Hashable$ | None' = None, normalize : 'bool' = False, sort : 'bool' = True, ascending : 'bool' = False, dropna: 'bool' = True)

注：默认subset=None 整行比对，normalize=False 显示出现频率，sort=True 将频率排序，ascending=False 降序排列，dropna=True 忽略NaN。

DataFrame**.**value_counts( )：整行比对，显示数据出现频率，按频率降序排列，忽略NaN。
DataFrame**.**value_counts(sort=False)：整行比对，显示数据出现频率，不排序（按列数据显示），忽略NaN。
DataFrame**.**value_counts(ascending=True)：整行比对，显示数据出现频率，按频率升序排列，忽略NaN。
DataFrame**.**value_counts(normalize=True)：整行比对，显示比例而不是频率，按比例降序排列，忽略NaN。
DataFrame**.** value_counts(subset=列名**,**dropna=False)：指定列比对，显示数据出现频率，按频率降序排列，不忽略NaN。

（2-4）统计指定轴上不同数据的数量：nunique

nunique(self, axis : 'Axis' = 0, dropna: 'bool' = True) -> 'Series'

注：默认axis=0 按行轴查看（即各列不同数据的数量），dropna=True 忽略NaN。

DataFrame**.**nunique( )：默认统计各列中不同数据的数量，忽略NaN。
DataFrame**.**nunique(axis=1)：默认统计各行中不同数据的数量，忽略NaN。
DataFrame**.**nunique(dropna=False)：统计各列中不同数据的数量，不忽略NaN。

3、缺省值处理（判断、填充、删除、替换）

缺省值：NaN（空值，非数值）。None和np**.** NaN都是缺省值。（np**.** nan和np**.**NaN一样都是NaN，需导入numpy，import numpy as np）

（3-1）判断缺省值：isna, isnull, notna, notnull

DataFrame**.**isna( )：判断DataFrame中数据是否是NaN，若是NaN则为True，否则为False。
DataFrame**.**isnull( )：isnull是isna的别名。
DataFrame**.**notna( )：判断DataFrame中数据是否不是NaN，若不是NaN则为True，否则为False。
DataFrame**.**notnull( )：notnull是notna的别名。
注：均返回DataFrame（数据为布尔值True/False）。

（3-2）填充缺省值：

（3-2-1）指定方式填充：fillna

fillna(self, value : 'object | ArrayLike | None' = None, method : 'FillnaOptions | None' = None, axis : 'Axis | None' = None, inplace : 'bool' = False,limit =None, downcast=None) -> 'DataFrame | None'

注：默认inplace=False 不替换原DataFrame。

DataFrame**.**fillna(填充值)：所有NaN值用指定填充值填充。
DataFrame**.**fillna(method=填充方式)：所有NaN值用指定填充方式填充。"backfill"或"bfill"都是用NaN所在位置下一行的数据填充，"ffill"或"pad"都是用NaN所在位置上一行的数据填充。
DataFrame**.**fillna(method=填充方式, limit=填充次数, inplace=True)：NaN值用指定填充方式填充指定次数，替换原DataFrame。
DataFrame**.**fillna(method=填充方式, axis=1)：所有NaN值用指定填充方式填充。指定axis=1，则"backfill"或"bfill"都是用NaN所在位置下一列的数据填充，"ffill"或"pad"都是用NaN所在位置上一列的数据填充。

DataFrame**.**fillna(字典)：字典中键为列名，填充值为列名对应的值。即将NaN值按照字典中相同键（列名）对应的值填充。

DataFrame**.**fillna(另一个DataFrame)：将NaN值按照另一个DataFrame的相同列名相同行索引位置的值填充。

补充：

DataFrame**.**backfill( )：NaN值按该位置下一行的数据填充。若axis=1，则NaN值按该位置下一列的数据填充
DataFrame**.**bfill( )：NaN值按该位置下一行的数据填充。若axis=1，则NaN值按该位置下一列的数据填充
DataFrame**.**ffill( )：NaN值按该位置上一行的数据填充。若axis=1，则NaN值按该位置上一列的数据填充
DataFrame**.**pad( )：NaN值按该位置上一行的数据填充。若axis=1，则NaN值按该位置上一列的数据填充
注：均默认inplace=False不替换原DataFrame，默认limit=None所有NaN填充。

（3-2-2）插值方式填充：interpolate

插值法：通过已知的离散的数据点，推算一定范围内新数据点的方法，常用于函数拟合。

线性关系：两个变量之间的关系用图形表示是一条直线。

线性插值法：通过连接两个已知点的直线，近似获取其他未知点的方法。

interpolate(self: 'DataFrame', method : 'str' = 'linear', axis : 'Axis' = 0, limit : 'int | None' = None,inplace : 'bool' = False, limit_direction : 'str | None' = None, limit_area : 'str | None' = None, downcast: 'str | None' = None, **kwargs) -> 'DataFrame | None'

注：默认method='linear' 线性，inplace=False 不替换原DataFrame。

DataFrame**.** interpolate( )：线性填充NaN。
DataFrame**.** interpolate(method='pad')：指定填充方法为'pad'，使用NaN所在位置上一行数据填充。
DataFrame**.** interpolate(axis=1)：线性填充NaN，指定列轴即横向填充。
DataFrame**.** interpolate(limit=填充个数, limit_direction=填充方向, limit_area=填充区域)：线性填充NaN，指定填充个数（大于0），指定填充方向（'forwar' 从前向后，'backward'从后向前，'both' 两个方向），指定填充区域（'inside'有效值包围的NaN，'outside' 有效值之外的NaN）。
注：多索引，只能使用默认参数method='linear'。部分指定的method（例如：'krogh'，'barycentric'等），需要安装scipy。

（3-3）删除缺省值所在行/列：dropna

dropna(self, axis : 'Axis' = 0, how : 'str' = 'any', thresh =None, subset : 'IndexLabel' = None, inplace: 'bool' = False)

注：默认axis=0 按行查看，how='any' 只要有NaN整行删除（不能和参数thresh一起使用），inplace=False 不替换原DataFrame。

DataFrame**.**dropna( )：只要有NaN，整行删除。
DataFrame**.**dropna(axis=1)：只要有NaN，整列删除。
DataFrame**.**dropna(how='all')：整行数据都是NaN，整行删除。参数how不能和thresh一起使用。
DataFrame**.**dropna(thresh=指定非NaN值)：至少有指定数量的非NaN值，该行就保留，即非NaN值少于指定数量，整行删除。参数thresh不能和how一起使用。
DataFrame**.**dropna(subset=指定列, inplace=True)：查看指定列，该列数据只要有NaN整行删除，并替换原DataFrame。

（3-4）替换值：replace

replace(self, to_replace =None, value =<no_default>, inplace : 'bool' = False, limit =None, regex : 'bool' = False, method: 'str | lib.NoDefault' = <no_default>)

DataFrame**.** replace(np**.**nan, 新值)：将NaN值替换成新值。新值可用单个数据，也可列表或字典表示。
DataFrame**.**replace(...) 除了替换NaN，可以替换其他数据，还可以用正则表达式替换值。此处忽略。
注：np**.** nan和np**.**NaN一样都是NaN，需导入numpy（import numpy as np）。

4、合并

（4-1）通过索引，连接另一个DataFrame/Series的列：join

join(self, other : 'DataFrame | Series',on : 'IndexLabel | None' = None, how : 'str' = 'left', lsuffix : 'str' = '', rsuffix : 'str' = '', sort: 'bool' = False) -> 'DataFrame'

注：默认通过索引连接。默认how='left' 左连接。

DataFrame**.**join(df2, lsuffix=左列名后缀, rsuffix=右列名后缀)：通过索引连接两个DataFrame的列。若列名相同，需设置左/右列名后缀。
DataFrame**.**join(df2, how=连接方式)：通过索引连接两个DataFrame的列，若列名相同，需设置左/右列名后缀。默认how='left' 左连接。'right' 右连接，'inner' 内连接，'outer' 外连接，'cross' 笛卡尔积。
DataFrame**.**join(df2, on=连接列, how=连接方式)：指定左DataFrame的连接列，右DataFrame需将连接列设为索引列，指定连接方式（默认左连接）。

|-------------|------|-------------------------------------------------------|
| how='left' | 左连接 | 按左DataFrame的索引，右DataFrame没有索引对应的数据，则为NaN |
| how='right' | 右连接 | 按右DataFrame的索引，左DataFrame没有索引对应的数据，则为NaN，若对应多个数据则都显示 |
| how='inner' | 内连接 | 两个DataFrame共同索引对应的数据 |
| how='outer' | 外连接 | 两个DataFrame的所有索引，各索引对应的数据，没有为NaN |
| how='cross' | 笛卡尔积 | X*Y，两个DataFrame的所有组合。 |

DataFrame**.**join(Series)：DataFrame可以通过索引连接Series。Series必须有name，且name作为列名。

（4-2）指定连接列，连接另一个DataFrame/Series的列：merge

merge(self, right : 'DataFrame | Series', how : 'str' = 'inner', on : 'IndexLabel | None' = None, left_on : 'IndexLabel | None' = None, right_on : 'IndexLabel | None' = None, left_index : 'bool' = False, right_index : 'bool' = False, sort : 'bool' = False,suffixes : 'Suffixes' = ('_x', '_y'), copy : 'bool' = True, indicator : 'bool' = False, validate: 'str | None' = None) -> 'DataFrame'

注：默认how='inner' 内连接。

DataFrame**.**merge(df2, left_on=左连接列, right_on=右连接列)：分别指定两个DataFrame的连接列（列名不同），默认内连接（两DataFrame连接列都有的数据）。
DataFrame**.**merge(df2, left_on=左连接列, right_on=右连接列,how=连接方式)：分别指定两个DataFrame的连接列（列名不同），指定连接方式。默认how='inner'内连接。'left'左连接，'right'右连接，'outer'外连接，'cross'笛卡尔积。
DataFrame**.**merge(df2, left_on=左连接列, right_on=右连接列,how=连接方式, indicator=True)：分别指定两个DataFrame的连接列（列名不同），指定连接方式，并多一列显示是哪一边的数据。

DataFrame**.**merge(df2, on=连接列)：指定两个DataFrame的连接列（列名相同），默认内连接（两DataFrame连接列都有的数据）。若只有连接列的列名相同，参数on可省略。
DataFrame**.**merge(df2, on=连接列,how=连接方式)：指定两个DataFrame的连接列（列名相同），指定连接方式。默认how='inner'内连接。'left'左连接，'right'右连接，'outer'外连接，'cross'笛卡尔积。
DataFrame**.**merge(df2, on=连接列,how=连接方式, indicator=True)：指定两个DataFrame的连接列（列名相同），指定连接方式，并多一列显示是哪一边的数据。

（4-3）尾部追加另一个DataFrame的行：append

append(self, other , ignore_index : 'bool' = False, verify_integrity : 'bool' = False, sort: 'bool' = False) -> 'DataFrame'

DataFrame**.**append(df2)：从DataFrame尾部追加另一个DataFrame，按列名追加。默认使用原索引。
DataFrame**.**append(df2, ignore_index=True)：从DataFrame尾部追加另一个DataFrame，按列名追加。忽略索引，使用从0开始的新索引。
注：append未来将会被移除，建议使用pd**.**concat(...)。

（4-4）指定轴，连接另一个DataFrame/Series的列/行：concat

concat(objs : 'Iterable $NDFrame$ | Mapping $Hashable, NDFrame$ ', axis : 'Axis' = 0, join : 'str' = 'outer', ignore_index : 'bool' = False, keys =None, levels =None, names =None, verify_integrity : 'bool' = False, sort : 'bool' = False, copy: 'bool' = True) -> 'DataFrame | Series'

注：默认join='outer' 外连接。

pd**.**concat(列表形式)：使用列表形式表示需连接的对象。默认DataFrame按列名从尾部追加另一个DataFrame，默认使用原索引。
pd**.**concat(列表形式, ignore_index=True, join=连接方式)：使用列表形式表示需连接的对象。DataFrame按列名从尾部追加另一个DataFrame，指定连接方式，默认join='outer'外连接。'inner'内连接。忽略索引，则使用从0开始的新索引。
pd**.**concat(列表形式, axis=1)：使用列表形式表示需连接的对象。指定列轴则DataFrame按索引横向追加另一个DataFrame，使用原列名。
pd**.**concat(列表形式, axis=1, ignore_index=True, join=连接方式)：使用列表形式表示需连接的对象。指定列轴则DataFrame按索引横向追加另一个DataFrame，指定连接方式，默认join='outer'外连接。'inner'内连接。忽略索引，则使用从0开始的新列名。

5、分组：groupby

groupby(self, by =None, axis : 'Axis' = 0, level : 'Level | None' = None, as_index : 'bool' = True, sort : 'bool' = True, group_keys : 'bool' = True,squeeze : 'bool | lib.NoDefault' = <no_default>, observed : 'bool' = False, dropna: 'bool' = True) -> 'DataFrameGroupBy'

注：默认dropna=True 忽略NaN。

DataFrame**.**groupby(指定列)：按指定列分组。默认忽略NaN。按多个列分组，可用列表形式表示。
分组后进行聚合，若不指定列聚合，则只对数值型的列聚合。
分组后类型为DataFrameGroupBy。DataFrameGroupBy官方文档：GroupBy --- pandas 2.2.2 documentation (pydata.org)

groupby一般和agg配合使用。agg在指定轴上使用多个操作进行聚合。

agg(self, func =None,axis: 'Axis' = 0, *args, **kwargs)

groupby可根据条件分组。

pandas 各函数官方文档：General functions --- pandas 2.2.2 documentation (pydata.org)

DataFrame 各方法官方文档：DataFrame --- pandas 2.2.2 documentation (pydata.org)

Series 各方法官方文档：Series --- pandas 2.2.2 documentation (pydata.org)