Introduction
上一期我们说到,Pandas可以方便地读取和存储表格类型的文件,以实现本地与Python间的交互。
但是,很多情景下我们所需要的数据可能分布在多个不同的文件或DataFrame中。若需要把它们整合成一个完整的数据集,就需要用到本期标题中所提到的拼接了。
Pandas提供了多种拼接方式,主要包括concat
、merge
、join
等,每种方式都有其适用场景,下面我们对它们进行介绍。
concat
❝
concat
函数主要用于将多个DataFrame
或Series
对象沿指定轴(行或列)进行拼接,常用的自定义参数主要包括:
axis
:指定拼接的轴,0表示行,1表示列,默认为0;
join
:指定拼接方式,inner
表示仅保留共有的行或列,outer
表示保留所有行或列,默认为outer
;
ignore_index
:指定是否重置索引,默认为False
。
我们直接来看案例:
import numpy as np
import pandas as pd
# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['precipitation'])
print(data0)
print(data1)
# 按列拼接数据集
data = pd.concat([data0, data1], axis=1)
print(data)
temperature
2000-12-31 2.956534
2001-12-31 8.961623
2002-12-31 0.720613
2003-12-31 4.456005
2004-12-31 0.500873
precipitation
2000-12-31 44.627392
2001-12-31 804.956900
2002-12-31 582.525174
2003-12-31 628.544335
2004-12-31 100.806668
temperature precipitation
2000-12-31 2.956534 44.627392
2001-12-31 8.961623 804.956900
2002-12-31 0.720613 582.525174
2003-12-31 4.456005 628.544335
2004-12-31 0.500873 100.806668
# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2005-01-01', periods=5, freq='Y'),columns=['temperature'])
print(data0)
print(data1)
# 按行拼接数据集
data = pd.concat([data0, data1])
print(data)
temperature
2000-12-31 9.371567
2001-12-31 9.577373
2002-12-31 9.100010
2003-12-31 1.377707
2004-12-31 2.171535
temperature
2005-12-31 182.571988
2006-12-31 623.641092
2007-12-31 865.181408
2008-12-31 333.438249
2009-12-31 780.453570
temperature
2000-12-31 9.371567
2001-12-31 9.577373
2002-12-31 9.100010
2003-12-31 1.377707
2004-12-31 2.171535
2005-12-31 182.571988
2006-12-31 623.641092
2007-12-31 865.181408
2008-12-31 333.438249
2009-12-31 780.453570
# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5, 3) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature', 'pressure', 'humidity'])
data1 = pd.DataFrame(np.random.rand(5, 4) * 10, index=pd.date_range('2001-01-01', periods=5, freq='Y'),columns=['temperature', 'precipitation', 'humidity', 'wind'])
print(data0)
print(data1)
# 默认拼接方式为outer,会对两个数据集取并集
data = pd.concat([data0, data1], axis=1)
print(data)
temperature pressure humidity
2000-12-31 1.001251 2.545684 9.666708
2001-12-31 2.902656 9.816877 7.251775
2002-12-31 3.361788 8.067900 3.038989
2003-12-31 9.847927 4.671510 9.594601
2004-12-31 5.341185 3.827158 4.114981
temperature precipitation humidity wind
2001-12-31 3.272974 4.891393 4.228730 1.531332
2002-12-31 4.177929 5.301486 5.563462 6.474081
2003-12-31 8.423616 6.799374 9.836578 9.643460
2004-12-31 7.839116 6.346595 2.614979 9.741067
2005-12-31 8.017644 7.170777 4.284372 0.935709
temperature pressure humidity temperature precipitation \
2000-12-31 1.001251 2.545684 9.666708 NaN NaN
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
2005-12-31 NaN NaN NaN 8.017644 7.170777
humidity wind
2000-12-31 NaN NaN
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
2005-12-31 4.284372 0.935709
我们可以注意到,当我们面对一个更为复杂的数据集时,需要更多的设定来得到我们需要的结果。concat
的默认拼接方式为outter
,会对两个数据集取并集。
此外,拼接维度之外的另一维度,若对应位置的数据不存在,则会填充为NaN
(此处即为,我们按列拼接,对应行号的值不存在,则被填充了NaN
值)。
我们还可以只保留两个数据集中均存在的列(交集):
# 拼接方式为inner,即对两个数据集取交集
data = pd.concat([data0, data1], axis=1, join='inner') # 仅保留相同行
print(data)
# 需要注意,由于我们是按列拼接,因此取交集的仅针对于行名
# 我们可以尝试下不同的轴拼接在本例中的影响
data = pd.concat([data0, data1], axis=0, join='inner') # 仅保留相同列
print(data)
temperature pressure humidity temperature precipitation \
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
humidity wind
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
temperature humidity
2000-12-31 1.001251 9.666708
2001-12-31 2.902656 7.251775
2002-12-31 3.361788 3.038989
2003-12-31 9.847927 9.594601
2004-12-31 5.341185 4.114981
2001-12-31 3.272974 4.228730
2002-12-31 4.177929 5.563462
2003-12-31 8.423616 9.836578
2004-12-31 7.839116 2.614979
2005-12-31 8.017644 4.284372
最后,上面的最后一个输出中,由于行名出现了重复将会在我们索引时造成麻烦(尝试运行data.loc['2001-12-31', :]
)。
如果行名的意义不大,我们可以直接重置索引:
data = pd.concat([data0, data1], axis=0, ignore_index=True)
print(data)
temperature pressure humidity precipitation wind
0 1.001251 2.545684 9.666708 NaN NaN
1 2.902656 9.816877 7.251775 NaN NaN
2 3.361788 8.067900 3.038989 NaN NaN
3 9.847927 4.671510 9.594601 NaN NaN
4 5.341185 3.827158 4.114981 NaN NaN
5 3.272974 NaN 4.228730 4.891393 1.531332
6 4.177929 NaN 5.563462 5.301486 6.474081
7 8.423616 NaN 9.836578 6.799374 9.643460
8 7.839116 NaN 2.614979 6.346595 9.741067
9 8.017644 NaN 4.284372 7.170777 0.935709
merge
merge
函数用于将DataFrame基于一个或多个键进行合并。
与concat
沿轴拼接不同,merge
更像是一种关系型数据库的连接操作。
它根据指定的键将两个DataFrame中的行关联起来,其用法与Excel中的透视功能类似。
❝
主要参数包括:
left
:第一个DataFrame;
right
:第二个DataFrame;
on
:连接键,即两个DataFrame中都存在的列名;
left_on
:第一个DataFrame的连接键;
right_on
:第二个DataFrame的连接键;
left_index
:是否使用第一个DataFrame的索引作为连接键;
right_index
:是否使用第二个DataFrame的索引作为连接键;
how
:连接方式,可选参数为inner
(内连接)、outer
(外连接)、left
(左连接)、right
(右连接);
suffixes
:当连接列名存在重复时添加后缀。
我们还是用上面的随机数据进行演示:
left = data0.copy()
right = data1.copy()
df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)
Empty DataFrame
Columns: [temperature, pressure, humidity_x, precipitation, humidity_y, wind]
Index: []
与concat
指定列名并匹配行名不同,由于merge
存在相同的值相同才会拼接两个数据,我们指定的温度随机数并不存在重复,因此该结果为空。
我们可以稍作修改:
right['temperature'] = left['temperature']
df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)
temperature pressure humidity_x precipitation humidity_y wind
0 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332
1 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081
2 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460
3 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067
当然,也可以直接尝试其他拼接方式:
# 拼接方法使用outer,对结果取并集
df_merge = pd.merge(left, right, on='temperature', how='outer')
print(df_merge)
# 合并方法使用left,只保留左表存在行号对应的数据
df_merge = pd.merge(left, right, on='temperature', how='left')
print(df_merge)
# 合并方法使用right,只保留右表存在行号对应的数据
df_merge = pd.merge(left, right, on='temperature', how='right')
print(df_merge)
temperature pressure humidity_x precipitation humidity_y wind
0 1.001251 2.545684 9.666708 NaN NaN NaN
1 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332
2 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081
3 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460
4 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067
5 NaN NaN NaN 7.170777 4.284372 0.935709
temperature pressure humidity_x precipitation humidity_y wind
0 1.001251 2.545684 9.666708 NaN NaN NaN
1 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332
2 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081
3 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460
4 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067
temperature pressure humidity_x precipitation humidity_y wind
0 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332
1 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081
2 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460
3 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067
4 NaN NaN NaN 7.170777 4.284372 0.935709
或直接使用索引值(行号)进行拼接:
# 基于行号合并表格
# 略微尝试下suffixes的效果
df_merge = pd.merge(left, right, left_index=True, right_index=True, how='outer', suffixes=('_L', '_R'))
print(df_merge)
temperature_L pressure humidity_L temperature_R precipitation \
2000-12-31 1.001251 2.545684 9.666708 NaN NaN
2001-12-31 2.902656 9.816877 7.251775 2.902656 4.891393
2002-12-31 3.361788 8.067900 3.038989 3.361788 5.301486
2003-12-31 9.847927 4.671510 9.594601 9.847927 6.799374
2004-12-31 5.341185 3.827158 4.114981 5.341185 6.346595
2005-12-31 NaN NaN NaN NaN 7.170777
humidity_R wind
2000-12-31 NaN NaN
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
2005-12-31 4.284372 0.935709
有些场景下,两个表格对应数据的列名存在不一致,或需要根据多个列数据进行拼接时,我们就需要分别指定不同的列名进行索引:
right['tas'] = left['temperature']
right['rh'] = left['humidity']
df_merge = pd.merge(left, right, left_on=['temperature', 'humidity'], right_on=['tas', 'rh'], how='inner', suffixes=('_α', '_β'))
print(df_merge)
temperature_α pressure humidity_α temperature_β precipitation \
0 2.902656 9.816877 7.251775 2.902656 4.891393
1 3.361788 8.067900 3.038989 3.361788 5.301486
2 9.847927 4.671510 9.594601 9.847927 6.799374
3 5.341185 3.827158 4.114981 5.341185 6.346595
humidity_β wind tas rh
0 4.228730 1.531332 2.902656 7.251775
1 5.563462 6.474081 3.361788 3.038989
2 9.836578 9.643460 9.847927 9.594601
3 2.614979 9.741067 5.341185 4.114981
join
该函数类似于以上两个函数的结合,可以将两个DataFrame对象按照索引行号进行合并。
其用法类似于pd.merge(left, right, left_index=True, right_index=True, suffixes=('_L', '_R'))
。
我们直接来看一下执行该语句的效果即可:
left = data0.copy()
right = data1.copy()
df_join = left.join(right, how='outer', lsuffix='_L', rsuffix='_R')
print(df_join)
df_join = left.join(right, how='inner', lsuffix='_L', rsuffix='_R')
print(df_join)
df_join = left.join(right, how='left', lsuffix='_L', rsuffix='_R')
print(df_join)
df_join = left.join(right, how='right', lsuffix='_L', rsuffix='_R')
print(df_join)
temperature_L pressure humidity_L temperature_R precipitation \
2000-12-31 1.001251 2.545684 9.666708 NaN NaN
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
2005-12-31 NaN NaN NaN 8.017644 7.170777
humidity_R wind
2000-12-31 NaN NaN
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
2005-12-31 4.284372 0.935709
temperature_L pressure humidity_L temperature_R precipitation \
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
humidity_R wind
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
temperature_L pressure humidity_L temperature_R precipitation \
2000-12-31 1.001251 2.545684 9.666708 NaN NaN
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
humidity_R wind
2000-12-31 NaN NaN
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
temperature_L pressure humidity_L temperature_R precipitation \
2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
2005-12-31 NaN NaN NaN 8.017644 7.170777
humidity_R wind
2001-12-31 4.228730 1.531332
2002-12-31 5.563462 6.474081
2003-12-31 9.836578 9.643460
2004-12-31 2.614979 9.741067
2005-12-31 4.284372 0.935709
后记
以上就是使用Pandas拼接数据的基本内容了,要说到它的实际应用,最浅显的大概就是把那种以年为单位存储的气象数据批量读取,并合并为完整的时间序列用于后续的处理吧。
这个案例更新完Pandas之后会找一期给出一个案例。
别的就先不说,今天是1024,毕竟大家也多多少少会用到一些代码和编程,特意选在今天更新,那就祝各位"程序员节"快乐!
那么,我们下期再见!
Manuscript: RitasCake
Proof: Philero; RitasCake
获取更多资讯,欢迎订阅微信公众号:Westerlies
跳转和鲸社区,云端运行本文案例。https://www.heywhale.com/mw/project/66221ce2e584e69fbfef87ba
关注我们