SC.Pandas 05 | 如何使用Pandas拼接数据?

Introduction

上一期我们说到,Pandas可以方便地读取和存储表格类型的文件,以实现本地与Python间的交互。

但是,很多情景下我们所需要的数据可能分布在多个不同的文件或DataFrame中。若需要把它们整合成一个完整的数据集,就需要用到本期标题中所提到的拼接了。

Pandas提供了多种拼接方式,主要包括concatmergejoin等,每种方式都有其适用场景,下面我们对它们进行介绍。

concat

concat函数主要用于将多个DataFrameSeries对象沿指定轴(行或列)进行拼接,常用的自定义参数主要包括:

  • axis:指定拼接的轴,0表示行,1表示列,默认为0;

  • join:指定拼接方式,inner表示仅保留共有的行或列,outer表示保留所有行或列,默认为outer

  • ignore_index:指定是否重置索引,默认为False

我们直接来看案例:

import numpy as np
import pandas as pd

# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['precipitation'])

print(data0)
print(data1)

# 按列拼接数据集
data = pd.concat([data0, data1], axis=1)
print(data)

              temperature
  2000-12-31     2.956534
  2001-12-31     8.961623
  2002-12-31     0.720613
  2003-12-31     4.456005
  2004-12-31     0.500873
              precipitation
  2000-12-31      44.627392
  2001-12-31     804.956900
  2002-12-31     582.525174
  2003-12-31     628.544335
  2004-12-31     100.806668
              temperature  precipitation
  2000-12-31     2.956534      44.627392
  2001-12-31     8.961623     804.956900
  2002-12-31     0.720613     582.525174
  2003-12-31     4.456005     628.544335
  2004-12-31     0.500873     100.806668

# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'),columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2005-01-01', periods=5, freq='Y'),columns=['temperature'])

print(data0)
print(data1)

# 按行拼接数据集
data = pd.concat([data0, data1])
print(data)

              temperature
  2000-12-31     9.371567
  2001-12-31     9.577373
  2002-12-31     9.100010
  2003-12-31     1.377707
  2004-12-31     2.171535
              temperature
  2005-12-31   182.571988
  2006-12-31   623.641092
  2007-12-31   865.181408
  2008-12-31   333.438249
  2009-12-31   780.453570
              temperature
  2000-12-31     9.371567
  2001-12-31     9.577373
  2002-12-31     9.100010
  2003-12-31     1.377707
  2004-12-31     2.171535
  2005-12-31   182.571988
  2006-12-31   623.641092
  2007-12-31   865.181408
  2008-12-31   333.438249
  2009-12-31   780.453570

# 创建一个数据集
data0 = pd.DataFrame(np.random.rand(5, 3) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature', 'pressure', 'humidity'])
data1 = pd.DataFrame(np.random.rand(5, 4) * 10, index=pd.date_range('2001-01-01', periods=5, freq='Y'),columns=['temperature', 'precipitation', 'humidity', 'wind'])

print(data0)
print(data1)

# 默认拼接方式为outer,会对两个数据集取并集
data = pd.concat([data0, data1], axis=1)
print(data)

              temperature  pressure  humidity
  2000-12-31     1.001251  2.545684  9.666708
  2001-12-31     2.902656  9.816877  7.251775
  2002-12-31     3.361788  8.067900  3.038989
  2003-12-31     9.847927  4.671510  9.594601
  2004-12-31     5.341185  3.827158  4.114981
              temperature  precipitation  humidity      wind
  2001-12-31     3.272974       4.891393  4.228730  1.531332
  2002-12-31     4.177929       5.301486  5.563462  6.474081
  2003-12-31     8.423616       6.799374  9.836578  9.643460
  2004-12-31     7.839116       6.346595  2.614979  9.741067
  2005-12-31     8.017644       7.170777  4.284372  0.935709
              temperature  pressure  humidity  temperature  precipitation  \
  2000-12-31     1.001251  2.545684  9.666708          NaN            NaN   
  2001-12-31     2.902656  9.816877  7.251775     3.272974       4.891393   
  2002-12-31     3.361788  8.067900  3.038989     4.177929       5.301486   
  2003-12-31     9.847927  4.671510  9.594601     8.423616       6.799374   
  2004-12-31     5.341185  3.827158  4.114981     7.839116       6.346595   
  2005-12-31          NaN       NaN       NaN     8.017644       7.170777   

              humidity      wind  
  2000-12-31       NaN       NaN  
  2001-12-31  4.228730  1.531332  
  2002-12-31  5.563462  6.474081  
  2003-12-31  9.836578  9.643460  
  2004-12-31  2.614979  9.741067  
  2005-12-31  4.284372  0.935709  

我们可以注意到,当我们面对一个更为复杂的数据集时,需要更多的设定来得到我们需要的结果。concat的默认拼接方式为outter,会对两个数据集取并集。

此外,拼接维度之外的另一维度,若对应位置的数据不存在,则会填充为NaN(此处即为,我们按列拼接,对应行号的值不存在,则被填充了NaN值)。

我们还可以只保留两个数据集中均存在的列(交集):

# 拼接方式为inner,即对两个数据集取交集
data = pd.concat([data0, data1], axis=1, join='inner')      # 仅保留相同行
print(data)

# 需要注意,由于我们是按列拼接,因此取交集的仅针对于行名
# 我们可以尝试下不同的轴拼接在本例中的影响
data = pd.concat([data0, data1], axis=0, join='inner')      # 仅保留相同列
print(data)

              temperature  pressure  humidity  temperature  precipitation  \
  2001-12-31     2.902656  9.816877  7.251775     3.272974       4.891393   
  2002-12-31     3.361788  8.067900  3.038989     4.177929       5.301486   
  2003-12-31     9.847927  4.671510  9.594601     8.423616       6.799374   
  2004-12-31     5.341185  3.827158  4.114981     7.839116       6.346595   

              humidity      wind  
  2001-12-31  4.228730  1.531332  
  2002-12-31  5.563462  6.474081  
  2003-12-31  9.836578  9.643460  
  2004-12-31  2.614979  9.741067  
              temperature  humidity
  2000-12-31     1.001251  9.666708
  2001-12-31     2.902656  7.251775
  2002-12-31     3.361788  3.038989
  2003-12-31     9.847927  9.594601
  2004-12-31     5.341185  4.114981
  2001-12-31     3.272974  4.228730
  2002-12-31     4.177929  5.563462
  2003-12-31     8.423616  9.836578
  2004-12-31     7.839116  2.614979
  2005-12-31     8.017644  4.284372

最后,上面的最后一个输出中,由于行名出现了重复将会在我们索引时造成麻烦(尝试运行data.loc['2001-12-31', :])。

如果行名的意义不大,我们可以直接重置索引:

data = pd.concat([data0, data1], axis=0, ignore_index=True)
print(data)

     temperature  pressure  humidity  precipitation      wind
  0     1.001251  2.545684  9.666708            NaN       NaN
  1     2.902656  9.816877  7.251775            NaN       NaN
  2     3.361788  8.067900  3.038989            NaN       NaN
  3     9.847927  4.671510  9.594601            NaN       NaN
  4     5.341185  3.827158  4.114981            NaN       NaN
  5     3.272974       NaN  4.228730       4.891393  1.531332
  6     4.177929       NaN  5.563462       5.301486  6.474081
  7     8.423616       NaN  9.836578       6.799374  9.643460
  8     7.839116       NaN  2.614979       6.346595  9.741067
  9     8.017644       NaN  4.284372       7.170777  0.935709

merge

merge函数用于将DataFrame基于一个或多个键进行合并。

concat沿轴拼接不同,merge更像是一种关系型数据库的连接操作。

它根据指定的键将两个DataFrame中的行关联起来,其用法与Excel中的透视功能类似。

主要参数包括:

  • left:第一个DataFrame;

  • right:第二个DataFrame;

  • on:连接键,即两个DataFrame中都存在的列名;

  • left_on:第一个DataFrame的连接键;

  • right_on:第二个DataFrame的连接键;

  • left_index:是否使用第一个DataFrame的索引作为连接键;

  • right_index:是否使用第二个DataFrame的索引作为连接键;

  • how:连接方式,可选参数为inner(内连接)、outer(外连接)、left(左连接)、right(右连接);

  • suffixes:当连接列名存在重复时添加后缀。

我们还是用上面的随机数据进行演示:

left = data0.copy()
right = data1.copy()

df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)

  Empty DataFrame
  Columns: [temperature, pressure, humidity_x, precipitation, humidity_y, wind]
  Index: []

concat指定列名并匹配行名不同,由于merge存在相同的值相同才会拼接两个数据,我们指定的温度随机数并不存在重复,因此该结果为空。

我们可以稍作修改:

right['temperature'] = left['temperature']

df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)

     temperature  pressure  humidity_x  precipitation  humidity_y      wind
  0     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
  1     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
  2     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
  3     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067

当然,也可以直接尝试其他拼接方式:

# 拼接方法使用outer,对结果取并集
df_merge = pd.merge(left, right, on='temperature', how='outer')
print(df_merge)

# 合并方法使用left,只保留左表存在行号对应的数据
df_merge = pd.merge(left, right, on='temperature', how='left')
print(df_merge)

# 合并方法使用right,只保留右表存在行号对应的数据
df_merge = pd.merge(left, right, on='temperature', how='right')
print(df_merge)

     temperature  pressure  humidity_x  precipitation  humidity_y      wind
  0     1.001251  2.545684    9.666708            NaN         NaN       NaN
  1     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
  2     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
  3     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
  4     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067
  5          NaN       NaN         NaN       7.170777    4.284372  0.935709
     temperature  pressure  humidity_x  precipitation  humidity_y      wind
  0     1.001251  2.545684    9.666708            NaN         NaN       NaN
  1     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
  2     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
  3     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
  4     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067
     temperature  pressure  humidity_x  precipitation  humidity_y      wind
  0     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
  1     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
  2     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
  3     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067
  4          NaN       NaN         NaN       7.170777    4.284372  0.935709

或直接使用索引值(行号)进行拼接:

# 基于行号合并表格
# 略微尝试下suffixes的效果
df_merge = pd.merge(left, right, left_index=True, right_index=True, how='outer', suffixes=('_L', '_R'))
print(df_merge)

              temperature_L  pressure  humidity_L  temperature_R  precipitation  \
  2000-12-31       1.001251  2.545684    9.666708            NaN            NaN   
  2001-12-31       2.902656  9.816877    7.251775       2.902656       4.891393   
  2002-12-31       3.361788  8.067900    3.038989       3.361788       5.301486   
  2003-12-31       9.847927  4.671510    9.594601       9.847927       6.799374   
  2004-12-31       5.341185  3.827158    4.114981       5.341185       6.346595   
  2005-12-31            NaN       NaN         NaN            NaN       7.170777   

              humidity_R      wind  
  2000-12-31         NaN       NaN  
  2001-12-31    4.228730  1.531332  
  2002-12-31    5.563462  6.474081  
  2003-12-31    9.836578  9.643460  
  2004-12-31    2.614979  9.741067  
  2005-12-31    4.284372  0.935709  

有些场景下,两个表格对应数据的列名存在不一致,或需要根据多个列数据进行拼接时,我们就需要分别指定不同的列名进行索引:

right['tas'] = left['temperature']
right['rh'] = left['humidity']

df_merge = pd.merge(left, right, left_on=['temperature', 'humidity'], right_on=['tas', 'rh'], how='inner', suffixes=('_α', '_β'))
print(df_merge)

     temperature_α  pressure  humidity_α  temperature_β  precipitation  \
  0       2.902656  9.816877    7.251775       2.902656       4.891393   
  1       3.361788  8.067900    3.038989       3.361788       5.301486   
  2       9.847927  4.671510    9.594601       9.847927       6.799374   
  3       5.341185  3.827158    4.114981       5.341185       6.346595   

     humidity_β      wind       tas        rh  
  0    4.228730  1.531332  2.902656  7.251775  
  1    5.563462  6.474081  3.361788  3.038989  
  2    9.836578  9.643460  9.847927  9.594601  
  3    2.614979  9.741067  5.341185  4.114981  

join

该函数类似于以上两个函数的结合,可以将两个DataFrame对象按照索引行号进行合并。

其用法类似于pd.merge(left, right, left_index=True, right_index=True, suffixes=('_L', '_R'))

我们直接来看一下执行该语句的效果即可:

left = data0.copy()
right = data1.copy()

df_join = left.join(right,  how='outer', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right,  how='inner', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right,  how='left', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right,  how='right', lsuffix='_L', rsuffix='_R')
print(df_join)

              temperature_L  pressure  humidity_L  temperature_R  precipitation  \
  2000-12-31       1.001251  2.545684    9.666708            NaN            NaN   
  2001-12-31       2.902656  9.816877    7.251775       3.272974       4.891393   
  2002-12-31       3.361788  8.067900    3.038989       4.177929       5.301486   
  2003-12-31       9.847927  4.671510    9.594601       8.423616       6.799374   
  2004-12-31       5.341185  3.827158    4.114981       7.839116       6.346595   
  2005-12-31            NaN       NaN         NaN       8.017644       7.170777   

              humidity_R      wind  
  2000-12-31         NaN       NaN  
  2001-12-31    4.228730  1.531332  
  2002-12-31    5.563462  6.474081  
  2003-12-31    9.836578  9.643460  
  2004-12-31    2.614979  9.741067  
  2005-12-31    4.284372  0.935709
  

              temperature_L  pressure  humidity_L  temperature_R  precipitation  \
  2001-12-31       2.902656  9.816877    7.251775       3.272974       4.891393   
  2002-12-31       3.361788  8.067900    3.038989       4.177929       5.301486   
  2003-12-31       9.847927  4.671510    9.594601       8.423616       6.799374   
  2004-12-31       5.341185  3.827158    4.114981       7.839116       6.346595   

              humidity_R      wind  
  2001-12-31    4.228730  1.531332  
  2002-12-31    5.563462  6.474081  
  2003-12-31    9.836578  9.643460  
  2004-12-31    2.614979  9.741067
  

              temperature_L  pressure  humidity_L  temperature_R  precipitation  \
  2000-12-31       1.001251  2.545684    9.666708            NaN            NaN   
  2001-12-31       2.902656  9.816877    7.251775       3.272974       4.891393   
  2002-12-31       3.361788  8.067900    3.038989       4.177929       5.301486   
  2003-12-31       9.847927  4.671510    9.594601       8.423616       6.799374   
  2004-12-31       5.341185  3.827158    4.114981       7.839116       6.346595   

              humidity_R      wind  
  2000-12-31         NaN       NaN  
  2001-12-31    4.228730  1.531332  
  2002-12-31    5.563462  6.474081  
  2003-12-31    9.836578  9.643460  
  2004-12-31    2.614979  9.741067

  
              temperature_L  pressure  humidity_L  temperature_R  precipitation  \
  2001-12-31       2.902656  9.816877    7.251775       3.272974       4.891393   
  2002-12-31       3.361788  8.067900    3.038989       4.177929       5.301486   
  2003-12-31       9.847927  4.671510    9.594601       8.423616       6.799374   
  2004-12-31       5.341185  3.827158    4.114981       7.839116       6.346595   
  2005-12-31            NaN       NaN         NaN       8.017644       7.170777   

              humidity_R      wind  
  2001-12-31    4.228730  1.531332  
  2002-12-31    5.563462  6.474081  
  2003-12-31    9.836578  9.643460  
  2004-12-31    2.614979  9.741067  
  2005-12-31    4.284372  0.935709  

后记

以上就是使用Pandas拼接数据的基本内容了,要说到它的实际应用,最浅显的大概就是把那种以年为单位存储的气象数据批量读取,并合并为完整的时间序列用于后续的处理吧。

这个案例更新完Pandas之后会找一期给出一个案例。

别的就先不说,今天是1024,毕竟大家也多多少少会用到一些代码和编程,特意选在今天更新,那就祝各位"程序员节"快乐!

那么,我们下期再见!

Manuscript: RitasCake

Proof: Philero; RitasCake

获取更多资讯,欢迎订阅微信公众号:Westerlies

跳转和鲸社区,云端运行本文案例。https://www.heywhale.com/mw/project/66221ce2e584e69fbfef87ba

关注我们

相关推荐
百流11 分钟前
Pyspark中pyspark.sql.functions常用方法(4)
1024程序员节
qq210846295315 分钟前
【Ubuntu】Ubuntu22双网卡指定网关
1024程序员节
搬砖天才、17 分钟前
自动化部署-02-jenkins部署微服务
微服务·自动化·jenkins
YueTann33 分钟前
APS开源源码解读: 排程工具 optaplanner II
1024程序员节
kinlon.liu41 分钟前
安全日志记录的重要性
服务器·网络·安全·安全架构·1024程序员节
爱编程— 的小李1 小时前
开关灯问题(c语言)
c语言·算法·1024程序员节
是程序喵呀1 小时前
Uni-App-02
uni-app·vue·1024程序员节
铁盒薄荷糖1 小时前
【Pytorch】Pytorch的安装
人工智能·pytorch·python
yyfhq1 小时前
rescorediff
python·深度学习·机器学习
糊涂君-Q1 小时前
Python小白学习教程从入门到入坑------第十九课 异常模块与包【下】(语法基础)
开发语言·python·学习·程序人生·改行学it