pandas教程:Time Series Basics 时间序列基础

文章目录

  • [11.2 Time Series Basics(时间序列基础)](#11.2 Time Series Basics(时间序列基础))
  • [1 Indexing, Selection, Subsetting(索引,选择,取子集)](#1 Indexing, Selection, Subsetting(索引,选择,取子集))
  • [2 Time Series with Duplicate Indices(重复索引的时间序列)](#2 Time Series with Duplicate Indices(重复索引的时间序列))

11.2 Time Series Basics(时间序列基础)

pandas中,一个基本的时间序列对象,是一个用时间戳作为索引的Series,在pandas外部的话,通常是用python 字符串或datetime对象来表示的:

python 复制代码
import pandas as pd
import numpy as np
from datetime import datetime
python 复制代码
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8), 
         datetime(2011, 1, 10), datetime(2011, 1, 12)]
python 复制代码
ts = pd.Series(np.random.randn(6), index=dates)
ts
2011-01-02    0.384868
2011-01-05    0.669181
2011-01-07    2.553288
2011-01-08   -1.808783
2011-01-10    1.180570
2011-01-12   -0.928942
dtype: float64

上面的转化原理是,datetime对象被放进了DatetimeIndex:

python 复制代码
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

像其他的Series一行,数值原色会自动按时间序列索引进行对齐:

python 复制代码
ts[::2]
2011-01-02    0.384868
2011-01-07    2.553288
2011-01-10    1.180570
dtype: float64
python 复制代码
ts + ts[::2]
2011-01-02    0.769735
2011-01-05         NaN
2011-01-07    5.106575
2011-01-08         NaN
2011-01-10    2.361140
2011-01-12         NaN
dtype: float64

ts[::2]会在ts中,每隔两个元素选一个元素。

pandas中的时间戳,是按numpy中的datetime64数据类型进行保存的,可以精确到纳秒的级别:

python 复制代码
ts.index.dtype
dtype('<M8[ns]')

DatetimeIndex的标量是pandasTimestamp对象:

python 复制代码
stamp = ts.index[0]
stamp
Timestamp('2011-01-02 00:00:00')

Timestamp可以在任何地方用datetime对象进行替换。

1 Indexing, Selection, Subsetting(索引,选择,取子集)

当我们基于标签进行索引和选择时,时间序列就像是pandas.Series

python 复制代码
ts
2011-01-02    0.384868
2011-01-05    0.669181
2011-01-07    2.553288
2011-01-08   -1.808783
2011-01-10    1.180570
2011-01-12   -0.928942
dtype: float64
python 复制代码
stamp = ts.index[2]
python 复制代码
ts[stamp]
2.5532875030792592

为了方便,我们可以直接传入一个字符串用来表示日期:

python 复制代码
ts['1/10/2011']
1.1805698813038874
python 复制代码
ts['20110110']
1.1805698813038874

对于比较长的时间序列,我们可以直接传入一年或一年一个月,来进行数据选取:

python 复制代码
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts
2000-01-01   -0.801668
2000-01-02   -0.325797
2000-01-03    0.047318
2000-01-04    0.239576
2000-01-05   -0.467691
2000-01-06    1.394063
2000-01-07    0.416262
2000-01-08   -0.739839
2000-01-09   -1.504631
2000-01-10   -0.798753
2000-01-11    0.758856
2000-01-12    1.163517
2000-01-13    1.233826
2000-01-14    0.675056
2000-01-15   -1.079219
2000-01-16    0.212076
2000-01-17   -0.242134
2000-01-18   -0.318024
2000-01-19    0.040686
2000-01-20   -1.342025
2000-01-21   -0.130905
2000-01-22   -0.122308
2000-01-23   -0.924727
2000-01-24    0.071544
2000-01-25    0.483302
2000-01-26   -0.264231
2000-01-27    0.815791
2000-01-28    0.652885
2000-01-29    0.203818
2000-01-30    0.007890
                ...   
2002-08-28   -2.375283
2002-08-29    0.843647
2002-08-30    0.069483
2002-08-31   -1.151590
2002-09-01   -2.348154
2002-09-02   -0.309723
2002-09-03   -1.017466
2002-09-04   -2.078659
2002-09-05   -1.828568
2002-09-06    0.546299
2002-09-07    0.861304
2002-09-08   -0.823128
2002-09-09   -0.150047
2002-09-10   -1.984674
2002-09-11    0.468010
2002-09-12   -0.066440
2002-09-13   -1.629502
2002-09-14    0.044870
2002-09-15    0.007970
2002-09-16    0.812104
2002-09-17   -1.835575
2002-09-18   -0.218055
2002-09-19   -0.271351
2002-09-20   -1.852212
2002-09-21    0.546462
2002-09-22    0.776960
2002-09-23   -1.140997
2002-09-24   -2.213685
2002-09-25   -0.586588
2002-09-26   -1.472430
Freq: D, dtype: float64
python 复制代码
longer_ts['2001']
2001-01-01    0.588405
2001-01-02   -3.027909
2001-01-03   -0.492280
2001-01-04   -0.807809
2001-01-05   -0.124139
2001-01-06   -0.198966
2001-01-07    2.015447
2001-01-08    1.454119
2001-01-09    0.157505
2001-01-10    1.077689
2001-01-11   -0.246538
2001-01-12   -0.865122
2001-01-13   -0.082186
2001-01-14    1.928050
2001-01-15    0.320741
2001-01-16    0.473770
2001-01-17    0.036649
2001-01-18    1.405034
2001-01-19    0.560502
2001-01-20   -0.695138
2001-01-21    3.318884
2001-01-22    1.704966
2001-01-23    0.145167
2001-01-24    0.366667
2001-01-25   -0.565675
2001-01-26    0.940406
2001-01-27   -1.468772
2001-01-28    0.098759
2001-01-29    0.267449
2001-01-30   -0.221643
                ...   
2001-12-02    0.002522
2001-12-03   -0.046712
2001-12-04    1.825249
2001-12-05   -1.000655
2001-12-06   -0.807582
2001-12-07    0.750439
2001-12-08    1.531707
2001-12-09   -0.195239
2001-12-10   -0.087465
2001-12-11   -0.041450
2001-12-12    1.992200
2001-12-13   -0.294916
2001-12-14    1.215363
2001-12-15    0.029039
2001-12-16   -0.165380
2001-12-17    1.192535
2001-12-18    0.737760
2001-12-19    0.044022
2001-12-20    0.582560
2001-12-21   -0.213569
2001-12-22   -0.024512
2001-12-23   -1.140873
2001-12-24   -1.351333
2001-12-25    0.725253
2001-12-26   -0.943740
2001-12-27   -2.134039
2001-12-28   -0.548597
2001-12-29    1.497907
2001-12-30   -0.594708
2001-12-31    0.068177
Freq: D, dtype: float64

这里,字符串'2001'就直接被解析为一年,然后选中这个时期的数据。我们也可以指定月份:

python 复制代码
longer_ts['2001-05']
2001-05-01   -0.560227
2001-05-02    2.160259
2001-05-03   -0.826092
2001-05-04   -0.183020
2001-05-05   -0.294708
2001-05-06   -1.210785
2001-05-07    0.609260
2001-05-08   -1.155377
2001-05-09   -0.127132
2001-05-10    0.576327
2001-05-11   -0.955206
2001-05-12   -2.002019
2001-05-13   -0.969865
2001-05-14    0.820993
2001-05-15    0.557336
2001-05-16   -0.262222
2001-05-17   -0.086760
2001-05-18    0.151608
2001-05-19    1.097604
2001-05-20    0.212148
2001-05-21   -1.164944
2001-05-22   -0.100020
2001-05-23    0.734738
2001-05-24    1.730438
2001-05-25    1.352858
2001-05-26    0.644984
2001-05-27    0.997554
2001-05-28    1.434452
2001-05-29    0.395946
2001-05-30   -0.142523
2001-05-31    1.205485
Freq: D, dtype: float64

利用datetime进行切片(slicing)也没问题:

python 复制代码
ts[datetime(2011, 1, 7)]
2.5532875030792592

因为大部分时间序列是按年代时间顺序来排列的,我们可以用时间戳来进行切片,选中一段范围内的时间:

python 复制代码
ts
2011-01-02    0.384868
2011-01-05    0.669181
2011-01-07    2.553288
2011-01-08   -1.808783
2011-01-10    1.180570
2011-01-12   -0.928942
dtype: float64
python 复制代码
ts['1/6/2011':'1/11/2011']
2011-01-07    2.553288
2011-01-08   -1.808783
2011-01-10    1.180570
dtype: float64

记住,这种方式的切片得到的只是原来数据的一个视图,如果我们在切片的结果上进行更改的的,原来的数据也会变化。

有一个相等的实例方法(instance method)也能切片,truncate,能在两个日期上,对Series进行切片:

python 复制代码
ts.truncate(after='1/9/2011')
2011-01-02    0.384868
2011-01-05    0.669181
2011-01-07    2.553288
2011-01-08   -1.808783
dtype: float64

所有这些都适用于DataFrame,我们对行进行索引:

python 复制代码
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
python 复制代码
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])
python 复制代码
long_df.loc['5-2001']

| | Colorado | Texas | New York | Ohio |
| 2001-05-02 | -0.477517 | 0.722685 | 0.337141 | -0.345072 |
| 2001-05-09 | -0.401860 | -0.475821 | 0.685129 | -0.809288 |
| 2001-05-16 | 1.900541 | 0.348590 | -0.805042 | -0.410077 |
| 2001-05-23 | -0.220870 | 1.654666 | -0.846395 | -0.207802 |

2001-05-30 2.094319 -0.972588 1.276059 -1.056146

2 Time Series with Duplicate Indices(重复索引的时间序列)

在某些数据中,可能会遇到多个数据在同一时间戳下的情况:

python 复制代码
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', 
                          '1/2/2000', '1/3/2000'])
python 复制代码
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

我们通过is_unique属性来查看index是否是唯一值:

python 复制代码
dup_ts.index.is_unique
False

对这个时间序列取索引的的话, 要么得到标量,要么得到切片,这取决于时间戳是否是重复的:

python 复制代码
dup_ts['1/3/2000'] # not duplicated
4
python 复制代码
dup_ts['1/2/2000'] # duplicated
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

假设我们想要聚合那些有重复时间戳的数据,一种方法是用groupby,设定level=0:

python 复制代码
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int64
python 复制代码
grouped.count()
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64
相关推荐
Hiweir ·35 分钟前
机器翻译之创建Seq2Seq的编码器、解码器
人工智能·pytorch·python·rnn·深度学习·算法·lstm
Element_南笙38 分钟前
数据结构_1、基本概念
数据结构·人工智能
FutureUniant43 分钟前
GitHub每日最火火火项目(9.21)
人工智能·计算机视觉·ai·github·音视频
不染_是非1 小时前
Django学习实战篇六(适合略有基础的新手小白学习)(从0开发项目)
后端·python·学习·django
star数模1 小时前
2024“华为杯”中国研究生数学建模竞赛(E题)深度剖析_数学建模完整过程+详细思路+代码全解析
python·算法·数学建模
菜♕卷1 小时前
深度学习-03 Pytorch
人工智能·pytorch·深度学习
明明真系叻1 小时前
第十二周:机器学习笔记
人工智能·机器学习
Midsummer啦啦啦1 小时前
NumPy库学习之argmax函数
学习·numpy
跟着大数据和AI去旅行1 小时前
使用肘部法则确定K-Means中的k值
python·机器学习·kmeans
QuantumYou1 小时前
【对比学习串烧】 SWav和 BYOL
学习·机器学习