文章目录
- [11.2 Time Series Basics(时间序列基础)](#11.2 Time Series Basics(时间序列基础))
- [1 Indexing, Selection, Subsetting(索引,选择,取子集)](#1 Indexing, Selection, Subsetting(索引,选择,取子集))
- [2 Time Series with Duplicate Indices(重复索引的时间序列)](#2 Time Series with Duplicate Indices(重复索引的时间序列))
11.2 Time Series Basics(时间序列基础)
在pandas
中,一个基本的时间序列对象,是一个用时间戳作为索引的Series
,在pandas
外部的话,通常是用python
字符串或datetime
对象来表示的:
python
import pandas as pd
import numpy as np
from datetime import datetime
python
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
datetime(2011, 1, 7), datetime(2011, 1, 8),
datetime(2011, 1, 10), datetime(2011, 1, 12)]
python
ts = pd.Series(np.random.randn(6), index=dates)
ts
2011-01-02 0.384868
2011-01-05 0.669181
2011-01-07 2.553288
2011-01-08 -1.808783
2011-01-10 1.180570
2011-01-12 -0.928942
dtype: float64
上面的转化原理是,datetime
对象被放进了DatetimeIndex
:
python
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
像其他的Series
一行,数值原色会自动按时间序列索引进行对齐:
python
ts[::2]
2011-01-02 0.384868
2011-01-07 2.553288
2011-01-10 1.180570
dtype: float64
python
ts + ts[::2]
2011-01-02 0.769735
2011-01-05 NaN
2011-01-07 5.106575
2011-01-08 NaN
2011-01-10 2.361140
2011-01-12 NaN
dtype: float64
ts[::2]
会在ts
中,每隔两个元素选一个元素。
pandas
中的时间戳,是按numpy
中的datetime64
数据类型进行保存的,可以精确到纳秒的级别:
python
ts.index.dtype
dtype('<M8[ns]')
DatetimeIndex
的标量是pandas
的Timestamp
对象:
python
stamp = ts.index[0]
stamp
Timestamp('2011-01-02 00:00:00')
Timestamp
可以在任何地方用datetime
对象进行替换。
1 Indexing, Selection, Subsetting(索引,选择,取子集)
当我们基于标签进行索引和选择时,时间序列就像是pandas.Series
:
python
ts
2011-01-02 0.384868
2011-01-05 0.669181
2011-01-07 2.553288
2011-01-08 -1.808783
2011-01-10 1.180570
2011-01-12 -0.928942
dtype: float64
python
stamp = ts.index[2]
python
ts[stamp]
2.5532875030792592
为了方便,我们可以直接传入一个字符串用来表示日期:
python
ts['1/10/2011']
1.1805698813038874
python
ts['20110110']
1.1805698813038874
对于比较长的时间序列,我们可以直接传入一年或一年一个月,来进行数据选取:
python
longer_ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
longer_ts
2000-01-01 -0.801668
2000-01-02 -0.325797
2000-01-03 0.047318
2000-01-04 0.239576
2000-01-05 -0.467691
2000-01-06 1.394063
2000-01-07 0.416262
2000-01-08 -0.739839
2000-01-09 -1.504631
2000-01-10 -0.798753
2000-01-11 0.758856
2000-01-12 1.163517
2000-01-13 1.233826
2000-01-14 0.675056
2000-01-15 -1.079219
2000-01-16 0.212076
2000-01-17 -0.242134
2000-01-18 -0.318024
2000-01-19 0.040686
2000-01-20 -1.342025
2000-01-21 -0.130905
2000-01-22 -0.122308
2000-01-23 -0.924727
2000-01-24 0.071544
2000-01-25 0.483302
2000-01-26 -0.264231
2000-01-27 0.815791
2000-01-28 0.652885
2000-01-29 0.203818
2000-01-30 0.007890
...
2002-08-28 -2.375283
2002-08-29 0.843647
2002-08-30 0.069483
2002-08-31 -1.151590
2002-09-01 -2.348154
2002-09-02 -0.309723
2002-09-03 -1.017466
2002-09-04 -2.078659
2002-09-05 -1.828568
2002-09-06 0.546299
2002-09-07 0.861304
2002-09-08 -0.823128
2002-09-09 -0.150047
2002-09-10 -1.984674
2002-09-11 0.468010
2002-09-12 -0.066440
2002-09-13 -1.629502
2002-09-14 0.044870
2002-09-15 0.007970
2002-09-16 0.812104
2002-09-17 -1.835575
2002-09-18 -0.218055
2002-09-19 -0.271351
2002-09-20 -1.852212
2002-09-21 0.546462
2002-09-22 0.776960
2002-09-23 -1.140997
2002-09-24 -2.213685
2002-09-25 -0.586588
2002-09-26 -1.472430
Freq: D, dtype: float64
python
longer_ts['2001']
2001-01-01 0.588405
2001-01-02 -3.027909
2001-01-03 -0.492280
2001-01-04 -0.807809
2001-01-05 -0.124139
2001-01-06 -0.198966
2001-01-07 2.015447
2001-01-08 1.454119
2001-01-09 0.157505
2001-01-10 1.077689
2001-01-11 -0.246538
2001-01-12 -0.865122
2001-01-13 -0.082186
2001-01-14 1.928050
2001-01-15 0.320741
2001-01-16 0.473770
2001-01-17 0.036649
2001-01-18 1.405034
2001-01-19 0.560502
2001-01-20 -0.695138
2001-01-21 3.318884
2001-01-22 1.704966
2001-01-23 0.145167
2001-01-24 0.366667
2001-01-25 -0.565675
2001-01-26 0.940406
2001-01-27 -1.468772
2001-01-28 0.098759
2001-01-29 0.267449
2001-01-30 -0.221643
...
2001-12-02 0.002522
2001-12-03 -0.046712
2001-12-04 1.825249
2001-12-05 -1.000655
2001-12-06 -0.807582
2001-12-07 0.750439
2001-12-08 1.531707
2001-12-09 -0.195239
2001-12-10 -0.087465
2001-12-11 -0.041450
2001-12-12 1.992200
2001-12-13 -0.294916
2001-12-14 1.215363
2001-12-15 0.029039
2001-12-16 -0.165380
2001-12-17 1.192535
2001-12-18 0.737760
2001-12-19 0.044022
2001-12-20 0.582560
2001-12-21 -0.213569
2001-12-22 -0.024512
2001-12-23 -1.140873
2001-12-24 -1.351333
2001-12-25 0.725253
2001-12-26 -0.943740
2001-12-27 -2.134039
2001-12-28 -0.548597
2001-12-29 1.497907
2001-12-30 -0.594708
2001-12-31 0.068177
Freq: D, dtype: float64
这里,字符串'2001
'就直接被解析为一年,然后选中这个时期的数据。我们也可以指定月份:
python
longer_ts['2001-05']
2001-05-01 -0.560227
2001-05-02 2.160259
2001-05-03 -0.826092
2001-05-04 -0.183020
2001-05-05 -0.294708
2001-05-06 -1.210785
2001-05-07 0.609260
2001-05-08 -1.155377
2001-05-09 -0.127132
2001-05-10 0.576327
2001-05-11 -0.955206
2001-05-12 -2.002019
2001-05-13 -0.969865
2001-05-14 0.820993
2001-05-15 0.557336
2001-05-16 -0.262222
2001-05-17 -0.086760
2001-05-18 0.151608
2001-05-19 1.097604
2001-05-20 0.212148
2001-05-21 -1.164944
2001-05-22 -0.100020
2001-05-23 0.734738
2001-05-24 1.730438
2001-05-25 1.352858
2001-05-26 0.644984
2001-05-27 0.997554
2001-05-28 1.434452
2001-05-29 0.395946
2001-05-30 -0.142523
2001-05-31 1.205485
Freq: D, dtype: float64
利用datetime
进行切片(slicing
)也没问题:
python
ts[datetime(2011, 1, 7)]
2.5532875030792592
因为大部分时间序列是按年代时间顺序来排列的,我们可以用时间戳来进行切片,选中一段范围内的时间:
python
ts
2011-01-02 0.384868
2011-01-05 0.669181
2011-01-07 2.553288
2011-01-08 -1.808783
2011-01-10 1.180570
2011-01-12 -0.928942
dtype: float64
python
ts['1/6/2011':'1/11/2011']
2011-01-07 2.553288
2011-01-08 -1.808783
2011-01-10 1.180570
dtype: float64
记住,这种方式的切片得到的只是原来数据的一个视图,如果我们在切片的结果上进行更改的的,原来的数据也会变化。
有一个相等的实例方法(instance method
)也能切片,truncate
,能在两个日期上,对Series
进行切片:
python
ts.truncate(after='1/9/2011')
2011-01-02 0.384868
2011-01-05 0.669181
2011-01-07 2.553288
2011-01-08 -1.808783
dtype: float64
所有这些都适用于DataFrame
,我们对行进行索引:
python
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
python
long_df = pd.DataFrame(np.random.randn(100, 4),
index=dates,
columns=['Colorado', 'Texas',
'New York', 'Ohio'])
python
long_df.loc['5-2001']
| | Colorado | Texas | New York | Ohio |
| 2001-05-02 | -0.477517 | 0.722685 | 0.337141 | -0.345072 |
| 2001-05-09 | -0.401860 | -0.475821 | 0.685129 | -0.809288 |
| 2001-05-16 | 1.900541 | 0.348590 | -0.805042 | -0.410077 |
| 2001-05-23 | -0.220870 | 1.654666 | -0.846395 | -0.207802 |
2001-05-30 | 2.094319 | -0.972588 | 1.276059 | -1.056146 |
---|
2 Time Series with Duplicate Indices(重复索引的时间序列)
在某些数据中,可能会遇到多个数据在同一时间戳下的情况:
python
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
'1/2/2000', '1/3/2000'])
python
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int64
我们通过is_unique
属性来查看index
是否是唯一值:
python
dup_ts.index.is_unique
False
对这个时间序列取索引的的话, 要么得到标量,要么得到切片,这取决于时间戳是否是重复的:
python
dup_ts['1/3/2000'] # not duplicated
4
python
dup_ts['1/2/2000'] # duplicated
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int64
假设我们想要聚合那些有重复时间戳的数据,一种方法是用groupby
,设定level=0:
python
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int64
python
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64