【Pandas】pandas Rolling window sem

Pandas2.2 Window

Expanding window functions

方法	描述
Expanding.count([numeric_only])	用于计算扩展窗口内非空值数量的方法
Expanding.sum([numeric_only, engine, ...])	用于计算扩展窗口内元素累积总和的方法
Expanding.mean([numeric_only, engine, ...])	用于计算扩展窗口（expanding window）内元素平均值的方法
Expanding.median([numeric_only, engine, ...])	用于计算扩展窗口（expanding window）内元素中位数的方法
Expanding.var([ddof, numeric_only, engine, ...])	用于计算扩展窗口（expanding window）内元素方差的方法
Expanding.std([ddof, numeric_only, engine, ...])	用于计算扩展窗口（expanding window）内元素标准差的方法
Expanding.min([numeric_only, engine, ...])	用于计算扩展窗口（expanding window）内元素最小值的方法
Expanding.max([numeric_only, engine, ...])	用于计算扩展窗口内元素最大值的方法
Expanding.corr([other, pairwise, ddof, ...])	用于计算扩展窗口内两个序列之间相关系数的方法
Expanding.cov([other, pairwise, ddof, ...])	用于计算扩展窗口内两个序列之间协方差的方法
Expanding.skew([numeric_only])	用于计算扩展窗口内元素偏度（skewness）的方法
Expanding.kurt([numeric_only])	用于计算扩展窗口内元素峰度（kurtosis）的方法
Expanding.apply(func[, raw, engine, ...])	用于在扩展窗口上应用自定义函数的方法
Expanding.aggregate(func, args, *kwargs)	用于在扩展窗口上应用聚合函数的方法
Expanding.quantile(q[, interpolation, ...])	用于计算扩展窗口内元素分位数的方法
Expanding.sem([ddof, numeric_only])	用于计算扩展窗口内元素标准误（Standard Error of Mean）的方法

pandas.Window.Expanding.sem()

方法描述

Expanding.sem() 是 pandas 中用于计算扩展窗口内元素标准误（Standard Error of Mean）的方法。它返回一个与原数据形状相同的对象，其中每个元素表示从序列开始到当前位置的所有非空值的标准误。

标准误是样本均值的标准差，用于衡量样本均值估计总体均值的精确程度。计算公式为：

SEM=σn\text{SEM} = \frac{\sigma}{\sqrt{n}}SEM=n σ

其中 σ\sigmaσ 是样本标准差，nnn 是样本数量。

对于扩展窗口，随着窗口的增大，每个新值都会被加入到计算中，因此提供了从序列开始到当前点的累积标准误。

参数说明

ddof: int, default 1
- Delta Degrees of Freedom，自由度调整参数
- 默认为1，表示计算样本标准误（除以n-1）
- 若设置为0，则计算总体标准误（除以n）
numeric_only: bool, default False
- 如果为 True，则只对数值类型的列进行标准误计算
- 如果为 False（默认），则对所有列进行标准误计算（非数值列会产生错误）

返回值

返回与原始对象相同维度的对象，其中每个值表示从序列开始到当前位置的所有非空元素的标准误。

使用示例

python 复制代码

import pandas as pd
import numpy as np

# 示例1: Series基本用法
print("=== 示例1: Series基本用法 ===")
data = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("原始数据:")
print(data)
print()

expanding_sem = data.expanding().sem()
print("扩展窗口标准误:")
print(expanding_sem)
print()

# 示例2: 包含NaN值的情况
print("=== 示例2: 包含NaN值的情况 ===")
data_with_nan = pd.Series([1, 2, np.nan, 4, 5, 6, 7])
print("包含NaN的原始数据:")
print(data_with_nan)
print()

expanding_sem_nan = data_with_nan.expanding().sem()
print("扩展窗口标准误:")
print(expanding_sem_nan)
print()

# 示例3: DataFrame中的应用
print("=== 示例3: DataFrame中的应用 ===")
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})
print("原始DataFrame:")
print(df)
print()

expanding_sem_df = df.expanding().sem()
print("所有列的扩展窗口标准误:")
print(expanding_sem_df)
print()

# 示例4: 使用numeric_only参数处理混合类型数据
print("=== 示例4: 使用numeric_only参数处理混合类型数据 ===")
df_mixed = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['a', 'b', 'c', 'd', 'e']
})
print("包含数值和非数值列的DataFrame:")
print(df_mixed)
print()

try:
    # 这会引发错误，因为有非数值列
    expanding_sem_error = df_mixed.expanding().sem()
    print(expanding_sem_error)
except Exception as e:
    print(f"错误信息: {e}")
    print()

# 正确做法：使用numeric_only=True
expanding_sem_numeric = df_mixed.expanding().sem(numeric_only=True)
print("仅对数值列进行扩展窗口标准误计算:")
print(expanding_sem_numeric)
print()

# 示例5: 不同ddof值的影响
print("=== 示例5: 不同ddof值的影响 ===")
simple_data = pd.Series([10, 20, 30, 40, 50])
print("原始数据:")
print(simple_data)
print()

print("扩展窗口标准误 (ddof=0):")
print(simple_data.expanding().sem(ddof=0))
print()

print("扩展窗口标准误 (ddof=1):")
print(simple_data.expanding().sem(ddof=1))
print()

# 示例6: 在时间序列数据中的应用
print("=== 示例6: 时间序列数据中的应用 ===")
dates = pd.date_range('2023-01-01', periods=7, freq='D')
ts_data = pd.Series([100, 200, np.nan, 400, 500, 600, 700], index=dates)
print("时间序列数据:")
print(ts_data)
print()

expanding_ts = ts_data.expanding().sem()
print("扩展窗口标准误:")
print(expanding_ts)
print()

# 示例7: 股票收益率标准误分析
print("=== 示例7: 股票收益率标准误分析 ===")
stock_prices = pd.Series([100, 102, 98, 105, 110, 108, 112, 115, 113, 118])
returns = stock_prices.pct_change().dropna()  # 计算收益率
print("股票收益率数据:")
print(returns.round(4))
print()

# 计算扩展窗口标准误
expanding_sem_returns = returns.expanding().sem()
print("收益率的扩展窗口标准误:")
print(expanding_sem_returns.round(6))
print()

# 计算累计均值和标准误
cumulative_mean = returns.expanding().mean()
analysis_df = pd.DataFrame({
    '收益率': returns,
    '累计均值': cumulative_mean,
    '标准误': expanding_sem_returns
})
print("收益率分析:")
print(analysis_df.round(6))

执行结果

复制代码

=== 示例1: Series基本用法 ===
原始数据:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

扩展窗口标准误:
0         NaN
1    0.707107
2    0.577350
3    0.500000
4    0.447214
5    0.408248
6    0.377964
dtype: float64

=== 示例2: 包含NaN值的情况 ===
包含NaN的原始数据:
0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

扩展窗口标准误:
0         NaN
1    0.707107
2    0.707107
3    0.881917
4    0.854583
5    0.816497
6    0.774597
dtype: float64

=== 示例3: DataFrame中的应用 ===
原始DataFrame:
   A   B    C
0  1  10  100
1  2  20  200
2  3  30  300
3  4  40  400
4  5  50  500

所有列的扩展窗口标准误:
          A          B             C
0       NaN        NaN           NaN
1  0.707107  7.071068   70.710678
2  0.577350  5.773503   57.735027
3  0.500000  5.000000   50.000000
4  0.447214  4.472136   44.721360

=== 示例4: 使用numeric_only参数处理混合类型数据 ===
包含数值和非数值列的DataFrame:
   A   B  C
0  1  10  a
1  2  20  b
2  3  30  c
3  4  40  d
4  5  50  e

错误信息: unsupported operand type(s) for -: 'str' and 'float'

仅对数值列进行扩展窗口标准误计算:
          A          B
0       NaN        NaN
1  0.707107  7.071068
2  0.577350  5.773503
3  0.500000  5.000000
4  0.447214  4.472136

=== 示例5: 不同ddof值的影响 ===
原始数据:
0    10
1    20
2    30
3    40
4    50
dtype: int64

扩展窗口标准误 (ddof=0):
0         NaN
1  5.000000
2  4.714045
3  4.330127
4  4.000000
dtype: float64

扩展窗口标准误 (ddof=1):
0         NaN
1  7.071068
2  5.773503
3  5.000000
4  4.472136
dtype: float64

=== 示例6: 时间序列数据中的应用 ===
时间序列数据:
2023-01-01    100.0
2023-01-02    200.0
2023-01-03      NaN
2023-01-04    400.0
2023-01-05    500.0
2023-01-06    600.0
2023-01-07    700.0
Freq: D, dtype: float64

扩展窗口标准误:
2023-01-01         NaN
2023-01-02    70.710678
2023-01-03    70.710678
2023-01-04    88.191710
2023-01-05    85.458280
2023-01-06    81.649660
2023-01-07    77.459667
Freq: D, dtype: float64

=== 示例7: 股票收益率标准误分析 ===
股票收益率数据:
1     0.0200
2    -0.0196
3     0.0714
4     0.0476
5    -0.0182
6     0.0370
7     0.0268
8    -0.0174
9     0.0442
dtype: float64

收益率的扩展窗口标准误:
1         NaN
2    0.028284
3    0.025673
4    0.021956
5    0.018840
6    0.017132
7    0.015448
8    0.014117
9    0.013003
dtype: float64

收益率分析:
      收益率    累计均值       标准误
1     0.0200  0.020000         NaN
2    -0.0196  0.000200    0.028284
3     0.0714  0.024000    0.025673
4     0.0476  0.029800    0.021956
5    -0.0182  0.024080    0.018840
6     0.0370  0.026367    0.017132
7     0.0268  0.027238    0.015448
8    -0.0174  0.024338    0.014117
9     0.0442  0.026311    0.013003

关键要点

sem() 方法计算的是扩展窗口内所有非空值的标准误，即均值的标准差
扩展窗口从序列的第一个元素开始，逐步增加到当前元素
第一个值总是 NaN，因为单个值无法计算标准误
ddof 参数控制标准误计算中的自由度，默认为1（样本标准误）
NaN 值会被自动忽略，不影响其他值的标准误计算
numeric_only 参数对于包含非数值列的数据非常重要
标准误在统计学中用于衡量样本均值估计总体均值的准确性
在金融分析中，标准误可用于评估收益率均值估计的可靠性
结果保留了原始数据的形状，便于后续分析和可视化