【AI大模型--NumPy-05】统计分析实战指南

05_statisticsAnalysis.py - 统计分析实战指南

学习路径第 5 步 (共 10 步) | 难度：中级
概述

使用 NumPy 进行完整的统计分析流程：描述统计、相关性分析、异常值检测、分组统计，以考试成绩数据集为实际案例进行演示。
学习目标

掌握 NumPy 统计函数：mean / median / std / var / percentile
理解协方差矩阵 与相关系数的含义和计算方法
学会两种异常值检测方法：IQR（四分位距）和 Z-Score
使用 NumPy 实现 groupby 式的分组统计
核心内容 (6 个模块)

模块	核心知识点
1. 描述性统计	集中趋势、离散程度、分布形态指标
2. 协方差与相关性	协方差矩阵、皮尔逊相关系数
3. 异常值检测	IQR 方法 vs Z-Score 方法对比
4. 分组统计	`np.unique` + 布尔掩码实现类 groupby 操作
5. 直方图分析	频率分布、分箱策略
6. 实战案例	考试成绩完整分析流程
code

python 复制代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
=====================================
NumPy 统计分析实战指南 (Statistical Analysis)
=====================================

本案例通过真实的数据分析场景，系统介绍 NumPy 统计功能：

1. 描述性统计 (均值/中位数/方差/分位数)
2. 协方差与相关系数
3. 分组统计 (类似 groupby)
4. 异常值检测 (IQR / Z-Score)
5. 直方图与频率分布
6. 实战案例：考试成绩分析

【适用场景】
  数据探索性分析 (EDA)、数据预处理、质量检验
  是 pandas / sklearn 等高级库的基础

作者：bloxed
日期：2026-05-20
"""

import numpy as np


def separator(title):
    print(f"\n{'='*60}")
    print(f"  {title}")
    print('='*60)


# ============================================================
# 第一部分：描述性统计
# ============================================================
separator("一、描述性统计指标大全")

print("""
【集中趋势度量】
  mean     --- 算术平均值 (受极端值影响大)
  median   --- 中位数 (抗干扰能力强)
  percentile --- 百分位数 (任意位置的分布点)

【离散程度度量】
  var      --- 方差 (Variance)
  std      --- 标准差 (Standard Deviation)
  ptp      --- 极差 (Peak-to-Peak = max - min)
  iqr      --- 四分位距 (Interquartile Range)
""")

# 生成模拟数据: 某班级考试成绩
np.random.seed(2024)
scores = np.round(np.random.normal(loc=72, scale=15, size=50), 1)
scores = np.clip(scores, 0, 100)  # 限制在 0-100

print(f"考试数据 ({len(scores)} 名学生):")
print(f"  前10个: {scores[:10]} ...")

print(f"\n--- 集中趋势 ---")
print(f"  平均分 (mean):     {scores.mean():.1f}")
print(f"  中位数 (median):   {np.median(scores):.1f}")
print(f"  [!] mean != median 说明数据不对称分布")
print(f"  差值: {abs(scores.mean() - np.median(scores)):.1f}")

print(f"\n--- 离散程度 ---")
print(f"  标准差 (std):      {scores.std():.1f}")
print(f"  方差 (var):        {scores.var():.1f}")
print(f"  极差 (ptp):        {np.ptp(scores):.1f}")
print(f"  最小值:             {scores.min()}")
print(f"  最大值:             {scores.max()}")

print(f"\n--- 分位点 ---")
for q in [25, 50, 75, 90, 95]:
    val = np.percentile(scores, q)
    bar_len = int(val / 2)
    print(f"  P{q:02d} = {val:5.1f}  {'#' * bar_len}")

q1, q3 = np.percentile(scores, [25, 75])
iqr = q3 - q1
print(f"\n  IQR (Q3-Q1) = {q3:.1f} - {q1:.1f} = {iqr:.1f}")


# ============================================================
# 第二部分：多维数据的轴向统计
# ============================================================
separator("二、多维数据轴向统计 (axis 参数)")

print("""
axis 参数决定了统计方向:

  axis=None (默认): 展平全部元素统计
  axis=0:           沿纵向 (每列统计) --- 压缩行
  axis=1:           沿横向 (每行统计) --- 压缩列
""")

# 模拟 5 个学生在 4 门课的成绩
exam_data = np.array([
    [85, 92, 78, 88],  # 学生A
    [76, 85, 90, 82],  # 学生B
    [92, 78, 85, 95],  # 学生C
    [68, 72, 75, 70],  # 学生D
    [88, 95, 82, 90],  # 学生E
])

subjects = ['Math', 'English', 'Physics', 'Chemistry']
students = ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']

print(f"成绩表 ({len(students)}学生 x {len(subjects)}科目):")
header = f"{'':>10}"
for s in subjects:
    header += f"{s:>10}"
print(header)
for i, name in enumerate(students):
    row = f"{name:>10}"
    for score in exam_data[i]:
        row += f"{score:>10}"
    print(row)

print(f"\n--- axis=0 (每科目的统计) ---")
col_means = exam_data.mean(axis=0)
col_stds = exam_data.std(axis=0)
for subj, m, s in zip(subjects, col_means, col_stds):
    print(f"  {subj:10s}: 均值={m:.1f}, 标准差={s:.1f}")

print(f"\n--- axis=1 (每学生的统计) ---")
row_means = exam_data.mean(axis=1)
row_totals = exam_data.sum(axis=1)
for name, avg, total in zip(students, row_means, row_totals):
    print(f"  {name:10s}: 均值={avg:.1f}, 总分={total}")

# cumsum/cumprod --- 累积统计
print(f"\n--- cumsum (累积和) ---")
print(f"  Alice 各科累计得分: {np.cumsum(exam_data[0])}")


# ============================================================
# 第三部分：协方差与相关性
# ============================================================
separator("三、协方差矩阵与相关系数")

print("""
【协方差 Covariance】衡量两个变量的协同变化方向
  cov(X,Y) > 0: 正相关 (X增大时Y也倾向增大)
  cov(X,Y) < 0: 负相关 (X增大时Y倾向减小)
  cov ≈ 0:     无线性相关

【皮尔逊相关系数 Correlation】标准化的协方差, 范围 [-1, 1]
  |r| = 1.0: 完美线性相关
  |r| > 0.7: 强相关
  |r| > 0.4: 中等相关
  |r| < 0.4: 弱相关
""")

# 两组变量: 学习时间 vs 成绩
study_hours = np.array([2, 3, 5, 7, 8, 10, 12, 15, 18, 20])
test_scores = np.array([55, 58, 65, 72, 75, 82, 85, 90, 93, 96])

print(f"学习时长 (小时): {study_hours}")
print(f"考试成绩 (分):   {test_scores}")

# 协方差矩阵
cov_matrix = np.cov(study_hours, test_scores)
print(f"\n协方差矩阵:\n{cov_matrix}")
print(f"  var(hours) = {cov_matrix[0,0]:.1f}")
print(f"  var(scores)= {cov_matrix[1,1]:.1f}")
print(f"  cov(h,s)   = {cov_matrix[0,1]:.1f}")

# 相关系数矩阵
corr_matrix = np.corrcoef(study_hours, test_scores)
r_value = corr_matrix[0, 1]
print(f"\n相关系数矩阵:\n{corr_matrix.round(4)}")
print(f"  相关系数 r = {r_value:.4f}")

# 解读
if abs(r_value) > 0.9:
    strength = "极强正相关"
elif abs(r_value) > 0.7:
    strength = "强正相关"
elif abs(r_value) > 0.4:
    strength = "中等相关"
else:
    strength = "弱相关"
print(f"\n  解读: 学习时长与成绩呈 {strength} (r={r_value:.3f})")


# ============================================================
# 第四部分：异常值检测
# ============================================================
separator("四、异常值检测方法")

print("""
【常见异常值检测方法】

方法1: Z-Score 法
  |z| > 3 通常认为是异常值 (正态分布下覆盖99.7%)

方法2: IQR 法 (更鲁棒)
  异常范围: < Q1 - 1.5*IQR  或  > Q3 + 1.5*IQR
""")

# 含有异常值的传感器数据
sensor_readings = np.array([23.5, 24.1, 23.8, 24.0, 23.9,
                             999.9,  # 明显异常 (可能是传感器故障)
                             24.2, 23.7, 24.3, -50.0,  # 负值异常
                             23.6, 24.1, 23.8, 1000.5,  # 另一个异常
                             23.9, 24.0, 23.7, 24.1])

print(f"传感器数据 ({len(sensor_readings)} 个读数):")
print(f"  {sensor_readings}")

# --- 方法1: Z-Score ---
mean_val = sensor_readings.mean()
std_val = sensor_readings.std()
z_scores = np.abs((sensor_readings - mean_val) / std_val)
outlier_z = z_scores > 3

print(f"\n--- 方法1: Z-Score (阈值=3) ---")
print(f"  均值={mean_val:.1f}, 标准差={std_val:.1f}")
print(f"  Z-Scores: {z_scores.round(1)}")
print(f"  异常值: {sensor_readings[outlier_z]}")
print(f"  异常数量: {outlier_z.sum()}")

# --- 方法2: IQR (更推荐!) ---
q1 = np.percentile(sensor_readings, 25)
q3 = np.percentile(sensor_readings, 75)
iqr_val = q3 - q1
lower_bound = q1 - 1.5 * iqr_val
upper_bound = q3 + 1.5 * iqr_val
outlier_iqr = (sensor_readings < lower_bound) | (sensor_readings > upper_bound)

print(f"\n--- 方法2: IQR (1.5*IQR规则) ---")
print(f"  Q1={q1:.1f}, Q3={q3:.1f}, IQR={iqr_val:.1f}")
print(f"  正常范围: [{lower_bound:.1f}, {upper_bound:.1f}]")
print(f"  异常值: {sensor_readings[outlier_iqr]}")
print(f"  异常数量: {outlier_iqr.sum()}")

# 清洗后统计
clean_data = sensor_readings[~outlier_iqr]
print(f"\n  清洗后数据: {clean_data}")
print(f"  有效均值: {clean_data.mean():.2f} (vs 原始 {mean_val:.2f})")


# ============================================================
# 第五部分：分组统计
# ============================================================
separator("五、分组统计 (模拟 GroupBy)")

print("""
在纯 NumPy 中实现类似 pandas groupby 的效果:
  1. np.unique() 获取分组键
  2. 遍历每组，用布尔掩码筛选
  3. 对每组分别做统计
""")

# 销售数据: 区域, 金额
regions = np.random.choice(['North', 'South', 'East', 'West'], size=200)
amounts = np.random.normal(loc=500, scale=200, size=200).clip(50, 2000).round(2)

print(f"销售记录: {len(regions)} 条")
print(f"区域分布: North={np.sum(regions=='North')}, South={np.sum(regions=='South')}, "
      f"East={np.sum(regions=='East')}, West={np.sum(regions=='West')}")

print(f"\n{'区域':<8} {'记录数':>6} {'总销售额':>12} {'平均值':>10} {'中位数':>10} {'最高单':>10}")
print("-" * 60)

for region in np.unique(regions):
    mask = regions == region
    region_amounts = amounts[mask]
    print(f"{region:<8} {mask.sum():>6} {region_amounts.sum():>11,.1f} "
          f"{region_amounts.mean():>9.1f} {np.median(region_amounts):>9.1f} "
          f"{region_amounts.max():>9.1f}")


# ============================================================
# 第六部分：频率分布与直方图
# ============================================================
separator("六、频率分布与直方图分析")

# 生成正态分布数据
data_normal = np.random.normal(loc=100, scale=20, size=1000)

# 手动计算直方图 (不用 matplotlib)
hist_counts, bin_edges = np.histogram(data_normal, bins=10)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

print("手动直方图统计 (10个区间):")
print(f"{'区间范围':>16} {'频次':>6} {'频率%':>8} {'柱状图'}")
print("-" * 60)
total = len(data_normal)
for i in range(len(hist_counts)):
    lo, hi = bin_edges[i], bin_edges[i+1]
    pct = hist_counts[i] / total * 100
    bar = '#' * int(pct / 2)
    print(f"[{lo:6.1f}, {hi:6.1f})  {hist_counts[i]:>5} {pct:>7.1f}%  {bar}")

# 其他有用的分布统计
print(f"\n其他分布信息:")
print(f"  众数区间: bin[{hist_counts.argmax()}] ({bin_edges[hist_counts.argmax()]:.1f}-{bin_edges[hist_counts.argmax()+1]:.1f})")
print(f"  偏度估算: (均值-中位数)/标准差 = {(data_normal.mean()-np.median(data_normal))/data_normal.std():.3f}")
print(f"  (正值右偏, 负值左偏)")


# ============================================================
# 总结
# ============================================================
separator("总结: NumPy 统计函数速查")

summary = """
+------------------------------------------------------------+
|  NumPy 统计函数速查                                          |
+------------------------------------------------------------+
|                                                            |
|  [集中趋势]                                                 |
|  np.mean()      --- 算术平均                                 |
|  np.median()    --- 中位数                                   |
|  np.percentile()--- 任意百分位                               |
|                                                            |
|  [离散程度]                                                 |
|  np.std()       --- 标准差 (ddof=0总体, ddof=1样本)          |
|  np.var()       --- 方差                                     |
|  np.ptp()       --- 极差 (max-min)                           |
|                                                            |
|  [相关性]                                                   |
|  np.cov()       --- 协方差矩阵                               |
|  np.corrcoef()  --- 皮尔逊相关系数矩阵                        |
|                                                            |
|  [累积]                                                     |
|  np.cumsum()    --- 累积和                                   |
|  np.cumprod()   --- 累积乘积                                  |
|                                                            |
|  [分布]                                                     |
|  np.histogram() --- 直方图统计 (返回频次+边界)                |
|  np.bincount()  --- 非负整数频率统计                         |
|  np.digitize()  --- 将数据分桶                               |
|                                                            |
|  [关键参数]                                                 |
|  axis --- 指定统计轴 (0=列, 1=行)                            |
|  keepdims --- 保持维度 (方便后续广播)                        |
+------------------------------------------------------------+
"""
print(summary)

print("\n运行完毕!")