【Python系列课程】Pandas（四）：数据统计与排序——describe、sort_values、sample

📊 阅读时长：16分钟 | 关键词：Pandas、describe描述统计、sort_values排序、sample采样、info摘要

引言

上一篇文章我们学了 DataFrame 的拼接、合并、删除等操作，相当于"把数据折腾到想要的形状"。形状对了，接下来就该"看懂数据"了------这列数据范围多大？有没有异常值？按某列排序后是什么样？这篇文章用五个常用方法帮你快速摸清数据的"底细"。

一、info() 查看数据摘要

拿到一个陌生的 DataFrame，第一件事就应该是 .info()------它用几行输出告诉你数据的骨架结构。

python 复制代码

import pandas as pd
import numpy as np

df = pd.DataFrame(data={'name': ['Tom', 'Bob', np.nan], 
                         'age': [18, 19, 17], 
                         'height': [167, 177, 178]}, 
                   index=['n1', 'n2', 'n3'])
print(df)
df.info()

输出：

复制代码

   name  age  height
n1  Tom   18     167
n2  Bob   19     177
n3  NaN   17     178

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, n1 to n3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2 non-null      object 
 1   age     3 non-null      int64  
 2   height  3 non-null      int64  
dtypes: int64(2), object(1)
memory usage: 96.0+ bytes

一眼就能看到：有几个缺失值、每列数据类型、内存占用。

python 复制代码

# verbose=False 打印简短摘要
df.info(verbose=False)

# show_counts=False 不显示 Non-Null Count
df.info(show_counts=False)

参数	说明
`verbose`	`None` 打印完整摘要（默认），`False` 打印简短摘要
`show_counts`	`None` 显示非空计数（默认），`False` 不显示

二、describe() 描述性统计

如果 info() 是"骨架"，describe() 就是"体检报告"------它把每列数值型数据的核心指标一次列出。

python 复制代码

df = pd.DataFrame(data={'name': ['Tom', 'Bob', 'Bob'], 
                         'age': [18, 19, 17], 
                         'height': [167, 177, 178]}, 
                   index=['n1', 'n2', 'n3'])
print(df.describe())

输出：

复制代码

            age      height
count   3.00000    3.000000
mean   18.00000  174.000000
std     1.00000    6.082763
min    17.00000  167.000000
25%    17.50000  172.000000
50%    18.00000  177.000000
75%    18.50000  177.500000
max    19.00000  178.000000

默认只统计数值列（name 被跳过了）。如果想看所有列：

python 复制代码

# include='all' 显示所有列（包括字符串列）
print(df.describe(include='all'))

# include='object' 只看字符串列
print(df.describe(include='object'))

# 组合使用
print(df.describe(include=['number', 'object']))

指标	含义
`count`	非缺失值数量
`mean`	平均值
`std`	标准差
`min`	最小值
`25%`	第 25 百分位数
`50%`	中位数（第 50 百分位）
`75%`	第 75 百分位数
`max`	最大值

参数详解：

参数	说明
`percentiles`	默认 `[.25, .5, .75]`，可自定义百分位
`include`	`None` 只统计数值列；`'all'` 所有列；`'number'` 数值列；`'object'` 字符串列
`exclude`	排除指定类型（`'number'` 或 `'object'`）

三、常用统计函数

除了 describe() 一键全输出，也可以单独调用这些函数：

python 复制代码

df = pd.DataFrame(data={'name': ['Tom', np.nan, 'Linda'], 
                         'age': [18, 19, 17]}, 
                   index=['n1', 'n2', 'n3'])

print(df.count())          # 每列非缺失值数量
print(df.count(axis=1))    # 每行非缺失值数量

python 复制代码

# 构造随机数据
d = np.random.normal(size=(7, 2))
df = pd.DataFrame(data=d)

print(df.max())      # 每列最大值
print(df.max(axis=1))  # 每行最大值
print(df.min())      # 每列最小值
print(df.min(axis=1))  # 每行最小值
print(df.mean())     # 每列平均值
print(df.mean(axis=1)) # 每行平均值
print(df.var())      # 每列方差
print(df.var(axis=1))  # 每行方差
print(df.std())      # 每列标准差
print(df.std(axis=1))  # 每行标准差

速查表：

函数	作用	默认 axis
`count()`	非缺失值数量	0（按列）
`max()`	最大值	0（按列）
`min()`	最小值	0（按列）
`mean()`	平均值	0（按列）
`var()`	方差	0（按列）
`std()`	标准差	0（按列）

💡 axis=0 是沿着行方向计算（得到每列的统计值），axis=1 是沿着列方向计算（得到每行的统计值）。

四、sample() 随机采样

做数据分析时，常常需要从大数据集中"抽几条看看"。sample() 就是干这个的。

python 复制代码

df = pd.DataFrame(data={'name': ['Tom', 'Bob', 'Jack', 'Linda'], 
                         'age': [18, 19, 17, 21], 
                         'height': [167, 177, 178, 188]}, 
                   index=['n1', 'n2', 'n3', 'n4'])
print(df)

复制代码

    name  age  height
n1   Tom   18     167
n2   Bob   19     177
n3  Jack   17     178
n4 Linda   21     188

python 复制代码

# 默认 n=1，随机抽 1 行
print(df.sample())

# 按比例抽取（75% = 3 行）
print(df.sample(frac=0.75))

# 指定抽取 2 行
print(df.sample(n=2))

# 有放回采样（可能抽到同一行）
print(df.sample(n=2, replace=True))

# 横向抽 2 列
print(df.sample(n=2, axis=1))

# 设置随机种子，结果可复现
print(df.sample(n=2, random_state=3))

参数	说明
`n`	采样数量（默认 1），不能与 `frac` 同时用
`frac`	采样比例（如 0.5 = 50%），不能与 `n` 同时用
`replace`	`True` 有放回，可能重复采样
`random_state`	随机种子，保证结果可复现
`axis`	0 = 采样行（默认），1 = 采样列

五、sort_values() 排序

数据分析中排序是高频操作。sort_values() 可以按一列或多列排序。

python 复制代码

df = pd.DataFrame({'col1': [4, 1, 2, np.nan, 5, 2],
                   'col2': [2, 1, 9, 8, 7, 6],
                   'col3': [0, 1, 9, 4, 2, 3],
                   'col4': ['a', 'B', 'c', 'D', 'e', 1]})
print(df)

复制代码

   col1  col2  col3 col4
0   4.0     2     0    a
1   1.0     1     1    B
2   2.0     9     9    c
3   NaN     8     4    D
4   5.0     7     2    e
5   2.0     6     3    1

python 复制代码

# 按单列排序
print(df.sort_values(by='col1'))

# 按多列排序（先按 col1，再按 col2）
print(df.sort_values(by=['col1', 'col2']))

# 降序排列
print(df.sort_values(by='col1', ascending=False))

# 多列混合排序（col1 升序，col2 降序）
print(df.sort_values(by=['col1', 'col2'], ascending=[True, False]))

# 缺失值排到开头（默认排末尾）
print(df.sort_values(by='col1', na_position='first'))

# 按行标签排序（axis=1，by 指定行标签）
print(df.sort_values(by=5, axis=1))

# 原地排序
df.sort_values(by='col1', inplace=True)
print(df)

参数	说明
`by`	排序依据的列名或列名列表
`axis`	0 按行排序，1 按列排序
`ascending`	`True` 升序（默认），`False` 降序；也可传列表
`inplace`	`True` 原地修改
`na_position`	`'last'` NaN 排末尾（默认），`'first'` NaN 排开头

小结

这五个方法帮你从"不认识数据"到"把握数据轮廓"：

方法	用途	记忆口诀
`info()`	数据结构摘要	拿到数据先 info，几行几列心里有数
`describe()`	数值描述统计	一键体检，count/mean/std/min/max 全齐
`count/max/min/mean...`	单指标统计	需要哪个取哪个，axis 控制方向
`sample()`	随机采样	大数据抽几条看看，seed 保可复现
`sort_values()`	排序	按列升序降序，NaN 可放头可放尾

下一篇文章，我们将探索 Pandas 真正的"生产力引擎"------groupby() 分组聚合和 apply() 函数应用。这才是数据分析中最强大的武器。

本文是「Python从入门到数据分析」系列的第 16 篇。关注我，不错过后续更新。