数据分析常用操作汇总

数据分析常用操作汇总：从基础到实战

一、NumPy 核心操作（ndarray）

1. ndarray 的核心特性

多维性

NumPy的核心是ndarray（N-dimensional array），它是一个多维、同构的数组对象。与Python原生列表相比，ndarray提供了更高效的存储和操作能力。

python 复制代码

import numpy as np

# 0维数组（标量）
arr_0d = np.array(5)
print("0维数组：", arr_0d, "维度：", arr_0d.ndim)    # 输出：5 维度：0

# 1维数组（向量）
arr_1d = np.array([1, 2, 3])
print("1维数组：", arr_1d, "维度：", arr_1d.ndim)    # 输出：[1 2 3] 维度：1

# 2维数组（矩阵）
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2维数组：\n", arr_2d, "\n维度：", arr_2d.ndim)    # 输出：维度：2

同质性（元素类型统一）

ndarray的所有元素必须是相同类型，这使得它能够更高效地存储和操作数据。当传入不同类型的数据时，NumPy会自动进行类型转换。

python 复制代码

# 不同类型自动转换为统一类型
arr_str = np.array([1, 'hello'])    # 全部转为字符串
print("字符串类型数组：", arr_str)    # 输出：['1' 'hello']

arr_float = np.array([1, 2.5])      # 全部转为浮点数
print("浮点类型数组：", arr_float)    # 输出：[1.  2.5]

2. ndarray 核心属性

ndarray提供了一系列属性来描述数组的特征：

python 复制代码

arr = np.array([[1,2,3],[4,5,6]])
print("数组形状：", arr.shape)      # (2, 3) - 行数×列数
print("数组维度：", arr.ndim)       # 2 - 维度数
print("元素个数：", arr.size)       # 6 - 总元素数
print("数据类型：", arr.dtype)      # int64 - 元素类型
print("数组转置：\n", arr.T)        # 行列互换

3. ndarray 创建方法

NumPy提供了多种创建数组的方法，适用于不同的场景：

python 复制代码

# 基础创建
list_data = [4, 5, 6]
arr = np.array(list_data, dtype=np.float64)

# 复制数组（深拷贝）
arr_copy = np.copy(arr)
arr_copy[0] = 8  # 修改拷贝不影响原数组

# 预定义形状数组
arr_zero = np.zeros((2, 3), dtype=int)  # 全0数组
arr_one = np.ones((5, 8), dtype=int)    # 全1数组
arr_empty = np.empty((2, 3))            # 未初始化数组
arr_full = np.full((3, 4), 2025)        # 固定值数组

# 序列生成
arr_arange = np.arange(1, 51, 1)        # 等差数列（start, end, step）
arr_linspace = np.linspace(0, 100, 5)   # 等间隔数列（start, end, 数量）
arr_logspace = np.logspace(0, 4, 3, base=2)  # 对数间隔数列

# 特殊矩阵
arr_eye = np.eye(3, 4, dtype=int)      # 单位矩阵
arr_diag = np.diag([5, 1, 2, 3])        # 对角矩阵

# 随机数组
np.random.seed(20)  # 设置随机种子（固定结果）
arr_rand = np.random.rand(2, 3)         # 0-1均匀分布浮点数
arr_uniform = np.random.uniform(3, 6, (2, 3))  # 指定范围浮点数
arr_randint = np.random.randint(3, 30, (2, 3)) # 指定范围整数
arr_randn = np.random.randn(2, 3)       # 标准正态分布

4. 索引与切片

NumPy提供了灵活的索引和切片机制，用于访问和修改数组元素：

python 复制代码

# 一维数组
arr_1d = np.random.randint(1, 100, 20)
print("索引取值：", arr_1d[10])            # 第11个元素
print("切片取值：", arr_1d[2:5])           # 第3-5个元素
print("布尔索引：", arr_1d[(arr_1d > 10) & (arr_1d < 70)])   # 条件筛选

# 二维数组
arr_2d = np.random.randint(1, 100, (4, 8))
print("二维索引：", arr_2d[1, 3])           # 第2行第4列
print("行切片：", arr_2d[1, 2:5])           # 第2行第3-5列
print("列切片：", arr_2d[:, 3])             # 第4列所有行

5. 运算与广播

NumPy支持向量化运算和广播机制，使得数组运算更加高效：

python 复制代码

# 基础算术运算
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[4, 5, 6], [7, 8, 9]])
print("数组相加：\n", a + b)
print("数组乘标量：\n", a * 3)

# 广播机制（维度兼容时自动扩展）
a_1d = np.array([1, 2, 3])      # 1×3
b_2d = np.array([[4], [5], [6]]) # 3×1
print("广播运算：\n", b_2d - a_1d)

# 矩阵乘法
print("矩阵乘法：\n", a @ b.T)    # 注意维度匹配

6. 常用函数

数学函数

NumPy提供了丰富的数学函数，用于数组的数学运算：

python 复制代码

arr = np.array([1, 25, 81])
print("平方根：", np.sqrt(arr))         # 开平方
print("指数：", np.exp(1))              # e^x
print("自然对数：", np.log(2.71))       # ln(x)
print("绝对值：", np.abs([-1, 1, -3]))  # 绝对值
print("四舍五入：", np.round([3.2, 4.5]))  # 四舍五入
print("向上取整：", np.ceil([1.6, 25.1]))  # 向上取整

统计函数

NumPy提供了多种统计函数，用于计算数组的统计特征：

python 复制代码

arr = np.random.randint(1, 200, 8)
print("求和：", np.sum(arr))            # 总和
print("平均值：", np.mean(arr))         # 平均值
print("中位数：", np.median(arr))       # 中位数
print("标准差：", np.std(arr))          # 标准差
print("最大值/位置：", np.max(arr), np.argmax(arr))   # 最大值及索引
print("累积和：", np.cumsum(arr))       # 累积求和

二、Pandas 核心操作

1. Series 创建与操作

Series是Pandas中的一维数组对象，类似于带标签的数组：

python 复制代码

import pandas as pd
import numpy as np

# 基础创建
s = pd.Series([10, 2, 3, 4, 5],
              index=['A', 'B', 'C', 'D', 'E'],
              name='月份')

# 字典创建
s_dict = pd.Series({"a": 1, "b": 2, "c": 3})

# 核心属性
print("索引：", s.index)    # 索引对象
print("值：", s.values)     # 数值数组
print("形状：", s.shape)    # 形状
print("元素数：", s.size)   # 元素个数

# 数据访问
print("显式索引：", s.loc['A'])   # 按标签
print("隐式索引：", s.iloc[0])    # 按位置
print("条件筛选：", s[s < 3])     # 布尔筛选

# 常用函数
s = pd.Series([10, 2, np.nan, None, 3], index=['A', 'B', 'C', 'D', 'E'])
print("描述统计：\n", s.describe())
print("缺失值检查：", s.isna())
print("平均值：", s.mean())
print("分位数：", s.quantile(0.8))
print("众数：", s.mode())
print("排序：", s.sort_values())

2. DataFrame 创建与操作

DataFrame是Pandas中的二维表格对象，类似于电子表格或SQL表：

python 复制代码

# 基础创建
df = pd.DataFrame({
    "name": ["tom", "jack", "alice", "bob", "allen"],
    "age": [15, 17, 20, 26, 30],
    "score": [60.5, 80, 30.6, 70, 83.5]
}, index=[1, 2, 3, 4, 5], columns=["name", "score", "age"])

# 核心属性
print("行索引：", df.index)
print("列标签：", df.columns)
print("值：\n", df.values)
print("维度：", df.ndim)
print("形状：", df.shape)

# 数据访问
print("单行：", df.loc[4])          # 按标签取行
print("单列：", df['name'])         # 按列名取列
print("多列：", df[['name', 'score']])  # 取多列
print("单个值：", df.at[3, 'score'])     # 按标签取单个值
print("条件筛选：", df[df.score > 70])   # 布尔筛选

# 常用函数
print("描述统计：\n", df.describe())
print("缺失值检查：\n", df.isna())
print("求和：", df['score'].sum())
print("排序：\n", df.sort_values(by='score'))
print("去重：\n", df.drop_duplicates())
print("随机抽样：\n", df.sample(2))

3. 数据清洗与预处理

数据清洗是数据分析的重要步骤，Pandas提供了丰富的工具来处理缺失值、重复值和数据类型转换：

python 复制代码

# 缺失值处理
df = pd.read_csv('data/weather_withna.csv')
print("缺失值统计：", df.isna().sum())

df_fill = df.fillna({'temp_max': 20, 'wind': 2.5})  # 固定值填充
df_fill_mean = df.fillna(df[['temp_max', 'wind']].mean())  # 均值填充
df_drop = df.dropna()  # 删除缺失值

# 重复值处理
df = df.drop_duplicates()  # 删除重复行

# 数据类型转换
df['age'] = df['age'].astype('int16')  # 数值类型转换
df['gender'] = df['gender'].astype('category')  # 类别类型转换

# 时间处理
df['date'] = pd.to_datetime(df['date'])  # 字符串转日期
df['year'] = df['date'].dt.year         # 提取年份
df['month'] = df['date'].dt.month       # 提取月份

# 数据分箱
df['price_level'] = pd.cut(df['price'],
                          bins=[0, 100, 200, 300],
                          labels=['低', '中', '高'])  # 数值分箱

# 数据变形
df_wide = pd.DataFrame({
    'ID': [1, 2],
    'name': ['alice', 'bob'],
    'Math': [90, 85],
    'English': [88, 92]
})

df_long = pd.melt(df_wide, id_vars=['ID', 'name'],
                  var_name='科目', value_name='分数')  # 宽表转长表

df_wide2 = pd.pivot(df_long, index=['ID', 'name'],
                   columns='科目', values='分数')  # 长表转宽表

4. 分组聚合

分组聚合是数据分析的核心操作之一，Pandas提供了强大的groupby功能：

python 复制代码

df = pd.read_csv('data/employees.csv')
df = df.dropna(subset=['department_id'])

# 单字段分组
dept_salary = df.groupby('department_id')['salary'].mean()
print("部门平均薪资：\n", dept_salary)

# 多字段分组
dept_job_salary = df.groupby(['department_id', 'job_id'])['salary'].mean()
print("部门-岗位平均薪资：\n", dept_job_salary)

# 多聚合函数
group_stats = df.groupby('department_id').agg({
    'salary': ['mean', 'max', 'min'],
    'employee_id': 'count'
})
print("多聚合统计：\n", group_stats)

三、数据可视化

1. Matplotlib 基础绘图

Matplotlib是Python中最常用的绘图库，提供了丰富的绘图功能：

python 复制代码

import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["STHeiti"]  # 中文显示

# 折线图（趋势分析）
plt.figure(figsize=(10, 5))
month = ['1月', '2月', '3月', '4月']
sales = [100, 150, 80, 130]
plt.plot(month, sales, color='orange', linewidth=2, marker='o', label='产品A')
plt.title('2025年销售趋势', fontsize=20)
plt.xlabel('月份')
plt.ylabel('销售额（万元）')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# 柱状图（类别对比）
plt.figure(figsize=(10, 5))
subjects = ['语文', '数学', '英语', '科学']
scores = [85, 92, 78, 88]
plt.bar(subjects, scores, color='orange', width=0.6)
plt.title('成绩分布', fontsize=20)
plt.xlabel('科目')
plt.ylabel('分数')
plt.grid(axis='y', alpha=0.3)
plt.show()

# 饼图（占比分析）
plt.figure(figsize=(8, 8))
things = ['学习', '娱乐', '运动', '睡觉', '其他']
times = [6, 4, 1, 8, 5]
colors = ['#66b3ff', '#99ff99', '#ffcc99', '#ff9999', '#ff4499']
plt.pie(times, labels=things, autopct='%.1f%%', colors=colors, startangle=90)
plt.title('一天的时间分布')
plt.show()

# 散点图（相关性分析）
plt.figure(figsize=(10, 5))
scores = [50, 55, 60, 65, 70, 75, 80]
hours = [1, 2, 3, 4, 5, 6, 7]
plt.scatter(hours, scores)
plt.title('学习时长与成绩关系')
plt.xlabel('学习时长（小时）')
plt.ylabel('成绩')
plt.show()

# 箱线图（分布与异常值）
plt.figure(figsize=(8, 6))
data = {
    '语文': [82, 85, 88, 70, 90, 76, 84, 83, 95],
    '数学': [75, 80, 79, 93, 88, 82, 87, 89, 92],
    '英语': [70, 72, 68, 65, 78, 80, 85, 90, 95]
}
plt.boxplot(data.values(), tick_labels=data.keys())
plt.title("各科成绩分布")
plt.ylabel("分数")
plt.grid(True, axis='y', linestyle='--', alpha=0.5)
plt.show()

2. Seaborn 进阶可视化

Seaborn是基于Matplotlib的统计绘图库，提供了更美观的绘图风格：

python 复制代码

import seaborn as sns

# 加载数据
penguins = pd.read_csv("data/penguins.csv")
penguins.dropna(inplace=True)

# 直方图（分布）
sns.histplot(data=penguins, x="bill_length_mm", kde=True)
plt.title('企鹅喙长度分布')
plt.show()

# 计数图（类别统计）
sns.countplot(data=penguins, x="island")
plt.title('不同岛屿企鹅数量')
plt.show()

# 散点图（相关性）
sns.scatterplot(data=penguins, x="body_mass_g", y="flipper_length_mm", hue="sex")
plt.title('体重与鳍长关系（按性别）')
plt.show()

# 箱线图（分组分布）
sns.boxplot(data=penguins, x="species", y="bill_length_mm")
plt.title('不同种类企鹅喙长度分布')
plt.show()

# 热力图（相关性）
corr = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr()
sns.heatmap(corr, cmap='coolwarm', annot=True)
plt.title('特征相关性热力图')
plt.show()

# 成对关系图（多变量关系）
sns.pairplot(data=penguins, hue="species")
plt.title('企鹅特征成对关系')
plt.show()

四、实战案例：房价数据分析

1. 数据加载与预处理

python 复制代码

# 数据加载
df = pd.read_csv('data/house_sales.csv')

# 数据清洗
df.drop(columns='origin_url', inplace=True)  # 删除无用列
df.dropna(inplace=True)                      # 删除缺失值
df.drop_duplicates(inplace=True)             # 删除重复值

# 数据类型转换
df['area'] = df['area'].str.replace('㎡', '').astype(float)
df['price'] = df['price'].str.replace('万', '').astype(float)
df['unit'] = df['unit'].str.replace('元/㎡', '').astype(float)
df['year'] = df['year'].str.replace('年建', '').astype(int)

# 异常值处理（IQR方法）
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['price'] < Q3 + 1.5 * IQR) & (df['price'] > Q1 - 1.5 * IQR)]

# 特征构造
df['district'] = df['address'].str.split('-').str[0]  # 提取区域
df['building_age'] = 2025 - df['year']                # 计算楼龄
df['price_level'] = pd.cut(df['price'], bins=4, labels=['低价', '中价', '高价', '豪华'])

2. 数据分析与可视化

python 复制代码

# 特征相关性分析
corr = df[['price', 'area', 'unit', 'building_age']].corr()
sns.heatmap(corr, cmap='coolwarm', annot=True)
plt.title('房价特征相关性')
plt.show()

# 不同区域房价对比
district_price = df.groupby('district')['price'].median().sort_values(ascending=False)
sns.barplot(x=district_price.index[:10], y=district_price.values[:10])
plt.title('TOP10区域房价中位数')
plt.xticks(rotation=45)
plt.show()

# 面积与价格关系
sns.scatterplot(x='area', y='price', data=df)
plt.title('房屋面积与价格关系')
plt.show()

# 不同朝向价格对比
toward_price = df.groupby('toward')['price'].median().sort_values(ascending=False)
sns.barplot(x=toward_price.index, y=toward_price.values)
plt.title('不同朝向房价中位数')
plt.xticks(rotation=45)
plt.show()

五、总结

1. NumPy 核心

ndarray是多维同构数组，支持高效的数值计算、广播机制和丰富的数学函数
是数据处理的基础，为Pandas等高级库提供底层支持

2. Pandas 核心

Series（一维）和DataFrame（二维）支持灵活的数据操作
包括索引、筛选、分组聚合和数据清洗，是结构化数据处理的核心工具

3. 可视化核心

Matplotlib提供基础绘图功能，适合快速生成各种图表
Seaborn提供更美观的统计可视化，适合探索性数据分析
常用图表包括折线图（趋势）、柱状图（对比）、散点图（相关性）、热力图（关联度）等

4. 数据分析流程

数据加载：从文件、数据库或API获取数据
预处理：清洗、转换、特征工程
探索性分析：统计分析、可视化
建模与预测：选择合适的模型进行预测
结论提炼：总结分析结果，提出建议