机器学习入门：Python三大核心库详解（NumPy+Pandas+Matplotlib 含代码）

前言

在机器学习与数据科学领域，Python凭借简洁的语法、完善的生态体系，成为行业首选编程语言。业内有一个公认的结论：机器学习项目中，80%的工作量都耗费在数据准备阶段。

而支撑Python数据处理、数据分析、数据可视化全流程的核心，正是五大经典库：NumPy、Pandas、Matplotlib、Scikit-learn、Scipy。

本文聚焦入门必学的三大核心库，从零讲解概念、核心作用、实战用法，搭配可直接运行的完整代码，零基础也能轻松吃透，为后续机器学习建模打好基础。

简单总结三者定位：

NumPy：数值计算基石，高效处理多维数组运算
Pandas：结构化数据处理工具，媲美Excel的可视化数据操作
Matplotlib：数据可视化工具，绘制各类专业图表

一、NumPy ------ 数值计算的基石

1.1 核心概念与作用

NumPy（Numerical Python）是Python科学计算的底层核心库，几乎所有数据分析、机器学习库（Pandas、TensorFlow、Scikit-learn）均基于NumPy构建。

核心优势：

内置**ndarray（N维数组）**对象，运算速度比Python原生List快数十倍
支持向量化运算，无需循环即可完成批量计算，代码更简洁高效
提供丰富的数学运算、矩阵操作、随机数生成方法

1.2 核心实战代码

1.2.1 数组创建

python 复制代码

import numpy as np

# 1. 基础数组创建（从列表转换）
arr1 = np.array([1, 2, 3])       # 一维数组
arr2 = np.array([[1, 2], [3, 4]])# 二维数组

# 2. 特殊数组创建
zeros = np.zeros((2, 3))         # 2行3列全0数组
ones = np.ones((2, 2))           # 2行2列全1数组
identity = np.eye(3)             # 3阶单位矩阵
random_arr = np.random.rand(2, 2)# 2*2 [0,1)随机数组

# 3. 序列数组
seq = np.arange(0, 10, 2)        # 等差序列：[0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)  # 均分序列：[0.  , 0.25, 0.5 , 0.75, 1. ]

1.2.2 数组常用属性

python 复制代码

arr2 = np.array([[1, 2], [3, 4]])
print(arr2.shape)   # 输出(2, 2)：数组维度
print(arr2.ndim)    # 输出2：数组维数
print(arr2.size)    # 输出4：数组元素总个数
print(arr2.dtype)   # 输出int64：数组数据类型

1.2.3 数组索引与切片

python 复制代码

matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(matrix[0, 1])     # 取第0行第1列元素 → 2
print(matrix[1, :])     # 取第1行所有列 → [4 5 6]
print(matrix[:, -1])    # 取所有行最后一列 → [3 6 9]
print(matrix[1:3, 0:2]) # 取1-2行、0-1列子数组 → [[4 5],[7 8]]

1.2.4 向量化运算（核心特性）

无需for循环，直接对数组整体运算，效率大幅提升

python 复制代码

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# 基础运算
print(a + b)        # 数组相加 → [5 7 9]
print(a * 2)        # 数组倍乘 → [2 4 6]
print(np.sqrt(a))   # 数组开方 → [1.         1.41421356 1.73205081]

# 矩阵乘法
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.dot(A, B))  # 等价 A @ B
# 输出结果：
# [[19 22]
#  [43 50]]

1.2.5 广播机制

NumPy核心特性，实现不同维度数组的批量运算

python 复制代码

matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
vector = np.array([10, 20, 30])

# 二维数组每一行批量加一维向量
result = matrix + vector
print(result)
# 输出：
# [[11 22 33]
#  [14 25 36]]

1.2.6 常用统计函数

python 复制代码

arr = np.array([1, 2, 3, 4])
print(np.sum(arr))      # 求和 → 10
print(np.mean(arr))     # 求均值 → 2.5
print(np.std(arr))      # 求标准差 → 1.118033988749895
print(np.max(arr))      # 求最大值 → 4
print(np.argmax(arr))   # 求最大值索引 → 3

二、Pandas ------ 数据处理的瑞士军刀

2.1 核心概念与作用

Pandas是基于NumPy构建的结构化数据分析库，完美适配表格型数据，操作逻辑和Excel高度一致，是机器学习数据清洗、特征工程的核心工具。

核心能力：支持多格式数据读取、缺失值处理、数据筛选、分组聚合、表格合并、自定义数据处理等。

两大核心数据结构：

数据结构	说明
Series	一维带标签数组，类似表格单列/字典结构
DataFrame	二维表格结构，带行索引、列名，等价Excel表格

2.2 核心实战代码

2.2.1 DataFrame创建与基础查看

python 复制代码

import pandas as pd

# 1. 字典创建表格
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Cathy'],
    'age': [25, 30, 35],
    'city': ['Beijing', 'Shanghai', 'Guangzhou']
})

# 2. 自定义行索引
df.index = ['user1', 'user2', 'user3']
print(df)

# 3. 基础数据查看
print(df.head(3))      # 查看前3行数据
print(df.info())       # 查看数据类型、非空值等概览
print(df.describe())   # 数值列统计摘要（均值、方差、最值等）

2.2.2 外部数据读取

python 复制代码

# 读取CSV文件
df = pd.read_csv('data.csv')

# 读取Excel文件
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

2.2.3 数据筛选与选择

python 复制代码

# 按列选择
names = df['name']                 # 单列（Series）
subset = df[['name', 'age']]       # 多列（DataFrame）

# 按行选择
first_row = df.iloc[0]             # 位置索引：取第0行
beijing_users = df.loc[df['city'] == 'Beijing'] # 标签条件筛选

# 条件过滤：筛选年龄大于25岁的用户
adults = df[df['age'] > 25]

2.2.4 缺失值处理（数据清洗核心）

python 复制代码

# 构建含缺失值的表格
df_nan = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})

print(df_nan.isnull())      # 标记所有缺失值
print(df_nan.dropna())      # 删除含缺失值的行
print(df_nan.fillna(0))     # 用0填充所有缺失值
print(df_nan.fillna(df_nan.mean()))  # 用列均值填充缺失值

2.2.5 分组聚合统计

python 复制代码

# 模拟销售数据
sales = pd.DataFrame({
    'product': ['A', 'B', 'A', 'B', 'A'],
    'region': ['North', 'South', 'North', 'South', 'East'],
    'amount': [100, 150, 200, 120, 180]
})

# 按产品分组，统计总销售额
total_by_product = sales.groupby('product')['amount'].sum()
print(total_by_product)
# 输出：
# product
# A    480
# B    270
# Name: amount, dtype: int64

2.2.6 表格合并

python 复制代码

# 构建两张关联表格
left = pd.DataFrame({'key': ['K0', 'K1'], 'A': ['A0', 'A1']})
right = pd.DataFrame({'key': ['K0', 'K1'], 'B': ['B0', 'B1']})

# 内连接合并
merged = pd.merge(left, right, on='key')
print(merged)

2.2.7 自定义函数批量处理数据

python 复制代码

# 方式1：lambda匿名函数
df['age_new'] = df['age'].apply(lambda x: x + 10)

# 方式2：自定义函数
def classify_age(age):
    return 'Young' if age < 30 else 'Senior'

df['age_group'] = df['age'].apply(classify_age)

三、Matplotlib ------ 数据可视化的起点

3.1 核心概念与作用

Matplotlib是Python最经典的2D绘图库，无需复杂配置即可生成高清折线图、散点图、柱状图、直方图等专业图表，是数据探索、模型可视化、结果展示的必备工具，也是Seaborn、Plotly等高级可视化库的底层基础。

机器学习核心应用场景：数据分布探索、特征相关性分析、模型训练过程监控、预测结果对比展示。

3.2 安装与基础绘图流程

安装命令：

bash 复制代码

pip install matplotlib

通用绘图三步走：准备数据 → 绘制图表 → 展示/保存

python 复制代码

import matplotlib.pyplot as plt

# 1. 准备数据
x = [1, 2, 3, 4]
y = [1, 4, 2, 3]

# 2. 绘制图表
plt.plot(x, y)

# 3. 展示/保存
plt.show()          # 弹窗展示图表
# plt.savefig('plot.png', dpi=300)  # 保存高清图片

3.3 常用图表实战示例

3.3.1 折线图（趋势分析）

python 复制代码

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# 设置画布大小
plt.figure(figsize=(8, 4))
# 绘制两条曲线
plt.plot(x, y1, label='sin(x)', color='blue', linestyle='-', linewidth=2)
plt.plot(x, y2, label='cos(x)', color='red', linestyle='--', linewidth=2)

# 图表美化
plt.title('Sine and Cosine Waves')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()  # 显示图例
plt.grid(True)# 显示网格
plt.show()

3.3.2 散点图（相关性分析）

python 复制代码

# 模拟身高体重数据
np.random.seed(42)
height = np.random.normal(170, 10, 100)
weight = height * 0.6 + np.random.normal(0, 5, 100)

# 绘制散点图
plt.scatter(height, weight, alpha=0.6, c='green', s=30)
plt.title('Height vs Weight')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.show()

3.3.3 柱状图（类别对比）

python 复制代码

categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]

plt.bar(categories, values, color=['skyblue', 'salmon', 'lightgreen', 'gold'])
plt.title('Sales by Category')
plt.ylabel('Units Sold')
plt.show()

3.3.4 直方图（数据分布）

python 复制代码

# 生成正态分布数据
data = np.random.normal(100, 15, 1000)

plt.hist(data, bins=30, color='purple', alpha=0.7, edgecolor='black')
plt.title('Distribution of Test Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

3.3.5 多子图布局

python 复制代码

# 2行2列子图
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# 子图1：折线图
axs[0, 0].plot([1, 2, 3], [1, 4, 9])
axs[0, 0].set_title('Line')

# 子图2：散点图
axs[0, 1].scatter([1, 2, 3], [2, 5, 3])
axs[0, 1].set_title('Scatter')

# 子图3：柱状图
axs[1, 0].bar(['X', 'Y'], [10, 20])
axs[1, 0].set_title('Bar')

# 子图4：直方图
axs[1, 1].hist(np.random.randn(1000), bins=20)
axs[1, 1].set_title('Histogram')

# 自动调整间距
plt.tight_layout()
plt.show()

3.4 Pandas与Matplotlib无缝集成

Pandas表格可直接调用plot方法绘图，底层基于Matplotlib，代码更简洁

python 复制代码

import pandas as pd

df = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar', 'Apr'],
    'Sales': [200, 220, 250, 230],
    'Profit': [20, 25, 30, 28]
})

# 折线图
df.plot(x='Month', y=['Sales', 'Profit'], kind='line', marker='o')
plt.title('Monthly Sales & Profit')
plt.show()

# 柱状图
df.set_index('Month')[['Sales']].plot(kind='bar', color='teal')
plt.show()

支持绘图类型：line、bar、barh、hist、box、scatter、pie

3.5 实用避坑技巧

3.5.1 解决中文乱码、负号异常

python 复制代码

plt.rcParams['font.sans-serif'] = ['SimHei']  # 黑体显示中文
plt.rcParams['axes.unicode_minus'] = False    # 正常显示负号

3.5.2 保存高清无白边图片

python 复制代码

plt.savefig('my_plot.png', dpi=300, bbox_inches='tight')

四、核心总结

三大库是机器学习入门的必备基础，三者层层递进、相辅相成：

NumPy：底层数值支撑，负责高效数组运算、矩阵计算，是所有库的基础
Pandas：上层数据处理，专注结构化表格清洗、筛选、统计，适配业务数据
Matplotlib：数据可视化输出，将抽象数据转化为直观图表，辅助数据分析与模型调试

熟练掌握这三大库，即可完成机器学习数据采集、清洗、处理、可视化全流程，为后续Scikit-learn建模、深度学习实战筑牢根基。