MLP多层感知机的相关概念与代码演示（附带Tensorboard可视化）

多层感知机的提出与工作原理

多层感知机的提出由于单层感知机无法解决XOR这类非线性问题，所以我们提出了多层感知机来解决该问题。
XOR问题回顾：如图所示，在以下的数据分布之中，不存在一条线性直线能够完成数据分类，所以单层感知机就无法解决该问题。为了解决该问题我们便提出来多层感知机。

多层感知机解决XOR问题的原理：如下图所示，存在4个数据点分布在数据空间之中，我们分别计算出两条线性直线将数据分为两类，然后将这两个直线的分类做同或（异或也可以）计算就可以得到两类样本完成分类。
以上计算过程就可以用以下网络表示，输入数据分别通过蓝色感知机与红色感知机得到不同的输出结果，然后再将两者的输出结果输入到绿色感知机中，就得到最后的数据分类。

多层感知机的构成与计算过程

多层感知机（Multilayer perceptron）又名全连接神经网络（Fully Connected Networks），由多个神经元层组成的神经网络，主要包含以下几个层次：

输入层：接收原始数据（如图像像素、特征向量）。
隐藏层：进行非线性变换（每层由多个神经元组成），通常会有多个隐藏层。
输出层：生成最终预测结果（如分类概率、回归值）。

MLP的核心特点是全连接（Dense），即相邻层的神经元两两相连，且每一层通常带有非线性激活函数。

以下是一个多层感知机的例图：

其中单个神经元的计算过程如下：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y = f ( ∑ i w i x i + b ) y = f(\sum_iw_ix_i+b) </math>y=f(i∑wixi+b)

<math xmlns="http://www.w3.org/1998/Math/MathML"> x i x_i </math>xi:输入
<math xmlns="http://www.w3.org/1998/Math/MathML"> w i w_i </math>wi:权重
<math xmlns="http://www.w3.org/1998/Math/MathML"> b b </math>b:偏置
<math xmlns="http://www.w3.org/1998/Math/MathML"> f f </math>f:激活函数（常用使用ReLU函数）
网络传播计算过程如下： 数据会从输入层经过数个隐藏层传递到输出层，每一层的计算为：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h l = f ( W l − 1 + b l ) h^l = f(W^{l-1}+b^{l}) </math>hl=f(Wl−1+bl)
- <math xmlns="http://www.w3.org/1998/Math/MathML"> h l h^l </math>hl：第l层的输出
- <math xmlns="http://www.w3.org/1998/Math/MathML"> w l w^{l} </math>wl：权重矩阵
- <math xmlns="http://www.w3.org/1998/Math/MathML"> h l − 1 h^{l-1} </math>hl−1：上一层的输出
激活函数
- 激活函数的作用：引入非线性，使网络能够拟合复杂函数。
- 常用函数：
  - ReLU： <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) = m a x ( 0 , x ) f(x)=max(0,x) </math>f(x)=max(0,x)
  - Sigmoid： <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) = 1 1 + e − x f(x)=\frac{1}{1+e^{-x}} </math>f(x)=1+e−x1
  - Tanh： <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) = e x − e − x e x + e − x f(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}} </math>f(x)=ex+e−xex−e−x
损失函数
- 分类任务：交叉熵损失（Cross-Entropy）
- 回归任务：均方误差（MSE）
反向传播与优化 常使用梯度下降方法来调整权重，以此最小化损失函数：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> W = W − η ∂ L ∂ W W = W - η\frac{∂L}{∂W} </math>W=W−η∂W∂L
- η：学习率
- <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ L ∂ W \frac{∂L}{∂W} </math>∂W∂L：损失对权重的梯度

接下来是对该网络运行过程的解读：

假设存在三层隐藏层，一层输入层，一层输出层，则计算过程可以用以下公式解读：
- 输入层->第一层隐藏层
  
  <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h 1 = f ( w 1 x + b 1 ) h_1 = f(w_1x+b_1) </math>h1=f(w1x+b1)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x为输出， <math xmlns="http://www.w3.org/1998/Math/MathML"> w 1 w_1 </math>w1为输入层到第一层隐藏层的权重， <math xmlns="http://www.w3.org/1998/Math/MathML"> b 1 b_1 </math>b1为输入层到第一层隐藏层的偏置， <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( ) f() </math>f()为激活函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> h 1 h_1 </math>h1为该隐藏层的输出
- 第一层隐藏层->第二层隐藏层。
  
  <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h 2 = f ( w 2 h 1 + b 2 ) h_2 = f(w_2h_1+b_2) </math>h2=f(w2h1+b2)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> h 1 h_1 </math>h1为第一层隐藏层的输出， <math xmlns="http://www.w3.org/1998/Math/MathML"> w 2 w_2 </math>w2为第一层隐藏层到第二层隐藏层的权重， <math xmlns="http://www.w3.org/1998/Math/MathML"> b 2 b_2 </math>b2为第第一层隐藏层到二层隐藏层的偏置， <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( ) f() </math>f()为激活函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> h 2 h_2 </math>h2为该隐藏层的输出。
- 第二次隐藏层->第三层隐藏层
  
  <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> h 3 = f ( w 3 h 2 + b 3 ) h_3 = f(w_3h_2+b_3) </math>h3=f(w3h2+b3)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> h 2 h_2 </math>h2为第二层隐藏层的输出， <math xmlns="http://www.w3.org/1998/Math/MathML"> w 3 w_3 </math>w3为第二层隐藏层到第三层隐藏层的权重， <math xmlns="http://www.w3.org/1998/Math/MathML"> b 3 b_3 </math>b3为第二层隐藏层到第三层隐藏层的偏置， <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( ) f() </math>f()为激活函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> h 3 h_3 </math>h3为该隐藏层的输出。
- 第三层隐藏层->输出层
  
  <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> o = f ( w 4 h 3 + b 4 ) o = f(w_4h_3+b_4) </math>o=f(w4h3+b4)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> h 3 h_3 </math>h3为第二层隐藏层的输出， <math xmlns="http://www.w3.org/1998/Math/MathML"> w 4 w_4 </math>w4为第三层隐藏层到输出层的权重， <math xmlns="http://www.w3.org/1998/Math/MathML"> b 4 b_4 </math>b4为第三层隐藏层到输出层的偏置， <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( ) f() </math>f()为激活函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> o o </math>o为输出层的输出。
- 最后执行 <math xmlns="http://www.w3.org/1998/Math/MathML"> y = s o f t m a x ( o ) y=softmax(o) </math>y=softmax(o)，就得到了最后网络的输出。

多层感知机的代码实例

数据集选择

FashionMNIST FashionMNIST 是一个流行的图像分类数据集，常用于机器学习和计算机视觉领域的入门级任务。它是作为经典 MNIST 数据集（手写数字识别）的替代品而设计的，提供了更具挑战性的分类任务，同时保持了相似的格式和规模。
- 数据集概述
  - 常用于图像分类任务，包含10种类别的时尚单品灰度图像。
  - 包含60,000张训练图像和10,000张测试图像，每张图像大小为28x28像素。

代码下载与对应图片演示代码如下：（本文最后有全部代码演示）

数据集下载

python 复制代码

# 1. 数据集准备
train_data = torchvision.datasets.FashionMNIST("./data",train=True,transform=torchvision.transforms.ToTensor(),download=True)
test_data = torchvision.datasets.FashionMNIST("./data",train=False,transform=torchvision.transforms.ToTensor(),download=True)

train_data_size = len(train_data)
test_data_size = len(test_data)
print("训练数据集的长度为：{}".format(train_data_size))
print("测试数据集的长度为：{}".format(test_data_size))

train_dataloader = DataLoader(train_data,batch_size=64,shuffle=True)
test_dataloader = DataLoader(test_data,batch_size=64,shuffle=True)

数据集可视化展示

python 复制代码

# 2. 可视化部分
# 定义类别标签
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# 从训练集中随机取一个batch的图像
images, labels = next(iter(train_dataloader))  # 获取一个batch（64张图）
# iter()转换为迭代器，next()获取下一个批次的数据
# images为一个形状为[64, 1, 28, 28]的张量 labels为[64]的张量

# 显示图像函数（反归一化到 [0,1]）
def imshow(img):
    img = img.numpy().transpose((1, 2, 0))  # 从CxHxW转为HxWxC
    # Pytorch默认为[通道, 高度, 宽度] Matplotlib需要[高度, 宽度, 通道]
    plt.imshow(img, cmap='gray')
    plt.axis('off')

# 画出一个4x8的网格（共32张图）
plt.figure(figsize=(12, 6))
# 设置画布大小
for i in range(32):  # 显示前32张
    plt.subplot(4, 8, i+1)
    # 将画布分为4行8列
    imshow(images[i])
    plt.title(class_names[labels[i].item()], fontsize=8)
    # 获取标签名字
plt.tight_layout()
# 调整间距
plt.show()
# 展示画布

数据集图片展示：

多层感知机的网络结构设置

本文的网络结构采用两层隐藏层的方式进行，网络参数设置如下：

python 复制代码

class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.model = nn.Sequential(
      nn.Flatten(),
      nn.Linear(784,512),
      nn.ReLU(),
      nn.Linear(512,256),
      nn.ReLU(),
      nn.Linear(256,10)
    )

  def forward(self,x):
    x = self.model(x)
    return x

Flatten层 (nn.Flatten()):
- 将输入数据展平为一维向量
- 假设输入是28×28的图像，展平后变为784维向量 (28×28=784)
第一个隐藏层 (nn.Linear(784, 512)):
- 输入维度：784
- 输出维度：512
- 后接ReLU激活函数
第二个隐藏层 (nn.Linear(512, 256)):
- 输入维度：512
- 输出维度：256
- 后接ReLU激活函数
输出层 (nn.Linear(256, 10)):
- 输入维度：256
- 输出维度：10
- 没有激活函数（通常用于分类任务，配合交叉熵损失函数使用）

多层感知机的训练过程

多层感知机的训练过程代码如下图所示：

python 复制代码

# 确定运行设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 实例化模型
net = Net().to(device)
# 损失函数
loss_fun = nn.CrossEntropyLoss()

# 优化器
lr = 0.01
# 使用SGD进行梯度下降
optimizer = torch.optim.SGD(net.parameters(),lr=lr)

# 设置训练网络的一些参数
# 训练次数
total_train_step = 0
# 测试次数
total_test_step = 0
# 训练轮次
epoch = 10

for i in range(epoch):
  print("-----第{}轮训练开始----".format(i+1))

  #开始训练
  net.train()  # 切换训练模式
  for data in train_dataloader:
    imgs, targets = data # imgs为数据，targets为图像标签
    # imgs的形状为 [64(批次大小), 1(通道数), 28(图像高度), 28(图像宽度)]
    # targets的形状为 [64(批次大小)]
    imgs = imgs.to(device)      # 转移数据到GPU
    targets = targets.to(device) # 转移数据到GPU

    outputs = net(imgs) # 将输入图像通过神经网络进行前向传播，网络内部会执行forward()方法
    loss = loss_fun(outputs,targets)
    # 梯度清空
    optimizer.zero_grad()
    # 计算当前的梯度
    loss.backward()
    # 更新权重参数
    optimizer.step()

设备选择
python 复制代码
```
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net = Net().to(device)
```
- 作用：自动检测并选择可用的计算设备（优先使用GPU加速）。
- 关键点 ：
  - 模型和数据需要转移到同一设备（to(device)）。
  - 如果GPU可用，所有计算会在GPU上进行，大幅加速训练。
损失函数与优化器选择
python 复制代码
```
loss_fun = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
```
- 损失函数 ：CrossEntropyLoss
  - 适用于多分类任务。
  - 内部自动结合LogSoftmax和NLLLoss，无需手动添加Softmax层。
- 优化器 ：随机梯度下降（SGD）
  - 学习率lr=0.01是常见初始值，可根据任务调整。
  - 优化目标是net.parameters()（即模型的所有可训练参数）。

开始训练

数据加载

python 复制代码

 imgs, targets = data # imgs为数据，targets为图像标签
    # imgs的形状为 [64(批次大小), 1(通道数), 28(图像高度), 28(图像宽度)]
    # targets的形状为 [64(批次大小)]
    imgs = imgs.to(device)      # 转移数据到GPU
    targets = targets.to(device) # 转移数据到GPU

前向传播计算损失

python 复制代码

outputs = net(imgs) # 将输入图像通过神经网络进行前向传播，网络内部会执行forward()方法
loss = loss_fun(outputs,targets)

反向传播更新参数

python 复制代码

    # 梯度清空
    optimizer.zero_grad()
    # 计算当前的梯度
    loss.backward()
    # 更新权重参数
    optimizer.step()

多层感知机的测试过程

多层感知机的测试过程如下：

python 复制代码

  #测试步骤
  net.eval()
  total_test_loss = 0
  total_accuracy = 0
  # 测试过程无需计算梯度
  with torch.no_grad():
    for data in test_dataloader:
      # 数据加载
      imgs,targets = data
      imgs = imgs.to(device)  # 数据→GPU
      targets = targets.to(device)  # 数据→GPU
      # 模型预测与损失计算
      outputs = net(imgs)
      loss = loss_fun(outputs,targets)
      total_test_loss = total_test_loss + loss.item()
      # 准确率计算
      accuracy = (outputs.argmax(1) == targets).sum()
      total_accuracy = total_accuracy + accuracy
    
  print(total_accuracy)
  print(test_data_size)
  print("整体测试集上的loss：{}".format(total_test_loss))
  print("整体测试集上的正确率：{}".format(total_accuracy.item() / test_data_size))

遍历加载数据
python 复制代码
```
for data in test_dataloader:
    imgs, targets = data
    imgs = imgs.to(device)      # 数据→GPU
    targets = targets.to(device)  # 数据→GPU
```
- 数据加载 ：
  - test_dataloader应提供测试集的批次数据（形状与训练时一致）。
  - 假设imgs形状为[batch_size, 1, 28, 28]，targets为[batch_size]。
- 设备转移：确保数据与模型在同一设备（GPU/CPU）。
模型预测与损失计算
python 复制代码
```
      # 模型预测与损失计算
      outputs = net(imgs)
      loss = loss_fun(outputs,targets)
      total_test_loss = total_test_loss + loss.item()
```
- outputs ：模型预测的logits（未归一化的分数），形状为[batch_size, 10]。
- loss.item()：获取标量形式的损失值（脱离计算图）。
- 累计总损失 ：total_test_loss记录所有批次的损失之和，后续需除以批次数量计算平均损失。
准确率计算
python 复制代码
```
      # 准确率计算
      accuracy = (outputs.argmax(1) == targets).sum()
      total_accuracy = total_accuracy + accuracy
```
- outputs.argmax(1) ：
  - 沿类别维度（第1维）取最大值索引，得到预测的类别标签，形状为[batch_size]。
- (pred == targets).sum() ：
  - 统计当前批次中预测正确的样本数。
- 累计总正确数 ：total_accuracy记录所有批次的正确预测总数，后续需除以总样本数计算整体准确率。

打印指标

python 复制代码

  print(total_accuracy)
  print(test_data_size)
  print("整体测试集上的loss：{}".format(total_test_loss))
  print("整体测试集上的正确率：{}".format(total_accuracy.item() / test_data_size))

多层感知机的tensorboard可视化

本文采用了tensorboard可视化观察最后的实验结果，由于tensorboard并非本文的重点，所以这里就直接给出需要在源代码中加入的相关部分的代码，如果认为该段代码不容易理解，可以看最后的所有代码演示部分。

python 复制代码

# 设置训练参数
config = {
    "model_name": "FashionMNIST_MLP",
    "batch_size": 64,
    "learning_rate": 0.01,
    "epochs": 10,
    "hidden_layers": "512-256",
    "optimizer": "SGD"
}

# 生成带有时间戳和参数的日志目录名
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_dir = f"./logs/{config['model_name']}_lr{config['learning_rate']}_bs{config['batch_size']}_{timestamp}"

# SummaryWriter初始化
writer = SummaryWriter(log_dir=log_dir)

''' 网络结构设置 '''

for i in range(epoch):
	''' 训练过程 '''
    total_train_step += 1
    if total_train_step % 100 == 0:
    	writer.add_scalar("train_loss",loss.item(),total_train_step)

''' 测试过程 '''

  writer.add_scalar("test_loss", total_test_loss, total_test_step)
  writer.add_scalar("test_accuracy", total_accuracy.item() / test_data_size, total_test_step)
  total_test_step += 1
writer.close()

多层感知机的全部代码演示

python 复制代码

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'  # 允许重复加载OpenMP库
import torch
import matplotlib.pyplot as plt
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime


# 1. 数据集准备
train_data = torchvision.datasets.FashionMNIST("./data",train=True,transform=torchvision.transforms.ToTensor(),download=True)
test_data = torchvision.datasets.FashionMNIST("./data",train=False,transform=torchvision.transforms.ToTensor(),download=True)

train_data_size = len(train_data)
test_data_size = len(test_data)
print("训练数据集的长度为：{}".format(train_data_size))
print("测试数据集的长度为：{}".format(test_data_size))

train_dataloader = DataLoader(train_data,batch_size=64,shuffle=True)
test_dataloader = DataLoader(test_data,batch_size=64,shuffle=True)

# 2. 可视化部分
# 定义类别标签
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# 从训练集中随机取一个batch的图像
images, labels = next(iter(train_dataloader))  # 获取一个batch（64张图）
# iter()转换为迭代器，next()获取下一个批次的数据
# images为一个形状为[64, 1, 28, 28]的张量 labels为[64]的张量

# 显示图像函数（反归一化到 [0,1]）
def imshow(img):
    img = img.numpy().transpose((1, 2, 0))  # 从CxHxW转为HxWxC
    # Pytorch默认为[通道, 高度, 宽度] Matplotlib需要[高度, 宽度, 通道]
    plt.imshow(img, cmap='gray')
    plt.axis('off')

# 画出一个4x8的网格（共32张图）
plt.figure(figsize=(12, 6))
# 设置画布大小
for i in range(32):  # 显示前32张
    plt.subplot(4, 8, i+1)
    # 将画布分为4行8列
    imshow(images[i])
    plt.title(class_names[labels[i].item()], fontsize=8)
    # 获取标签名字
plt.tight_layout()
# 调整间距
plt.show()
# 展示画布

# 确定运行设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

#创建模型
class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.model = nn.Sequential(
      nn.Flatten(),
      nn.Linear(784,512),
      nn.ReLU(),
      nn.Linear(512,256),
      nn.ReLU(),
      nn.Linear(256,10)
    )

  def forward(self,x):
    x = self.model(x)
    return x

# 实例化模型
net = Net().to(device)

# 设置训练参数
config = {
    "model_name": "FashionMNIST_MLP",
    "batch_size": 64,
    "learning_rate": 0.01,
    "epochs": 10,
    "hidden_layers": "512-256",
    "optimizer": "SGD"
}

# 生成带有时间戳和参数的日志目录名
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_dir = f"./logs/{config['model_name']}_lr{config['learning_rate']}_bs{config['batch_size']}_{timestamp}"

# SummaryWriter初始化
writer = SummaryWriter(log_dir=log_dir)

# 损失函数
loss_fun = nn.CrossEntropyLoss()

# 优化器
lr = 0.01
# 使用SGD进行梯度下降
optimizer = torch.optim.SGD(net.parameters(),lr=lr)

# 设置训练网络的一些参数
# 训练次数
total_train_step = 0
# 测试次数
total_test_step = 0
# 训练轮次
epoch = 10

for i in range(epoch):
  print("-----第{}轮训练开始----".format(i+1))

  #开始训练
  net.train()  # 切换训练模式
  for data in train_dataloader:
    imgs, targets = data # imgs为数据，targets为图像标签
    # imgs的形状为 [64(批次大小), 1(通道数), 28(图像高度), 28(图像宽度)]
    # targets的形状为 [64(批次大小)]
    imgs = imgs.to(device)      # 转移数据到GPU
    targets = targets.to(device) # 转移数据到GPU

    outputs = net(imgs) # 将输入图像通过神经网络进行前向传播，网络内部会执行forward()方法
    loss = loss_fun(outputs,targets)
    # 梯度清空
    optimizer.zero_grad()
    # 计算当前的梯度
    loss.backward()
    # 更新权重参数
    optimizer.step()
    total_train_step += 1
    if total_train_step % 100 == 0:
      print("训练次数：{}，loss:{}".format(total_train_step, loss.item()))#加个item(),输出时为数字，不会有个tensor
      writer.add_scalar("train_loss",loss.item(),total_train_step)
  # 测试步骤
  net.eval()
  total_test_loss = 0
  total_accuracy = 0
  # 测试过程无需计算梯度
  with torch.no_grad():
    for data in test_dataloader:
      # 数据加载
      imgs,targets = data
      imgs = imgs.to(device)  # 数据→GPU
      targets = targets.to(device)  # 数据→GPU
      # 模型预测与损失计算
      outputs = net(imgs)
      loss = loss_fun(outputs,targets)
      total_test_loss = total_test_loss + loss.item()
      # 准确率计算
      accuracy = (outputs.argmax(1) == targets).sum()
      total_accuracy = total_accuracy + accuracy

  print(total_accuracy)
  print(test_data_size)
  print("整体测试集上的loss：{}".format(total_test_loss))
  print("整体测试集上的正确率：{}".format(total_accuracy.item() / test_data_size))
  writer.add_scalar("test_loss", total_test_loss, total_test_step)
  writer.add_scalar("test_accuracy", total_accuracy.item() / test_data_size, total_test_step)
  total_test_step += 1

writer.close()

实验结果与总结

准确度曲线：
损失曲线
最后，整体测试集上的loss：0.0075168761312961576，整体测试集上的准确率：0.8255 训练结果在合理范围内。但在FashionMNIST数据集的官方论文中，MLP的准确率更高，本文的模型仍然有很大的优化空间。