NeuralForecast TokenEmbedding 一维卷积 (Conv1d) 与矩阵乘法

flyfish

TokenEmbedding中使用了一维卷积 (Conv1d)

TokenEmbedding 源码分析

在源码的基础上增加调用示例

下面会分析这段代码

py 复制代码

import torch
import torch.nn as nn
class TokenEmbedding(nn.Module):
    def __init__(self, c_in, hidden_size):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__ >= "1.5.0" else 2
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=hidden_size,
            kernel_size=3,
            padding=padding,
            padding_mode="circular",
            bias=False,
        )
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(
                    m.weight, mode="fan_in", nonlinearity="leaky_relu"
                )

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x
    

import torch

# 创建 TokenEmbedding 实例
c_in = 10  # 输入通道数
hidden_size = 20  # 输出通道数
token_embedding = TokenEmbedding(c_in, hidden_size)

# 创建输入数据
batch_size = 32
sequence_length = 100
input_features = 10
x = torch.randn(batch_size, sequence_length, input_features)  # 输入数据形状为 (batch_size, sequence_length, input_features)

# 前向传播
output = token_embedding(x)

# 输出结果
print("Output shape:", output.shape)  # 打印输出的形状
#Output shape: torch.Size([32, 100, 20])

TokenEmbedding类继承自nn.Module类，通过super().init()调用了父类nn.Module的__init__()方法，以执行nn.Module类中的初始化操作，确保TokenEmbedding类的实例在创建时也执行了nn.Module类的初始化

init _ 方法：

在初始化过程中，定义了一个一维卷积层 self.tokenConv。这个卷积层的输入通道数为 c_in，输出通道数为 hidden_size，卷积核大小为 3，填充模式为 "circular"，并且设置偏置为 False。在 PyTorch 的版本大于等于 1.5.0 时，设置填充为 1，否则设置填充为 2。然后通过循环遍历模型的所有模块，并对其中类型为 nn.Conv1d 的模块进行参数初始化，使用 Kaiming 初始化方法。

forward 方法：

将输入 x 进行形状变换，然后通过 self.tokenConv 进行一维卷积操作，并将结果进行转置，最后返回卷积操作的结果。

比较下不同的padding_mode

py 复制代码

import torch
import torch.nn as nn

# 定义输入序列
input_seq = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32).view(1, 1, -1)

# 定义卷积层
conv_zero_padding = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=1, padding_mode='zeros', bias=False)
conv_circular_padding = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=1, padding_mode='circular', bias=False)

# 手动设置卷积核为简单的平均操作
with torch.no_grad():
    conv_zero_padding.weight = nn.Parameter(torch.ones_like(conv_zero_padding.weight) / 3)
    conv_circular_padding.weight = nn.Parameter(torch.ones_like(conv_circular_padding.weight) / 3)

# 进行卷积操作
output_zero_padding = conv_zero_padding(input_seq)
output_circular_padding = conv_circular_padding(input_seq)

print("Input sequence:", input_seq)
print("Zero padding output:", output_zero_padding)
print("Circular padding output:", output_circular_padding)

复制代码

Input sequence: tensor([[[1., 2., 3., 4., 5.]]])
Zero padding output: tensor([[[1., 2., 3., 4., 3.]]], grad_fn=<ConvolutionBackward0>)
Circular padding output: tensor([[[2.6667, 2.0000, 3.0000, 4.0000, 3.3333]]],
       grad_fn=<ConvolutionBackward0>)

嵌入层 nn.Conv1d和 nn.Embedding不同的处理方式

使用 nn.Conv1d 的 TokenEmbedding

py 复制代码

import torch
import torch.nn as nn

class TokenEmbedding(nn.Module):
    def __init__(self, c_in, hidden_size):
        super(TokenEmbedding, self).__init__()
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=hidden_size,
            kernel_size=3,
            padding=1,
            padding_mode="circular",
            bias=False,
        )

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x

# 示例输入
batch_size = 2
sequence_length = 10
feature_dim = 3

time_series = torch.randn(batch_size, sequence_length, feature_dim)
embedding = TokenEmbedding(c_in=feature_dim, hidden_size=8)
embedded_time_series = embedding(time_series)
print(embedded_time_series.shape)  # 输出形状：[2, 10, 8]

使用 nn.Embedding

py 复制代码

class SimpleEmbedding(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbedding, self).__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)

    def forward(self, x):
        return self.embedding(x)

# 示例输入：假设我们有一些离散的索引序列
batch_size = 2
sequence_length = 10
vocab_size = 20  # 假设有20个不同的类别
embedding_dim = 8

indices = torch.randint(0, vocab_size, (batch_size, sequence_length))
embedding = SimpleEmbedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
embedded_indices = embedding(indices)
print(embedded_indices.shape)  # 输出形状：[2, 10, 8]

Conv1d

（1维卷积）和矩阵乘法在数学上有密切的关系。1维卷积操作实际上可以看作是某种形式的矩阵乘法

1维卷积操作可以通过将输入向量转换成Toeplitz矩阵，然后与卷积核进行矩阵乘法来实现。这种方法可以帮助我们更好地理解卷积操作的本质及其与线性代数的关系。

1. Conv1d 操作

假设我们有一个输入向量 x = [ x 1 , x 2 , ... , x n ] \mathbf{x} = [x_1, x_2, \ldots, x_n] x=[x1,x2,...,xn] 和一个卷积核（滤波器） w = [ w 1 , w 2 , ... , w k ] \mathbf{w} = [w_1, w_2, \ldots, w_k] w=[w1,w2,...,wk]，1维卷积操作可以定义为：

y i = ∑ j = 1 k x i + j − 1 ⋅ w j y_i = \sum_{j=1}^{k} x_{i+j-1} \cdot w_j yi=∑j=1kxi+j−1⋅wj

对于每一个输出位置 i i i，卷积核 w \mathbf{w} w 会与输入向量 x \mathbf{x} x 的某一部分元素进行点积。

2. 矩阵乘法表示

1维卷积操作可以通过将输入向量转换成一个特定的矩阵，然后进行矩阵乘法来实现。这种矩阵称为"Toeplitz矩阵"或"卷积矩阵"。例如，对于输入向量 x \mathbf{x} x 和卷积核 w \mathbf{w} w，我们构建一个Toeplitz矩阵：
X = [ x 1 x 2 x 3 ... x k x 2 x 3 x 4 ... x k + 1 x 3 x 4 x 5 ... x k + 2 ⋮ ⋮ ⋮ ⋱ ⋮ x n − k + 1 x n − k + 2 x n − k + 3 ... x n ] \mathbf{X} = \begin{bmatrix} x_1 & x_2 & x_3 & \ldots & x_k \\ x_2 & x_3 & x_4 & \ldots & x_{k+1} \\ x_3 & x_4 & x_5 & \ldots & x_{k+2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n-k+1} & x_{n-k+2} & x_{n-k+3} & \ldots & x_n \end{bmatrix} X= x1x2x3⋮xn−k+1x2x3x4⋮xn−k+2x3x4x5⋮xn−k+3.........⋱...xkxk+1xk+2⋮xn

然后将卷积核 w \mathbf{w} w 看作一个列向量：
w = [ w 1 w 2 w 3 ⋮ w k ] \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ \vdots \\ w_k \end{bmatrix} w= w1w2w3⋮wk

那么，1维卷积的输出可以表示为：
y = X ⋅ w \mathbf{y} = \mathbf{X} \cdot \mathbf{w} y=X⋅w

3. 示例

假设输入向量 x = [ 1 , 2 , 3 , 4 , 5 ] \mathbf{x} = [1, 2, 3, 4, 5] x=[1,2,3,4,5] 和卷积核 w = [ 1 , 0 , − 1 ] \mathbf{w} = [1, 0, -1] w=[1,0,−1]，我们可以构建Toeplitz矩阵：
X = [ 1 2 3 2 3 4 3 4 5 ] \mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \end{bmatrix} X= 123234345

然后进行矩阵乘法：
y = X ⋅ w = [ 1 2 3 2 3 4 3 4 5 ] ⋅ [ 1 0 − 1 ] = [ 1 ⋅ 1 + 2 ⋅ 0 + 3 ⋅ ( − 1 ) 2 ⋅ 1 + 3 ⋅ 0 + 4 ⋅ ( − 1 ) 3 ⋅ 1 + 4 ⋅ 0 + 5 ⋅ ( − 1 ) ] = [ − 2 − 2 − 2 ] \mathbf{y} = \mathbf{X} \cdot \mathbf{w} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 + 2 \cdot 0 + 3 \cdot (-1) \\ 2 \cdot 1 + 3 \cdot 0 + 4 \cdot (-1) \\ 3 \cdot 1 + 4 \cdot 0 + 5 \cdot (-1) \end{bmatrix} = \begin{bmatrix} -2 \\ -2 \\ -2 \end{bmatrix} y=X⋅w= 123234345 ⋅ 10−1 = 1⋅1+2⋅0+3⋅(−1)2⋅1+3⋅0+4⋅(−1)3⋅1+4⋅0+5⋅(−1) = −2−2−2

这就是1维卷积的输出。

用代码演示一维卷积 (Conv1d) 和矩阵乘法会得到相同结果的方式

py 复制代码

import torch
import torch.nn as nn

# 输入序列
x = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.float32)  # shape: [1, 5]
# 卷积核
w = torch.tensor([[1, 0, -1]], dtype=torch.float32).unsqueeze(0)  # shape: [1, 3]

# 使用 nn.Conv1d
conv1d = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=0, bias=False)
conv1d.weight.data = w

x_unsqueezed = x.unsqueeze(0)  # shape: [1, 1, 5]
output_conv1d = conv1d(x_unsqueezed).squeeze(0)  # shape: [1, 3]
print("Conv1d output:", output_conv1d)

# 使用矩阵乘法
X = torch.tensor([
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5]
], dtype=torch.float32)

W = torch.tensor([1, 0, -1], dtype=torch.float32).view(-1, 1)

output_matmul = X @ W
print("Matrix multiplication output:", output_matmul.squeeze())


# Conv1d output: tensor([[-2., -2., -2.]], grad_fn=<SqueezeBackward1>)
# Matrix multiplication output: tensor([-2., -2., -2.])

在代码中，以下部分对卷积层的权重进行了初始化：

python 复制代码

for m in self.modules():
    if isinstance(m, nn.Conv1d):
        nn.init.kaiming_normal_(
            m.weight, mode="fan_in", nonlinearity="leaky_relu"
        )

这段代码使用了Kaiming初始化（也称为He初始化）来初始化卷积层的权重。为了理解为什么要使用 mode="fan_in" 和 nonlinearity="leaky_relu"，我们需要了解一些背景知识。

1. Kaiming 初始化 (He Initialization)

Kaiming初始化是一种针对神经网络权重的初始化方法，旨在解决在训练深度神经网络时可能遇到的梯度消失或梯度爆炸问题。Kaiming初始化的方法依据权重矩阵的大小来设置初始值，使得每一层的输出保持适当的方差。

2. mode="fan_in" 和 nonlinearity="leaky_relu"

mode="fan_in"：这是Kaiming初始化中的一种模式，表示初始化应该考虑输入的数量（即每个神经元输入连接的数量）。使用这种模式，可以确保前向传播过程中信号的方差不会膨胀。
nonlinearity="leaky_relu"：这是Kaiming初始化时需要指定的非线性激活函数类型。在初始化过程中，不同的激活函数需要不同的方差调整。leaky_relu 是一种变体的ReLU激活函数，可以防止神经元死亡问题。

详细解释

在使用Kaiming初始化时，根据不同的激活函数，初始化权重时需要调整标准差。Kaiming初始化的公式通常是：

std = 2 fan_in \text{std} = \sqrt{\frac{2}{\text{fan\_in}}} std=fan_in2

其中，fan_in 是指每个神经元输入的数量。

当使用不同的激活函数时，初始化的标准差需要调整，以适应激活函数的特点。对于ReLU和其变体（如Leaky ReLU），公式中的系数2是经验上获得的最优值。

因此，代码中指定 mode="fan_in" 和 nonlinearity="leaky_relu" 是为了确保在使用Leaky ReLU激活函数时，权重初始化的方差被正确地设置，从而使网络训练更加稳定和高效。

代码示例

具体到代码：

python 复制代码

for m in self.modules():
    if isinstance(m, nn.Conv1d):
        nn.init.kaiming_normal_(
            m.weight, mode="fan_in", nonlinearity="leaky_relu"
        )

这段代码的作用是遍历所有模块（即网络层），并对所有 nn.Conv1d 层的权重使用Kaiming初始化方法进行初始化。mode="fan_in" 和 nonlinearity="leaky_relu" 的指定，确保了权重的初始化是根据Leaky ReLU激活函数的特点来进行的。

Leaky ReLU

ReLU 函数将所有负值映射为零，正值不变。

Leaky ReLU 函数在负值区域有一个小的斜率（在此例子中为0.1），以避免神经元死亡。

PReLU 是Leaky ReLU的参数化版本，其负值区域的斜率可以学习。

ELU 在负值区域逐渐趋于一个负的固定值，正值区域类似ReLU。

py 复制代码

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F

# 定义x轴数据
x = np.linspace(-10, 10, 400)
x_tensor = torch.tensor(x, dtype=torch.float32)

# 定义不同的激活函数
relu = F.relu(x_tensor).numpy()
leaky_relu = F.leaky_relu(x_tensor, negative_slope=0.1).numpy()
prelu = torch.nn.PReLU(num_parameters=1, init=0.1)
prelu_output = prelu(x_tensor).detach().numpy()
elu = F.elu(x_tensor, alpha=1.0).numpy()

# 绘图
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.plot(x, relu, label='ReLU', color='blue')
plt.title('ReLU')
plt.grid(True)

plt.subplot(2, 2, 2)
plt.plot(x, leaky_relu, label='Leaky ReLU (0.1)', color='red')
plt.title('Leaky ReLU')
plt.grid(True)

plt.subplot(2, 2, 3)
plt.plot(x, prelu_output, label='PReLU (0.1)', color='green')
plt.title('PReLU')
plt.grid(True)

plt.subplot(2, 2, 4)
plt.plot(x, elu, label='ELU', color='purple')
plt.title('ELU')
plt.grid(True)

plt.tight_layout()
plt.show()