SELU Function - Derivatives and Gradients (导数和梯度)

SELU Function - Derivatives and Gradients {导数和梯度}

  • [1. SELU (Scaled Exponential Linear Unit) Function](#1. SELU (Scaled Exponential Linear Unit) Function)
    • [1.1. Parameters](#1.1. Parameters)
    • [1.2. Shape](#1.2. Shape)
  • [2. SELU Function - Derivatives and Gradients (导数和梯度)](#2. SELU Function - Derivatives and Gradients (导数和梯度))
    • [2.1. PyTorch `torch.nn.SELU(inplace=False)`](#2.1. PyTorch torch.nn.SELU(inplace=False))
    • [2.2. PyTorch `torch.nn.SELU(inplace=False)`](#2.2. PyTorch torch.nn.SELU(inplace=False))
    • [2.3. Python SELU Function](#2.3. Python SELU Function)
    • [2.4. Python SELU Function](#2.4. Python SELU Function)
  • References

1. SELU (Scaled Exponential Linear Unit) Function

class torch.nn.SELU(inplace=False)
https://docs.pytorch.org/docs/stable/generated/torch.nn.SELU.html

torch.nn.functional.selu(input, inplace=False) -> Tensor
https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.selu.html

https://github.com/pytorch/pytorch/blob/v2.9.1/torch/nn/modules/activation.py

class torch.nn.SELU(inplace=False)

Applies the SELU function element-wise.

Method described in the paper: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).

The definition of the ELU function:

SELU ( x ) = scale ∗ ( max ⁡ ( 0 , x ) + min ⁡ ( 0 , α ∗ ( exp ⁡ ( x ) − 1 ) ) ) = { scale ∗ x , x > 0 scale ∗ α ∗ ( exp ⁡ ( x ) − 1 ) , x ≤ 0 = { scale ∗ x , x > 0 scale ∗ α ∗ ( e x − 1 ) , x ≤ 0 \begin{aligned} \text{SELU}(x) &= \text{scale} * (\max(0,x) + \min(0, \alpha * (\exp(x) - 1))) \\ &= \begin{cases} \text{scale} * x, & x > 0\\ \text{scale} * \alpha * (\exp(x) - 1), & x \leq 0 \end{cases} \\ &= \begin{cases} \text{scale} * x, & x > 0\\ \text{scale} * \alpha * (e^{x} - 1), & x \leq 0 \end{cases} \end{aligned} SELU(x)=scale∗(max(0,x)+min(0,α∗(exp(x)−1)))={scale∗x,scale∗α∗(exp(x)−1),x>0x≤0={scale∗x,scale∗α∗(ex−1),x>0x≤0

with α = 1.6732632423543772848170429916717 \alpha = 1.6732632423543772848170429916717 α=1.6732632423543772848170429916717 and scale = 1.0507009873554804934193349852946 \text{scale} = 1.0507009873554804934193349852946 scale=1.0507009873554804934193349852946.

When using kaiming_normal or kaiming_normal_ for initialisation, nonlinearity='linear' should be used instead of nonlinearity='selu' in order to get Self-Normalizing Neural Networks.

See torch.nn.init.calculate_gain for more information.

More details can be found in the paper Self-Normalizing Neural Networks.

Basically, the SELU activation function multiplies scale ( > 1 > 1 >1) with the output of the elu function to ensure a slope larger than one for positive inputs.

The derivative of the ELU function:

d y d x = f ′ ( x ) = d ( { scale ∗ x , x > 0 scale ∗ α ∗ ( exp ⁡ ( x ) − 1 ) , x ≤ 0 ) d x = { scale , x > 0 scale ∗ α ∗ exp ⁡ ( x ) , x ≤ 0 = { scale , x > 0 scale ∗ α ∗ exp ⁡ ( x ) − scale ∗ α + scale ∗ α , x ≤ 0 = { scale , x > 0 scale ∗ α ∗ ( exp ⁡ ( x ) − 1 ) + scale ∗ α , x ≤ 0 = { scale , x > 0 scale ∗ α ∗ ( e x − 1 ) + scale ∗ α , x ≤ 0 \begin{aligned} \frac{dy}{dx} &= f'(x) \\ &= \frac{d \left( {\begin{cases} \text{scale} * x, & x > 0\\ \text{scale} * \alpha * (\exp(x) - 1), & x \leq 0 \end{cases}} \right) }{dx} \\ &= \begin{cases} \text{scale}, & x > 0 \\ \text{scale} * \alpha * \exp(x), & x \le 0 \\ \end{cases} \\ &= \begin{cases} \text{scale}, & x > 0 \\ \text{scale} * \alpha * \exp(x) - \text{scale} * \alpha + \text{scale} * \alpha, & x \le 0 \\ \end{cases} \\ &= \begin{cases} \text{scale}, & x > 0 \\ \text{scale} * \alpha * (\exp(x) - 1) + \text{scale} * \alpha, & x \le 0 \\ \end{cases} \\ &= \begin{cases} \text{scale}, & x > 0 \\ \text{scale} * \alpha * (e^{x} - 1) + \text{scale} * \alpha, & x \le 0 \\ \end{cases} \\ \end{aligned} dxdy=f′(x)=dxd({scale∗x,scale∗α∗(exp(x)−1),x>0x≤0)={scale,scale∗α∗exp(x),x>0x≤0={scale,scale∗α∗exp(x)−scale∗α+scale∗α,x>0x≤0={scale,scale∗α∗(exp(x)−1)+scale∗α,x>0x≤0={scale,scale∗α∗(ex−1)+scale∗α,x>0x≤0

1.1. Parameters

  • inplace (bool, optional): can optionally do the operation in-place. Default: False

1.2. Shape

  • Input : (*), where * means any number of dimensions.

  • Output : (*), same shape as the input.

We can see that when scale = 1, SELU is simply ELU.

复制代码
# !/usr/bin/env python
# coding=utf-8

import torch
from matplotlib import pyplot as plt


def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None, ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """
    https://github.com/d2l-ai/d2l-en/blob/master/d2l/torch.py
    """

    def has_one_axis(X):  # True if X (tensor or list) has 1 axis
        return ((hasattr(X, "ndim") and (X.ndim == 1)) or (isinstance(X, list) and (not hasattr(X[0], "__len__"))))

    if has_one_axis(X): X = [X]

    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]

    if len(X) != len(Y):
        X = X * len(Y)

    # Set the default width and height of figures globally, in inches.
    plt.rcParams['figure.figsize'] = figsize

    if axes is None:
        axes = plt.gca()  # Get the current Axes

    # Clear the Axes
    axes.cla()

    for x, y, fmt in zip(X, Y, fmts):
        axes.plot(x, y, fmt) if len(x) else axes.plot(y, fmt)

    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)  # Set the label for the x/y-axis
    axes.set_xscale(xscale), axes.set_yscale(yscale)  # Set the x/y-axis scale
    axes.set_xlim(xlim), axes.set_ylim(ylim)  # Set the x/y-axis view limits

    if legend:
        axes.legend(legend)  # Place a legend on the Axes

    # Configure the grid lines
    axes.grid()

    plt.show()
    plt.savefig("yongqiang.png", transparent=True)  # Save the current figure


x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.nn.functional.selu(x)
plot(x.detach(), y.detach(), 'x', 'SELU(x)', figsize=(5, 2.5))

# Clear out previous gradients
# x.grad.data.zero_()
y.backward(torch.ones_like(x), retain_graph=True)
plot(x.detach(), x.grad, 'x', 'gradient of SELU', figsize=(5, 2.5))

The SELU function:

The derivative of the SELU function:

2. SELU Function - Derivatives and Gradients (导数和梯度)

Notes

  • Element-wise Multiplication (Hadamard Product) (* operator or numpy.multiply()): Multiplies corresponding elements of two arrays that must have the same shape (or be broadcastable to a common shape).
  • Matrix Multiplication (Dot Product) (@ operator or numpy.matmul() or numpy.dot()): Performs the standard linear algebra operation that requires specific dimension compatibility rules. (e.g., the number of columns in the first array must match the number of rows in the second).

2.1. PyTorch torch.nn.SELU(inplace=False)

复制代码
# !/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

torch.set_printoptions(precision=6)

input = torch.tensor([[-1.5, 0.0, 1.5], [0.5, -2.0, 3.0]], dtype=torch.float, requires_grad=True)

print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")

selu = nn.SELU()
forward_output = selu(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

forward_output.backward(torch.ones_like(input), retain_graph=True)

print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/selu.py 
input.requires_grad: True, input.shape: torch.Size([2, 3])

forward_output.shape: torch.Size([2, 3])
Forward Pass Output:
tensor([[-1.365814,  0.000000,  1.576051],
        [ 0.525351, -1.520167,  3.152103]], grad_fn=<EluBackward0>)

backward_output.shape: torch.Size([2, 3])
Backward Pass Output:
tensor([[0.392285, 1.758099, 1.050701],
        [1.050701, 0.237933, 1.050701]])

Process finished with exit code 0

2.2. PyTorch torch.nn.SELU(inplace=False)

复制代码
# !/usr/bin/env python
# coding=utf-8

import torch
import torch.nn as nn

torch.set_printoptions(precision=6)

input = torch.tensor([-1.5, 0.0, 1.5, 0.5, -2.0, 3.0], dtype=torch.float, requires_grad=True)

print(f"input.requires_grad: {input.requires_grad}, input.shape: {input.shape}")

selu = nn.SELU()
forward_output = selu(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

forward_output.backward(torch.ones_like(input), retain_graph=True)

print(f"\nbackward_output.shape: {input.grad.shape}")
print(f"Backward Pass Output:\n{input.grad}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/selu.py 
input.requires_grad: True, input.shape: torch.Size([6])

forward_output.shape: torch.Size([6])
Forward Pass Output:
tensor([-1.365814,  0.000000,  1.576051,  0.525351, -1.520167,  3.152103],
       grad_fn=<EluBackward0>)

backward_output.shape: torch.Size([6])
Backward Pass Output:
tensor([0.392285, 1.758099, 1.050701, 1.050701, 0.237933, 1.050701])

Process finished with exit code 0

2.3. Python SELU Function

复制代码
# !/usr/bin/env python
# coding=utf-8

import numpy as np


# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting

class SELULayer:
    """
    A class to represent an SELU activation layer for a neural network.
    """

    def __init__(self):
        self.alpha = 1.6732632423543772848170429916717
        self.scale = 1.0507009873554804934193349852946
        # Cache the input for the backward pass
        self.input = None

    def forward(self, input):
        """
        Calculates the forward pass:
        f(x) = scale * x if x > 0 else scale * alpha * (exp(x) - 1)
        """

        self.input = input
        output = self.scale * np.where(input > 0, input, self.alpha * (np.exp(input) - 1))
        return output

    def backward(self, upstream_gradient):
        """
        f'(x) = scale if x > 0 else scale * alpha * exp(x)
        The total gradient is the element-wise product of the upstream
        gradient and the derivative of the SELU.
        """

        selu_derivative = self.scale * np.where(self.input > 0, 1.0, self.alpha * np.exp(self.input))
        print(f"\nselu_derivative.shape: {selu_derivative.shape}")
        print(f"SELU Derivative:\n{selu_derivative}")

        # Computes the gradient of the loss with respect to the input (dL/dx)
        # Apply the chain rule: multiply the derivative by the upstream gradient
        # dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
        downstream_gradient = upstream_gradient * selu_derivative
        return downstream_gradient


selu_layer = SELULayer()

input = np.array([-1.5, 0.0, 1.5, 0.5, -2.0, 3.0], dtype=np.float32)

# Forward pass
forward_output = selu_layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = selu_layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/selu.py 

forward_output.shape: (6,)
Forward Pass Output:
[-1.3658143  0.         1.5760515  0.5253505 -1.5201665  3.152103 ]

selu_derivative.shape: (6,)
SELU Derivative:
[0.392285   1.7580993  1.050701   1.050701   0.23793288 1.050701  ]

backward_output.shape: (6,)
Backward Pass Output:
[0.0392285  0.17580993 0.1050701  0.1050701  0.02379329 0.1050701 ]

Process finished with exit code 0

2.4. Python SELU Function

复制代码
# !/usr/bin/env python
# coding=utf-8

import numpy as np


# numpy.multiply:
# Multiply arguments element-wise
# Equivalent to x1 * x2 in terms of array broadcasting

class SELULayer:
    """
    A class to represent an SELU activation layer for a neural network.
    """

    def __init__(self):
        self.alpha = 1.6732632423543772848170429916717
        self.scale = 1.0507009873554804934193349852946
        # Cache the input for the backward pass
        self.input = None

    def forward(self, input):
        """
        Calculates the forward pass:
        f(x) = scale * x if x > 0 else scale * alpha * (exp(x) - 1)
        """

        self.input = input
        output = self.scale * np.where(input > 0, input, self.alpha * (np.exp(input) - 1))
        return output

    def backward(self, upstream_gradient):
        """
        f'(x) = scale if x > 0 else scale * alpha * exp(x)
        The total gradient is the element-wise product of the upstream
        gradient and the derivative of the SELU.
        """

        selu_derivative = self.scale * np.where(self.input > 0, 1.0, self.alpha * np.exp(self.input))
        print(f"\nselu_derivative.shape: {selu_derivative.shape}")
        print(f"SELU Derivative:\n{selu_derivative}")

        # Computes the gradient of the loss with respect to the input (dL/dx)
        # Apply the chain rule: multiply the derivative by the upstream gradient
        # dL/dx = dL/dy * dy/dx = upstream_gradient * f'(x)
        downstream_gradient = upstream_gradient * selu_derivative
        return downstream_gradient


selu_layer = SELULayer()

input = np.array([[-1.5, 0.0, 1.5], [0.5, -2.0, 3.0]], dtype=np.float32)

# Forward pass
forward_output = selu_layer.forward(input)
print(f"\nforward_output.shape: {forward_output.shape}")
print(f"Forward Pass Output:\n{forward_output}")

# Backward pass
upstream_gradient = np.ones(forward_output.shape) * 0.1
backward_output = selu_layer.backward(upstream_gradient)
print(f"\nbackward_output.shape: {backward_output.shape}")
print(f"Backward Pass Output:\n{backward_output}")

/home/yongqiang/miniconda3/bin/python /home/yongqiang/quantitative_analysis/selu.py 

forward_output.shape: (2, 3)
Forward Pass Output:
[[-1.3658143  0.         1.5760515]
 [ 0.5253505 -1.5201665  3.152103 ]]

selu_derivative.shape: (2, 3)
SELU Derivative:
[[0.392285   1.7580993  1.050701  ]
 [1.050701   0.23793288 1.050701  ]]

backward_output.shape: (2, 3)
Backward Pass Output:
[[0.0392285  0.17580993 0.1050701 ]
 [0.1050701  0.02379329 0.1050701 ]]

Process finished with exit code 0

References

1\] Yongqiang Cheng (程永强), \[2\] 动手学深度学习, \[3\] Deep Learning Tutorials, \[4\] Gradient boosting performs gradient descent, \[5\] Matrix calculus, \[6\] Artificial Inteligence,

相关推荐
Yongqiang Cheng5 天前
Softmax Function - Derivatives and Gradients (导数和梯度)
梯度·softmax·导数·gradients·derivatives
charlie1145141919 天前
从0开始的机器学习(笔记系列)——导数 · 多元函数导数 · 梯度
人工智能·笔记·学习·数学·机器学习·导数
Yongqiang Cheng9 天前
Softsign Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·softsign
Yongqiang Cheng13 天前
Abs Function - Derivatives and Gradients (导数和梯度)
梯度·导数·abs·gradients·derivatives
Yongqiang Cheng15 天前
Softplus Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·softplus
課代表17 天前
从初等数学到高等数学
算法·微积分·函数·极限·导数·积分·方程
点云SLAM17 天前
Derivative 英文单词学习
导数·英文单词学习·雅思备考·derivative·派生物 / 衍生量·非原创的 / 模仿性的
Yongqiang Cheng1 个月前
ELU Function - Derivatives and Gradients (导数和梯度)
梯度·导数·gradients·derivatives·elu
Yongqiang Cheng1 个月前
Tanh Function - Derivatives and Gradients (导数和梯度)
梯度·导数·tanh·gradients·derivatives