8.一起学习机器学习 -- Stochastic Gradient Descent

Stochastic Gradient Descent

The purpose of this notebook is to practice implementing the stochastic gradient descent (SGD) optimisation algorithm from scratch.

python 复制代码
import numpy as np
import matplotlib.pyplot as plt

# Imports used for testing.
import numpy.testing as npt

We consider a linear regression problem of the form
y = β 0 + x β 1 + ϵ   , ϵ ∼ N ( 0 , σ 2 ) y = \beta_0 + x \beta_1 + \epsilon\,,\quad \epsilon \sim \mathcal N(0, \sigma^2) y=β0+xβ1+ϵ,ϵ∼N(0,σ2)

where x ∈ R x\in\mathbb{R} x∈R are inputs and y ∈ R y\in\mathbb{R} y∈R are noisy observations. The bias β 0 ∈ R \beta_0\in\mathbb{R} β0∈R and coefficient β 1 ∈ R \beta_1\in\mathbb{R} β1∈R parametrize the function.

In this tutorial, we assume that we are able to sample data inputs and outputs ( x n , y n ) (\boldsymbol x_n, y_n) (xn,yn), n = 1 , ... , K n=1,\ldots, K n=1,...,K, and we are interested in finding parameters β 0 \beta_0 β0 and β 1 \beta_1 β1 that map the inputs well to the ouputs.

From our lectures, we know that the parameters β 0 \beta_0 β0 and β 1 \beta_1 β1 can be calculated analytically. However, here we are interested in computing a numerical solution using the stochastic gradient descent algorithm (SGD).

We will start by setting up a generator of synthetic data inputs and outputs, see: https://realpython.com/introduction-to-python-generators/.

python 复制代码
# define parameters for synthetic data
true_beta0 = 3.7
true_beta1 = -1.8
sigma = 0.5
xlim = [-3, 3]

def data_generator(batch_size, seed=0):
  """Generator function for synthetic data.

  Parameters:
    batch_size (int): Batch size for generated data
    seed (float): Seed for random numbers epsilon

  Returns:
    x (np.array): Synthetic feature data
    y (np.array): Synthetic target data following y=true_beta0 + x^T true_beta + epsilon
  """
  # fix seed for random numbers
  np.random.seed(seed)

  while True:
      # generate random input data
      x = np.random.uniform(*xlim, (batch_size, 1))
      # generate noise
      noise = np.random.randn(batch_size, 1) * sigma
      # compute noisy targets
      y = true_beta0 + x * true_beta1 + noise
      yield x, y
python 复制代码
# Create generator for batch size 16
train_data = data_generator(batch_size=16)
print(train_data)
<generator object data_generator at 0x7f5599713530>

We can visualise the first batch of synthetic data along with the true underlying function.

python 复制代码
# Pull a batch of training data
x_sample, y_sample = next(train_data)

# Plot training data along true underlying function
plt.figure(figsize=(8, 5))
plt.scatter(x_sample, y_sample, label='Data samples')
plt.plot(x_sample, true_beta0 + x_sample * true_beta1, color='C1', label='True fn')
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.legend()
plt.show()


The loss that we wish to minimise is the expected mean squared error (MSE) loss computed on the training data:

L ( β 0 , β 1 ) : = E ( x , y ) ∼ p d a t a [ ( y − β 0 − x β 1 ) 2 ] \mathcal{L}(\beta_0, \beta_1) := \mathbb{E}{(x, y)\sim p{data}} \left[(y - \beta_0 - x\beta_1)^2\right] L(β0,β1):=E(x,y)∼pdata[(y−β0−xβ1)2]

We first compute the mean squared error loss on a single batch of input and output data.

python 复制代码
## EDIT THIS FUNCTION
def mse_loss(x, y, beta0, beta1):
  """Computed expected MSE loss for a single batch.

  Parameters:
  x (np.array): K x 1 array of inputs
  y (np.array): K x 1 array of outputs
  beta0 (float): Bias parameter
  beta (float): Coefficient

  Returns:
  MSE (float): computed on this batch of inputs and outputs; K x 1 array
  """
  # compute expected MSE loss
  loss = np.mean(((y - beta0 - x * beta1)**2)) ## <-- SOLUTION
  return loss

To check your implementation you can run this test:

python 复制代码
# This line verifies the correctness of the mse_loss implementation
npt.assert_allclose(mse_loss(x_sample,y_sample,1.0,-0.5), 9.311906)

Before we can minimze the MSE loss we need to initialise the parameters β 0 \beta_0 β0 and β 1 \beta_1 β1.

python 复制代码
# Initialise the parameters
beta0 = 1.0
beta1 = -0.5

# Plot the initialised regression function
plt.figure(figsize=(8, 5))
plt.scatter(x_sample, y_sample, label='Data samples')
plt.plot(x_sample, true_beta0 + x_sample * true_beta1, color='C1', label='True function')
plt.plot(x_sample, beta0 + x_sample * beta1, color='C2', label='Initialised function')
plt.title(r'Initial parameters: $\beta_0=$ {:.2f}, $\beta_1=$ {:.2f}. MSE loss: {:.3f}'.format(
    np.squeeze(beta0), np.squeeze(beta1), mse_loss(x_sample, y_sample, beta0, beta1))
         )
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.legend()
plt.show()


Stochastic gradient descent samples a batch of K K K input and output samples, and makes a parameter update by computing the gradient of the loss function

∇ ( β 0 , β 1 ) L ( β 0 ( i ) , β 1 ( i ) ∣ X ( i ) , Y ( i ) ) , \nabla_{(\beta_0, \beta_1)}\mathcal{L}(\beta_0^{(i)}, \beta_1^{(i)} \mid \mathcal{X}^{(i)}, \mathcal{Y}^{(i)}), ∇(β0,β1)L(β0(i),β1(i)∣X(i),Y(i)),

where β 0 ( i ) , β 1 ( i ) \beta_0^{(i)}, \beta_1^{(i)} β0(i),β1(i) are the values of the parameters at the i i i-th iteration of the algorithm, and X ( i ) , Y ( i ) \mathcal{X}^{(i)}, \mathcal{Y}^{(i)} X(i),Y(i) are the i i i-th batch of inputs and outputs.

The following function should compute the gradient of the MSE loss for a given batch of data, and current parameter values.

python 复制代码
## EDIT THIS FUNCTION
def mse_grad(x, y, beta0, beta1):
  """Compute gradient of MSE loss w.r.t. beta0 and beta1 averaged over batch.

  Parametes:
  x (np.array): K x 1 array of inputs
  y (np.array): K x 1 array of outputs
  beta0 (float): Bias parameter
  beta1 (float): Coefficient

  Returns:
  delta_beta0 (float): Partial derivative w.r.t. beta_0 averaged over batch
  delta_beta1 (float): Partial derivative w.r.t. beta_1 averaged over batch
  """

  # compute partial derivative w.r.t. beta_0
  delta_beta0 = - 2 * np.mean(y - beta0 - x * beta1) ## <-- SOLUTION

  # compute partial derivative w.r.t. beta_1
  delta_beta1 = - 2 * np.mean((y - beta0 - x * beta1) * x) ## <-- SOLUTION

  return delta_beta0, delta_beta1

To check your implementation you can run this cell:

python 复制代码
# These lines verify that the derivatives delta_beta0 and delta_beta1 are computed correctled
delta_beta0, delta_beta1 = mse_grad(x_sample, y_sample, 1.0, -0.5)
npt.assert_allclose(delta_beta0, -4.51219)
npt.assert_allclose(delta_beta1, 4.047721)

We have now established all ingredients needed to implement the SGD algorithm for our problem task.

Recall that SGD makes the following parameter update at each iteration:

( β 0 ( i + 1 ) , β 1 ( i + 1 ) ) = ( β 0 ( i ) , β 1 ( i ) ) − η ∇ ( β 0 , β 1 ) L ( β 0 ( i ) , β 1 ( i ) ∣ X ( i ) , Y ( i ) ) , (\beta_0^{(i+1)}, \beta_1^{(i+1)}) = (\beta_0^{(i)}, \beta_1^{(i)}) - \eta \nabla_{(\beta_0, \beta_1)}\mathcal{L}(\beta_0^{(i)}, \beta_1^{(i)} \mid \mathcal{X}^{(i)}, \mathcal{Y}^{(i)}), (β0(i+1),β1(i+1))=(β0(i),β1(i))−η∇(β0,β1)L(β0(i),β1(i)∣X(i),Y(i)),

where η > 0 \eta>0 η>0 is the learning rate.

Implement below a training of the parameters β 0 \beta_0 β0 and β 1 \beta_1 β1 using SGD over 2000 iterations and a learning rate η = 0.001 \eta=0.001 η=0.001.

python 复制代码
## EDIT THIS CELL

# parameters for SGD
iterations = 2000
losses = []
learning_rate = 0.001

## SOLUTION
for iteration in range(iterations):
  # get a new batch of training data at every iteration
  x_batch, y_batch = next(train_data)
  # compute MSE loss
  losses.append(mse_loss(x_batch, y_batch, beta0, beta1))
  # compute gradient
  delta_beta0, delta_beta1 = mse_grad(x_batch, y_batch, beta0, beta1)
  # update parameters
  beta0 -= learning_rate * delta_beta0
  beta1 -= learning_rate * delta_beta1

# report results
print('Learned parameters:')
print('beta0 =', np.around(beta0,2), "\nbeta1 =", np.around(beta1,2))
print('\nTrue parameters:')
print('beta0 =', true_beta0, "\nbeta1 =", true_beta1)
Learned parameters:
beta0 = 3.65 
beta1 = -1.8

True parameters:
beta0 = 3.7 
beta1 = -1.8

We finally plot the fitted curve and the visualise the training over several iterations.

python 复制代码
# Plot the learned regression function and loss values
fig = plt.figure(figsize=(16, 5))
fig.add_subplot(121)
plt.scatter(x_sample, y_sample, label='Data samples')
plt.plot(x_sample, true_beta0 + x_sample * true_beta1, color='C1', label='True fn')
plt.plot(x_sample, beta0 + x_sample * beta1, color='C2', label='Initialised fn')
plt.title(r'Trained parameters: $\beta_0=$ {:.2f}, $\beta_1=$ {:.2f}. MSE loss: {:.4f}'.format(
    np.squeeze(beta0), np.squeeze(beta1), mse_loss(x_sample, y_sample, beta0, beta1))
         )
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.legend()

fig.add_subplot(122)
plt.plot(losses)
plt.xlabel("Iteration")
plt.ylabel("MSE loss")
plt.title("Loss vs iteration")
plt.show()


Questions
  1. Does the solution above look reasonable?
  2. Play around with different values of the learning rate. How is the convergence of the algorithm affected?
  3. Try using different batch sizes and re-run the algorithm. What changes?
相关推荐
liruiqiang05几秒前
机器学习 - 衡量模型的特性
人工智能·机器学习
日记成书8 分钟前
详细介绍嵌入式硬件设计
嵌入式硬件·深度学习·学习
thinkMoreAndDoMore10 分钟前
深度学习(3)-TensorFlow入门(梯度带)
人工智能·深度学习·tensorflow
Dream251216 分钟前
【Transformer架构】
人工智能·深度学习·transformer
黎智程17 分钟前
AI助力小微企业技术开发规范化管理 | 杂谈
人工智能
技术小齐26 分钟前
网络运维学习笔记 022 HCIA-Datacom新增知识点03园区网典型组网架构及案例实战
运维·网络·学习
西猫雷婶32 分钟前
python学智能算法(三)|模拟退火算法:深层分析
算法·机器学习·模拟退火算法
web1508541593540 分钟前
超级详细Spring AI运用Ollama大模型
人工智能·windows·spring
啊哈哈哈哈哈啊哈哈42 分钟前
J4打卡—— ResNet 和 DenseNet结合实现鸟类分类
人工智能·pytorch·分类
alden_ygq1 小时前
Ollama API 交互
人工智能·交互