【深度学习系统】Lecture 2 - ML Refresher / Softmax Regression

一、问题的理解方式

首先，什么是数据驱动的编程？面对经典的MNIST数据集识别任务，传统的编程思维和数据驱动的编程思维有何不同？

传统编程思维： 通常从明确的问题定义和具体的算法开始。对于 MNIST 数据集识别任务，可能会首先考虑使用特定的图像处理算法，如边缘检测、特征提取等，然后设计一系列的逻辑步骤来对图像进行处理和分类。
数据驱动编程思维： 更注重对数据本身的理解。会先分析 MNIST 数据集的特点，包括图像的像素分布、数字的特征等。通过对数据的观察和探索，发现数据中的模式和规律，从而决定采用何种方法进行处理和分类。

然而即便是MNIST这样一种简单的数据集任务，用传统的编程思维通过人为设计一系列的逻辑步骤来解决也是十分地困难 。于是人们尝试用计算机自动识别数据当中的模式，而非人为设定逻辑界定。那么如何让计算机学会学习就变成一件很本质的事情。以MNIST数据集为例，现在我有一堆图片数据，以及它的分类标签。 f ( picture i ) = label i f(\text {picture}_i)=\text{label}_i f(picturei)=labeli关键的任务是要设计一种学习算法，让计算机能够随着数据的不断增加，自己自动能够朝着一个方向找到这个所谓的分类函数 f f f。

在统计当中，我们管 f f f叫做模型；通常假设的模型都会有参数；既然有参数，那么我们必然要做的一件事情就是估计模型的参数，也就是参数估计 。估计得是好还是坏，我们需要有一个评判标准，也就是要有损失函数 。我们当然希望能够尽可能好地估计参数，那么就需要使它的损失函数更小，也就是要做优化。这就是整套机器学习的流程，所以为什么说机器学习是一门数据驱动的编程，因为它的底层逻辑就是统计，而统计就是一门研究如何从数据当中发现规律的学科。

二、Softmax Regression

softmax regression是一种最简单的机器学习算法，因为它基于线性假设 。线性假设的本质就是说，标签是数据的一组加权求和 。不过由于这里是多分类任务，这里不做这样的直接映射，而是首先映射到各个标签类的概率。再在外面套一层argmax函数即可。
f ( x ; θ ) = arg max ⁡ i = 0 , 1 , . . . , 9 h θ ( x ) = arg max ⁡ i ( θ T x ) f(x;\theta)=\argmax_{i=0,1,...,9} h_{\theta}(x)=\argmax_i(\theta^T x) f(x;θ)=i=0,1,...,9argmaxhθ(x)=iargmax(θTx)其中 θ ∈ R n × k , x ∈ R n \theta \in R^{n\times k},x\in R^n θ∈Rn×k,x∈Rn，这里面 x x x其实是把图片拉伸成一维的向量。

引入矩阵能够进行批量化运算。大大提高运算效率。
X ∈ R m × n = [ x ( 1 ) T . . . x ( m ) T ] X\in R^{m\times n}=\begin{bmatrix}x^{T}{(1)}\\ .\\.\\.\\x^{T}{(m)} \end{bmatrix} X∈Rm×n= x(1)T...x(m)T h θ ( X ) = [ h θ ( x ( 1 ) ) T . . . h θ ( x ( m ) ) T ] = [ x ( 1 ) T θ . . . x ( m ) T θ ] = X θ h_{\theta}(X)=\begin{bmatrix} h_{\theta}(x_{(1)})^T\\ .\\.\\.\\ h_{\theta}(x_{(m)})^T \end{bmatrix}=\begin{bmatrix} x_{(1)}^T\theta\\ .\\.\\.\\ x_{(m)}^T\theta \end{bmatrix}=X\theta hθ(X)= hθ(x(1))T...hθ(x(m))T = x(1)Tθ...x(m)Tθ =Xθ

损失函数：
L e r r ( h ( x ) , y ) = { 0 if argmax i h i ( x ) = y 1 otherwise L_{err}(h(x),y)=\begin{cases}0&\text{if argmax}_ih_i(x)=y\\1&\text{otherwise}\end{cases} Lerr(h(x),y)={01if argmaxihi(x)=yotherwise其中 y = { 0 , 1 , . . . , 9 } y=\{0,1,...,9\} y={0,1,...,9}是标签。这个损失函数对于我们挑选最优参数来说并不合适，因为它随着参数 θ \theta θ的变化是是离散变化的，求不了导，只能把 θ \theta θ的空间遍历完才能知道最优的 θ \theta θ，而这在离散数学当中是一个NP-hard的难题。

NP-hard之所难，就是因为当我们站在某一个 θ \theta θ点位的时候，我们手头上没有任何关于函数空间的信息。通常一般的函数优化为什么简单？不就是有梯度嘛，梯度告诉我们站在当前的点位，朝着哪个方向走使得函数变化的最快。即便它有着陷入局部极值的隐患，但起码我们有能够出的牌了。

所以我们需要选择一个能够求当前梯度的损失函数。
L e r r ( h ( x ) , y ) = − log ⁡ p ( label = y ) = − h y ( x ) + log ⁡ ∑ j = 1 k exp ⁡ ( h j ( x ) ) L_{err}(h(x),y)=-\log p(\text{label}=y)=\\-h_y(x)+\log \sum_{j=1}^k\exp (h_j(x)) Lerr(h(x),y)=−logp(label=y)=−hy(x)+logj=1∑kexp(hj(x))这就是所谓的交叉熵损失 ，其中， p ( label = i ) = exp ⁡ h i ( x ) ∑ j = 1 k exp ⁡ h j ( x ) = softmax ( h ( x ) ) i p(\text{label}=i)=\frac{\exp{h_{i}(x)}}{\sum_{j=1}^k \exp{h_j(x)}} =\text{softmax}(h(x))_i p(label=i)=∑j=1kexphj(x)exphi(x)=softmax(h(x))i

而我们要做的优化：
minimize θ 1 m ∑ i = 1 m L e r r ( h θ ( x ( i ) ) , y ( i ) ) = minimize θ 1 m ∑ i = 1 m L e r r ( θ T x ( i ) , y ( i ) ) \underset{{\theta}}{\text{minimize}}\frac{1}{m}\sum_{i=1}^m L_{err}(h_{\theta}(x_{(i)}),y_{(i)})=\\ \underset{{\theta}}{\text{minimize}}\frac{1}{m}\sum_{i=1}^m L_{err}(\theta^Tx_{(i)},y_{(i)}) θminimizem1i=1∑mLerr(hθ(x(i)),y(i))=θminimizem1i=1∑mLerr(θTx(i),y(i))现在我们手头有梯度这张牌，怎么用呢？现在 θ \theta θ是一个 n × k n\times k n×k的矩阵，以下 ∇ θ f ( θ ) \nabla_{\theta}f(\theta) ∇θf(θ)是梯度信息矩阵：
∇ θ f ( θ ) = [ ∂ f ( θ ) ∂ θ 11 . . . ∂ f ( θ ) ∂ θ 1 k . . . ∂ f ( θ ) ∂ θ n 1 . . . ∂ f ( θ ) ∂ θ n k ] \nabla_{\theta}f(\theta)=\begin{bmatrix}\frac{\partial f(\theta) }{\partial \theta_{11}}...\frac{\partial f(\theta) }{\partial \theta_{1k}}\\...\\ \frac{\partial f(\theta) }{\partial \theta_{n1}}...\frac{\partial f(\theta) }{\partial \theta_{nk}} \end{bmatrix} ∇θf(θ)= ∂θ11∂f(θ)...∂θ1k∂f(θ)...∂θn1∂f(θ)...∂θnk∂f(θ) 我们可以采用逐步利用梯度来更新 θ \theta θ，使之朝着一个让损失函数减小的方向变动（矩阵减法就是对应位置相减）：
θ = θ − α ∇ θ f ( θ ) \theta=\theta-\alpha\nabla_{\theta}f(\theta) θ=θ−α∇θf(θ)其中， α \alpha α我们称之为学习率 ，这样一个方法称为梯度下降法。

学习率因为需要人为设定，称这类参数为超参数 ，跟 θ \theta θ的区别在于后者只需要随机给一个初始值，基本不会不影响学习结果，而学习率是需要人为精心设计的，并且会对学习结果产生较大的影响。

还有一个值得注意的问题就是，理论上最优的方式是每次更新参数时，计算的梯度都应该用上所有样本的信息，但是在实际中这样造成的时间消耗太大，毕竟一个epoch 只更新了一次参数，这样的时间效率实在划不来。实际当中采用的是随机梯度下降 ，也就是将样本随机打乱以后，每次只选取一个batch 的数据来计算梯度，并对参数及时更新。这样等到过了一遍epoch的时候，就能够更新好几轮参数，效率更高。
θ = θ − α 1 B ∑ i = 1 B ∇ θ L e r r ( h θ ( x ( i ) ) , y ( i ) ) \theta=\theta-\alpha\frac{1}{B}\sum_{i=1}^B\nabla_{\theta}L_{err}(h_{\theta}(x_{(i)}),y_{(i)}) θ=θ−αB1i=1∑B∇θLerr(hθ(x(i)),y(i))由于学习率的人为设定，我们很难做到一个epoch期间，参数的逐步更新能够就把损失函数降到最低，通常需要几个epoch才能做到损失函数的收敛。当损失函数收敛的时候，对应到参数空间，要么已经到最高海拔的山顶了，要么还在"高原"逗留。这个我们无从知晓。

最后还有一小片乌云摆在面前，那就是 ∇ θ L e r r ( h θ ( x , y ) ) ∈ R n × k \nabla_{\theta}L_{err}(h_{\theta}(x,y))\in R^{n\times k} ∇θLerr(hθ(x,y))∈Rn×k怎么算?注意， L e r r ( h θ ( x ) , y ) L_{err}(h_{\theta}(x),y) Lerr(hθ(x),y)是一个复合函数 L h [ h θ ( x ) ] L_{h}[h_{\theta}(x)] Lh[hθ(x)]，这个就需要用到高数的链式法则技巧。

首先计算外层的 h h h：
∂ L e r r ( h , y ) ∂ h i = ∂ ∂ h i ( − h y + log ⁡ ∑ j = 1 k exp ⁡ h j ) = − I ( i = y ) + exp ⁡ h i ∑ i = 1 k exp ⁡ h j = softmax ( h ) i − e y ( i ) \frac{\partial{L_{err}(h,y)}}{\partial h_i}= \frac{\partial}{\partial h_i}(-h_y+\log\sum_{j=1}^k\exp h_j)=\\ -I(i=y)+\frac{\exp h_i}{\sum_{i=1}^k\exp h_j}=\text{softmax}(h)i-e^{(i)}y ∂hi∂Lerr(h,y)=∂hi∂(−hy+logj=1∑kexphj)=−I(i=y)+∑i=1kexphjexphi=softmax(h)i−ey(i)故， ∇ h L e r r ( h , y ) = z − e y \nabla{h}L{err}(h,y)=z-e_y ∇hLerr(h,y)=z−ey其中， z = softmax ( h ) ∈ R k × 1 z=\text{softmax}(h)\in R^{k\times 1} z=softmax(h)∈Rk×1
再来计算内层 θ \theta θ：
∂ h θ ( x ) ∂ θ = ∂ ∂ θ ( θ T x ) = x ∈ R n × 1 \frac{\partial h_{\theta}(x)}{\partial \theta}=\frac{\partial}{\partial \theta}( \theta^Tx)=x\in R^{n\times 1} ∂θ∂hθ(x)=∂θ∂(θTx)=x∈Rn×1
最后按照矩阵维度适配的顺序乘起来： ∇ θ L e r r ( h θ ( x ) , y ) = x ( z − e y ) T \nabla_{\theta}L_{err}(h_{\theta}(x),y)=x(z-e_y)^T ∇θLerr(hθ(x),y)=x(z−ey)T

所以，进一步有： ∇ θ L e r r ( X θ , y ) = X T ( Z − I y ) ∈ R n × k \nabla_{\theta}L_{err}(X\theta ,y)=X^T(Z-I_y)\in R^{n\times k} ∇θLerr(Xθ,y)=XT(Z−Iy)∈Rn×k，( R n × m ⋅ R m × k = R n × k R^{n\times m}·R^{m\times k }=R^{n\times k} Rn×m⋅Rm×k=Rn×k)

三、Homework 0

simple_ml.py

python 复制代码

import struct
import numpy as np
import gzip
try:
    from simple_ml_ext import *
except:
    pass


def add(x, y):
    """ A trivial 'add' function you should implement to get used to the
    autograder and submission system.  The solution to this problem is in the
    the homework notebook.

    Args:
        x (Python number or numpy array)
        y (Python number or numpy array)

    Return:
        Sum of x + y
    """
    ### BEGIN YOUR CODE
    return x + y
    ### END YOUR CODE


def parse_mnist(image_filename, label_filename):
    """ Read an images and labels file in MNIST format.  See this page:
    http://yann.lecun.com/exdb/mnist/ for a description of the file format.

    Args:
        image_filename (str): name of gzipped images file in MNIST format
        label_filename (str): name of gzipped labels file in MNIST format

    Returns:
        Tuple (X,y):
            X (numpy.ndarray[np.float32]): 2D numpy array containing the loaded 
                data.  The dimensionality of the data should be 
                (num_examples x input_dim) where 'input_dim' is the full 
                dimension of the data, e.g., since MNIST images are 28x28, it 
                will be 784.  Values should be of type np.float32, and the data 
                should be normalized to have a minimum value of 0.0 and a 
                maximum value of 1.0 (i.e., scale original values of 0 to 0.0 
                and 255 to 1.0).

            y (numpy.ndarray[dtype=np.uint8]): 1D numpy array containing the
                labels of the examples.  Values should be of type np.uint8 and
                for MNIST will contain the values 0-9.
    """
    ### BEGIN YOUR CODE
    # Read the image file
    with gzip.open(image_filename, 'rb') as img_file:
        img_file.read(16)  # Skip the header
        img_data = np.frombuffer(img_file.read(), dtype=np.uint8)  # Read the image data
        img_data = img_data.reshape(-1, 28*28).astype(np.float32)  # Reshape and convert to float32
        img_data /= 255.0  # Normalize to [0.0, 1.0]
    
    # Read the label file
    with gzip.open(label_filename, 'rb') as lbl_file:
        lbl_file.read(8)  # Skip the header
        lbl_data = np.frombuffer(lbl_file.read(), dtype=np.uint8)  # Read the label data
    
    return img_data, lbl_data
    ### END YOUR CODE


def softmax_loss(Z, y):
    """ Return softmax loss.  Note that for the purposes of this assignment,
    you don't need to worry about "nicely" scaling the numerical properties
    of the log-sum-exp computation, but can just compute this directly.

    Args:
        Z (np.ndarray[np.float32]): 2D numpy array of shape
            (batch_size, num_classes), containing the logit predictions for
            each class.
        y (np.ndarray[np.uint8]): 1D numpy array of shape (batch_size, )
            containing the true label of each example.

    Returns:
        Average softmax loss over the sample.
    """
    ### BEGIN YOUR CODE
    # Subtract the maximum value in each row (for numerical stability)
    Z_stable = Z - np.max(Z, axis=1, keepdims=True)
    
    # Compute log(sum(exp(z_i))) for each row
    log_sum_exp = np.log(np.sum(np.exp(Z_stable), axis=1))
    
    # Subtract z_y (the logit corresponding to the correct class)
    correct_class_logits = Z_stable[np.arange(Z.shape[0]), y]
    
    # Compute the loss as the average of log(sum(exp(z_i))) - z_y
    loss = np.mean(log_sum_exp - correct_class_logits)
    
    return loss
    ### END YOUR CODE


def softmax_regression_epoch(X, y, theta, lr = 0.1, batch=100):
    """ Run a single epoch of SGD for softmax regression on the data, using
    the step size lr and specified batch size.  This function should modify the
    theta matrix in place, and you should iterate through batches in X _without_
    randomizing the order.

    Args:
        X (np.ndarray[np.float32]): 2D input array of size
            (num_examples x input_dim).
        y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
        theta (np.ndarrray[np.float32]): 2D array of softmax regression
            parameters, of shape (input_dim, num_classes)
        lr (float): step size (learning rate) for SGD
        batch (int): size of SGD minibatch

    Returns:
        None
    """
    ### BEGIN YOUR CODE
    num_examples, input_dim = X.shape
    num_classes = theta.shape[1]

    for i in range(0, num_examples, batch):
        # Extract the batch
        X_batch = X[i:i+batch]
        y_batch = y[i:i+batch]

        # Compute logits
        logits = np.dot(X_batch, theta)  # Shape: (batch_size, num_classes)

        # Apply softmax to logits
        logits_stable = logits - np.max(logits, axis=1, keepdims=True)
        exp_logits = np.exp(logits_stable)
        Z = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)  # Z = softmax probabilities

        # Create a one-hot encoded version of y_batch
        I_y = np.zeros_like(Z)
        I_y[np.arange(batch if i + batch < num_examples else num_examples - i), y_batch] = 1  # I_y

        # Compute the gradient
        grad = np.dot(X_batch.T, (Z - I_y)) / X_batch.shape[0]

        # Update the parameters
        theta -= lr * grad
    ### END YOUR CODE

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(np.float32)

def nn_epoch(X, y, W1, W2, lr = 0.1, batch=100):
    """ Run a single epoch of SGD for a two-layer neural network defined by the
    weights W1 and W2 (with no bias terms):
        logits = ReLU(X * W1) * W2
    The function should use the step size lr, and the specified batch size (and
    again, without randomizing the order of X).  It should modify the
    W1 and W2 matrices in place.

    Args:
        X (np.ndarray[np.float32]): 2D input array of size
            (num_examples x input_dim).
        y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
        W1 (np.ndarray[np.float32]): 2D array of first layer weights, of shape
            (input_dim, hidden_dim)
        W2 (np.ndarray[np.float32]): 2D array of second layer weights, of shape
            (hidden_dim, num_classes)
        lr (float): step size (learning rate) for SGD
        batch (int): size of SGD minibatch

    Returns:
        None
    """
    ### BEGIN YOUR CODE
    num_examples, input_dim = X.shape
    _, hidden_dim = W1.shape
    num_classes = W2.shape[1]

    for i in range(0, num_examples, batch):
        # Extract the batch
        X_batch = X[i:i+batch]
        y_batch = y[i:i+batch]

        # Forward pass
        Z1 = np.dot(X_batch, W1)  # Shape: (batch_size, hidden_dim)
        A1 = relu(Z1)            # Shape: (batch_size, hidden_dim)
        Z2 = np.dot(A1, W2)      # Shape: (batch_size, num_classes)
        
        # Compute softmax probabilities and loss
        logits_stable = Z2 - np.max(Z2, axis=1, keepdims=True)
        exp_logits = np.exp(logits_stable)
        softmax_probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        
        # Create a one-hot encoded version of y_batch
        y_one_hot = np.zeros_like(softmax_probs)
        y_one_hot[np.arange(batch if i + batch < num_examples else num_examples - i), y_batch] = 1  # I_y

        # Compute the gradient for W2
        dL_dZ2 = (softmax_probs - y_one_hot) / X_batch.shape[0]
        dL_dW2 = np.dot(A1.T, dL_dZ2)  # Gradient w.r.t. W2
        
        # Compute the gradient for W1
        dL_dA1 = np.dot(dL_dZ2, W2.T) * relu_derivative(Z1)
        dL_dW1 = np.dot(X_batch.T, dL_dA1)  # Gradient w.r.t. W1

        # Update weights
        W2 -= lr * dL_dW2
        W1 -= lr * dL_dW1
    ### END YOUR CODE



### CODE BELOW IS FOR ILLUSTRATION, YOU DO NOT NEED TO EDIT

def loss_err(h,y):
    """ Helper funciton to compute both loss and error"""
    return softmax_loss(h,y), np.mean(h.argmax(axis=1) != y)


def train_softmax(X_tr, y_tr, X_te, y_te, epochs=10, lr=0.5, batch=100,
                  cpp=False):
    """ Example function to fully train a softmax regression classifier """
    theta = np.zeros((X_tr.shape[1], y_tr.max()+1), dtype=np.float32)
    print("| Epoch | Train Loss | Train Err | Test Loss | Test Err |")
    for epoch in range(epochs):
        if not cpp:
            softmax_regression_epoch(X_tr, y_tr, theta, lr=lr, batch=batch)
        else:
            softmax_regression_epoch_cpp(X_tr, y_tr, theta, lr=lr, batch=batch)
        train_loss, train_err = loss_err(X_tr @ theta, y_tr)
        test_loss, test_err = loss_err(X_te @ theta, y_te)
        print("|  {:>4} |    {:.5f} |   {:.5f} |   {:.5f} |  {:.5f} |"\
              .format(epoch, train_loss, train_err, test_loss, test_err))


def train_nn(X_tr, y_tr, X_te, y_te, hidden_dim = 500,
             epochs=10, lr=0.5, batch=100):
    """ Example function to train two layer neural network """
    n, k = X_tr.shape[1], y_tr.max() + 1
    np.random.seed(0)
    W1 = np.random.randn(n, hidden_dim).astype(np.float32) / np.sqrt(hidden_dim)
    W2 = np.random.randn(hidden_dim, k).astype(np.float32) / np.sqrt(k)

    print("| Epoch | Train Loss | Train Err | Test Loss | Test Err |")
    for epoch in range(epochs):
        nn_epoch(X_tr, y_tr, W1, W2, lr=lr, batch=batch)
        train_loss, train_err = loss_err(np.maximum(X_tr@W1,0)@W2, y_tr)
        test_loss, test_err = loss_err(np.maximum(X_te@W1,0)@W2, y_te)
        print("|  {:>4} |    {:.5f} |   {:.5f} |   {:.5f} |  {:.5f} |"\
              .format(epoch, train_loss, train_err, test_loss, test_err))



if __name__ == "__main__":
    X_tr, y_tr = parse_mnist("data/train-images-idx3-ubyte.gz",
                             "data/train-labels-idx1-ubyte.gz")
    X_te, y_te = parse_mnist("data/t10k-images-idx3-ubyte.gz",
                             "data/t10k-labels-idx1-ubyte.gz")

    print("Training softmax regression")
    train_softmax(X_tr, y_tr, X_te, y_te, epochs=10, lr = 0.1)

    print("\nTraining two layer neural network w/ 100 hidden units")
    train_nn(X_tr, y_tr, X_te, y_te, hidden_dim=100, epochs=20, lr = 0.2)

simple_ml_ext.cpp

c 复制代码

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <cmath>
#include <iostream>

namespace py = pybind11;

void softmax_regression_epoch_cpp(const float *X, const unsigned char *y,
                                  float *theta, size_t m, size_t n, size_t k,
                                  float lr, size_t batch) 
{
    std::vector<float> logits(batch * k);       // Stores the logits
    std::vector<float> softmax(batch * k);      // Stores softmax probabilities
    std::vector<float> gradient(n * k);         // Stores the gradient for theta

    for (size_t i = 0; i < m; i += batch) {
        size_t cur_batch_size = std::min(batch, m - i);

        // Compute logits: Z = X_batch * theta
        for (size_t b = 0; b < cur_batch_size; ++b) {
            for (size_t j = 0; j < k; ++j) {
                logits[b * k + j] = 0.0f;
                for (size_t d = 0; d < n; ++d) {
                    logits[b * k + j] += X[(i + b) * n + d] * theta[d * k + j];
                }
            }
        }

        // Apply softmax to compute the probabilities
        for (size_t b = 0; b < cur_batch_size; ++b) {
            float max_logit = logits[b * k];
            for (size_t j = 1; j < k; ++j) {
                if (logits[b * k + j] > max_logit) {
                    max_logit = logits[b * k + j];
                }
            }

            // Subtract max_logit for numerical stability
            float sum_exp = 0.0f;
            for (size_t j = 0; j < k; ++j) {
                softmax[b * k + j] = std::exp(logits[b * k + j] - max_logit);
                sum_exp += softmax[b * k + j];
            }

            for (size_t j = 0; j < k; ++j) {
                softmax[b * k + j] /= sum_exp;
            }
        }

        // Compute the gradients and update theta
        std::fill(gradient.begin(), gradient.end(), 0.0f);

        for (size_t b = 0; b < cur_batch_size; ++b) {
            for (size_t j = 0; j < k; ++j) {
                float indicator = (j == y[i + b]) ? 1.0f : 0.0f;
                float error = softmax[b * k + j] - indicator;

                for (size_t d = 0; d < n; ++d) {
                    gradient[d * k + j] += X[(i + b) * n + d] * error;
                }
            }
        }

        // Update theta using the gradients
        for (size_t j = 0; j < n * k; ++j) {
            theta[j] -= lr * gradient[j] / cur_batch_size;
        }
    }
}



/**
 * This is the pybind11 code that wraps the function above.  It's only role is
 * wrap the function above in a Python module, and you do not need to make any
 * edits to the code
 */
PYBIND11_MODULE(simple_ml_ext, m) {
    m.def("softmax_regression_epoch_cpp",
    	[](py::array_t<float, py::array::c_style> X,
           py::array_t<unsigned char, py::array::c_style> y,
           py::array_t<float, py::array::c_style> theta,
           float lr,
           int batch) {
        softmax_regression_epoch_cpp(
        	static_cast<const float*>(X.request().ptr),
            static_cast<const unsigned char*>(y.request().ptr),
            static_cast<float*>(theta.request().ptr),
            X.request().shape[0],
            X.request().shape[1],
            theta.request().shape[1],
            lr,
            batch
           );
    },
    py::arg("X"), py::arg("y"), py::arg("theta"),
    py::arg("lr"), py::arg("batch"));
}