From Theory to Code: Implementing a Neural Network in 200 Lines of C

December 19, 2024

Introduction

Neural networks are computational models inspired by the human brain, capable of learning complex patterns from data. Today, high-level languages and libraries like TensorFlow, PyTorch, or Keras make building neural networks trivial. However, implementing one in C provides a deeper understanding of the underlying mechanics and helps gain appreciation of the higher-level abstractions.

In this article, we will make our own 200 line neural network in C using only the standard library. The neural network will be trained and fine-tuned to recognize handwritten digits from the MNIST dataset.

Note: This article is based on miniMNIST-C, my minimal neural network implementation in C that you can explore on GitHub.

Neural network structure

Our neural network will consist of three layers: input, hidden, and output. The number of neurons in each layer is determined by the characteristics of the dataset we are using. In this case, we'll be working with the MNIST dataset, which provides handwritten digits in a 28x28 pixel format. By flattening each image into a single vector, we obtain an array of 784 elements that will serve as our input layer. The output layer will comprise 10 neurons, each corresponding to one of the possible digits (0-9).

This can be summarised as:

Input Layer: Accepts the 28×28 pixel images (flattened into a 784-dimensional vector).
Hidden Layer: Contains 256 neurons
Output Layer: Contains 10 neurons (one for each digit class)

Processing Input Data

The MNIST dataset comprises 60,000 training images and 10,000 testing images of handwritten digits, ranging from 0 to 9. Each image is represented as a 28×28 pixel grayscale matrix. Before we can use this data in our neural network, we need to preprocess and format it in a way that aligns with the network's architecture.

We use two functions to read the images and labels from the IDX file format:

void read_mnist_images(const char *filename, unsigned char **images, int *nImages) {
    FILE *file = fopen(filename, "rb");
    if (!file) exit(1);

    int temp, rows, cols;
    fread(&temp, sizeof(int), 1, file);
    fread(nImages, sizeof(int), 1, file);
    *nImages = __builtin_bswap32(*nImages);

    fread(&rows, sizeof(int), 1, file);
    fread(&cols, sizeof(int), 1, file);

    rows = __builtin_bswap32(rows);
    cols = __builtin_bswap32(cols);

    *images = malloc((*nImages) * IMAGE_SIZE * IMAGE_SIZE);
    fread(*images, sizeof(unsigned char), (*nImages) * IMAGE_SIZE * IMAGE_SIZE, file);
    fclose(file);
}
            
void read_mnist_labels(const char *filename, unsigned char **labels, int *nLabels) {
    FILE *file = fopen(filename, "rb");
    if (!file) exit(1);

    int temp;
    fread(&temp, sizeof(int), 1, file);
    fread(nLabels, sizeof(int), 1, file);
    *nLabels = __builtin_bswap32(*nLabels);

    *labels = malloc(*nLabels);
    fread(*labels, sizeof(unsigned char), *nLabels, file);
    fclose(file);
}

The idea is to:

Open the IDX files in binary mode
Read and convert the header information (number of images, rows, columns)
Allocate memory and read the image pixel data and labels

Implementing the Neural Network Structure

We define a struct to represent each layer in the network:

typedef struct {
    float *weights, *biases;
    int input_size, output_size;
} Layer;

weights: A flattened array representing the weight matrix
biases: An array for the biases of each neuron
input_size: Number of inputs to the layer
output_size: Number of neurons in the layer

The weights and biases of each layer need to be initialized to a default value. They will later be updated by the neural network as it learns through the process of backpropagation.

We initialize the weights and biases in the `init_layer` function:

void init_layer(Layer *layer, int in_size, int out_size) {
    int n = in_size * out_size;
    float scale = sqrtf(2.0f / in_size);

    layer->input_size = in_size;
    layer->output_size = out_size;
    layer->weights = malloc(n * sizeof(float));
    layer->biases = calloc(out_size, sizeof(float));

    for (int i = 0; i < n; i++)
        layer->weights[i] = ((float)rand() / RAND_MAX - 0.5f) * 2 * scale;
}

We use He Initialization to set the weights:

$$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\right)$$ The weights $W$ of the neural network are initialized using a uniform distribution $\mathcal{U}$ with lower and upper bounds determined by the number of input units $n_{in}$, ensuring proper scaling and reducing the risk of vanishing or exploding gradients during training.

Biases are initialized to zero.

Forward Propagation

Forward propagation, also known as the forward pass, is the process of computing the output of a neural network given an input. It involves moving the input data through each layer of the network, applying linear transformations and activation functions (ReLU, Softmax etc.), to produce the final output.

Forward propagation is essential for several reasons. First, it generates predictions by moving the input through the network, which we need for both training the model and making inferences with it later. During training, forward propagation gives us the predicted output that we compare against the true labels to calculate the loss and see how much error there is. Finally, forward propagation computes the activations and intermediate outputs that are used in the backpropagation step to calculate gradients and update the model weights.

Activation Functions

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. We will use two kinds of activation functions in our implementation:

ReLU

We apply the ReLU (Rectified Linear Unit) activation function in the hidden layer:

for (int i = 0; i < HIDDEN_SIZE; i++)
                hidden_output[i] = hidden_output[i] > 0 ? hidden_output[i] : 0;

$$\text{ReLU}(x) = \max(0, x)$$ The ReLU (Rectified Linear Unit) activation function returns the maximum value between 0 and the input $x$, effectively introducing non-linearity and allowing the network to learn complex patterns while avoiding negative outputs.

Softmax

We apply the softmax function to the output layer to compute probabilities:

void softmax(float *input, int size) {
    float max = input[0], sum = 0;
    for (int i = 1; i < size; i++)
        if (input[i] > max) max = input[i];
    for (int i = 0; i < size; i++) {
        input[i] = expf(input[i] - max);
        sum += input[i];
    }
    for (int i = 0; i < size; i++)
        input[i] /= sum;
    }

$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$ The softmax function $\sigma(\mathbf{z})_i$ computes the probability distribution over $K$ classes for the $i$-th element of the input vector $\mathbf{z}$ by exponentiating each element and normalizing by the sum of all exponentiated values.

Backpropagation

Backpropagation, also known as the backward pass, is the reversed flow of the forward pass - we start by propagating the error from the output layer until reaching the input layer, passing through the hidden layer(s). The backward function updates the weights and biases based on the gradients:

void backward(Layer *layer, float *input, float *output_grad, float *input_grad, float lr) {
    for (int i = 0; i < layer->output_size; i++) {
        for (int j = 0; j < layer->input_size; j++) {
            int idx = j * layer->output_size + i;
            float grad = output_grad[i] * input[j];
            layer->weights[idx] -= lr * grad;
            if (input_grad)
                input_grad[j] += output_grad[i] * layer->weights[idx];
        }
        layer->biases[i] -= lr * output_grad[i];
    }
}

Weight update

float grad = output_grad[i] * input[j];
    layer->weights[idx] -= lr * grad;

$$w_{ij} = w_{ij} - \eta \cdot \frac{\partial L}{\partial w_{ij}}$$ The weight $w_{ij}$ is updated by subtracting the product of the learning rate $\eta$ and the gradient of the loss function with respect to the weight, effectively moving the weight in the direction that minimizes the loss.

Bias update

layer->biases[i] -= lr * output_grad[i];

$$b_i = b_i - \eta \cdot \frac{\partial L}{\partial b_i}$$ The bias term $b_i$ is updated by subtracting the product of the learning rate $\eta$ and the gradient of the loss function with respect to the bias, effectively adjusting the bias in the direction that minimizes the loss.

Input gradient calculation

if (input_grad)
    input_grad[j] += output_grad[i] * layer->weights[idx];

$$\frac{\partial L}{\partial x_j} = \sum_{i=1}^n \frac{\partial L}{\partial y_i} \cdot w_{ij}$$ The partial derivative of the loss function $L$ with respect to the input $x_j$ is computed as the sum of the products of the partial derivatives of the loss with respect to each output $y_i$ and their corresponding weights $w_{ij}$, effectively propagating the gradients backwards through the network.

Gradient calculation for weight update

float grad = output_grad[i] * input[j];

$$\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial y_i} \cdot x_j$$ The partial derivative of the loss function $L$ with respect to the weight $w_{ij}$ is calculated as the product of the partial derivative of the loss with respect to the output $y_i$ and the corresponding input $x_j$, which is used to update the weight during backpropagation.

Backpropagation is essential because it calculates how much each weight and bias affects the network's error, which allows it to adjust these parameters to reduce the error in future predictions. By sending the error information backward through the network, backpropagation enables all layers to learn and improve the network's performance on the given task through iterative updates. In short, backpropagation is the key mechanism that allows neural networks to learn and make better predictions over time.

Training

With forward and backward propagation now implemented, we are ready to train our neural network. Training involves calling these propagation functions in a loop to adjust the network's weights and biases, minimizing the loss and enhancing the model's ability to make accurate predictions with each pass.

The train function encapsulates the training logic for a single input-label pair:

void train(Network *net, float *input, int label, float lr) {
    float hidden_output[HIDDEN_SIZE], final_output[OUTPUT_SIZE];
    float output_grad[OUTPUT_SIZE] = {0}, hidden_grad[HIDDEN_SIZE] = {0};

    // Forward Pass: Input to Hidden Layer
    forward(&net->hidden, input, hidden_output);
    for (int i = 0; i < HIDDEN_SIZE; i++)
        hidden_output[i] = hidden_output[i] > 0 ? hidden_output[i] : 0;  // ReLU Activation

    // Forward Pass: Hidden to Output Layer
    forward(&net->output, hidden_output, final_output);
    softmax(final_output, OUTPUT_SIZE);

    // Compute Output Gradient
    for (int i = 0; i < OUTPUT_SIZE; i++)
        output_grad[i] = final_output[i] - (i == label);

    // Backward Pass: Output Layer to Hidden Layer
    backward(&net->output, hidden_output, output_grad, hidden_grad, lr);

    // Backpropagate Through ReLU Activation
    for (int i = 0; i < HIDDEN_SIZE; i++)
        hidden_grad[i] *= hidden_output[i] > 0 ? 1 : 0;  // ReLU Derivative

    // Backward Pass: Hidden Layer to Input Layer
    backward(&net->hidden, input, hidden_grad, NULL, lr);
}

This process represents one training iteration.

Training loop

Training is an iterative process spanning multiple epochs. Each epoch involves running the entire dataset through the network, performing forward and backward passes to adjust weights and biases. Thus improving the accuracy of the neural network.

This can be achieved with training loop, iterating over a set number of epochs (we will use 20 for our use case):

    for (int epoch = 0; epoch < EPOCHS; epoch++) {
    float total_loss = 0;
    for (int i = 0; i < train_size; i += BATCH_SIZE) {
        for (int j = 0; j < BATCH_SIZE && i + j < train_size; j++) {
            int idx = i + j;
            for (int k = 0; k < INPUT_SIZE; k++)
                img[k] = data.images[idx * INPUT_SIZE + k] / 255.0f;

            train(&net, img, data.labels[idx], learning_rate);

            // Compute loss for monitoring
            // ...
        }
    }
    // Evaluate on test set
    // ...
}

Evaluation

With the neural network trained, we can move on to evaluating its performance by running inference on an 'unseen' set of data. We split the input data (80% training, 20% test) to provide a benchmark for evaluation.

We assess accuracy using the predict function:

int predict(Network *net, float *input) {
    float hidden_output[HIDDEN_SIZE], final_output[OUTPUT_SIZE];

    forward(&net->hidden, input, hidden_output);
    for (int i = 0; i < HIDDEN_SIZE; i++)
        hidden_output[i] = hidden_output[i] > 0 ? hidden_output[i] : 0;  // ReLU

    forward(&net->output, hidden_output, final_output);
    softmax(final_output, OUTPUT_SIZE);

    int max_index = 0;
    for (int i = 1; i < OUTPUT_SIZE; i++)
        if (final_output[i] > final_output[max_index])
            max_index = i;

    return max_index;
}

The predict function implements the forward pass of our neural network for running inference. It uses the same forward propagation and activation functions (ReLU, softmax) covered earlier, but instead of computing gradients, it returns the most likely digit prediction.

We combine that with out main training loop to evaluate the performance of the model:

for (int epoch = 0; epoch < EPOCHS; epoch++) {
    float total_loss = 0;
    for (int i = 0; i < train_size; i += BATCH_SIZE) {
        for (int j = 0; j < BATCH_SIZE && i + j < train_size; j++) {
            int idx = i + j;
            for (int k = 0; k < INPUT_SIZE; k++)
                img[k] = data.images[idx * INPUT_SIZE + k] / 255.0f;

            train(&net, img, data.labels[idx], learning_rate);

            float hidden_output[HIDDEN_SIZE], final_output[OUTPUT_SIZE];
            forward(&net.hidden, img, hidden_output);
            for (int k = 0; k < HIDDEN_SIZE; k++)
                hidden_output[k] = hidden_output[k] > 0 ? hidden_output[k] : 0;  // ReLU
            forward(&net.output, hidden_output, final_output);
            softmax(final_output, OUTPUT_SIZE);

            total_loss += -logf(final_output[data.labels[idx]] + 1e-10f);
        }
    }
    int correct = 0;
    for (int i = train_size; i < data.nImages; i++) {
        for (int k = 0; k < INPUT_SIZE; k++)
            img[k] = data.images[i * INPUT_SIZE + k] / 255.0f;
        if (predict(&net, img) == data.labels[i])
            correct++;
    }
    printf("Epoch %d, Accuracy: %.2f%%, Avg Loss: %.4f\n", epoch + 1, (float)correct / test_size * 100, total_loss / train_size);
}

Running the training loop gives us the following results:

Epoch 1, Accuracy: 96.12%, Avg Loss: 0.2188
Epoch 2, Accuracy: 96.98%, Avg Loss: 0.0875
Epoch 3, Accuracy: 97.41%, Avg Loss: 0.0561
Epoch 4, Accuracy: 97.63%, Avg Loss: 0.0383
Epoch 5, Accuracy: 97.63%, Avg Loss: 0.0270
Epoch 6, Accuracy: 97.69%, Avg Loss: 0.0193
Epoch 7, Accuracy: 97.98%, Avg Loss: 0.0143
Epoch 8, Accuracy: 98.03%, Avg Loss: 0.0117
Epoch 9, Accuracy: 98.03%, Avg Loss: 0.0103
Epoch 10, Accuracy: 98.06%, Avg Loss: 0.0094
Epoch 11, Accuracy: 98.06%, Avg Loss: 0.0087
Epoch 12, Accuracy: 98.16%, Avg Loss: 0.0081
Epoch 13, Accuracy: 98.16%, Avg Loss: 0.0078
Epoch 14, Accuracy: 98.18%, Avg Loss: 0.0075
Epoch 15, Accuracy: 98.19%, Avg Loss: 0.0074
Epoch 16, Accuracy: 98.20%, Avg Loss: 0.0072
Epoch 17, Accuracy: 98.24%, Avg Loss: 0.0070
Epoch 18, Accuracy: 98.23%, Avg Loss: 0.0069
Epoch 19, Accuracy: 98.23%, Avg Loss: 0.0069
Epoch 20, Accuracy: 98.22%, Avg Loss: 0.0068

The results demonstrate consistent improvement in accuracy and reduction in loss over the course of 20 epochs, with the model ultimately achieving an impressive 98.22% accuracy on the final test set. It's worth noting that state-of-the-art models typically achieve around 99% accuracy on the same dataset.

Taking this into account, our small, custom-built model implemented in C performs remarkably well, especially considering its simplicity.

Conclusion

You should now be able to combine the key steps above to create your own neural network in C.

You can explore the project and its source code on GitHub.