Abstract Mathematical Explanation with Examples for Neural Networks
In this article, I will explain the core ideas of neural networks from an abstract mathematical perspective. By "abstract", I mean that I will try to explain the "why" of mathematical concepts without covering all mathematical details. To simplify the mathematical concepts of neural networks, I will use some analogies from real life situations, with visualisations and examples. I will start by explaining why we need neural networks, and then discuss the role of optimisation and backpropagation algorithms.
Why Do We Need Neural Networks
Neural networks are tools that allow us to approximate complex multivariate functions representing the relationships between dataset inputs and outputs. Typically, it is not feasible to define one explicit equation that can reproduce these multivariate functions. The role of training is thus to approximate them. Indeed, most neural networks architectures are based on a mathematical theorem called Universal Approximation Theorem. The Universal Approximation Theorem uses a combination of linear transformation (using weighted sums of linear functions) and non-linear transformations (using static activation functions).
Let's take one example to illustrate the role of approximators for neural network. For example, let's assume that we have a neural network with three inputs, one layer with three neurons and one output y.
As per the Universal Approximation Theorem, the following equations can allow to approximate any multivariate function (we have three inputs in our case) using the following equations:
The number of neurons and layers play an important role to determine the accuracy of the approximation. Let's use visualisation to understand this. Imagine we have the below blue function that we want to approximate. The accuracy of the approximation with three linear segments (dashed red lines) is less than the one when we use 6 segments.
Let's draw an analogy with the real life. Imagine we give two people wooden sticks and we ask them to emulate a curved shape that we have drawn in the sand. We gave the first person 3 sticks, while the second person receives 6 sticks. We have one rule: players can use only one part of each stick. Clearly, the second person will be able to approximate the curved shape in the sand more accurately than first person because he has more sticks than the first person.The Universal Approximation Theorem specifically states that neural networks with a single hidden layer containing a finite number of neurons can approximate any bounded continuous functions. However, later, another paradigm of neural networks that has multiple hidden layers was proven in many scenarios to be more efficient than the neural networks with one hidden layer. The neural networks with multiple layers are called deep neural networks. The neural network given below is an example of a deep neural network with two hidden layers.
This network has the following properties:
Inputs:
Hidden Layer 1:
The neuron
For neuron
For neuron
where
is an activation function is bias vector is weight matrix that links the input with the the first hidden layer.
Hidden Layer 2:
For neuron
For neuron
For neuron
The output
Our neural network has the following parameters that affect its behaviour:
Loss Functions and Optimisations
Loss functions are mathematical functions that take two main inputs: the predictions of the neural network and the true values. They output a value that represents the gap between their two inputs. One of the most known loss functions is the Mean Squared Error. Sometimes, we don't have the reference true values in our datasets (i.e., unsupervised learning). In this case, we build artificial reference that we use it to compare with the predictions. For example, generative models use a method called reconstruction loss. In this method, the generative models calculate how well the generated outputs approximate the input data n the datasets.
In our example of deep neural network, our objective is to know how the loss function would change if any of the 28th parameters of our network gets a new value. If the loss function is denoted by
The gradient vector is then used to update the parameters of the neural network :
Let's take a simple example for a neutral network that has only two parameters to show how the gradient vector can be calculated and how it updates the parameters of the model. Consider the loss function:
The gradients are:
Now we have to use the gradient vector to update the parameters in iterative way. Let's assume that we start at
1. Initial Position:
2. First Iteration:
3. Second Iteration:
4. Third Iteration:
Repeating these iterations will allow us to reach to the global minimum at the point
The example that I give is quite straightforward because we used a convex function that has only one global minimum. Things can be more complicated if we have a loss function that has multiple minimums. The gradient decent will allow us to reach one minimum but this minimum is not necessarily the ideal one.
Let's consider this loss function which is non-convex function that has multiple local minimas, such as:
Compute the gradients:
Update parameters:
Thus, if we use fixed learning rate, we can miss the global minimum because we are making very big updates or converging very slowly to global minimum. The solution to this problem is to use adaptive learning rate such as Adam (Adaptive Moment Estimation), which is one of the most popular optimisation algorithms used today for training neural networks. Adam algorithm adapts the learning rate based on the past gradients that are calculated in the previous iterations.
Backpropagation Algorithm
Another issue we need to handle during training is the efficiency of gradient calculation. In our example with 28 parameters, calculating the gradients with regards to each parameter is not an issue. However, if our neural network has billions of parameters, a direct calculation of gradients will not be possible. This is where the backpropagation algorithm comes in, as it proposes an efficient method to calculate gradient for large neural networks.
To explain backpropagation, we need first to distinguish between two types of parameters: bias vectors and weight matrices. Our objective is to know how much every change in the parameters would affect the loss function. Below is a general equation for any neural network that uses the general approximation theorem.
Where
Our objective during training is to see how much each parameter of the model affects the individual loss
Before we mention how to calculate the derivative efficiently, let's see what is needed to calculate
and
So calculating the changes
and
For both types of parameters, we need to know
This decomposition solves our problem because we know already
We can continue moving back until we reach the input layer (
While backpropagation is needed for most the neural network architectures (CNN, RNN, Transformer, MLP, etc..), it can not be applied in the same way for Kolmogorov-Arnold Networks (KANs). KANs are another types of neural network that uses another method of approximation. KANs transform multivariable functions into sum of continuous functions with one single variable. This means that calculating the derivatives for every layer is straightforward process as we are calculating derivates with regards to a function with one variable.
References
- Simon J.D. Prince. (2023). Understanding Deep Learning. The MIT Press. http://udlbook.com
- Danka, Tivadar (2024). Mathematics of Machine Learning.
Comments
Post a Comment