Abstract Mathematical Explanation with Examples for Neural Networks

In this article, I will explain the core ideas of neural networks from an abstract mathematical perspective. By "abstract", I mean that I will try to explain the "why" of mathematical concepts without covering all mathematical details. To simplify the mathematical concepts of neural networks, I will use some analogies from real life situations, with visualisations and examples.  I will start by explaining why we need neural networks, and then discuss the role of optimisation and backpropagation algorithms. 

Why Do We Need Neural Networks 

Neural networks are tools that allow us to approximate complex multivariate functions representing the relationships between dataset inputs and outputs. Typically, it is not feasible to define one explicit equation that can reproduce these multivariate functions. The role of training is thus to approximate them. Indeed, most neural networks architectures are based on a mathematical theorem called Universal Approximation Theorem. The Universal Approximation Theorem uses a combination of linear transformation (using weighted sums of linear functions) and non-linear transformations (using static activation functions).  

Let's take one example to illustrate the role of approximators for neural network. For example, let's assume that we have a neural network with three inputs, one layer with three neurons and one output y.


As per the Universal Approximation Theorem, the following equations can allow to approximate any multivariate function (we have three inputs in our case) using the following equations:

\[ \begin{aligned} n_1 &= w_{11} x_1 + w_{12} x_2 + w_{13} x_3 + b_1 \\ a_1 &= a(n_1) \\ n_2 &= w_{21} x_1 + w_{22} x_2 + w_{23} x_3 + b_2 \\ a_2 &= a(n_2) \\ n_3 &= w_{31} x_1 + w_{32} x_2 + w_{33} x_3 + b_3 \\ a_3 &= a(n_3)\\y&= w_{o1} a_1 + w_{o2} a_2 + w_{o3} a_3 + b_{\text{out}} \end{aligned} \]

$a()$ is an activation function whose role is to introduce non-linearity to the neural network so that it can approximate complex functions. 

The number of neurons and layers play an important role to determine the accuracy of the approximation. Let's use visualisation to understand this. Imagine we have the below blue function that we want to approximate.  The accuracy of the approximation with three linear segments (dashed red lines) is less than the one when we use 6 segments. 

Approximation using three linear segments
Approximation using three linear segments 
Approximation using six linear segments 

Let's draw an analogy with the real life. Imagine we give two people wooden sticks and we ask them to emulate a curved shape that we have drawn in the sand. We gave the first person 3 sticks, while the second person receives 6 sticks. We have one rule: players can use only one part of each stick. Clearly, the second person will be able to approximate the curved shape in the sand more accurately than first person because he has more sticks than the first person. 

In real life, the curves representing datasets are often more complex and less smooth than those illustrated in the above example. Real data are often noisy, irregular, and have various patterns that aren't easily captured by simple functions. For instance, the red points in the figure below represent data points from a dataset that we want to approximate. The neural network generates different curves to fit these data points and adjusts its predictions by measuring the distance between its predicted output and the actual data points. This process involves using a loss function, whose objective is to minimize the difference between the predicted values and the true data points.

The Universal Approximation Theorem specifically states that neural networks with a single hidden layer containing a finite number of neurons can approximate any bounded continuous functions. However, later, another paradigm of neural networks that has multiple hidden layers was proven in many scenarios to be more efficient than the neural networks with one hidden layer. The neural networks with multiple layers are called deep neural networks.  The neural network given below is an example of a deep neural network with two hidden layers.  


This network has the following properties: 

Inputs

\( X=[x1, x2, x3] \), where \(X\) is a vector of three input values.

Hidden Layer 1:

The neuron \( N11 \) applies this equation: \[ N11 = a(w_{N11,X1} \cdot X1 + w_{N11,X2} \cdot X2 + w_{N11,X3} \cdot X3 + b_{N11}) \] 

For neuron \( N21 \): \[ N21 = a(w_{N21,X1} \cdot X1 + w_{N21,X2} \cdot X2 + w_{N21,X3} \cdot X3 + b_{N21}) \]

For neuron \( N31 \): \[ N31 = a(w_{N31,X1} \cdot X1 + w_{N31,X2} \cdot X2 + w_{N31,X3} \cdot X3 + b_{N31}) \]

where 

  • \(a \) is an activation function
  • \(B0 \)  is bias vector
  • \(W0 \)  is weight matrix that links the input with the the first hidden layer. 
It is not important in our context to give a specific type of activation function, so we will use \(a \). The values of \(B0 \) and \(W0 \) are given below. In similar way, the values of matrices of the other layers can be calculated.


\[ B0 = \begin{bmatrix} b_{N11} \\ b_{N21} \\ b_{N31} \end{bmatrix} \]

\[ W0 = \begin{bmatrix} w_{N11,X1} & w_{N11,X2} & w_{N11,X3} \\ w_{N21,X1} & w_{N21,X2} & w_{N21,X3} \\ w_{N31,X1} & w_{N31,X2} & w_{N31,X3} \end{bmatrix} \]

Hidden Layer 2:

For neuron \( N12 \): \[ N12 = a(w_{N12,N11} \cdot N11 + w_{N12,N21} \cdot N21 + w_{N12, N31} \cdot N31 + b_{N12}) \] 

For neuron \( N22 \): \[ N22 = a(w_{N22,N11} \cdot N11 + w_{N22, N21} \cdot N21 + w_{N22, N31} \cdot N31 + b_{N22}) \] 

For neuron \( N32 \): \[ N32 = a(w_{N32,N11} \cdot N11 + w_{N32, N21} \cdot N21 + w_{N32, N31} \cdot N31 + b_{N32}) \] 

The output  \( Y \) can be calculated using this equation:

\[ Y = a(w_{N12} \cdot N12 + w_{N22} \cdot N22 + w_{N33} \cdot N32 + b_{Y} )+ b_{3} \]

Our neural network has the following parameters that affect its behaviour:

\begin{align*} \phi = [ &\phi_0 = w_{N11, X1}, \phi_1 = w_{N11, X2}, \phi_2 = w_{N11, X3},  \phi_3 = b_{N11}, \\ &\phi_4 = w_{N21, X1}, \phi_5 = w_{N21, X2}, \phi_6 = w_{N21, X3}, \phi_7 = b_{N21}, \\ &\phi_8 = w_{N31, X1}, \phi_9 = w_{N31, X2}, \phi_{10} = w_{N31, X3}, \phi_{11} = b_{N31}, \ldots, b3=\phi_{27} ] \end{align*}

Loss Functions and Optimisations

Loss functions are mathematical functions that take two main inputs: the predictions of the neural network and the true values.  They output a value that represents the gap between their two inputs. One of the most known loss functions is the Mean Squared Error.  Sometimes, we don't have the reference true values in our datasets (i.e., unsupervised learning). In this case, we build artificial reference that we use it to compare with the predictions. For example, generative models use a method called reconstruction loss. In this method, the generative models calculate how well the generated outputs approximate the input data n the datasets.

In our example of deep neural network, our objective is to know how the loss function would change if any of the 28th parameters of our network gets a new value. If the loss function is denoted by \(L\), then our objective is to find the gradient vector (gradient of \(L\) with regards to the vector \(\phi\)) :

\[\frac{\partial L}{\partial \phi} = \begin{bmatrix} \frac{\partial L}{\partial \phi_0} \\ \frac{\partial L}{\partial \phi_1} \\ \vdots \\ \frac{\partial L}{\partial \phi_{27}} \end{bmatrix}\]

The gradient vector is then used to update the parameters of the neural network :

\[\phi \leftarrow \phi - \alpha \cdot \frac{\partial L}{\partial \phi}\]

 \(\alpha\) is constant that is defined before training as hyperparameter and it is called "learning rate" because it determines the amount of change on the parameters of the neural networks.  

Let's take a simple example for a neutral network that has only two parameters to show how the gradient vector can be calculated and how it updates the parameters of the model. Consider the loss function: \[ L(\phi_1, \phi_2) = (\phi_1 - 3)^2 + (\phi_2 - 2)^2 \] 

The gradients are: \[ \frac{\partial L}{\partial \phi_1} = 2(\phi_1 - 3) \] \[ \frac{\partial L}{\partial \phi_2} = 2(\phi_2 - 2) \] 

Now we have to use the gradient vector to update the parameters in iterative way. Let's assume that we start at \((\phi_1, \phi_2) = (0, 0)\) and using \(\alpha = 0.1\): 

1. Initial Position: \[ (\phi_1, \phi_2) = (0, 0) \] 

2. First Iteration: \[ \frac{\partial L}{\partial \phi_1} \bigg|_{\phi_1=0} = 2(0 - 3) = -6 \] \[ \frac{\partial L}{\partial \phi_2} \bigg|_{\phi_2=0} = 2(0 - 2) = -4 \] \[ \phi_1 \leftarrow 0 - 0.1 \cdot (-6) = 0.6 \] \[ \phi_2 \leftarrow 0 - 0.1 \cdot (-4) = 0.4 \] 

3. Second Iteration: \[ \frac{\partial L}{\partial \phi_1} \bigg|_{\phi_1=0.6} = 2(0.6 - 3) = -4.8 \] \[ \frac{\partial L}{\partial \phi_2} \bigg|_{\phi_2=0.4} = 2(0.4 - 2) = -3.2 \] \[ \phi_1 \leftarrow 0.6 - 0.1 \cdot (-4.8) = 1.08 \] \[ \phi_2 \leftarrow 0.4 - 0.1 \cdot (-3.2) = 0.72 \] 

4. Third Iteration: \[ \frac{\partial L}{\partial \phi_1} \bigg|_{\phi_1=1.08} = 2(1.08 - 3) = -3.84 \] \[ \frac{\partial L}{\partial \phi_2} \bigg|_{\phi_2=0.72} = 2(0.72 - 2) = -2.56 \] \[ \phi_1 \leftarrow 1.08 - 0.1 \cdot (-3.84) = 1.464 \] \[ \phi_2 \leftarrow 0.72 - 0.1 \cdot (-2.56) = 0.976 \] 

Repeating these iterations will allow us to reach to the global minimum at the point \((3, 2)\) as illustrated in the figure below. These steps illustrate the core idea of the "Gradient Decent Algorithm".


The example that I give is quite straightforward because we used a convex function that has only one global minimum. Things can be more complicated if we have a loss function that has multiple minimums. The gradient decent will allow us to reach one minimum but this minimum is not necessarily the ideal one. 

Let's consider this loss function which is non-convex function that has multiple local minimas, such as: 

\[L(\phi_1, \phi_2) = \frac{\sin(\phi_1) \cdot \cos(\phi_2) + \sin(\phi_2) \cdot \cos(\phi_1) + 2}{4}\]


With these types of loss functions, it is not convenient to use the traditional gradient decent algorithm but other variants are used such as Stochastic Gradient Descent (SGD) or the mini-batch Gradient Descent. The basic ideas of these approaches is to select randomly one datapoint or a set of datapoints which allows to explore different potential local minimas. 

Imagine you are exploring a mountainous area with multiple valleys. your objective is to find the deepest point in the area. Let's assume that you can't see the entire landscape at once, so finding the absolute deepest point is challenging. The SGD algorithm allows us to handle these types situations to eventually find the deepest valley. The randomness introduced by the SGD algorithm allows us to escape eventually from shallow valleys to find deeper ones.

However, SGD and its variants will not work perfectly for all training scenarios as they are using a fixed learning rate. To illustrate this idea, let's take an example. Let's assume that we have the following loss function: 

\[L(\phi_1, \phi_2) = 50 \phi_1^2 + \phi_2^2\]

The gradients with respect to $\phi_1$ and $\phi_2$ are:
\[\frac{\partial L}{\partial \phi_1} = 100 \phi_1\]
\[\frac{\partial L}{\partial \phi_2} = 2 \phi_2\]

If we select initial parameters as : $\phi_1 = 1$, $\phi_2 = 1$ with learning rate $\alpha = 0.1$, we can see clearly that the updates for $\phi_1$ are very large while they are very low for $\phi_2$.
Iteration1
Compute the gradients:
\[\frac{\partial L}{\partial \phi_1} = 100 \times 1 = 100\]
\[\frac{\partial L}{\partial \phi_2} = 2 \times 1 = 2\]
Update parameters:
\[\phi_1 = \phi_1 - \alpha \times \frac{\partial L}{\partial \phi_1} = 1 - 0.1 \times 100 = -9\]
\[\phi_2 = \phi_2 - \alpha \times \frac{\partial L}{\partial \phi_2} = 1 - 0.1 \times 2 = 0.8\]
Iteration2:

Compute the gradients:

\[\frac{\partial L}{\partial \phi_1} = 100 \times (-9) = -900\]

\[\frac{\partial L}{\partial \phi_2} = 2 \times 0.8 = 1.6\]

Update parameters:

\[\phi_1 = \phi_1 - \alpha \times \frac{\partial L}{\partial \phi_1} = -9 - 0.1 \times (-900) = 81\]

\[\phi_2 = \phi_2 - \alpha \times \frac{\partial L}{\partial \phi_2} = 0.8 - 0.1 \times 1.6 = 0.64\]

Thus, if we use fixed learning rate, we can miss the global minimum because we are making very big updates or converging very slowly to global minimum.  The solution to this problem is to use adaptive learning rate such as Adam (Adaptive Moment Estimation), which is one of the most popular optimisation algorithms used today for training neural networks. Adam algorithm adapts the learning rate based on the past gradients that are calculated in the previous iterations.

Backpropagation Algorithm

Another issue we need to handle during training is the efficiency of gradient calculation. In our example with 28 parameters, calculating the gradients with regards to each parameter is not an issue. However, if our neural network has billions of parameters, a direct calculation of gradients will not be possible. This is where the backpropagation algorithm comes in, as it proposes an efficient method to calculate gradient for large neural networks.  

To explain backpropagation, we need first to distinguish between two types of parameters: bias vectors and weight matrices. Our objective is to know how much every change in the parameters would affect the loss function. Below is a general equation for any neural network that uses the general approximation theorem. 

\[ f_k = \beta_k + \Omega_k h_k \]

\[h_k = a[\beta_{k-1} + \Omega_{k-1} h_{k-1}]\]

\[\phi = \{\beta_k, \Omega_k\}\]

Where \(a\) is an activation function. \(h_k\) is the hidden state at layer \(k\), which is a vector that contains the scalar values of activation functions for every neuron at layer  \(k\).  \(k \in \{0, 1, 2, \ldots, K\}\).

Our objective during training is to see how much each parameter of the model affects the individual loss $\ell_i$ which allows later to calculate the total loss $L[\phi]$ on a dataset $I$: $L[\phi] = \sum_{i=1}^{I} \ell_i$. Specifically, we need to know $ \frac{\partial \ell_i}{\partial \beta_k} $ and $\frac{\partial \ell_i}{\partial \Omega_k}$.

Before we mention how to calculate the derivative efficiently, let's see what is needed to calculate $ \frac{\partial \ell_i}{\partial \beta_k} $ and $\frac{\partial \ell_i}{\partial \Omega_k}$. We know that ${\ell_i}$ is a function of $f_K$ where $K$ is the output layer (for example,  $\ell_i = (f_K[\mathbf{x}_i, \phi] - y_i)^2$, where $y_i$ is the true value and $f_K[\mathbf{x}_i, \phi]$ is the prediction of the neural network), so we can start the calculation of all derivative from this point.

\[ \frac{\partial \ell_i}{\partial \beta_K}= \frac{\partial f_K}{\partial \beta_K} \frac{\partial \ell_i}{\partial f_K}= 1.\frac{\partial \ell_i}{\partial f_K}=\frac{\partial \ell_i}{\partial f_K}\] 

and 

\[\frac{\partial \ell_i}{\partial \Omega_K}=\frac{\partial f_K}{\partial \Omega_K} \frac{\partial \ell_i}{\partial f_K}=h_K.\frac{\partial \ell_i}{\partial f_K}\]

So calculating the changes ${\ell_i}$ with regards to the parameters of the last layer is quite easy. However, calculating efficiently $ \frac{\partial \ell_i}{\partial \beta_{0..K-1}} $ and $\frac{\partial \ell_i}{\partial \Omega_{0..K-1}}$ is more challenging. Here again,  we will calculate them starting from points that we know using the chain rule of derivatives. For example:

\[ \frac{\partial \ell_i}{\partial \beta_{K-1}}= \frac{\partial f_{K-1}}{\partial \beta_{K-1}} \frac{\partial \ell_i}{\partial f_{K-1}}=\frac{\partial \ell_i}{\partial f_{K-1}}\] 

and 

\[\frac{\partial \ell_i}{\partial \Omega_{K-1}}=\frac{\partial f_{K-1}}{\partial \Omega_{K-1}} \frac{\partial \ell_i}{\partial f_{K-1}}=h_{K-1}.\frac{\partial \ell_i}{\partial f_{K-1}}\]

For both types of parameters, we need to know $\frac{\partial \ell_i}{\partial f_{K-1}}$. We know that $\ell_i$ is composite function that we can write in term of $f_{K-1}$ as follows: $\ell_i (f_K(h_K(f_{K-1})))$. So We can apply the chain rule to calculate $\frac{\partial \ell_i}{\partial f_{K-1}}$:

\[\frac{\partial \ell_i}{\partial f_{K-1}}=\frac{\partial h_K}{\partial f_{K-1}} \frac{\partial f_K}{\partial h_K} \frac{\partial \ell_i}{\partial f_K}\] 

This decomposition solves our problem because we know already $\frac{\partial \ell_i}{\partial f_K}$. We can easily calculate  $\frac{\partial f_K}{\partial h_K}=\Omega_K$. Calculating $\frac{\partial h_K}{\partial f_{K-1}}$ is also easy because it is the derivative of the activation function with regards to the function $f_{K-1}$. 

We can continue moving back until we reach the input layer ($k=0$). The efficiency of the backpropagation algorithms comes from the fact that all intermediate values are pre-stored during the forward pass before calculating the loss. It means that most of the derivatives calculations require only invoking the intermediate values from memory.

While backpropagation is needed for most the neural network architectures (CNN, RNN, Transformer, MLP, etc..), it can not be applied in the same way for Kolmogorov-Arnold Networks (KANs). KANs are another types of neural network that uses another method of approximation. KANs transform multivariable functions into sum of continuous functions with one single variable. This means that calculating the derivatives for every layer is straightforward process as we are calculating derivates with regards to a function with one variable. 


References 

  1. Simon J.D. Prince. (2023). Understanding Deep Learning. The MIT Press. http://udlbook.com
  2. Danka, Tivadar (2024). Mathematics of Machine Learning.






Comments

Popular posts from this blog

ChatGPT or CheatGPT? The impact of ChatGPT on education

What does information security really mean?