## Introduction to Neural Networks - Part 4 (Backpropagation)

Last time we saw the neural network cost function that we'd like to minimize. Today we'll look at the algorithm for calculating cost derivatives which can be used in optimization algorithm.

## Reminder: Forward propagation

As a reminder, in Part 2 of this series we were calculating activations of each layer of `XOR`

network with feedforward algorithm (version for `k`

samples):

```
Matrix<n, k> z2 = theta1.T * a1.Prepend(1);
Matrix<n, k> a2 = Sigmoid(z2);
```

and so on up to `a4`

which is the output of neural network:

```
Matrix<p, k> z3 = theta2.T * a2.Prepend(1);
Matrix<p, k> a3 = Sigmoid(z3);
Matrix<One, k> z4 = theta3.T * a3.Prepend(1);
Matrix<One, k> a4 = Sigmoid(z4);
```

## Backpropagation

Backpropagation is used to calculate errors in each layer. It starts at the output layer and moves back to first hidden layer (there's no point in calculating errors for input layer).

Error for output layer is simple - it's the difference between actual and expected output. For `XOR`

network this is:

```
Matrix<One, k> delta4 = a4 - y;
```

Moving back through hidden layers, we calculate the error as follows:

```
Matrix<p, k> delta3 = (theta3.RemoveFirstRow() * delta4).MultiplyEach(SigmoidDerivative(z3));
```

First of all we must keep in mind that theta3 had been calculated with additional bias value - we don't need it here, as we're only interested in errors of the neurons activations. That's why we call `RemoveFirstRow()`

on theta3.

`SigmoidDerivative`

is defined as follows:

```
Matrix SigmoidDerivative(Matrix z) => Sigmoid(z).MultiplyEach(1 - Sigmoid(z));
```

Also notice that the `delta3`

is `p x k`

matrix - which is what we expected to get, as it's the same size as this layer activation matrix `a3`

. Each column of `delta3`

corresponds to error in single sample. Each row corresponds to error in single neuron.

## Derivatives

When `delta4`

, `delta3`

and `delta2`

are calculated, we can use them to calculate the partial derivatives of the cost function:

```
Matrix<m, n> dJ1 = Zeros(theta1.Size);
Matrix<n, p> dJ2 = Zeros(theta2.Size);
Matrix<p, One> dJ3 = Zeros(theta3.Size);
for (var i = 0; i < k; ++i)
{
dJ1 += a1.Prepend(1).GetColumn(i) * delta2.GetColumn(i).T;
dJ2 += a2.Prepend(1).GetColumn(i) * delta3.GetColumn(i).T;
dJ3 += a3.Prepend(1).GetColumn(i) * delta4.GetColumn(i).T;
}
```

`dJ1`

, `dJ2`

, `dJ3`

hold now the partial derivatives of the cost function.

Remember that the sizes and number of layers will change - this particular example is based on the `XOR`

network.