Marcin Drobik

software journeyman notes

Introduction to Neural Networks - Part 4 (Backpropagation)

Last time we saw the neural network cost function that we'd like to minimize. Today we'll look at the algorithm for calculating cost derivatives which can be used in optimization algorithm.

Reminder: Forward propagation

As a reminder, in Part 2 of this series we were calculating activations of each layer of XOR network with feedforward algorithm (version for k samples):

Matrix<n, k> z2 = theta1.T * a1.Prepend(1);
Matrix<n, k> a2 = Sigmoid(z2);

and so on up to a4 which is the output of neural network:

Matrix<p, k> z3 = theta2.T * a2.Prepend(1);
Matrix<p, k> a3 = Sigmoid(z3);

Matrix<One, k> z4 = theta3.T * a3.Prepend(1);
Matrix<One, k> a4 = Sigmoid(z4);


Backpropagation is used to calculate errors in each layer. It starts at the output layer and moves back to first hidden layer (there's no point in calculating errors for input layer).

Error for output layer is simple - it's the difference between actual and expected output. For XOR network this is:

Matrix<One, k> delta4 = a4 - y;

Moving back through hidden layers, we calculate the error as follows:

Matrix<p, k> delta3 = (theta3.RemoveFirstRow() * delta4).MultiplyEach(SigmoidDerivative(z3));

First of all we must keep in mind that theta3 had been calculated with additional bias value - we don't need it here, as we're only interested in errors of the neurons activations. That's why we call RemoveFirstRow() on theta3.

SigmoidDerivative is defined as follows:

Matrix SigmoidDerivative(Matrix z) => Sigmoid(z).MultiplyEach(1 - Sigmoid(z));

Also notice that the delta3 is p x k matrix - which is what we expected to get, as it's the same size as this layer activation matrix a3. Each column of delta3 corresponds to error in single sample. Each row corresponds to error in single neuron.


When delta4, delta3 and delta2 are calculated, we can use them to calculate the partial derivatives of the cost function:

Matrix<m, n> dJ1 = Zeros(theta1.Size);
Matrix<n, p> dJ2 = Zeros(theta2.Size);
Matrix<p, One> dJ3 = Zeros(theta3.Size);

for (var i = 0; i < k; ++i)
    dJ1 += a1.Prepend(1).GetColumn(i) * delta2.GetColumn(i).T;
    dJ2 += a2.Prepend(1).GetColumn(i) * delta3.GetColumn(i).T;
    dJ3 += a3.Prepend(1).GetColumn(i) * delta4.GetColumn(i).T;

dJ1, dJ2, dJ3 hold now the partial derivatives of the cost function.

Remember that the sizes and number of layers will change - this particular example is based on the XOR network.

comments powered by Disqus