Marcin Drobik

software journeyman notes

Introduction to Neural Networks - Part 3 (Cost Function)

In previous post we created neural network by manually setting all weights. In real life situations this is not feasible solution - not only the neural network will have much larger number of connections but also it will be simply impossible to say how the weights relate to problem we want to solve.

Our goal is to implement an algorithm to find those weights.

Optimization again?

As you may suspect this problem won't be much different to what we've already done in Linear and Logistic Regression algorithms - we need to form an optimization problem, which when solved, will yield a matching set of weights for our neural network.

When dealing with optimization problem we first need to define what we want to optimize: having some training set we want our neural network to output values as closest to training values as possible. In fact this problem is almost the same as in Logistic Regression with the difference that neural network can have multiple outputs, and we have to sum the errors of each of them.

The lower the Cost the better

In Logistic Regression the cost function was defined as:

In neural networks, the output for each training example will be a vector, so we have to amend the equation so that it sums the errors of each value inside those vectors (assume that output has K elements and there're m training examples):

Notice that the only difference is additional sum operator over each output element.

Matrix please

It should be relatively easy to write this down with matrix operations in Stratosphere.NET. Let's start with method declaration - we'll accept the networks output values and the training examples:

private static double Cost<K, M>(Matrix<K, M> networkOutputs, Matrix<K, M> trainingOutputs)

networkOutputs will hold one column for every training sample - i-th column holds a network output for i-th training sample. Each column is vector of length K and there are M training samples. trainingOutputs has exactly the same structure.

Let's name the variables the same as in the cost formula:

    Matrix<K, M> h = networkOutputs;
    Matrix<K, M> y = trainingOutputs;
    int m = h.Width;

Since we operate on entire matrices at the same time, we don't have to loop over their individual elements:

    var t1 = y.MultiplyEach(Log(h));
    var t2 = (1 - y).MultiplyEach(Log(1 - h));

    var toSum = -t1 - t2;

Now we only need to sum everything and divide by m:

    return toSum.Sum() / m;

Is that all?

The optimization algorithms we have (like Gradient Descent or Quasi-newton) use gradient information to find the optimal solution - in next post we'll take a look how we can calculate it.

comments powered by Disqus