Marcin Drobik

software journeyman notes

Posts in Stratosphere.NET

Stratosphere.NET - 2 months later

On February 29th I posted about joining "Daj się poznać" competition.

Since then I added 21 post about Stratosphere.NET project:

  1. Matrices - Introduction
  2. Matrix transpose
  3. Matrix multiplication
  4. Matrix - Lazy operations
  5. Numerical Optimization - Steepest Gradient Descent
  6. Linear Regression
  7. Plotting data with OxyPlot
  8. Plotting data with OxyPlot - Part 2
  9. Optimizing with Newton's Method
  10. Optimizing optimization algorithms - Line Search
  11. Quasi-Newton Method
  12. Logistic Regression
  13. Playing with C# type system - Part 1
  14. Playing with C# type system - Part 2
  15. Playing with C# type system - Part 3
  16. Introduction to Neural Networks - Part 1
  17. Introduction to Neural Networks - Part 2
  18. Introduction to Neural Networks - Part 3 (Cost Function)
  19. Introduction to Neural Networks - Part 4 (Backpropagation)
  20. Introduction to Neural Networks - Part 5 (Learning)
  21. Generic Neural Network class

There was one additional post in the meantime - my short report from Wroc# conference.

I learned and coded almost everything as I created those posts. Some of the stuff went pretty deep into numerical methods, which I didn't anticipated when I started the project (but It was fun nonetheless!)

During this period my blog had 1334 visitor sessions and 2643 page views. Most of the traffic was generated by RSS readers, some of it came from twitter links.

From Feb 29 I made 71 commits to project's repository. My longest streak was 21 days in a row (almost entire March). I even got one star on GitHub ;-)


Even though I didn't get thousands of fans I'm extremely happy I managed to create so much. "Daj się poznać" helped me creating a habit of posting regularly and I plan to keep it that way (that is - after I return from my holidays!)

Thanks everyone!

Generic Neural Network class

In the "Introduction to Neural Networks" series we worked on the specific example of XOR network. Because of that, the network layout was hardcoded so that It'd be very hard to change the number of nodes or layers.

With some refactoring I managed to make the code more generic and moved it to separate class (TrainableNerualNetwork - you can look it up on GitHub).


The class is quite easy to work with:

public void XOR_TrainableNetwork()

First define the training data. As previously we define the set of input and outputs for the XOR function:

    var traingData = Vector(0, 0)
        .Concat(Vector(0, 1))
        .Concat(Vector(1, 0))
        .Concat(Vector(1, 1));
    Matrix traingOutput = "0, 1, 1, 0";

    var X = traingData.Evaluate();
    var Y = traingOutput.Evaluate();

Instantiate the TrainableNeuralNetwork object by passing the number of nodes in each layer:

    var xorNetwork = new TrainableNeuralNetwork(2, 3, 2, 1);

Train the network with prepared training data:

    xorNetwork.Train(X, Y);

Network should output the same values as training data:

    MatrixAssert.AreEqual("0, 1, 1, 0", xorNetwork.Output);

Smaller network?

Is it possible to make the XOR network with smaller number of nodes and layers? Certainly! You can try running code above with following object:

var xorNetwork = new TrainableNeuralNetwork(2, 2, 1);

And you'll find that it still outputs correct data.

Introduction to Neural Networks - Part 5 (Learning)

Until now we discussed a neural network for which both the structure - number and sizes of each layer - and the weights were known.

Today we'll see how we can find the parameters theta for given neural network (it's structure is still defined upfront).

Putting it all together

In Part 3 of these series we saw how to calculate the cost for specific choice of parameters theta. In Part 4 we used the Backpropagation algorithm to calculate errors for each layer. Having those error we could then calculate partial derivatives of the cost function.

Having the cost function and it's derivatives, we should have everything to use some optimization method and find best theta parameters, right?

The optimization algorithms we used until now expect the cost function to be of type Func<Matrix, double> f, where the input parameter of the function are the theta parameter. So how do we pass all our theta parameters as single matrix?

That's how I roll

We may use simple trick to pack theta for each layer into single Matrix by putting their raw data next to each other, and pretend they are 1D matrix. Let's call this operation unrolling. Then, when we need to calculate something we can convert the 1D matrix back to 2D matrices by splitting the data back to original form.


theta 1:
1 3
2 4

theta 2:
5 8
6 9
7 10

Then the unrolled version would be:

1 2 3 4 5 6 7 8 9 10

You can easily reverse the operation if you know the original matrix dimensions:

| theta 1             | theta 2             |
| column 1 | column 2 | column 1 | column 2 |
 1 2         3 4        5 6 7      8 9 10

I prepared a simple Matrix decorator class - UnrolledMatrix - that handles those operations.

Unrolled functions

Using this technique we can prepare a cost function that accepts the unrolled parameters and the training data:

double Cost(Matrix unrolledThetas, Matrix X, Matrix y)

For new we'll use the theta parameters for the Xor network example:

    var unrolledMatrix = UnrolledMatrix.Parse(unrolledThetas, new [] {3, 3}, new [] {4, 2}, new [] { 3, 1});

    var theta1 = unrolledMatrix.Matrices[0];
    var theta2 = unrolledMatrix.Matrices[1];
    var theta3 = unrolledMatrix.Matrices[2];

    Matrix<m, n> dJ1;
    Matrix<n, p> dJ2;
    Matrix<p, One> dJ3;

Notice that the backpropagation will calculate both the activations and the partial derivatives in one pass, but for cost we don't use the partial derivatives:

    var result = FeedforwardBackpropagate(X, y, theta1.As<m, n>(), theta2.As<n, p>(), theta3.As<p, One>(), out dJ1, out dJ2, out dJ3);

    return Cost(result, y.As<One, k>());

Function that calculates gradient information looks exactly the same, but instead of cost it returns the unrolled partial derivatives:

static Matrix Gradient(Matrix untolledThetas, Matrix X, Matrix y)
    // same code as in Cost function ...

    FeedforwardBackpropagate(X, y, theta1.As<m, n>(), theta2.As<n, p>(), theta3.As<p, One>(), out dJ1, out dJ2, out dJ3);

    return new UnrolledMatrix(dJ1, dJ2, dJ3);

Let's run it!

Finally we're now able to optimize the parameters of the neural network:

Matrix FindMinimum(Matrix X, Matrix Y)
    var fminunc = new BacktrackingSteepestDescentMethod(maxIterations: 2000);

    return fminunc.Find(
        f: unrolledThetas => Cost(unrolledThetas, X, Y),
        df: unrolledThetas => Gradient(unrolledThetas, X, Y),
        initial: InitialThetas());

(InitialThetas returns random, unrolled theta matrices)

Result? It will be different every time you run it because of the random initialization, here's one example (it's unrolled):

0,83  -1  1,2  0,48  4,43  -4,51  -3,29  5,58  -5,71  0,23  2,99  -9,44  11,6  -1,19  -0,18  5,24  -6,02  -6,8  22,75  -8,93 

They look very different from the original XOR parameters, however if we run the tests for XOR for them:

Works, yay!

Introduction to Neural Networks - Part 4 (Backpropagation)

Last time we saw the neural network cost function that we'd like to minimize. Today we'll look at the algorithm for calculating cost derivatives which can be used in optimization algorithm.

Reminder: Forward propagation

As a reminder, in Part 2 of this series we were calculating activations of each layer of XOR network with feedforward algorithm (version for k samples):

Matrix<n, k> z2 = theta1.T * a1.Prepend(1);
Matrix<n, k> a2 = Sigmoid(z2);

and so on up to a4 which is the output of neural network:

Matrix<p, k> z3 = theta2.T * a2.Prepend(1);
Matrix<p, k> a3 = Sigmoid(z3);

Matrix<One, k> z4 = theta3.T * a3.Prepend(1);
Matrix<One, k> a4 = Sigmoid(z4);


Backpropagation is used to calculate errors in each layer. It starts at the output layer and moves back to first hidden layer (there's no point in calculating errors for input layer).

Error for output layer is simple - it's the difference between actual and expected output. For XOR network this is:

Matrix<One, k> delta4 = a4 - y;

Moving back through hidden layers, we calculate the error as follows:

Matrix<p, k> delta3 = (theta3.RemoveFirstRow() * delta4).MultiplyEach(SigmoidDerivative(z3));

First of all we must keep in mind that theta3 had been calculated with additional bias value - we don't need it here, as we're only interested in errors of the neurons activations. That's why we call RemoveFirstRow() on theta3.

SigmoidDerivative is defined as follows:

Matrix SigmoidDerivative(Matrix z) => Sigmoid(z).MultiplyEach(1 - Sigmoid(z));

Also notice that the delta3 is p x k matrix - which is what we expected to get, as it's the same size as this layer activation matrix a3. Each column of delta3 corresponds to error in single sample. Each row corresponds to error in single neuron.


When delta4, delta3 and delta2 are calculated, we can use them to calculate the partial derivatives of the cost function:

Matrix<m, n> dJ1 = Zeros(theta1.Size);
Matrix<n, p> dJ2 = Zeros(theta2.Size);
Matrix<p, One> dJ3 = Zeros(theta3.Size);

for (var i = 0; i < k; ++i)
    dJ1 += a1.Prepend(1).GetColumn(i) * delta2.GetColumn(i).T;
    dJ2 += a2.Prepend(1).GetColumn(i) * delta3.GetColumn(i).T;
    dJ3 += a3.Prepend(1).GetColumn(i) * delta4.GetColumn(i).T;

dJ1, dJ2, dJ3 hold now the partial derivatives of the cost function.

Remember that the sizes and number of layers will change - this particular example is based on the XOR network.

Introduction to Neural Networks - Part 3 (Cost Function)

In previous post we created neural network by manually setting all weights. In real life situations this is not feasible solution - not only the neural network will have much larger number of connections but also it will be simply impossible to say how the weights relate to problem we want to solve.

Our goal is to implement an algorithm to find those weights.

Optimization again?

As you may suspect this problem won't be much different to what we've already done in Linear and Logistic Regression algorithms - we need to form an optimization problem, which when solved, will yield a matching set of weights for our neural network.

When dealing with optimization problem we first need to define what we want to optimize: having some training set we want our neural network to output values as closest to training values as possible. In fact this problem is almost the same as in Logistic Regression with the difference that neural network can have multiple outputs, and we have to sum the errors of each of them.

The lower the Cost the better

In Logistic Regression the cost function was defined as:

In neural networks, the output for each training example will be a vector, so we have to amend the equation so that it sums the errors of each value inside those vectors (assume that output has K elements and there're m training examples):

Notice that the only difference is additional sum operator over each output element.

Matrix please

It should be relatively easy to write this down with matrix operations in Stratosphere.NET. Let's start with method declaration - we'll accept the networks output values and the training examples:

private static double Cost<K, M>(Matrix<K, M> networkOutputs, Matrix<K, M> trainingOutputs)

networkOutputs will hold one column for every training sample - i-th column holds a network output for i-th training sample. Each column is vector of length K and there are M training samples. trainingOutputs has exactly the same structure.

Let's name the variables the same as in the cost formula:

    Matrix<K, M> h = networkOutputs;
    Matrix<K, M> y = trainingOutputs;
    int m = h.Width;

Since we operate on entire matrices at the same time, we don't have to loop over their individual elements:

    var t1 = y.MultiplyEach(Log(h));
    var t2 = (1 - y).MultiplyEach(Log(1 - h));

    var toSum = -t1 - t2;

Now we only need to sum everything and divide by m:

    return toSum.Sum() / m;

Is that all?

The optimization algorithms we have (like Gradient Descent or Quasi-newton) use gradient information to find the optimal solution - in next post we'll take a look how we can calculate it.

Introduction to Neural Networks - Part 2

In previous post we saw single neuron and learned how to represent and name it's components.

Today we'll see how to combine neurons together to form network!

NAND for the start

NAND operation is functionally complete which means that we can combine it into any logic function. Sounds like fun, so let's see if we can come up with neuron that represents it:

f = S(30 - 20a - 20b)

By looking at function values we can confirm it's indeed a NAND operation:

f(0, 0) = S(30)  = ~1  
f(0, 1) = S(10)  = ~1
f(1, 0) = S(10)  = ~1
f(1, 1) = S(-10) = ~0

Composite operations

By combining NAND operations, we can now create a more complex XOR gate:

I changed the notation a bit - instead of repeating the bias inputs for every neuron I just placed the bias weight inside the neuron circle.

Notice the XOR is composed of 4 NAND neurons.

Layers topology

The topology of XOR if very irregular. Neural networks should form a layers with each node in given layer connected to every node in next layer.

Lets amend our XOR function to look like that:

Notice that I added 2 nodes in second layer. Those nodes are designed to pass one of the arguments. If the argument is 0, then it will become S(-10) ~ 0. If argument is 1 it will become S(10) ~ 1. Rest of the system is unchanged and should function exactly the same as previous one.

Now we clearly see that XOR has 4 layers. The first layer is called input layer. Last layer is called output layer. The layers in between (second and third in our example) are called hidden layers.

Defining Neural Network with Matrices

Matrices can be used to represent our neural network. First, there's a input matrix x (nodes a and b on the diagram):

var x = Matrix.Vector(a, b);

For each neuron in second layer we define the weights of the inputs, taking into account additional bias input:

var theta1_1 = Matrix.Vector(-10, 20, 0);
var theta1_2 = Matrix.Vector(30, -20, -20);
var theta1_3 = Matrix.Vector(-10, 0, 20);

You can take a moment and verify that above values correspond to the weights on XOR diagram. Together the weights form a 2-dimensional matrix that represents entire second layer:

var theta1 = theta1_1.Concat(theta1_2).Concat(theta1_3).Evaluate().As<m, n>();

By convention those matrices are named with index of previous layer (theta1 instead of theta2 in this case) to indicate that they are used to calculate next layer activations.

Matrices for third and fourth layer are defined in the same way:

var theta2_1 = Matrix.Vector(30, -20, -20, 0);
var theta2_2 = Matrix.Vector(30, 0, -20, -20);    
var theta2 = theta2_1.Concat(theta2_2).Evaluate().As<n, p>();

var theta3_1 = Matrix.Vector(30, -20, -20);
var theta3 = theta3_1.As<p, One>();

Since the fourth layer is the output layer, there's no theta4.


Now, when we've defined neural network by its weights, we can derive the value of output layer by calculating individual activation of each neuron, layer by layer. This procedure is called feedforwarding.

Using the matrices, we can calculate activation of entire layer at once. First layer activation is equal to the input values:

var a1 = x.As<m>();

To calculate the activation of second layer, we multiply the weight matrix by the activation of its previous layer. Resulting values are piped through Sigmoid function. Notice that inputs in each layer have additional bias value (.Prepend(1)):

Matrix<n, One> a2 = Sigmoid(theta1.T * a1.Prepend(1));

We repeat the operation for each layer:

Matrix<p, One> a3 = Sigmoid(theta2.T * a2.Prepend(1));
Matrix<One, One> a4 = Sigmoid(theta3.T * a3.Prepend(1));

Last activation resulted in Matrix<One, One> which is what we expected (XOR has single output). It's helpful to round it to integer:

int result = a4.Inner > 0.5 ? 1 : 0;


Several simple test cases reassure us that the network works correctly:

[TestCase(0, 0, ExpectedResult = 0)]
[TestCase(1, 1, ExpectedResult = 0)]
[TestCase(0, 1, ExpectedResult = 1)]
[TestCase(1, 0, ExpectedResult = 1)]

Introduction to Neural Networks - Part 1

Before we start talking about Neural Networks, lets talk about specific implementation of simple logic functions. By using them we'll learn basic math and graphical representations of Neural Networks.


When we implemented the Logistic Regression we used the sigmoid function. This function is defined as:

... and looks like that:

The important properties for us is that for large values of t the S(t) approaches 1 and for small (negative) values of t the S(t) approaches 0.


Logic "and" is a function of two parameters - a and b. To use it in arithmetic calculations we'll assume that values close to 0 represents logical false, and values close to 1 represent logical true.

Let's look at following function (S is sigmoid):

f = S(-30 + 20a + 20b)

When looking at the various values of a and b we see that it indeed implements the logical And - it only outputs value close to 1 when both a and b are 1:

a b f
0 0 S(-30) ~ 0
1 0 S(-10) ~ 0
0 1 S(-10) ~ 0
1 1 S(10) ~ 1

Neuron representation

Our function f can be represented as a single neuron on following graph:

In neural networks terminology, the a and b are Inputs, the 1 is special input called Bias Input, the orange circle is a Neuron (with activation function S(t)) and the arrows in between represent Weights of the inputs for particular Neuron. The arrow on the right represents Output of the neuron - in this case, our function f = S(-30 + 20a + 20b)

Vectorized version

It's very easy to represent this single neuron using matrix operations in Stratosphere.NET. First let's define our inputs as vector x:

var x = Matrix.Vector(1, 0, 1); // a = 0; b = 1

... and the weights as vector theta:

var theta = Matrix.Vector(-30, 20, 20);

Now the neuron output is simply the multiplication:

var f = Sigmoid(x*theta.T);

In next post we'll see how to create simple neural networks.

Playing with C# type system - Part 3

Last two posts have introduced some basic operations on strong typed matrices. We need only last bit to be able to write down following formula with compiler checks:

var dH = (s * s.T) / (s.T * q) - (H * q) * ((H * q).T / (q.T * H * q));

Dot product

There a special case we need to handle: when the result of multiplying two matrices is a single number. This happens if A is 1 x n matrix and B is n x 1 matrix, then the result is by definition 1 x 1 (which is just a number). It should be simple to define this special case with overload:

public class One { }

public class Matrix<D1, D2>
    // ...

    public static Matrix<D1, One> operator *(Matrix<D1, D2> a, Matrix<D2, One> b) => a.Inner.Multiply(b.Inner).As<D1, One>();

    // ...

Having this we can define additional operator for division:

public static Matrix<D1, D2> operator / (Matrix<D1, D2> a, Matrix<One, One> scalar) => a.Inner.Map(v => v / scalar.Inner).As<D1, D2>();

Notice that the second operand is explicitly defined as Matrix<One, One>, so compiler won't let you divide by anything else.

Back to the formula

So now when you write down the formula again, this time using the strongly typed matrices:

Matrix<n, n> H = ...
Matrix<n, One> q = ...
Matrix<n, One> s = ...

var dH = (s*s.T)/(s.T*q) - (H*q)*((H*q).T/(q.T*H*q));

The compiler will tell you dH is of type Matrix<n, n>, which is exactly what we need:

H = H + dh;

Playing with C# type system - Part 2

Last time I introduced the strongly typed Matrix class with some simple operation. Let's try to leverage this so that compiler can help us construct more advanced formulas.

The formula

Imagine that writing down following formula (s, q and H are matrices):

var dH = (s * s.T) / (s.T * q) - (H * q) * ((H * q).T / (q.T * H * q));

To use strongly typed matrices in this formula we need several operations defined: + (defined in last post), - (same as +), Transposition, / and *.

Let's do Transposition first. It quite simple - the returned matrix will have swapped dimensions:

public class Matrix<D1, D2>
    // ...

    public Matrix<D2, D1> T => Inner.Transpose().As<D2, D1>();

    // ...


It's bit more complex for multiplication. From Wikipedia: if A is an n × m matrix and B is an m × p matrix, their matrix product AB is an n × p matrix. Let's give it a try:

// I changed type parameter names to match the definition above
public class Matrix<n, m>
    // Type p is not defined!
    public static Matrix<n, p> operator *(Matrix<n, m> a, Matrix<m, p> b) => a.Inner.Multiply(b.Inner).As<n, p>();

The problem is that type p is not defined - it cannot be a type parameter, so the only option is to actually have if defined as a class:

public class p { }

A bit dirty hack but it compiles and gives a necessary compiler check:

public class n { }
public class m { }

public void MultiplicationTest()
    Matrix<n, m> A = null; // some actual values go here
    Matrix<m, p> B = null;

    Matrix<n, p> AB = A*B; 

However if you try this:

public void NotValid()
    Matrix<m, m> A = null; // some actual values go here
    Matrix<m, p> B = null;

    // Does not compile - operator * is not defined for those matrices
    var AB = A*B; 

... you'll get a compiler error.

Playing with C# type system - Part 1

The Quasi-Newton method uses quite complex formula for calculating the approximation of second derivative. Implemented in Stratosphere.NET it looks like that:

H = H + (s * s.T) / (s.T * q) - (H * q) * ((H * q).T / (q.T * H * q));

All operands - H, s and q are matrices. However until we run the code we can't even say if the operation is valid, because matrix operations require specific matrix sizes (i.e. + and - requires matrices of the same size).

Generic Matrix

What if we'd make the compiler check if the matrix sizes are OK, and even give us information of what the resulting matrix size is? Let's try to do it with generics:

public class Matrix<D1, D2> 
    public Matrix Inner { get; }

    public Matrix(Matrix inner)
        Inner = inner;

    public static implicit operator Matrix(Matrix<D1, D2> a) => a.Inner;

Nothing fancy yet - just a generic class that can be constructed from matrix and converted easily back to it using implicit operator. In Matrix class, we can add a method to easily convert to this class:

public abstract class Matrix
    // ...

    public Matrix<D1, D2> As<D1, D2>() => new Matrix<D1, D2>(this);

    // ... 

Strongly typed addition

We can now define a first operation for strongly typed matrix:

public class Matrix<D1, D2> 
    // ...

    public static Matrix<D1, D2> operator +(Matrix<D1, D2> a, Matrix<D1, D2> b) => a.Inner.Add(b.Inner).As<D1, D2>();

    // ...

Addition requires both Matrices to be same size. Just bare in mind that this code won't enforce that the matrices are indeed of the same size - it only guarantees that we declared them as the same size.

Using new addition operator we can now say:

// Types to distinguish the dimensions
public class M { }    
public class N { }

// ... somewhere in code ...
public void SeriousCalculations(Matrix<M, N> m1, Matrix<M, N> m2)
    Matrix<M, N> result = m1 + m2;

That's simple, but in next post we'll try to write entire Quasi-Newton formula using strongly typed generics.

Logistic Regression

Using the Matrix class and optimization algorithms (like Quasi-Newton Method) we're now ready to build another Machine Learning tool - the Logistic Regression.

The Border

Let's say we have a data set like the following:

Notice that there are two subsets - the green diamonds and orange circles (i.e. rock planets and gas planets). With logistic regression, we can find such line that divides the two subsets in best possible way:

Optimization goal

Similar to Linear Regression, we need to calculate how well our solution fits the data.

To work with the math we need some notations:

  • X - training example
  • y - will hold 1 if corresponding example is our search goal (i.e. "Is that planet a gas planet?"). It will hold 0 otherwise.
  • m - size of training set (number of training examples).
  • <code>theta</code> - parameters for logistic regression, similar to linear regression parameters.
  • <code>sigma(t)</code> - logistic function (we'll define it later).

    To find logistic regression parameters <code>theta</code>, we will optimize following cost function:

    Logistic regression cost

    As you remember, optimization method uses function derivative, so need to define it as well:

Last thing we need is the logistic function <code>sigma(t)</code>:

Stratosphere.NET in action

I'm using C#'s using static to import Math and Matrix namespaces:

using static System.Math;
using static Stratosphere.Math.Matrix;

Using Matrix class we can write down the vectorized formulas. Lets start with cost function (J):

double Cost(Matrix X, Matrix y, Matrix theta)
    var m = X.Height;
    var h = Sigmoid(theta.T * X.T).T;

    return (
        -y.MultiplyEach(h.Map(v => Log(NonZero(v)))) 
        -(1 - y).MultiplyEach((1 - h).Map(v => Log(NonZero(v))))).Sum() / m;

Vectorized gradient function:

Matrix Gradient(Matrix X, Matrix y, Matrix theta)
    var m = X.Height;
    var h = Sigmoid(theta.T * X.T).T;

    var temp = (Ones(theta.Height, 1) * (h - y).T).T;
    var gradient = temp.MultiplyEach(X).Sum(0) / m;

    return gradient.T;

And last but not the least - logistic function (also called sigmoid function):

public static double Sigmoid(double x) => 1.0 / (1.0 + Exp(-x));

public static Matrix Sigmoid(Matrix x) => x.Map(Sigmoid);

Running Optimization

Having all formulas implemented, we're now ready to pass them into our optimization algorithm:

public static Matrix LogisticRegression(Matrix X, Matrix y)

Prepend the data with ones - this way the found model parameters will have constant term:

    X = Matrix.Ones(X.Height, 1)

Pass the functions to Quasi-Newton Method that will found the optimal regression model:

    return QuasiNewtonMethod.Find(
        f: theta => _Cost(X, y, theta),
        df: theta => Gradient(X, y, theta),
        x0: Matrix.Zeros(X.Width, 1),
        maxIterations: 1000);

Quasi-Newton Method

We've already seen the Newton's Method in action. If you recall, the Newton's Method requires computing inverse matrix of functions' second derivative.

This poses two problems: first, calculating the inverse matrix is computationally expensive (my implementation is hard-coded for 2x2 and 3x3 matrices). Second, calculating second derivative may be complex and also computationally expensive (generally we want call the optimized function - or its derivatives - as few times as possible).

Broyden–Fletcher–Goldfarb–Shanno algorithm

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is used to approximate the second derivative of the function. Moreover, with small change it can be used to directly calculate the inversed second derivative making it computationally very efficient.


The formula is bit complex, but can be easily - almost directly - implemented using our Matrix class:

var dfx1 = df(x1);
var q = df(x2) - dfx1;
var s = x2 - x1;

H = H + (s*s.T)/(q.T*s) - (H*q*q.T*H.T)/(q.T*H*q);

H is our inverted matrix of second derivatives (inverted Hessian). Having it, we can use it in Newton's Method. Recall, that original Newton Method used second derivative to calculate the search direction and then uses line search to find minimum in that direction:

var dfx = df(x1);
var p = -ddf(x1).Inverse()*dfx;

var x2 = BacktrackingLineSearch.Find(f, df, p, x1, dfx).Evaluate();

Now all we need to do is replace second derivative calculation (ddf(x)) with our inverted Hessian:

var dfx = df(x1);
var p = -H*dfx;

var x2 = BacktrackingLineSearch.Find(f, df, p, x1, dfx).Evaluate();  

And that's the Quasi Newton Method.

Initial value

The BFGS algorithm is recursive - it uses previous approximation to calculate next. So to run it, we need some initial value. If second derivative is not available, Identity matrix may be used instead.


Optimizing optimization algorithms - Line Search

Both Steepest Gradient Decent and Newton's Method used parameter alpha to control the length of algorithm step. This would vary for each optimized function and for now we could only set it manually.

Wolfie Conditions

We need some sort of mathematical constrains for the step length to be able to choose it automatically. Wolfie Conditions give us such constrains. The conditions are as follows:

Armijo rule: Armijo Rule

Curvature Condition: Curvature Condition


  • p_{k} is search direction (- \nabla f(x_{k}) for Steepest Descent).
  • f is function to optimize.
  • \nabla f is function derivative.
  • \alpha is step length.
  • 0 < c<em>{1} < c</em>{2} < 1 are two coefficients, usually 0.0001 and 0.9 respectively.

In essence, the Armijo rule is constraining the maximum step length and curvature condition is constraining the minimum step length.

Straightforward implementation

The algorithm starts with \alpha set to 1 and then makes it smaller until it fulfills the Armijo rule:

public static Matrix Find(Func<Matrix, double> f, Func<Matrix, Matrix> df, Matrix p, Matrix x_start, Matrix dfx_start)
    var x0 = x_start;
    var fx0 = f(x0);

... limit the number of iterations - after 16 divisions alpha will be so small that it probably doesn't make sense to continue:

    double alpha = 1;
    int i = 0;                              

    while (!Armijo(f, x0, fx0, dfx_start, p, alpha) && i < 16)
        alpha = K * alpha; // K == 0.5

.. and return next point in direction p using calculated alpha:

    return x0 + alpha * p; 

The Armijo function is direct implementation of the Armijo formula:

private static bool Armijo(Func<Matrix, double> f, Matrix x0, double fx0, Matrix dfx0, Matrix p, double alpha)
    return f(x0 + alpha * p) <= fx0 + C * alpha * (dfx0.T * p) + Epsilon;

Alpha-less Gradient Descent

Let's put the algorithm to use. In Steepest Gradient Descent we had following line to calculate next point:

var x2 = x - Alpha * df(x);

Instead, we can use algorithm we just implemented (let's call it Backtracking Line Serch):

var dfx = df(x);
var p = -dfx / dfx.Length; // Normalize the direction vector

var x2 = BacktrackingLineSearch.Find(f, df, p, x, dfx);

When run against Banana function, we get following result (2000 iterations):

(Notice that the Steepest Gradient Decent didn't reach the minimum).

Alpha-less Newton's Method

Similarly, Newton's method defines step as:

var dfx = df(x);
var x2 = x - Alpha * ddf(x).Inverse() * dfx;

With the Backtracking Line Search we can rewrite it as:

var dfx = df(x);
var p = -ddf(x).Inverse()*dfx;

var x2 = BacktrackingLineSearch.Find(f, df, p, x, dfx).Evaluate();

This will give following result for Banana function:

Notice that the Newton method approximated the minimum much better and with fewer steps (< 30 iterations).

Optimizing with Newton's Method

Newton's Method is used to find function's roots (zeros). The method is somehow similar to Steepest Descent - it uses derivative to perform steps. The goal is however different - Steepest Descent uses it to find local minimum.

Newton's Method uses following steps to approximate the roots:

Here's a good visualization of the method from Wikipedia (by Ralf Pfeifer):


What does it have to do with optimization?

Recall that at local optimum, the function's first derivate equals to zero. If so, then we can say that searching for minimum is searching for first derivative's zero - and for that we can use Newton's method.

To do so, when using Newton's Method for optimizing of function f (instead of finding it's roots) we'll need information about both first and second derivatives:

Implementation ...

Here's a simple implementation that assumes both derivatives can be calculated:

public Matrix Find(Func<Matrix, Matrix> df, Func<Matrix, Matrix> ddf, Matrix initial, double alpha)
    var x = initial;

MaxIterations is some constant to make sure the algorithm won't iterate forever (i.e. 1000)

    for (int i = 0; i < MaxIterations; ++i)

This is vectorized version for Newton's optimization - second derivative is inversed instead of being in denominator. alpha is used to control the step length - similar to Gradient Descent.

        var x2 = x - alpha * ddf(x).Inverse()*df(x);

Algorithm will terminate when the step made is very small (i.e. 0.0001):

        if ((x2 - x).Length < Epsilon)
            return x2;

        x = x2;

    return x;

Let's run it!

And here's the result for Banana function (with alpha set to 0.5):

Plotting data with OxyPlot - Part 2

In previous post we looked at using OxyPlot for drawing simple functions and datasets.

Today let's see how we can plot a function of 2 arguments.

Heat Map

As far as I know, OxyPlot doesn't have any support for plotting 3D objects, so we have to find different way for visualizing functions with 2 arguments.

One neat way for doing it is to plot a heat map of values. It works like in cartography - low values are deep blue. High values are red. Everything in between goes from blue, through green and orange to red. Just like following map of Poland:


OxyPlot has a data series called HeatMapSeries that let us do exactly that.

We'll work in context of following method:

public static HeatMapSeries HeatMap(this PlotModel model, double minX0, double maxX0, double minX1, double maxX1, Func<Matrix, double> f)

(plot function f, that accepts Matrix and computes single value, in given range of values)

Start by initializing the OxyPlot series:

var map = new HeatMapSeries
    X0 = minX0,
    X1 = maxX0,
    Y0 = minX1,
    Y1 = maxX1,
    Interpolate = false,
    RenderMethod = HeatMapRenderMethod.Rectangles

Each element of series will be an individual pixel. Let's define the resolution at which we'll evaluation our function, and allocate data for series accordingly:

var x0Resolution = 200;
var step = (maxX0 - minX0)/x0Resolution;
var x1Resolution = (int) ((map.Y1 - map.Y0)/step);

map.Data = new double[x0Resolution, x1Resolution];

With preallocated series data, now it's time to fill it in by iterating through each "pixel" on our map and evaluating function value:

int x0Index = 0;

for (double x0 = minX0; x0 < maxX0; x0 += step)
    var x1Index = 0;
    for (double x1 = minX1; x1 < maxX1; x1 += step)
        map.Data[x0Index, x1Index++] = f(Matrix.Vector(x0, x1));


Finally, let's define how the function value will be translated to color. For this, OxyPlot uses a notion of axises:

model.Axes.Add(new LinearColorAxis
    Position = AxisPosition.Right,
    Palette = OxyPalettes.HueDistinct(1500)

"Banana" function

Let's try our method on following function (logarithm of Rosenbrock function):

First, let's write down the Rosenbrock function itself:

double BananaFunction(Matrix x)
    var a = x[1] - (x[0] * x[0]);
    var b = 1 - x[0];

    return 100 * a * a + b * b;

... and display the Heatmap (using the PlotModel from previous post):

model.HeatMap(-2.0, 2.0, -1, 3, x => Log(BananaFunction(x)));

And here's your Banana:

(Notice a clearly visible minimum at point [1, 1])

Plotting data with OxyPlot

Startosphere.NET has a simple WinForms application for visualizing some of the algorithms. To create data charts I used OxyPlot library - it's open source and cross platform.

With OxyPlot, you start with creating an instance of PlotModel:

var model = new PlotModel { Title = "f(x) = x^2" };


Model can have multiple data series. To draw a line chart we'll use LineSeries class:

var lineSeries = new LineSeries ();

We can then add data:

for (double x = -2, fx = x*x; x <= 2; x += 0.01, fx = x* x)
    lineSeries.Points.Add(new DataPoint(x, fx));

When finished, register the series in the plot object:


And finally, display the plot - this platform specific. Following is the WinForms code:

var plotView = new PlotView ();
plotView.Dock = DockStyle.Fill;
plotView.Location = new Point(0, 0);
plotView.Margin = new Padding(0);
plotView.Name = "plotView";
plotView.Model = model;

Controls.Add (plotView);

When run, this gives a nice chart:


Scatter Series:

Scatter series let's you define a set o points to visualize. This very handy to visualize simple data sets:

public static ScatterSeries Scatter(this PlotModel plot, Matrix x, Matrix y, MarkerType marker = MarkerType.Cross)
    var scatterSeries = new ScatterSeries() { MarkerType = marker, MarkerStrokeThickness = 3};
    for (int row = 0; row < x.Height; ++row)
        scatterSeries.Points.Add(new ScatterPoint(x[row, 0], y[row, 0], 5, 1));


    return scatterSeries;


Multiple Series

Since you can add as many series as you want, it's easy to combine them into single chart. For Linear Regression this can be used to display the data set and corresponding linear regression model:

model.Scatter(diameters, y);

var theta = LinearRegression(diameters, y);
model.Function(diameters, x => theta[0] + theta[1] * x); // Creates LineSeries

Combined Series

Linear Regression

In previous post we designed some basic optimization algorithm - the Steepest Gradient Decent. Let's put it to use by implementing linear regression algorithm.

What's linear regression?

Linear regression works by finding linear function that fits given data set with smallest error.

For example, for given data:

Linear regression could find a line like this:

Linear Regression, can be used to model relationship between various features (height vs weight, distance vs time, star wars planet size vs rotational period etc) and use this model to predict future values.

Linear regression is an example of supervised learning algorithm - we need a training dataset that will have examples with corresponding "right" answers.

Optimizing line parameters

You may remember from school that line is often given by formula y = ax + b. Linear regression is an optimization problem for parameters a and b. We look for such a and b that gives best matching line.

How to define how line matches to our data set? The value of line function (y) for item from data set should be as close to the dataset value as possible, ideally pass through it.

Having some sample data (x_i, y_i), we can calculate the value of linear regression model for x_i: y_regression = a*x_i + b. Then, we calculate the error: error_i = (y_regression - y_i)^2. It's squared so it's always positive.

To calculate the error for entire data set, we simply calculate the mean of errors for all items:

Vectorizing line equation

Line equation y = ax + b can be written as multiplication of two Vectors (1D matrices):

If we now run our optimization algorithm for error function, we'll obtain a matrix |b a| that will contain our linear regression model.

The naming convention is such that the first matrix is called theta and second is called X. In this case the formula becomes: theta * X

There's just one more thing we need to run the optimization: the derivative. For linear regression error function, the derivative is given by:

Stratosphere is action

Time for implementation! First we need the error function (also called "Cost" function):

private static double ComputeCost(Matrix X, Matrix y, Matrix theta)
    var m = y.Height;
    var error = X * theta - y;
    return error.Map(v => v * v).Sum() / (2.0 * m);

Time for gradient function (vectorized version):

private static Matrix Gradient(Matrix X, Matrix y, Matrix theta)
    var error = (X * theta - y);
    var a = (Matrix.Ones(theta.Height, 1) * error.Transpose()).Transpose().MultiplyEach(X);
    return a.Sum(0).Transpose();

Ready for some linear regression?

private static Matrix LinearRegression(Matrix dataset, Matrix y)
    var X = Matrix.Ones(dataset.Height, 1).Concat(dataset).Evaluate();
    return SimpleSteepestDescentMethod.Find(
        f: theta => ComputeCost(X, y, theta),
        df: theta => Gradient(X, y, theta),
        x0: Matrix.Ones(X.Width, 1),
        alpha: 0.0001,
        maxIterations: 5000);

And voilà!

Numerical Optimization - Steepest Gradient Descent

Having built basic Matrix operations we're now ready to look into some useful algorithms that will form basis for our machine learning library.

Numerical optimization algorithms are used to find a minimum or maximum of a function. In machine learning problems, they form a base for building complex algorithms like neural networks, support vector machines etc.

Let's look into one of the simplest algorithms for finding local minimum - Steepest Gradient Descent.


The "gradient" in algorithm name refers to derivative of function we'll minimize. For those of you not feeling that well about calculus: the derivative of the function describes how the function is changing in a neighborhood of a given point.

Say we take some point x. If derivative (df(x)) is greater that zero, it means that the function value (f(x)) near this point is increasing. If it's less than zero, then the function is decreasing. If derivative is exactly 0 it means the function is constant at this point.

Function derivative


Our algorithm will start at some arbitrary point. "Being" at this point it has to choose where to go next. It will use the gradient information - it will move in the direction where the function is guaranteed to become smaller.

Assuming that df(x0) is gradient at point x0, if df(x0) > 0 then the function is increasing (right-hand side of plot above), so to find the minimum we must look at left-hand side of x0: x1 < x0

If the df(x0) < 0 then function is decreasing (left-hand side of plot) so we want to look to the right: x1 > x0.

Let's write it in more compact form:

  • if df(x0) > 0 then x1 < x0
  • if df(x0) < 0 then x1 > x0

Looking at this, we can generalize that we must search in the opposite direction of gradient. In fact this is exactly what the algorithm does, even for functions of multiple variables (then the gradient is a vector of all partial derivatives).


Now when we know in what direction to search, we have to choose how far we want to go in that direction. Since function derivate is zero at optimum (function doesn't change there) it's safe to assume that as it approaches the minimum it will become more flat.

If so, we can then make our step proportional to the actual value of derivative (but in opposite direction!):

x1 ~ x0 - df(x0)

This is fine, however let's just add final element to this:

x1 = x0 - alpha*df(x0), where 0 < alpha < 1

Since we don't know the what the slope is we'll need to control our step size - that's what the alpha parameter is for.

This algorithm will produce following steps (with alpha set to 0.1):

Gradient descent steps

Notice that steps become denser as we approach the minimum - that's because we made it the proportional to the derivative.

Step size

If the step is too small we'll end up doing very little progress towards minimum. If it's too big, we can end up with situation like this:

Too big step size

... which is also sub optimal.

There are algorithms for find right value, but that's topic for another time.


Wrapping it up, you could write down the algorithm as follows:

Matrix x = x0;
var fx = f(x0);

Terminate the algorithm if it runs for too many iterations:

for (var i = 0; i < MaxIterations; ++i)

Calculate next step:

    var x1 = (x - df(x) * alpha).Evaluate();

Terminate if value is not changed (Epsilon is very small number):

    var fx1 = f(x1);
    if (fx - fx1 < Epsilon)
        return x1;

    x = x1;
    fx = fx1;

Using the code

After putting above into single method, here's example of running the algorithm against f(x) = x^2:

var xmin = SimpleSteepestDescentMethod.Find(
    f: x => x[0]*x[0], 
    df: x => 2 * x, 
    x0: Matrix.Scalar(-2), 
    alpha: 0.1, 
    maxIterations: 1000);

In next post we'll see how we can put this algorithm to use in some basic machine learning problem.

Wroc# 2016

On last Thursday (10 March 2016) I went to Wrocław to attend Wroc# conference. Organized by Objectivity on the Wrocław City Stadium - a beautiful venue - was a full day packed with great talks by great speakers. I enjoyed all 6 sessions, but let just cut the description to the 3 I liked the most:

Ian Cooper - "From Monolith to Microservices"

The sessions were started by Ian Copper talking about microservices. I really liked Ian's view on the subject - he wasn't talking about some over-hyped approach, but instead he showed mature, "next SOA" architecture style that's not there to scale, increase profit etc, but to reduce costs when working in large teams.

That's the argument I totally buy. His presentation inspired me so much that I immediatly opened my laptop and started refactoring a project that I'm working at now. I didn't intent to refactor it to microservices - there's no need for that, at least now. Instead I made a quick and dirty proof of concept of a better structured monolith that would more resemble structure of teams working on the project, making it more aligned with Conway's Law.

There's a recording of this session on SkillsMatter.

Julie Lerman - "Entity Framework Core and OS X: WAT?"

I know Julie from her great Pluralsight trainings on Entity Framework and I was very interested on what she was gonna to show during her session. I was betting she'll write code in Visual Studio Code using Entity Framework's In Memory data store.

And I was wrong (mostly). She did code in Visual Studio Code, however against PostgresSQL database! And at the end she deployed the application on Linux using docker! That's amazing peek into what .NET development may look in near future, and it's so different to what we're used to now (especially in "enterprise" development I'm sitting in).

Glenn Condron - "ASP.NET Core"

Glenn is the first person I met that's actually a member of .NET team at Microsoft. Glenn's session was packed with loads of information of what .NET Core is. And there was lot of code and lot of new "dotnet" command. His talked complemented Julie's very nicely! And we had over an hour long discussion at the after party ;-)


Wroc# 2016 was definitely one of the best conferences I've ever been to. I really like their "single day - single track - high profile international speakers" policy. I talked with great people for hours. I returned with new knowledge and new inspirations (no drones this year). And it's all free, which is totally amazing! I hope that Objectivity will decide to organize it next year.


Matrix - Lazy operations

Matrix has some operations we didn't cover yet: multiply by number, add number to every element, compute squares etc. The result of all those operations is same size matrix with every element transformed by some function, i.e:

For example: multiplying matrix by 2 (A * 2) can be written as A.Map(value => value * 2), where Map works similar to LINQ Select function - applies transformation to every element.

Scalar transformation

Instead of performing the operation immediately (and copy the matrix memory in process) we can instead create an object that represents the transformation behavior (again this is similar to the way Select works). Such object will only evaluate the transformed value when accessing matrix data:

public class ScalarTransformedMatrix : Matrix

We store the original matrix and the transformation function that we'll apply to each element

    private readonly Matrix _matrix;
    private readonly Func<double, double> _transformation;

    public ScalarTransformedMatrix( Matrix matrix, Func< double, double> transformation ) : base(matrix.Size)
        _matrix = matrix;
        _transformation = transformation;

Then we override methods accessing the data so they transform returned value:

    public override double GetByColumnIndex( int columnIndex)
        return _transformation(_matrix.GetByColumnIndex(columnIndex));

    public override double GetByRowIndex( int rowIndex)
        return _transformation(_matrix.GetByRowIndex(rowIndex));

    public override double GetByCoordinates( int row, int column)
        return _transformation(_matrix.GetByCoordinates(row, column));

Now the scalar multiplication can be expressed as:

public static Matrix operator *(Matrix a, double scalar) => new ScalarTransformedMatrix(a, v => v * scalar);

And used in code:

Matrix b = a * 2.5;

Element wise operations

Another type of operation we'd like to handle is element-wise matrix operation. Imagine we have two matrices - A and B - and we want to multiply each element of A by corresponding element of B (known as Hadamard product). In Octave/MatLab this is done by using "." operator: A .* B.

To implement this, we can create another transformation class that represents such operation:

public class MatrixTransformedMatrix : Matrix

Main difference between this and ScalarTransofmedMatrix is that we have to store additional matrix and our transformation function now accepts two values (for both matrices):

    private readonly Matrix _matrix;
    private readonly Matrix _transformationValue;
    private readonly Func<double, double, double> _transformation;

    public MatrixTransformedMatrix( Matrix matrix, Matrix transformationValue, Func<double , double , double > transformation) : base(matrix.Size)
        _matrix = matrix;
        _transformationValue = transformationValue;
        _transformation = transformation;

The value access functions look almost exactly the same - the value returned from inner matrix is transformed using the transformation function and corresponding value from second matrix:

    public override double GetByColumnIndex( int columnIndex)
        return _transformation(_matrix.GetByColumnIndex(columnIndex), _transformationValue.GetByColumnIndex(columnIndex));

    public override double GetByRowIndex( int rowIndex)
        return _transformation(_matrix.GetByRowIndex(rowIndex), _transformationValue.GetByRowIndex(rowIndex));

    public override double GetByCoordinates( int row, int column)
        return _transformation(_matrix.GetByCoordinates(row, column), _transformationValue.GetByCoordinates(row, column));

In fact, this is very similar to approach taken when designing the TransposedMatrix class, which flips the coordinates when value is accessed. Sub-matrices - like ColumnFilteredMatrix - and composite matrices - like ColumnsConcatMatrix - use exactly the same technique.

Thanks to that, after adding few operator overloads to our class, we can write complex equations without copying matrix data multiple times, i.e:

var resultMatrix = (1 + Y).Transpose().MultiplyEach(X.GetColumn(1)) - 1;

The matrix class now has enough operations to start building basic algorithms for machine learning - but that's topic for next articles.

Matrix multiplication

Today let's look at one of the most important operations in matrix-based algorithms - matrix product (multiplication of two matrices).


Matrix Product

Wikipedia has a great article together with some examples describing the multiplication very well. Among tons of information it gives following formula for multiplication:

Definition of matrix product


Let's write a simple algorithm that implements the multiplication directly from above formula. For clarity we'll rename the indexes i and j to rowand column respectively.

public ColumnMajorMatrix Multiply(Matrix other)

First validate the input matrices dimensions:

    if (Width != other.Height)
        throw new InvalidOperationException("Matrix dimensions must agree");                

    var resultData = new double[Height * other.Width];

Start by iterating through each cell in result matrix...

    for (int column = 0; column < other.Width; ++column)
        for (int row = 0; row < Height; ++row)

... and compute its value by multiplying corresponding row and column from input matrices:

            double sum = 0;
            for (int k = 0; k < Width; ++k)
                sum += GetByCoordinates(row, k)*other.GetByCoordinates(k, column);

Save the value of computed cell to result array:

            resultData[Coordinate2ColumnIndex(row, column)] = sum;


Since the resultData is stored using column major indexing, we return it as ColumnMajorMatrix:

    return new ColumnMajorMatrix(resultData, new[] { Height, other.Width });

The Coordinate2ColumnIndex simply calculates column-first index:

private int Coordinate2ColumnIndex(int row, int column)
    return (column * Height) + row;

Is that all?

This algorithm's computational complexity is O(n3). There are algorithms that can do better than this, not only by reducing the complexity, but also by improving data locality and parallelism. Some implementations may even use your GPU to speed things up a little.

But that's a topic for another time.

Matrix transpose

Matrix transposition, for 2 dimensional matrix, is done by writing columns as rows and vice versa.

For example, given following matrix:

1 2 3
4 5 6

... it's transposition will look like that:

1 4
2 5
3 6

Switching column and rows - that sounds familiar! Different ways of indexing our matrix does exactly that! Let's give it a try and create a new class using decorator pattern, that will represent transposed matrix:

public class TransposedMatrix : Matrix
    private readonly Matrix _matrix;

    public TransposedMatrix(Matrix matrix) :base(matrix)
        Size = matrix.Size.Reverse().ToArray();
        _matrix = matrix;

    public override double GetByColumnIndex( int index)
        return _matrix.GetByRowIndex(index);

    public override double GetByRowIndex( int index)
        return _matrix.GetByColumnIndex(index);

You'll notice that GetByColumnIndex in fact returns GetByRowIndex of decorated matrix and GetByRowIndex returnes GetByColumnIndex of decorated matrix, thus swaping the columns and rows on the fly - no memory copying is needed.

The dimensions are also reversed - Transposing 3 x 2 matrix will result in 2 x 3 matrix.

In Matrix class, we simply return transposed (decorated) self:

public Matrix Transpose()
    return new TransposedMatrix( this);

Next, add little test to verify it works and Voilà!

You can find TransposedMatrix class on project's GitHub repo.

Matrices - Introduction

Before we get started with algorithms used in Machine Learning, we need to build some basic matrix-based math. We could use something that's already implemented, but hey - we'd miss all the fun of implementing it ourselves!

Data Structure

Let's start with some basic matrix structure. We'd like to build a matrix that isn't just a 2-dimensional table of numbers but instead could hold any number of dimensions. How we could store this in memory? Let's just use a simple array of values and append every next dimension at the end of array.

There're two way to do this. First approach, more compatible with indexing used in Octave/MatLab, is to use "column major" approach:

Column major matrix

public class ColumnMajorMatrix
    private readonly double[] _data;
    private readonly int[] _dimensions;

    public ColumnMajorMatrix( double[] data, int[] dimensions)
        _data = data;
        _dimensions = dimensions;

Alternative approach would be to store the data in row-first (or "row major") fashion:

Row major matrix

We'll get back to it on another occasion, for know let's stick to column major version.

Data Access

Like in storage, there are two main approaches to index and retrieve the data from the matrix:

1) Column-first indexing:

Column first index

For column major matrix, this reflects the order of data in memory, so the code for such index is straightforward:

public double GetByColumnIndex(int index) => _data[index];

2) Row-first indexing:

Row first index

For column major matrix, this indexing is bit more complex. Simplifying to 2D matrices only:

public double GetByRowIndex(int index)
    var row = index/_dimensions[1];
    var column = index % _dimensions[1];

    return _data[row + column*_dimensions[0]];


In next post we'll look into basic Matrix operations - like Transposing and Multiplication - and see how different indexing will help us designing efficient algorithms for those operations.


I joined the "Daj się poznać" competition. For next 10 weeks I'll be (hopefully) posting about a project I started 2 weeks ago.

I called it Stratosphere.NET (because SkyNet was already taken...)

The project is inspired by Stanford's "Machine Learning" (available on Coursera). The aim is to write math/machine learning library and recreate course exercises in C# by porting MatLab/Octave scripts.

Repository will be hosted on GitHub.

All blog entires will be posted under Stratosphere.NET category.