Optimizing MATLAB Code

This post at Stackoverflow by Samuel O'Malley is unique, since there are very few occasion a MATLAB programmer think of doing an Optimization of their code. But they don't know what the benefit they will get if they do so. Here Samuel O'Malley discussed the effect of pre-allocation of arrays in the execution time & the effect of "fread" function. He used tic & toc block to find out the execution time of various codes. We don't know the configuration of the computer system he used(both hardware & software). The result can significantly vary system to system or event the MATLAB version you are using. But still this article is providing in insight to MATLAB CODE OPTIMIZATION, which we generally avoid.

Guide to Optimizing MATLAB Code

I have noticed many individual questions on SO but no one good guide to MATLAB optimization.
Common Questions:
  • Optimize this code for me
  • How do I vectorize this?
I don't think that these questions will stop, but I'm hoping that the ideas presented here will them something centralized to refer to.
Optimizing Matlab code is kind of a black-art, there is always a better way to do it. And sometimes it is straight-up impossible to vectorize your code.
So my question is: when vectorization is impossible or extremely complicated, what are some of your tips and tricks to optimize MATLAB code? Also if you have any common vectorization tricks I wouldn't mind seeing them either.

Preface

All of these tests are performed on a machine that is shared with others, so it is not a perfectly clean environment. Between each test I clear the workspace to free up memory.
Please don't pay attention to the individual numbers, just look at the differences between the before and after optimisation times.
Note: The tic and toc calls I have placed in the code are to show where I am measuring the time taken.

Pre-allocation

The simple act of pre-allocating arrays in Matlab can give a huge speed advantage.
 
tic;
for i = 1:100000
    my_array(i) = 5 * i;
end
toc;
 
This takes 47 seconds

tic;
length = 100000;
my_array = zeros(1, length);
for i = 1:length
    my_array(i) = 5 * i;
end
toc;
 
This takes 0.1018 seconds

47 seconds to 0.1 seconds for a single line of code added is an amazing improvement. Obviously in this simple example you could vectorize it to my_array = 5 * 1:100000 (which took 0.000423 seconds) but I am trying to represent the more complicated times when vectorization isn't an option.
I recently found that the zeros function (and others of the same nature) are not as fast at pre-allocating as simply setting the last value to 0:


tic;
length = 100000;
my_array(length) = 0;
for i = 1:length
    my_array(i) = 5 * i;
end
toc;
 
This takes 0.0991 seconds

Now obviously this tiny difference doesn't prove much but you'll have to believe me over a large file with many of these optimisations the difference becomes a lot more apparent.

Why does this work?
The pre-allocation methods allocate a chunk of memory for you to work with. This memory is contiguous and can be pre-fetched, just like an Array in C++ or Java. However if you do not pre-allocate then MATLAB will have to dynamically find more and more memory for you to use. As I understand it, this behaves differently to a Java ArrayList and is more like a LinkedList where different chunks of the array are split all over the place in memory.
Not only is this slower when you write data to it (47 seconds!) but it is also slower every time you access it from then on. In fact, if you absolutely CAN'T pre-allocate then it is still useful to copy your matrix to a new pre-allocated one before you start using it.

What if I don't know how much space to allocate?
This is a common question and there are a few different solutions:
  1. Overestimation - It is better to grossly overestimate the size of your matrix and allocate too much space, than it is to under-allocate space.
  2. Deal with it and fix later - I have seen this a lot where the developer has put up with the slow population time, and then copied the matrix into a new pre-allocated space. Usually this is saved as a .mat file or similar so that it could be read quickly at a later date.
How do I pre-allocate a complicated structure?
Pre-allocating space for simple data-types is easy, as we have already seen, but what if it is a very complex data type such as a struct of structs?
I could never work out to explicitly pre-allocate these (I am hoping someone can suggest a better method) so I came up with this simple hack:
 
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = 1:length
    complicated_structure = read_from_file(i);
end
toc;
 
This takes 1.5 minutes

tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = length:-1:1
    complicated_structure = read_from_file(i);
end
% Flip the array back to the right way
complicated_structure = fliplr(complicated_structure);
toc;
 
This takes 6 seconds

This is obviously not perfect pre-allocation, and it takes a little while to flip the array afterwards, but the time improvements speak for themselves. I'm hoping someone has a better way to do this, but this is a pretty good hack in the mean time.

Data Structures

In terms of memory usage, an Array of Structs is orders of magnitude worse than a Struct of Arrays:

% Array of Structs
a(1).a = 1;
a(1).b = 2;
a(2).a = 3;
a(2).b = 4;
 
Uses 624 Bytes

% Struct of Arrays
a.a(1) = 1;
a.b(1) = 2;
a.a(2) = 3;
a.b(2) = 4;
 
Uses 384 Bytes

As you can see, even in this simple/small example the Array of Structs uses a lot more memory than the Struct of Arrays. Also the Struct of Arrays is in a more useful format if you want to plot the data.
Each Struct has a large header, and as you can see an array of structs repeats this header multiple times where the struct of arrays only has the one header and therefore uses less space. This difference is more obvious with larger arrays.

File Reads

The less number of freads (or any system call for that matter) you have in your code, the better.

tic;    
for i = 1:100
    fread(fid, 1, '*int32');
end
toc;
 
The previous code is a lot slower than the following:

tic;
fread(fid, 100, '*int32');
toc;
 
You might think that's obvious, but the same principle can be applied to more complicated cases:
 
tic;
for i = 1:100
    val1(i) = fread(fid, 1, '*float32');
    val2(i) = fread(fid, 1, '*float32');
end
toc;
 
This problem is no longer simple because in memory the floats are represented like this:
val1 val2 val1 val2 etc.
 
However you can use the skip value of fread to achieve the same optimizations as before:
 
tic;
% Get the current position in the file
initial_position = ftell(fid);
% Read 100 float32 values, and skip 4 bytes after each one
val1 = fread(fid, 100, '*float32', 4);
% Set the file position back to the start (plus the size of the initial float32)
fseek(fid, position + 4, 'bof');
% Read 100 float32 values, and skip 4 bytes after each one
val2 = fread(fid, 100, '*float32', 4);
toc;
 
So this file read was accomplished using two freads instead of 200, a massive improvement.

Function Calls

I recently worked on some code that used many function calls, all of which were located in separate files. So lets say there were 100 separate files, all calling each other. By "inlining" this code into one function I saw a 20% improvement in execution speed from 9 seconds.
Obviously you would not do this at the expense of re-usability, but in my case the functions were automatically generated and not reused at all. But we can still learn from this and avoid excessive function calls where they are not really needed.

This article obeys cc-wiki licensing, which does require attribution (stackoverflow & Samuel O'Malley).
 

0 comments: