I have noticed many individual questions on SO but no one good guide to MATLAB optimization.
I don't think that these questions will stop, but I'm hoping that the
ideas presented here will them something centralized to refer to.
Optimizing Matlab code is kind of a black-art, there is always a
better way to do it. And sometimes it is straight-up impossible to
vectorize your code.
So my question is: when vectorization is impossible or extremely
complicated, what are some of your tips and tricks to optimize MATLAB
code? Also if you have any common vectorization tricks I wouldn't mind
seeing them either.
Preface
All of these tests are performed on a machine that is shared with
others, so it is not a perfectly clean environment. Between each test I
clear the workspace to free up memory.
Please don't pay attention to the individual numbers, just look at
the differences between the before and after optimisation times.
Note: The
tic
and
toc
calls I have placed in the code are to show where I am measuring the time taken.
Pre-allocation
The simple act of pre-allocating arrays in Matlab can give a huge speed advantage.
tic;
for i = 1:100000
my_array(i) = 5 * i;
end
toc;
This takes 47 seconds
tic;
length = 100000;
my_array = zeros(1, length);
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.1018 seconds
47 seconds to 0.1 seconds for a single line of code added is an
amazing improvement. Obviously in this simple example you could
vectorize it to
my_array = 5 * 1:100000
(which took
0.000423 seconds) but I am trying to represent the more complicated times when vectorization isn't an option.
I recently found that the zeros function (and others of the same
nature) are not as fast at pre-allocating as simply setting the last
value to 0:
tic;
length = 100000;
my_array(length) = 0;
for i = 1:length
my_array(i) = 5 * i;
end
toc;
This takes 0.0991 seconds
Now obviously this tiny difference doesn't prove much but you'll have
to believe me over a large file with many of these optimisations the
difference becomes a lot more apparent.
Why does this work?
The pre-allocation methods allocate a chunk of memory for you to work
with. This memory is contiguous and can be pre-fetched, just like an
Array in C++ or Java. However if you do not pre-allocate then MATLAB
will have to dynamically find more and more memory for you to use. As I
understand it, this behaves differently to a Java ArrayList and is more
like a LinkedList where different chunks of the array are split all over
the place in memory.
Not only is this slower when you write data to it (47 seconds!) but
it is also slower every time you access it from then on. In fact, if you
absolutely CAN'T pre-allocate then it is still useful to copy your
matrix to a new pre-allocated one before you start using it.
What if I don't know how much space to allocate?
This is a common question and there are a few different solutions:
- Overestimation - It is better to grossly overestimate the size of your matrix and allocate too much space, than it is to under-allocate space.
- Deal with it and fix later - I have seen this a lot
where the developer has put up with the slow population time, and then
copied the matrix into a new pre-allocated space. Usually this is saved
as a
.mat
file or similar so that it could be read quickly at a later date.
How do I pre-allocate a complicated structure?
Pre-allocating space for simple data-types is easy, as we have
already seen, but what if it is a very complex data type such as a
struct of structs?
I could never work out to explicitly pre-allocate these (I am hoping
someone can suggest a better method) so I came up with this simple hack:
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = 1:length
complicated_structure = read_from_file(i);
end
toc;
This takes 1.5 minutes
tic;
length = 100000;
% Reverse the for-loop to start from the last element
for i = length:-1:1
complicated_structure = read_from_file(i);
end
% Flip the array back to the right way
complicated_structure = fliplr(complicated_structure);
toc;
This takes 6 seconds
This is obviously not perfect pre-allocation, and it takes a little
while to flip the array afterwards, but the time improvements speak for
themselves. I'm hoping someone has a better way to do this, but this is a
pretty good hack in the mean time.
Data Structures
In terms of memory usage, an Array of Structs is orders of magnitude worse than a Struct of Arrays:
% Array of Structs
a(1).a = 1;
a(1).b = 2;
a(2).a = 3;
a(2).b = 4;
Uses
624 Bytes
% Struct of Arrays
a.a(1) = 1;
a.b(1) = 2;
a.a(2) = 3;
a.b(2) = 4;
Uses
384 Bytes
As you can see, even in this simple/small example the Array of
Structs uses a lot more memory than the Struct of Arrays. Also the
Struct of Arrays is in a more useful format if you want to plot the
data.
Each Struct has a large header, and as you can see an array of
structs repeats this header multiple times where the struct of arrays
only has the one header and therefore uses less space. This difference
is more obvious with larger arrays.
File Reads
The less number of
freads
(or any system call for that matter) you have in your code, the better.
tic;
for i = 1:100
fread(fid, 1, '*int32');
end
toc;
The previous code is a lot slower than the following:
tic;
fread(fid, 100, '*int32');
toc;
You might think that's obvious, but the same principle can be applied to more complicated cases:
tic;
for i = 1:100
val1(i) = fread(fid, 1, '*float32');
val2(i) = fread(fid, 1, '*float32');
end
toc;
This problem is no longer simple because in memory the floats are represented like this:
val1 val2 val1 val2 etc.
However you can use the
skip
value of fread to achieve the same optimizations as before:
tic;
% Get the current position in the file
initial_position = ftell(fid);
% Read 100 float32 values, and skip 4 bytes after each one
val1 = fread(fid, 100, '*float32', 4);
% Set the file position back to the start (plus the size of the initial float32)
fseek(fid, position + 4, 'bof');
% Read 100 float32 values, and skip 4 bytes after each one
val2 = fread(fid, 100, '*float32', 4);
toc;
So this file read was accomplished using two
fread
s instead of 200, a massive improvement.
Function Calls
I recently worked on some code that used many function calls, all of
which were located in separate files. So lets say there were 100
separate files, all calling each other. By "inlining" this code into one
function I saw a 20% improvement in execution speed from 9 seconds.
Obviously you would not do this at the expense of re-usability, but
in my case the functions were automatically generated and not reused at
all. But we can still learn from this and avoid excessive function calls
where they are not really needed.
6 comments: