The Checkerboard Twist: Benchmarking Proved Dumb Me Was Better Than Smart Me
2023-06-24
Yesterday I almost finished a 2000 word blog post explaining how clever and efficient one of my recent pieces of code was, compared to what I would've written some years ago. It was a three-liner that filled a canvas with checkerboard pattern, used to denote transparent area in graphics applications. The post started with some naive algorithms and then moved on to the one-liner. There were some tables that broke down the bitwise operations, explanations for the equivalency of some expressions based on Boolean algebra, etc. At last, while including benchmark results that compared the clever methods, I measured the "dumb" code also, and the whole thing collapsed.
Turns out the dumb me from the past wrote code that ran more than ten times faster.
(Turns out that wasn't the real twist.)
What I Wrote Recently
The original write-up was about the code that I wrote as part of a humble digital painting app that I've been working on recently. I had to draw a canvas full of checker pattern, and almost like second nature, I wrote something which would look like this in C:
// Snippet 1 (Algorithm 1) void image_fill_checkpattern(Image *img) { for(int y = 0; y < img->height; ++y) for(int x = 0; x < img->width; ++x) nan_img_set_pixel(img, x, y, (((x >> 3) & 1) ^ ((y >> 3) & 1))? LIGHT: DARK); }
It visits every pixel, and sets it with either the light shade or the dark shade
based on the calculation ((x >> 3) & 1) ^ ((y >> 3) & 1)
.
(Yes, it can be simplified as ((x >> 3) ^ (y >> 3)) & 1
, maybe
even better in ways that I have not thought yet, but if you are thinking about
the simpler expression (i + j) % 2
found on StackOverflow.com,
no, it can't be used here as is, for we need to draw checkers of size 8x8
pixels; the StackOverflow answer switches the color every pixel.)
The original write-up explains the logic behind this, and even shows how it
equates to (i + j) % 2
in a sense. I should rework and post it,
but let's skip that part now. For the time being, just appreciate how clever
Snippet 1 looks.
What I Would've Written Years Ago
In the original write-up, I had included two algorithms as proof for the naivety of younger me. Of those two, this is the one I called "terrible" and it's the only one I've tested against the clever ones:
Algorithm 2 ("The Naive Algorithm") 1. For y from 0 under (height / 8): 2. For x from 0 under (width / 8): 3. For j from 0 to 7: 4. For i from 0 to 7: 5. Set the pixel, selecting the color based on x and y
This jumps to every eighth pixel in every eighth row and then draws an 8x8 square starting there. I called it terrible because:
- It has four loops in a nested manner, and anybody "knows" nested loops slow down your programs exponentially.
- Loops, which disrupt the control flow, are certainly slower than bitwise operations, which are the cheapest in any CPU. Algorithm 2 uses loops to accomplish what the algorithm behind Snippet 1 accomplishes with bitwise operations, so Algorithm 2 must be slower.
What Benchmarking Revealed
Simply that the naive algorithm could run 10x faster than the genius one-liner. Let's get into the details.
Well, when tested inside my actual drawing application, the naive algorithm is
still faster, but the difference is negligible regardless of the compiler
optimization level (I tried -O0
and -O2
). The
application was using a dynamically allocated array of floating point RGBA
pixels, each component stored as double
(an overkill that I should
fix). So the compiler has limitations anyway.
However, things become really interesting with the dedicated test program that I started for my original write-up, which uses a simple static structure that could represent a black-and-white PPM image, each pixel stored in a byte.
Here is the image structure:
typedef struct { int width; int height; char map[MAXDIM * MAXDIM]; } Bitmap;
Here are the functions being compared:
// a.k.a. "the naive algorithm" void drawchecker_nested_hell(Bitmap *bm) { // Disabled in favor of benchmarking // assert(bm->width / 8 == 0); // assert(bm->height / 8 == 0); for(int y = 0; y < bm->height / 8; ++y) for(int x = 0; x < bm->width / 8; ++x) for(int j = 0; j < 8; ++j) for(int i = 0; i < 8; ++i) bm->map[bm->width * ((y * 8) + j) + ((x * 8) + i)] = ((x + y) % 2)? DARK: LIGHT; } void drawchecker_bitwise_short(Bitmap *bm) { for(int y = 0; y < bm->height; ++y) for(int x = 0; x < bm->width; ++x) bm->map[bm->width * y + x] = (((x >> 3) ^ (y >> 3)) & 1)? DARK: LIGHT; }
Yes, I named the naive version "nested_hell" because I was 90% sure it would be slower. BTW, if you've noticed the bitwise expression in drawchecker_bitwise_short() is slightly different from the one in Snippet 1 (the one that I started this post with), it's a simplification based on Boolean Algebra, and I've made sure that it's faster than the unsimplified form.
This is how long they took to fill a 256x256 image 1000 times each (no disk I/O,
all drawchecker_bitwise_short()
iterations happen only after all
drawchecker_nested_hell()
iterations, the platform is x86_64):
$ gcc -O0 draw.c && ./a.out time taken for drawchecker_nested_hell(): 0.296156s ... time taken for drawchecker_bitwise_short(): 0.318273s ... $ gcc -O2 draw.c && ./a.out time taken for drawchecker_nested_hell(): 0.014055s ... time taken for drawchecker_bitwise_short(): 0.110391s ...
(I didn't take the average of multiple runs, but I can assure you that the relative performance doesn't vary between runs.)
So the naive algorithm yielded a 10x faster code at -O2
, and even
at -O0
(no optimization), it was faster.
What's Going On?
I don't want to act like this was a surprise to me. Even with a 10% chance, I was ready to expect something like this given how "experienced" I'm at premature optimization (thankfully, I always measured). But at the same time, I had no immediate ideas as to why the naive version, the one with four loops, ran faster than the plain 2D scan.
First I tried to check the Assembly output produced by GCC. There was this
strange number 72340172838076673
in the optimized version
(compilers usually cost-reduce operations with the help of magic numbers).
However, I soon left it alone because the naive algorithm was faster even
without optimization. But instead of going back to the Assembly output of the
unoptimized compilation, I took a break and started thinking.
Instead of asking why the naive algorithm ran faster, I asked myself why did I think it would be slower in the first place. Because it had four loops, all nested? So what?
The problem is, my "trained eyes" had developed a habit of assuming the
complexity of a piece of code by looking at the nesting of loops. If it
had one loop, the complexity was O(n)
. It if had one nested loop,
it became O(n^2)
. Three loops? O(n^3)
. So the
naive algorithm must be O(n^4)
and the two-loops one must be
O(n^2)
, right? Wrong! They are both O(n)
!
Why? Because the whole image has n pixels and all of these algorithms visit
each pixel exactly once! (The outer loops in the naive algorithm skips rows and
pixels, except for the eighth ones, remember?)
Alright, now we know why the naive algorithm didn't have to be any slower. But the actual question still remains unanswered: why did it run faster?
The Potential of the Naive Algorithm
One property of the naive algorithm is, it doesn't do decision making on a
per-pixel basis. Sure, there is ((x + y) % 2)? DARK: LIGHT
inside
the innermost loop, but that could be moved out to the x loop since it doesn't
depend on those inner loops. So an optimizing compiler can take advantage of
this property. To test this theory, I hand-optimized
drawchecker_nested_hell()
by moving out the color selection part
to the x loop (see drawchecker_nested_hell_handoptim_ifout()
in draw.c).
This is what I got at -O0
:
$ gcc -O0 draw.c && ./a.out time taken for drawchecker_nested_hell(): 0.297783s time taken for drawchecker_nested_hell_handoptim(): 0.266290s ... time taken for drawchecker_bitwise_short(): 0.314078s ...
Not much. But can we go further? Moving partial calculations out didn't help
much. We could still do something like replacing the innermost
loop with some sort of assignment that fills eight pixels in a single step
(assign a long
value, maybe?).
Let's see what the compiler does at -O2
.
Remember the magic number 72340172838076673
? Turns out it is
101010101010101
in hexadecimal. That is, it's a constant chunk of
data that has eight bytes with the value 1
. In the context of my
program, it means eight DARK
pixels. Multiplied with zero, the same
would become eight LIGHT
pixels. Yes, the compiler is also throwing
away an entire loop. Instead, it places a simple assignment/multiplication
(it cannot do the same to drawchecker_bitwise_short()
).
Please remember that it's not just about getting rid of a loop that helped here, but it's the fact that the outer loops are sparse.
There could be more to the story in -O2
, but we've already gained
some interesting facts.
By the way, none of that explains why the naive code ran slightly faster in the unoptimized form. But this difference is negligible, and more importantly, I believe the reason lies in the machine level rather than in the algorithmic level. It could still be worth having a look at, but let me save it for another day.
The Real Twist
I had the idea of replacing the innermost loop with an assignment, even before I
understood the meaning of the magic number in GCC's -O2
Assembly
output. But I didn't bother to write it myself first thinking what I had in
mind was what the compiler was already doing. But I still hadn't analyzed the
Assembly output thoroughly, so there was a chance what I would write could
differ, so I gave it a shot. It is present in draw.c as
drawchecker_nested_hell_handoptim_noloop()
.
$ gcc -O0 draw.c && ./a.out time taken for drawchecker_nested_hell(): 0.302826s time taken for drawchecker_nested_hell_handoptim_ifout(): 0.278121s time taken for drawchecker_nested_hell_handoptim_noloop(): 0.039804s ... time taken for drawchecker_bitwise_short(): 0.322290s ...
See who's 10x now? Jokes aside, GCC made
drawchecker_nested_hell_handoptim_noloop()
even faster at
-O2
(0.007s), while GCC-optimized
drawchecker_nested_hell()
was taking 0.014s (as we saw already).
However, at -O3
, GCC-optimized
drawchecker_nested_hell()
caught up.
I'm yet to dive into the Assembly output to see how GCC's optimization and
my optimization differ. Maybe there isn't much of a difference (makes sense
when you realize that the above comparisons are unfair and I should be comparing
the -O0
performance of
drawchecker_nested_hell_handoptim_noloop()
with that of the -O1
or -O2
performance of
drawchecker_nested_hell()
).
Anyway, I don't recommend hand-optimizing unless it's absolutely essential.
Takeaway
The last section was a deviation. Let me course-correct and present what I wanted to say all along:
- The clumsy-looking naive algorithm (Algorithm 2) was actually
O(n)
, exactly as efficient as Algorithm 1 in asymptotic terms. I could've easily figured this out, but I didn't think about it until I got puzzled with the benchmarks. - Compilers like GCC are very good at optimizing, and sometimes (most of the time?) hand optimizations could render the code unable to machine-optimize. But the opposite could also happen. I knew this, but needed examples, and now I do.
Notes
- I'm sorry if this post sounded like I was comparing the performance of two pieces of code; I was actually comparing the optimizability of them, that too based on a particular compiler.
- Even at
-O0
, GCC usesLEA
for addition andSAR
for division (allowed in-O0
probably because they don't change the control flow and debugging isn't affected). - draw.c includes some other drawchecker() variants.
Download
Download the tarball containing the test program and archived results: nandakumar.org/docs/2023-06-checkerbench.tar.gz
Discussion
2023-06-26: Rajesh Ramakrishnan, a colleague of mine, wasn't ready to leave
the puzzle of the -O0
performance gain of the naive algorithm
alone (he does performance engineering for a living, with passion; what else
do you expect?). He asked if I had considered the possibility of CPU caching,
and here are the facts I shared:
- Based on the image representation used,
drawchecker_bitwise_short()
accesses pixels in the right order only (every row in order, every pixel in it in order); on the other hand, the naive algorithm does considerable random access (it considers only eight pixels from a row before jumping to the next; having done this to eight rows, it jumps back to the first row, to repeat this). - That was about target operands. Cacheability of the input operands appears the same in both cases to me.
- I have tried this for 512x512, 1024x1024 and 32x32 images (enough to reveal some cache-related issues, I think); no difference in the ranking.
- I used callgrind, only to find that
drawchecker_nested_hell()
had 1.64M Ir per call, whiledrawchecker_bitwise_short()
had 1.44M only. In other words, the naive routine is faster with-O0
only in terms of time. - If I'm doing and interpreting the cachegrind analysis correctly,
drawchecker_bitwise_short()
has slightly less cache misses compared todrawchecker_nested_hell()
- I tried calling the routines in a different order. No, there is no difference in the ranking.
- The length and complexity of instructions in the Assembly output seems to be comparable for both routines at a glance, but in the CPU, under the hood, it could be a totally different story (micro-operations, speculative execution, etc.).
Nandakumar Edamana
Tags: programming, experiment, vara
Read more from Nandakumar at nandakumar.org/blog/