The Checkerboard Twist: Benchmarking Proved Dumb Me Was Better Than Smart Me

Yesterday I almost finished a 2000 word blog post explaining how clever and efficient one of my recent pieces of code was, compared to what I would've written some years ago. It was a three-liner that filled a canvas with checkerboard pattern, used to denote transparent area in graphics applications. The post started with some naive algorithms and then moved on to the one-liner. There were some tables that broke down the bitwise operations, explanations for the equivalency of some expressions based on Boolean algebra, etc. At last, while including benchmark results that compared the clever methods, I measured the "dumb" code also, and the whole thing collapsed.

Turns out the dumb me from the past wrote code that ran more than ten times faster.

(Turns out that wasn't the real twist.)

What I Wrote Recently

The original write-up was about the code that I wrote as part of a humble digital painting app that I've been working on recently. I had to draw a canvas full of checker pattern, and almost like second nature, I wrote something which would look like this in C:

// Snippet 1 (Algorithm 1)

void image_fill_checkpattern(Image *img)
{
  for(int y = 0; y < img->height; ++y)
    for(int x = 0; x < img->width; ++x)
      nan_img_set_pixel(img, x, y,
        (((x >> 3) & 1) ^ ((y >> 3) & 1))? LIGHT: DARK);
}

It visits every pixel, and sets it with either the light shade or the dark shade based on the calculation ((x >> 3) & 1) ^ ((y >> 3) & 1). (Yes, it can be simplified as ((x >> 3) ^ (y >> 3)) & 1, maybe even better in ways that I have not thought yet, but if you are thinking about the simpler expression (i + j) % 2 found on StackOverflow.com, no, it can't be used here as is, for we need to draw checkers of size 8x8 pixels; the StackOverflow answer switches the color every pixel.)

Checkerboard pattern — Vara indicates transparent background with a checkerboard pattern, drawn using Snippet 1

The original write-up explains the logic behind this, and even shows how it equates to (i + j) % 2 in a sense. I should rework and post it, but let's skip that part now. For the time being, just appreciate how clever Snippet 1 looks.

What I Would've Written Years Ago

In the original write-up, I had included two algorithms as proof for the naivety of younger me. Of those two, this is the one I called "terrible" and it's the only one I've tested against the clever ones:

Algorithm 2 ("The Naive Algorithm")

1. For y from 0 under (height / 8):
2.     For x from 0 under (width / 8):
3.         For j from 0 to 7:
4.             For i from 0 to 7:
5.                 Set the pixel, selecting the color based on x and y

This jumps to every eighth pixel in every eighth row and then draws an 8x8 square starting there. I called it terrible because:

It has four loops in a nested manner, and anybody "knows" nested loops slow down your programs exponentially.
Loops, which disrupt the control flow, are certainly slower than bitwise operations, which are the cheapest in any CPU. Algorithm 2 uses loops to accomplish what the algorithm behind Snippet 1 accomplishes with bitwise operations, so Algorithm 2 must be slower.

What Benchmarking Revealed

Simply that the naive algorithm could run 10x faster than the genius one-liner. Let's get into the details.

Well, when tested inside my actual drawing application, the naive algorithm is still faster, but the difference is negligible regardless of the compiler optimization level (I tried -O0 and -O2). The application was using a dynamically allocated array of floating point RGBA pixels, each component stored as double (an overkill that I should fix). So the compiler has limitations anyway.

However, things become really interesting with the dedicated test program that I started for my original write-up, which uses a simple static structure that could represent a black-and-white PPM image, each pixel stored in a byte.

Here is the image structure:

typedef struct {
	int width;
	int height;
	char map[MAXDIM * MAXDIM];
} Bitmap;

Here are the functions being compared:

// a.k.a. "the naive algorithm"
void drawchecker_nested_hell(Bitmap *bm)
{
// Disabled in favor of benchmarking
//	assert(bm->width / 8 == 0);
//	assert(bm->height / 8 == 0);

	for(int y = 0; y < bm->height / 8; ++y)
		for(int x = 0; x < bm->width / 8; ++x)
			for(int j = 0; j < 8; ++j)
				for(int i = 0; i < 8; ++i)
					bm->map[bm->width * ((y * 8) + j) + ((x * 8) + i)] =
						((x + y) % 2)? DARK: LIGHT;
}

void drawchecker_bitwise_short(Bitmap *bm)
{
	for(int y = 0; y < bm->height; ++y)
		for(int x = 0; x < bm->width; ++x)
			bm->map[bm->width * y + x] =
				(((x >> 3) ^ (y >> 3)) & 1)? DARK: LIGHT;
}

Yes, I named the naive version "nested_hell" because I was 90% sure it would be slower. BTW, if you've noticed the bitwise expression in drawchecker_bitwise_short() is slightly different from the one in Snippet 1 (the one that I started this post with), it's a simplification based on Boolean Algebra, and I've made sure that it's faster than the unsimplified form.

This is how long they took to fill a 256x256 image 1000 times each (no disk I/O, all drawchecker_bitwise_short() iterations happen only after all drawchecker_nested_hell() iterations, the platform is x86_64):

$ gcc -O0 draw.c && ./a.out 
time taken for drawchecker_nested_hell(): 0.296156s
...
time taken for drawchecker_bitwise_short(): 0.318273s
...

$ gcc -O2 draw.c && ./a.out 
time taken for drawchecker_nested_hell(): 0.014055s
...
time taken for drawchecker_bitwise_short(): 0.110391s
...

(I didn't take the average of multiple runs, but I can assure you that the relative performance doesn't vary between runs.)

So the naive algorithm yielded a 10x faster code at -O2, and even at -O0 (no optimization), it was faster.

What's Going On?

I don't want to act like this was a surprise to me. Even with a 10% chance, I was ready to expect something like this given how "experienced" I'm at premature optimization (thankfully, I always measured). But at the same time, I had no immediate ideas as to why the naive version, the one with four loops, ran faster than the plain 2D scan.

First I tried to check the Assembly output produced by GCC. There was this strange number 72340172838076673 in the optimized version (compilers usually cost-reduce operations with the help of magic numbers). However, I soon left it alone because the naive algorithm was faster even without optimization. But instead of going back to the Assembly output of the unoptimized compilation, I took a break and started thinking.

Instead of asking why the naive algorithm ran faster, I asked myself why did I think it would be slower in the first place. Because it had four loops, all nested? So what?

The problem is, my "trained eyes" had developed a habit of assuming the complexity of a piece of code by looking at the nesting of loops. If it had one loop, the complexity was O(n). It if had one nested loop, it became O(n^2). Three loops? O(n^3). So the naive algorithm must be O(n^4) and the two-loops one must be O(n^2), right? Wrong! They are both O(n)! Why? Because the whole image has n pixels and all of these algorithms visit each pixel exactly once! (The outer loops in the naive algorithm skips rows and pixels, except for the eighth ones, remember?)

Alright, now we know why the naive algorithm didn't have to be any slower. But the actual question still remains unanswered: why did it run faster?

The Potential of the Naive Algorithm

One property of the naive algorithm is, it doesn't do decision making on a per-pixel basis. Sure, there is ((x + y) % 2)? DARK: LIGHT inside the innermost loop, but that could be moved out to the x loop since it doesn't depend on those inner loops. So an optimizing compiler can take advantage of this property. To test this theory, I hand-optimized drawchecker_nested_hell() by moving out the color selection part to the x loop (see drawchecker_nested_hell_handoptim_ifout() in draw.c).

This is what I got at -O0:

$ gcc -O0 draw.c && ./a.out 
time taken for drawchecker_nested_hell(): 0.297783s
time taken for drawchecker_nested_hell_handoptim(): 0.266290s
...
time taken for drawchecker_bitwise_short(): 0.314078s
...

Not much. But can we go further? Moving partial calculations out didn't help much. We could still do something like replacing the innermost loop with some sort of assignment that fills eight pixels in a single step (assign a long value, maybe?).

Let's see what the compiler does at -O2. Remember the magic number 72340172838076673? Turns out it is 101010101010101 in hexadecimal. That is, it's a constant chunk of data that has eight bytes with the value 1. In the context of my program, it means eight DARK pixels. Multiplied with zero, the same would become eight LIGHT pixels. Yes, the compiler is also throwing away an entire loop. Instead, it places a simple assignment/multiplication (it cannot do the same to drawchecker_bitwise_short()).

Please remember that it's not just about getting rid of a loop that helped here, but it's the fact that the outer loops are sparse.

There could be more to the story in -O2, but we've already gained some interesting facts.

By the way, none of that explains why the naive code ran slightly faster in the unoptimized form. But this difference is negligible, and more importantly, I believe the reason lies in the machine level rather than in the algorithmic level. It could still be worth having a look at, but let me save it for another day.

The Real Twist

I had the idea of replacing the innermost loop with an assignment, even before I understood the meaning of the magic number in GCC's -O2 Assembly output. But I didn't bother to write it myself first thinking what I had in mind was what the compiler was already doing. But I still hadn't analyzed the Assembly output thoroughly, so there was a chance what I would write could differ, so I gave it a shot. It is present in draw.c as drawchecker_nested_hell_handoptim_noloop().

$ gcc -O0 draw.c && ./a.out 
time taken for drawchecker_nested_hell(): 0.302826s
time taken for drawchecker_nested_hell_handoptim_ifout(): 0.278121s
time taken for drawchecker_nested_hell_handoptim_noloop(): 0.039804s
...
time taken for drawchecker_bitwise_short(): 0.322290s
...

See who's 10x now? Jokes aside, GCC made drawchecker_nested_hell_handoptim_noloop() even faster at -O2 (0.007s), while GCC-optimized drawchecker_nested_hell() was taking 0.014s (as we saw already). However, at -O3, GCC-optimized drawchecker_nested_hell() caught up.

I'm yet to dive into the Assembly output to see how GCC's optimization and my optimization differ. Maybe there isn't much of a difference (makes sense when you realize that the above comparisons are unfair and I should be comparing the -O0 performance of drawchecker_nested_hell_handoptim_noloop() with that of the -O1 or -O2 performance of drawchecker_nested_hell()).

Anyway, I don't recommend hand-optimizing unless it's absolutely essential.

Takeaway

The last section was a deviation. Let me course-correct and present what I wanted to say all along:

The clumsy-looking naive algorithm (Algorithm 2) was actually O(n), exactly as efficient as Algorithm 1 in asymptotic terms. I could've easily figured this out, but I didn't think about it until I got puzzled with the benchmarks.
Compilers like GCC are very good at optimizing, and sometimes (most of the time?) hand optimizations could render the code unable to machine-optimize. But the opposite could also happen. I knew this, but needed examples, and now I do.

Notes

I'm sorry if this post sounded like I was comparing the performance of two pieces of code; I was actually comparing the optimizability of them, that too based on a particular compiler.
Even at -O0, GCC uses LEA for addition and SAR for division (allowed in -O0 probably because they don't change the control flow and debugging isn't affected).
draw.c includes some other drawchecker() variants.

Download

Download the tarball containing the test program and archived results: nandakumar.org/docs/2023-06-checkerbench.tar.gz

Discussion

2023-06-26: Rajesh Ramakrishnan, a colleague of mine, wasn't ready to leave the puzzle of the -O0 performance gain of the naive algorithm alone (he does performance engineering for a living, with passion; what else do you expect?). He asked if I had considered the possibility of CPU caching, and here are the facts I shared:

Based on the image representation used, drawchecker_bitwise_short() accesses pixels in the right order only (every row in order, every pixel in it in order); on the other hand, the naive algorithm does considerable random access (it considers only eight pixels from a row before jumping to the next; having done this to eight rows, it jumps back to the first row, to repeat this).
That was about target operands. Cacheability of the input operands appears the same in both cases to me.
I have tried this for 512x512, 1024x1024 and 32x32 images (enough to reveal some cache-related issues, I think); no difference in the ranking.
I used callgrind, only to find that drawchecker_nested_hell() had 1.64M Ir per call, while drawchecker_bitwise_short() had 1.44M only. In other words, the naive routine is faster with -O0 only in terms of time.
If I'm doing and interpreting the cachegrind analysis correctly, drawchecker_bitwise_short() has slightly less cache misses compared to drawchecker_nested_hell()
I tried calling the routines in a different order. No, there is no difference in the ranking.
The length and complexity of instructions in the Assembly output seems to be comparable for both routines at a glance, but in the CPU, under the hood, it could be a totally different story (micro-operations, speculative execution, etc.).