14 Comments
Aug 4, 2023Liked by Casey Muratori

Really neat little exercise! I briefly got confused in the video though. By choosing 3 or 0b11 as the example color value to test, we get the input again after the XOR and NOT. It took me a sec to realize that that only happens for 0b11.

I'm not working with binary every day and watched the video out-of-order because it is on YouTube, so I didn't immediately catch it. For binary beginners choosing a different test color might have helped, but I guess if you follow the course in order that should be basically no one. Great explanation though!

Expand full comment
author

That's a good point - I hadn't considered that. I just kind of randomly picked a color on the lightboard without thinking about what the values would be.

- Casey

Expand full comment

Since you can preprocess the color input, you can actually avoid the NOT by inverting the color bits first since ~(A XOR B) == A XOR ~B. So, if the color you want to check for is 0b01, you pass 0b10101010 to ContainsColor.

Expand full comment
author

Yes! I have a listing for this in the Addendum, and estimate it saves 3 cycles on 8086.

- Casey

Expand full comment

Oh, I see! I only watched the video and skimmed the email so I wasn't aware of the addendum but now I know :)

Expand full comment

Always wondered what this technique is called, to do 4 pixels at once, as I had previously coding demoscene effects on 386. There in VGA 8bits per pixel, I realized I could use 32bit regs to read 4 pixels at once, add them together, shift them right but and them with a correction constant in some cases, I would write a typical blur/fire effect, by reading neighbour pixels, but doing it in almost 4 times faster than if byte per byte (with the small overhead from doing the extra corrections). I even limited my gradient of the blur/fire palette to be at indices 0-63, so that after four adds to not overflow on the left (but still do the AND when shift right by 2 to zero the overflow bits on the right on the packed bytes). At the time I couldn't find any docs/tutorials in the demoscene mentioning this trick. But pretty sure people in demoscene or the industry might have used things like this. I would call it poor man's SIMD as a joke :)

Come forward, in one of your videos you proposed the book Hacker's Delight. I started reading it, pretty interesting binary algebra stuff I didn't know, but found in two pages a mention on this kind of techniques. They had mention of ADD/SUB of 4 packed byte values in a 32bit reg, and even with correction to use it for full 0-255. Also an ABS (absolute) of 4 packed bytes that helped me expand the fire effect to a water ripple (how it's done in demoscene, I needed to clamp or abs byte values under). Gave also a boost on the water effect on my 386 at 3x faster.

Fun stuff of how people used to optimize things at the time several times faster, when your previous thought was "There is no way this can even go 10% faster". I love that stuff!!! Also curious to see your lectures on SIMD as I am aware of it but never worked it with on modern PCs (I am still stuck on writing regular C/C++ code the way I used to know :)

Expand full comment
Aug 4, 2023·edited Aug 4, 2023

I got a bit confused after the NOT so wanted to provide an alternative explanation for anyone that might be confused :)

What Casey is effectively doing is ANDing every 2 bit with each other, so if the result (after XOR and NOT) is:

[ab cd ef gh]

we want to AND a with b, c with d etc.

We do that by masking (AND with 10 10 10 10) and shifting to right, so at the end we end up ANDing

[ab cd ef gh] with

[0a 0c 0e 0g]

As a result we get every second bit as 1 if only both two bits are 1 in that group. (as in a=1, b=1)

Expand full comment
Apr 17·edited Apr 17

I paused the video and had my own attempt. To be fair it wasn't the first time I've played around with bitwise logic, so I had an advantage over teenage Casey. Still, I'm surprised how similar my answer was.

```

char unsigned c = 0b11;

char unsigned px = 0b01110110;

c |= c << 2;

c = ~(c | c << 4);

char unsigned n = px ^ c;

int answer = n & (n >> 1) & 0b01010101;

```

P.S. These comments are really terrible for writing code in... :/

Expand full comment

A key aspect of why we want to do it is that you can't AND bits to know if a two bits pattern like 10 == 10 directly. (0 AND 0 = 0).

I think bit puzzles key understanding is that any bit tricks will need to retain informations somewhat and never loose it. Like the XOR value will always retain the bits you TEST somewhat.

Expand full comment

I think I'm lucky that you're exposing me with these "bit puzzle" as early in programming as I am. I thank you for that. Furthermore, I love these interview questions. I don't know if it's just my brain, but, the transcript is way easier to follow than the video.

One of the reason (apart from the Substack's terrible video player) is because my brain seems to be distracted very easily by visuals than reading a text or just listening to a voice, inside transcripts it seems that you add extra information, delete some mistakes you've made and rectify some sentences for better understanding.

Conclusion: Keep doing these transcripts :)

Expand full comment
Aug 6, 2023·edited Aug 7, 2023

How would one replicate 2-bit color pattern to produce 01010101 out of 01 color now and back in the days?

- You can do ORs and shifts `u8 c = Color & 3; c |= c << 2; ... `;

(8086 clocks: 4(AND) + 4 * 3 (SHL) + 4 * 3 (OR) ≈ 28)

- You can do `0x5555 * (Color & 3)`;

(8086 clocks: 4(AND) + 77(MUL) ≈ 81

- You can do a table lookup `patterns[Color & 3]`;

(8086 clocks: 4(AND) + 8(MOV) + 5(EA) ≈ 17)

I guess table lookup wast the best for 8086, and you can even get inverted pattern saving NOT.

Does the reasoning change now, on modern architectures, where compute is in general cheaper than memory access?

Expand full comment

What about `c |= c << 2; c |= c << 4;`? Just two shifts and two ORs. (plus the AND, but I feel like that's a waste of an instruction if the assumption c < 4 is part of the problem statement and therefore a given)

The truth is it really doesn't matter much how efficient this part of the code is since we only run it once per flood fill. In comparison, the flood fill check will run thousands or even millions of times.

Expand full comment

That solution is literally the same as mine=)

```

bool flood_fill_detect(u8 pix4, u8 color) {

static const u8 t[4] = {

0b00000000,

0b01010101,

0b10101010,

0b11111111

};

static const u8 h = 0b10101010;

u8 w = ~(pix4 ^ t[color]);

u8 s = (w & h) >> (u8)1;

return w & s;

}

```

But I noticed in godbolt that generated code contains a constant 85=0x55=0x01010101 which I didn't use, I used 0xAA. It seems like compiler did some optimizations, becase there is only one AND!

```

flood_fill_detect(unsigned char, unsigned char):

movzx esi, sil

xor dil, BYTE PTR flood_fill_detect(unsigned char, unsigned char)::t[rsi]

not edi

mov eax, edi

shr al

and eax, edi

test al, 85

setne al

ret

```

Expand full comment
Aug 5, 2023·edited Aug 5, 2023

never mind, TEST actually does AND

Expand full comment