0:00
/
0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #80 (2025-10-31)

Answers to questions from the last Q&A thread.

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

The questions addressed in this video are:

  • [00:05] “Question regarding the ‘fat struct’ approach: do you ever find yourself thinking about excess memory consumption caused by some entities having unused fields?”

  • [04:29] “About fat structs, you said we would initialize the struct to be able to be A or B depending on the situation, if you imagine that fat struct to be of type Result with cases Success or Error, how would you be able to initialize the Error in the case you have a Success as there was no Error, and vice-versa. This type for example exists in F# and it makes sense to initialize Success only or Error only.”

  • [07:40] “Do you have any suggestions for an off the shelf tool to measure bandwidth and flops for a machine your run it on?”

  • [13:10] “I sense a recurring theme of reliability and predictability – preferring simple control flow to early returns, preferring simple compilers to strict aliasing, preferring large blocks of memory to pointer festivals, etc. You’ve also spoken about improving _robustness_ by preferring zero/dummy values that just flow through code to null pointers, and preferring handles to pointers (from the iCloud example we just saw).

    What do you mean with “robustness”, and what other techniques can I use to make my code more robust in this way?”

  • [18:00] “Do you have any thoughts on Apple’s approach to SIMD?”

  • [20:10] “Hi Casey, on the recent podcast with Marco you said that if you could choose a single piece of software to be magically redesigned it would definitely be the browser because the software platform it defines is bad. Could you please elaborate on that? What are the main problems with today’s browsers from your perspective?”

  • [22:30] “I’m a bit confused when analyzing the bandwidth I get when reading directly from the volume (using ReadFile with the path ‘\.\C:’ with an offset, for example). As far as I know, that’s the correct way to read system files like the $MFT (which I can currently read properly, by the way).

    When reading 20 GB of contiguous data from the beginning of the volume, I get 2.6 GB/s, and it doesn’t matter whether I use the FILE_FLAG_NO_BUFFERING flag or not—the result is the same. I’d expect something closer to a non-cached read (4.9 GB/s), but I’m getting the same throughput as a cold cached read. I’m not sure where this penalty comes from (assuming the read isn’t triggering extra cache operations since it’s non-buffered).

    Any idea what might be going on here? Do you think these read bandwidths make sense?”

  • [25:46] “Will we be able to reuse the coefficients we’re currently using for f64 sine for f32 sine or will we need new ones?”

  • [28:24] “I have a question about PC hardware components. I’m not clear on things like the motherboard and chipset. Do they play any role in performance? They vary a lot in price even with similar features, so I assume some aspects must affect overall system performance. Could you give a brief overview of these system parts, if possible?

    Or, to rephrase my question: When you’re building a PC, what do you specifically look at besides just the CPU, RAM, disk and GPU in terms of performance? How do you decide what’s suitable for your specific builds?”

  • [36:09] “In my code, I ended up having a single loop over the input that directly produced the haversine sum, rather than splitting parsing and math into two loops. But that means if I want to time parsing vs math, I have to put blocks into the loop, which (seemingly) inevitably introduces a lot of overhead.

    Is there a good way to handle this? The best way I can think of is to instead temporarily comment out parts and just time the rest, though while that’s easy to do with the math part, it seems harder to do for the parsing part, since you still have to somehow produce dummy data for the math while making sure this doesn’t lead to any compiler optimizations you wouldn’t otherwise get.”

  • [39:28] “Sorry for repeating the question from the last Q&A, but here is: I have just gotten to it, did it and looked up your solution in QA47 to cross reference. There is one thing that we got differently, and I don’t quite understand your reasoning about it:

    You said that shl rbx, 0 should be recognized by the frontend as a nop and not do anything with flags, but would produce rbx.

    1) If the frontend sees it as a nop, why would it RAW the value of rbx, and not just be a pure nop?

    2) I actually thought that it would not be recognized as a nop (I didn’t find anything about this kind of optimization, i presumed it would be somewhere near zero idiom stuff in the manual), and then it seems like shl will have not only a RAW on rbx, but also on all the flags, as it has be ready that the ALU will say that the shift was 0 and the previous value of flags should be preserved (i. e. RAW)

    So the question is, why is it that rbx is a RAW and flags are skipped, and do you know if there is any place in the docs where such a frontend optimization might be mentioned?”

The full video is for paid subscribers