Q&A #39 (2024-01-09)

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #39 (2024-01-09)

Answers to questions from the last Q&A thread.

Jan 10, 2024

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course. Transcripts are not available for Q&A videos due to length. I do produce closed captions for them, but Substack still has not enabled closed captions on videos :(

Questions addressed in this video:

[0:00:15] “Hi Casey, thanks for this course. How would you approach transitioning from BS cloud programming career (7 years of Go) to a career where I can focus on stuff you teach in this course?”
[0:04:48] “How come our programming languages and programming approaches have not 'evolved with the times' to natively support SIMD programming in a nice way?”
[0:06:35] “I can't remember if it was here or in a Handmade Hero video, but you've mentioned that auto-vectorisation isn't very effective. Could you elaborate on what you mean?”
[0:12:11] “Hey Casey, what features are Linux debuggers missing in your opinion? I'm looking into building a new GUI debugger for Linux specifically, that isn't just a gdb frontend.”
[0:21:53] “This question might have been asked before and I might have missed it, but I would like to know how are debuggers supposed to work with out-of-order processors in the sense that the CPU could be executing an instruction far away from the current instruction the debugger is showing.”
[0:28:11] “Do I understand correctly that the slow file read in the haversine calculator is due to the initial malloc like we have in the repetition tester where tests allocate new buffer on each read?”
[0:28:40] “Perf monitor looks helpful, for what kind of scenarios are you using it? Do you know helpful and interesting resources to read that dig deeper into this program (for example, what those stats were all about)? Are there similar, perhaps even better, alternatives that you know of or used?”
[0:35:20] “If you compile the project on a specific x64 chip, will the specific chip type influence the kind of optimizations that the compiler is doing, or do they just output generic optimized code for a ‘typical x64’ chip?”
[0:42:00] “How useful are things like __builtin_expect() and C++20 [[likely]] / [[unlikely]] for optimising conditional statements?”
[0:44:52] “I've come across talks about ARM SVE, where the width of the vector register is not known at compile time, but you can query it from the CPU at runtime. My initial intuition is that while maybe that is useful for doing vectorized loops, for anything more complicated I'd really like to know the width of my SIMD registers ahead of time, so I can pack data accordingly. I'd be curious to hear your thoughts on vector length agnostic SIMD programming.”
[0:54:01] “Given that the frontend only knows wether an instruction is a jump when it is finished decoding but needs the branch predictor to know what to decode next, does the branch predictor act prior to decoding and can bypass it using the uop cache or something similar?”
[0:55:20] “I'm implementing a Bresenham's line algorithm, and I'm wondering if GPU allows you to actually work with divide and floating on each pixel without much cost ?”
[0:59:18] “This one is kind of unrelated to the course but what do you think about code review? The idea that your co-workers must review every line of code you wrote? Do you practice it, would you practice it, what do you think about it?”
[1:03:19] “In the profiler, is it worth to try and pack multiple profiling anchors into the same cache line? My implementation could fit at least two (and maybe four) into 64 bytes. However, I have the uneasy feeling that this could actually be harmful, because this creates a dependency between a write to that cache line and a subsequent read while otherwise the anchors would be totally independent (if I space them out with padding), albeit with a larger footprint.”
[1:07:12] “I came across pretty weird behaviour. First time I wrote ConditionalNOP function in assembly from Branch Prediction lesson I used rdi register instead of r10 (the one you used) when moving pattern-bytes from memory. This caused page faults every ~4k bytes in measurment for all branch patterns. If I use r10 register instead of rdi I have no page faults. While page faults start happening with rdi register, runtime is pretty similar in both cases. Do you know what might cause those page faults?”
[1:11:29] “My question is regarding the introduction video about caches.
In the sum loop of integer array, when we increase the array size to not fit in L1, we see a performance hit, due to data getting fetched from L2. How is the data is available in L2 when we have cache miss in L1?”
[1:12:43] “What is the experience of porting a piece of software designed for a particular cpu, console, or whatever when you originally designed everything around it being performant on a different(and unforeseen) set of machines?”
[1:21:30] “I have made a loop code in which the total bytes fit exactly in two cache lines, I assumed by shifting by one byte the 64 aligned loop code I would have decreased the performance of the program. But it does not. Was my assumption good and my code not right?”

Computer, Enhance!

Paid episode

Q&A #39 (2024-01-09)

The full video is for paid subscribers