Q&A #70 (2025-01-20)

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #70 (2025-01-20)

Answers to questions from the last Q&A thread.

Jan 21, 2025

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

Questions addressed in this video:

[00:02] “Question about latency / throughput. Let's look at `SQRTPD XMM, XMM` on Skylake: it has latency 19, and throughput 4.5 (taken from uops.info). This means that we can dispatch the instruction every 4.5 cycles. But Skylake has only one port for `SQRTPD`, so the execution unit should stay busy for 19 cycles. Does this mean that the execution itself is also pipelined?”
[12:32] “was catching up to the course and have a question regarding hardware/software prefetchers: how "smart" are they when it comes to evicting cache lines?
Say we have an instruction stream with the `prefetcht0` instruction, but the instruction that accesses that memory has more dependencies, is it possible that future prefetches executed out of order would evict the earlier prefetched cache line (due to an aliasing issue, for example)? And similarly, how far ahead in a data stream the hardware prefetcher operates?”
[22:02] “For listing 0127 (testing with Large Pages), did you happen to have to enable the "Lock Pages in Memory" policy to enable the privilege for large pages or was that policy already enabled for your machine. Wasn't sure we knew if that policy is enabled by default or not.”
[23:48] “I often find myself debating the appropriate size of my variables in order to "optimize space/performance", should I use u16, u32, u64 etc. However it seems to me like this is a waste of mental energy for one off variables such as for loop counters that would easily fit in the cache, therefore it seems reasonable to default to u64? I'm guessing variable size would be more relevant in large collections so that they would fit in the cache? That said, I’ve noticed you use b32 for booleans instead of b64. How do you decide on the size of your variables?”
[28:54] “After watching quite some Handmade Hero episodes, I can observe that oftentimes you are able to explain and implement a feature, like a projection matrix or a sorting algorithm, from the ground up.
When I implement something once or twice, I tend to forget how it works and I have to go read about it again and again. Surely, if I made something many many times, I will know it by heart.
Given that you know *a lot* of things on a very high level, do you have any specific approach you use to get to that state, something like The Feynman Technique, or maybe you just have done those things so many times, that you have naturally absorbed them?”

Computer, Enhance!

Paid episode

Q&A #70 (2025-01-20)

The full video is for paid subscribers