Q&A #47 (2024-03-18)

Playback speed

undefinedx

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #47 (2024-03-18)

Answers to questions from the last Q&A thread.

Mar 19, 2024

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

Questions addressed in this video:

[00:03] “In Q&A #45 at 22:59 you explain that cache lines do not work as a sliding window over contiguous memory, but can be taken really far from each other in memory. This makes me realize I don't actually know what a cache miss is because my understanding was that cache misses happen when the cache has to jump all over main memory. This is why I thought people advocated laying things out in planar data structures rather than interleaved ones and why they would say that a linked lists are terrible, because they'd cause you to iterate over things that are non-contiguous in memory. By extension, I've been led to believe that the same is true for "heap" allocations in general. But based on what you said, it seems to me that it would only matter if the things in your linked list were literarily so small that you could have had multiple ones in the same cache line? Thank you for your time.”
[23:35] “I ran listing 142 and 145 on my Alder Lake CPU (12700K) and got some strange results. When running listing 142, both RATAdd and RATMovAdd run at roughly the same speed … My CPU runs at ~4.4GHz so it appears both RATAdd and RATMovAdd do two adds per cycle. Is this just an indication that my CPU sees that the serial adds can be parallelized?”
[27:21] “When running listing 145 I see a 2x improvement between Read_x1 and Read_x2 but I see no improvement with Read_x3. So that indicates that my CPU is doing 2 loads/cycle. In the documentation (Intel, Chips and Cheese, and Agner) I read that my P-cores can do 3 loads/cycle. But even if I run the listing pinned to a P-core I still see only 2 loads/cycle.”
[29:02] “Considering we already have benchmarks for bandwidth, and the "bus width" is just the size of the cache line (?), does this mean we can also try to graph the buffer count *per memory hierarchy level* by measuring latency and then dividing the (bandwidth * latency) product by the bus width??”
[30:54] “I have a Skylake processor just like you and my code for the homework is almost exactly the same as yours (only the name of the registers is different, since I use the Linux System-V ABI). In the cache size testing using power of two, I find that the speed of my L1 cache is 250GB/s (just like in your video). However, when I run the same test size with the two-loops variant, I measure 200GB/s. If I move the "align 64" from the start of the outer loop to the start of the inner loop, I measure 230GB/s, so it's way better, but still not as fast as the 250GB/s I get with the one-loop solution. I'm curious if you have any idea what might cause this slowdown when I use the two-loop variant?”
[34:35] “in the "Cache Size and Bandwidth Testing" video, you were loading a 1GB contiguous memory block into registers. Does the prefetcher have any appreciable effect on the bandiwdth? The address should be perfectly predictable.”
[36:50] “In many CS courses, they tell us that static (stack) allocation is for data we know the size at compile-time, and dynamic (heap) allocation is for data we only know the size at running time. I recently learned about the `alloca` function which allocate memory on the stack but at running time. I'm curious what would be the use case of this function. Are there any performance advantage in using this ?”
[41:12] “replicating the case in the RAT video, I was curious if I unroll the loop with independent cycles to do the work twice, but half the iterations, I would get double the throughput. But, it only increased slightly, any idea what's going on?”
[45:12] “Regarding the challenge homework of the RAT video, is my understanding correct here?
pop rcx ; Updates the stack pointer (rsp) and rcx
sub rsp, rdx ; Updates rsp, and the flags
mov rbx, rax ; Updates rbx
shl rbx, 0; Updates rbx and the flags
not rbx; Updates rbx (it doesn’t updates the flags)
loopne top; Depends only on rcx, not on the flags”

The full video is for paid subscribers

Programming Courses

A series of courses on programming topics.