In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.
Questions addressed in this video:
[00:02] “Lots of the techniques here seem to rely on knowing details about the specific CPU a user has. How do you structure a codebase that needs to ship on a lot of different platforms/CPUs? What if you don't know where exactly your software will be used, can you detect if eg SIMD is available and fallback if it isn't? How do we stop that becoming a huge mess in the code?”
[14:45] “Do you have any opinion on simd wrappers like xsimd? Did you ever use them? On one hand they seem totally redundant for many use-cases since after doing these things once, you will re-use your own code and on the other hand they are available and if you need ARM/other architectures later on, they already support them.”
[17:42] “You mentioned in a podcast a while ago that all your compile times are fast (a few seconds). Besides compiling with all cores how do you achieve fast compile times and what are things to look out and avoid that would slow down or bloat compile times?”
[24:44] “Will you provide some homework at the end of the course to solidify all the knowledge you gave? I personally feel like I've learned a lot of fundamentals that I didn't know before, but I don't feel confident enough or don't know how to apply everything learned in the course yet. Creating practice work by myself/ourselves might be troublesome in terms of error-proneness, so it probably would be better if you (or someone in the community who is also as knowledgeable) could create homework or tests. Thanks!”
[27:32] “My unaligned bandwidth test was just the aligned case, i.e., Read8x32Bytes routine with an offset start pointer.
An interesting result was that the aligned case was had the lowest bandwidth from main memory by a significant amount. Do you know what could be the reason?”
[29:14] “I'm getting different L1 bandwidths for different numbers of consecutive bytes read. When I read 8 * 256 bytes, the speed is ~610Gb/s, for 64 * 256 bytes it is ~575Gb/s. The L1 size is 48K, so both tests fit into L1. I wonder what can cause such a difference? Also, are there reliable benchmarks I can compare my results with to be sure I'm not completely wrong?”
[31:40] “I have a few questions from lesson ‘The RAT and the Register File’. Couldn't the CPU just determine that it can skip _the execution_ of all but the last two instructions, since the intermediate rcx values are all overwritten before they are read anyways? And maybe this could be done via macro-op fusion before they get into the uop queue and hit the backend?”
[38:02] “What are your thoughts on the recent Intel APX (Advanced Performance Extensions) and X86S (from "Envisioning a Simplified Intel Architecture") efforts? What did Intel get wrong or right, and what should be addressed next in improving x64?”
[42:31] “Chips using the ARM ISA(s) have consistently been been very competitive in terms of performance per watt, going all the way back to the first ARM chip in the 80's (where as I understand having lower power draw allowed them to package the die in a cheaper material due to thermal considerations) and continuing down to Apple's M series which seem to consistently beat Intel and AMD in terms of battery life when performance is roughly even.
Do you think this has much to do with the design of the ISA itself, or is it more to do with Apple, ARM, Qualcomm et. al heavily targeting their engineering at this use case? Is ARM's pervasive use in low(er) power applications mostly do to historical factors?”
[51:54] “What are some conventions you keep in mind at the beginning of a project to make multithreaded code easier to implement later?”
[57:30] “I still don't quite understand the cache penalty when doing unaligned memory accesses.
I know that when the pointer is misaligned so that we access two cache lines instead of one, when pulling our data, well, we access two cache lines instead of one, so this is more expensive for sure. But on the next loop iteration we are accessing the very next cache line which has just been pulled from memory on previous iteration. So it's already in the cache. So we are not paying any more cache miss cost. Why is the performance worse then ? Is it because we are paying the cost of just accessing the cache in the first place? Whether or not it is a hit or miss, we do twice the amount of cache accesses ?”