Monday Q&A #2 (2023-02-12)

Answers to a ton of questions from last week's videos!


Each Monday I answer questions from the comments on the prior week’s videos. Transcripts are not available for Q&A videos due to length. I do produce closed captions for them, but Substack still has not enabled closed captions on videos :(

Questions addressed in this video (timestamps generously provided by a community member!):


  • [00:01:01] “Regarding the register, I avoid making this question because I think it more ‘advanced’, but return to the loop, why the register renamer would not solved some of that data dependencies?”

  • “As an additional question (which could answer the question above if true): can the CPU emit micro-ops for a particular instruction on different frames? In other words, could the CPU emit the arithmetic and load micro-ops for a particular ADD in different frames? If so, then I would understand why listing 9 is faster, but I wouldn't think that's possible as my understanding is that the front-end would have to parse and emit the micro-ops at ‘once’.”

  • [00:10:55] “I also think it makes sense to aks the question why would somebody use python instead of cpp. It might showcase some problems with cpp or benefits of pyhon. With there learnings you might be able to make better choices.”

  • [00:13:44] ”In the back of my mind I have this gnawing question on how you would make things performant when you don't know the end-user's machine. You simply do some general research on what hardware is out there and configure your compile to match? I have only worked in an environment where the hardware is under our control, so haven't really had to deal with the jungle of hardware out there.” / "You've mentioned a few times now, about different CPUs, ‘you can look this up’. If you're writing something to run on some known server hardware (and you're never going to migrate it to new hardware), that seems useful. But what if I'm trying to ship software to a lot of different users on different systems? Some of these things you mention seem like pretty huge differences, like whether the L2 cache is shared or per-core. Are there strategies to help deal with this without trying to reason about performance on dozens of memorized CPU specs in your head at once?”

  • [00:19:10] “Why would compiler and CPU designers do this to me?”

  • [00:21:06] “Real world loops that have more complicated code inside, exercises with those would be very appreciated, since this is what we are looking to accomplish at the end of the day.”

  • [00:22:35] ”When I'm writing a program (or a function), how can I tell if I have created a wasteful implementation or not? Of course looking at the number of instructions for the simple add function in python made it obvious that it was very wasteful. But how do I do this for an entire program?”

  • [00:26:15] “How do you think about lanes and bottlenecks when you're designing and making software or when you're trying to optimize existing software? Does the distinction matter to you? Does this high-level understanding play a role at all for you, in practise?”

  • [00:28:50] “I wonder if it’s been a bad thing that we’ve focussed on asymptotic runtime performance in education instead of things like this?”

  • [00:30:19] “A bit of an off-topic question: you are always discussing performance as in ‘cpu time’ reduction. Does it make sense with modern desktop computers to talk about ‘power usage reduction’? Does that map directly to ‘instruction count reduction’?”


  • [00:33:08] “I was just a bit confused when you replaced ‘u32 Sum’ with an ‘__m128i Sum’, because I read the ‘i’ in ‘__m128i’ as ‘signed integer’ vs ‘u32 unsigned’. But I guess it doesn't imply signedness after all? A quick google search didn't make that aspect of SSE much clearer”

  • [00:44:31] “do compilers optimize code with SIMD instructions?” / “So how much of SIMD work does and can the compiler do? You mention you compile with specific flags to make the points stand out more. And also that compilers get stuff wrong as well. But would a compiler SIMD-fy your code when it can or can SIMD hardly never be generated from code because it is not as straightforward as the demo loop?”

  • [00:47:26] “Question: SIMD is different for intel and AMD?”

  • [00:50:20] “Is it correct to think that cpu have only wide registers but if we don't use them in the program then wideness is simply ignored and not used? Like using ax instead of rax.” / “Since SSE instructions have a 128-bit ‘capacity’, if we're dealing with additions involving 64-bit integers, would they still have value over the normal ADD instructions? (e.g. addq) Since both of them would still only be able to operate on 2 integers at a time (unless there is a broader context I'm missing out here)”

  • [00:56:40] “SIMD vs GPU. both brings parallelism. when to pick one rather than the other ?”

  • [01:04:01] “Are there microarchitectures that are deterministic? Where you get the same cycle count when you boot and run a single program on them directly (no OS, drivers, etc.)? If not, what do microarchitectures do to become non-deterministic and what do they gain?”

  • [01:09:06] “You showed SSE, AVX and AVX-512 Where does AVX2 fit into this, is it same size as AVX but with additional instructions?”

  • [01:11:00] “From the look of it, we are going to look mostly at the x86_64 platform. I suspect the process is more important than the platform, but how transferable is it to a platform like ARM?”

  • [01:15:16] “Did programmers before SIMD was a thing hardware-wise do ‘SIMD’ when working with smaller numbers? For example if they were working with 8 bit numbers, I imagine they could still pack them together in a 32 bit integer and do a normal add to add 4 at the same time. Was this a common thing before CPUs added wider registers and the special simd instructions?”

  • [01:19:38] “I have heard that sse instructions on the 256 and 512 size registers can cause the clock speed to decrease as they are quite power hungry. Would that significantly impact the theoretical 2000x speedup?”

  • [01:21:17] “A question I have and will most likely be answered as we continue is this: When programming / compiling an executable that will be distributed to a lot of different people running a lot of different hardware, whether it is a game or another program, how do you account for hardware that may or may not support SIMD, or may support one family type, but not the other? Do you try to insert different versions of optimizations with #ifdefs? Are there any general rules to follow to be able to think about this?”

  • [01:24:08] “When we are looking at disassembly, these instructions are still detached from what is actaully happening, right? the CPU is still deciding to parallelize. how does stepping through disassembly work with this, when an ADD instruction can still be parallelized with other instructions? are we waiting on the CPU to finish the instruction? will it start a different one before finishing this one? it seems very multi-threaddy and undeterministic to me (I know threading is very different) but the way I get it now is that those instructions we can look at (and measure?) are also a layer of abstraction of what is happening on the CPU itself. Could you clarify this, please?”

  • [01:35:09] “Does SIMD add requirements on the length of the dataset we're trying to add? What if the array we're trying to sum has an odd number of elements?”

  • [01:43:29] “Are there multiple SIMD circuits in the CPU as well? Especially considering the fact that we were able to unroll the AVX loop to the point of four instructions per iteration, does this mean there are 4 instances of the SIMD circuitry in the hardware?”

  • [01:45:07] “Would it be right to say that the special SIMD instructions have much higher ‘overhead’ when being processed by the CPU compared to normal ADD instructions? If so, would it then make sense to say that they're more suitable to be used when the input is of a certain size (and possibly of a power of 2)?”

  • [01:50:05] “As a beginner to SIMD intrinsics, I have a question about the load instruction. How does using _mm256_loadu_si256 impact the performance? According to the Intel Intrinsics Guide, this load intrinsic has a latency of 7, which is much higher than the latency of _mm256_add_epi32, which is 1. However, in the example you provided, loading from the input array did not seem to impede the program's performance. I was just wondering if it could be attributed to the optimization of instruction pipelining or memory caching, and if you could provide some further insights into this matter?”

  • [01:54:16] “How come if the CPU was capable of doing 16 adds per clock, why couldnt that be achieved with no dependency scalar adds? Meaning paddd vs 4 zero-dependency adds, I would think the compiler could look at the 4 independent adds and generate instructions with the assumption "these are independent and ok to execute in the same clock based on input args" Additionally, you mentioned with 8bit adds one could do 16 adds with SSE, which makes me think theres more than enough execution units in the cpu to do more adds per clock without SIMD, its confusing to me that we need to specify and align everything beforehand.”

On Caching

  • [02:01:53] “Is there a way to control the cache? My understanding is that we can control the size/organization of our structs to ensure we don't exceed a cache line, but that we also can't know when/what the cpu is going to pull into the cache, even with C/C++.” / “How can I inspect my program and see if it's using L1/L2/../RAM? If I understand properly that is completely transparent even in the assembly code. All I can see is a LEA instruction.” / “How can I ask the CPU to load data in L1/../LN? Or, I cannot? I have to chunk data in buffers of xKB, where x < sizeof L1?”

  • [02:09:27] “I heard that there're some instructons on x86 that instead of loading data into cache can operate straight from the memory omitting the cache, is there any benefit doing that for large worksets (I doubt there's streaming add instruction but I remember something for copying the data)?”

  • [02:10:43] “If I have a contiguous buffer of 5MB, why can't the compiler or the CPU chunk it for me in multiple arrays of 32KB that can be loaded into L1, instead of loading it into L3?” / “Curious about how data moves from the caches.. If something is not in L1, or L2 but is in L3, does it get copied 64 bytes per cycle from L3 to L2 and then 64 bytes per cycle from L2 to L1 and then 64 bytes per cycle from L1 to the CPU?” / “1. CPU says ‘Oh, you want the INPUT[0]? I'll go to main memory and get it! Oh, and I'll also bring along INPUT[1, 2, & 3] and store them in L1 because you'll probably need them soon!’ 2. Then loop iterations for INPUT[1, 2, &3] are fetched from L1 3. When the CPU hits INPUT[4] it repeats step one, but brings INPUT[5, 6 & 7] along and stores them in L1 So, in my simplistic example the majority of reads are from L1, but the CPU has to go to main memory at some regular interval? OR... does the CPU somehow know the size of the INPUT buffer is <= 32K and loads it all in at one time?” / “This is also why I don't understand how you forced the L2 cache reads - and subsequent caches up the chain: In the example, does it constantly read from L2 cache or does it copy chunks into L1? And why is it the case?” / “How are the caches populated? In the beginning we load 100 mb of ints in memory. As we iterate through the chunk, does the system store n consecutive ints in l1 so it can read them quickly? Does it also at the same time load to l2 and l3?”

  • [02:11:56] “Shouldn't there be a way to update L1, with the next 32k of L2?”

  • [02:19:37] “As far as I understood, CPUs notice when memory is linearly accessed and can fetch data for the program ahead of time. In this case, what is stopping the CPU from pre-emptively fetching data before it is actually needed? Is it that it favors not kicking out the data that was recently used (by us)?”

  • [02:24:37] “What is the operating system doing in all of this? Do we just assume that the OS is cached on one of the cores and all of our code we write (or software we run) is spread out across all the other cores? Also is the speed decrease less than half with each increase in size because there will always be some code in the lower caches so the decrease isn't linear?”

  • [02:28:59] “How does the naive version (scalar add?) behave when the data size is increased? I guess it's not affected that much because it wasn't operating at full speed to begin with?” / “Would love to see a graph of the cycles per int number, so we can clearly see the cliffs associated with the L1, L2, L3 cache and main memory!” / “I'm just curious, what is the drop if it can almost but not really fit in L1 like what is the slowdown if you have 5001(so assuming 5001*64 does not fit in L1 32K cache, not sure about the actual number for your CPU) so the same scenario that adds 1 extra property to a struct or something to just make it miss L1”

  • [02:34:00] “When discussing cache here, it's based around the full cache size of 32k, 256k, 8MB. What about the 64 byte cache line? Is there another level of optimization we're missing out on by not writing code that acknowledges that cache line limit?” / “Question: is data loaded from main memory in chunks of constant size (ex: 64 bytes) ? Or the CPU only retrieves what it needs for a specific operation ?” / “Does memory alignment also still play a role how data is fed into the registers from the cache?”

  • [02:40:07] “Will you cover cache line overlap and false sharing in the future?”

  • [02:40:13] “Does it take some cycles to check if each cache has the data in memory? I'd imagine there is some cost right? Just curious if there's a way to customize this, like if you know that the memory is not in any of the caches, can you run some operation which will skip the l1, l2 and l3 cache check and pull directly from memory to save a few cycles? Probably not very useful in practice either way...”

  • [02:41:37] “On shared caches I imagine cache eviction is a problem right? So someone could write some code running in parallel on two cores, and if they don't have a good understanding of where their data is residing in the cache, they could get in a loop where each core is evicting the shared cache every time it, which could slow things down considerably right?”

  • [02:43:56] “In the previous videos it was reasonably clear what to do to avoid the pitfalls associated with each multiplier. This time I'm either dumb or maybe it's less obvious: So what *do* I do if I have to work with 33 million integers and want to utilize the cache efficiently? Do I just slice the thing into 4k-sized smaller arrays and work on those? Wouldn't that incur its own performance cost?”

  • [02:46:05] “I have a dumb question. Why won't they just make the L1 cache bigger (like 8mb or even a gb)? Is it constrained by physics or would it become too expensive to make such a CPU?”

  • [02:48:02] “are the instructions of the program also stored in the caches or is that in a different place? I was trying to read some assembly primer and it had things like Data Segment and Code Segment. I have assumed that, that is because the program and data are just bits in the same memory and the segments are just offsets in it. If the program instructions are also in the cache then does that mean all of that 32k is not actually available for the integers?” / “Someone said that the program instructions are in fact in a separate cache. Then does that cache have similar levels? And wouldn't that mean if I jump to a point in my program that cannot fit inside however many caches are there, then I will get a slowdown like this just by jumping in the program?”

  • [02:51:46] “It's sort of implied, but as long as we stay in the same cache level, it it reasonable to expect similar timings? L2 fits 256K so a 64K or a 128K or a 196K array of ints should get the same adds/cycle?”

  • [02:53:20] “If going to the L1 cache takes about 3 cycles to fetch a value, wouldn't it mean that we would get about (1/3) adds per cycles in the naive C implementation? How did we get 0.8?”

  • [02:55:04] “Is there a way to make debug what is really going on? Right now it looks like you have to know how the CPU works and guess what it will do, and when it hits L1 or L2 or the main memory. Is there a way to actually see what is going on?” / “Is there any way to check/profile if a program or code section is causing data-cache issues? Can code instructions cause instructions-cache performance issues?”

  • [02:56:16] “You said by making it bigger than L1, now it's reading the values from L2 (and then L3 and main memory). But wouldn't it still mostly be reading from L1? I would expect that since L1 misses are so costly, and there's some bandwidth for L2, that the compiler (or CPU?) would be grabbing bandwidth-sized hunks of L2 into L1, so it can still operate on L1 as much as possible. Isn't that what it means for it to be a "cache"? Is your `cache` program there to prevent this behavior somehow? (How?)”

  • [03:00:19] “Is the data in L1/L2/... cache "persistent" between programs? That is, say I run a program A in which the data fits *entirely* in L1 cache. Upon completion, I immediately run it again, should it then be expected to run faster, since we don't have to perform the re-fetching of the data from main memory/disk?”

  • [03:01:20] “I have one question about how the CPU deals with caching data when all of it doesn't fit in the same cache level. Let's say I have an array that is larger than the 32k that could fit on L1 in the case of this CPU, but it is smaller than 256k. Will the CPU cache as much of this array to L1 then cache the rest to L2, or will it have to cache everything to the same level?”

  • “Do we have control over what gets cached to which level? Let's say I know that I will be processing two categories of data where one of them will be accessed frequently while the other not as much. Is there a way for me to ensure that the CPU keeps the data I need frequently in the L1 cache (assuming of course it fits there) and keep the one that isn't accessed a lot in the L2 or L3?”

  • [03:03:42] “Do I understand correctly that the QuadScalarPtr(128mb) version will have the same speed as the QuadAVXPtr(128mb)? But it is still can be usefull to do AVX in case some customer have much faster memory, I think.”

  • [03:05:00] “Are there computers with one to one relation of memory speed to execution speed? Or this wouldn't be very applicable to realworld tasks? (because many of the algorithms are slower than O(N) and you are doing arbitrary large work per data for large enough data)”

  • [03:06:42] “Perhaps a naive question: By increasing our integers-to-be-added to 32k or 256k and getting them out of L1 and L2 cache, aren't we also increasing the number of ADD (or PADDD) instructions we do per cycle? How much of the slowdown is attributed to this?”

  • [03:08:44] “In the grand scheme of things do you think that leveraging on chip memory in a cache architecture is preferable to directly being able to allocate in l1, l2? Given a blank slate do you think that chip designers could design the ISA and hardware differently to better support high performance programming?”

  • [03:10:10] “If you only ran the loop once, the difference in array size would not matter presumably, since the load delay from main memory would be uniform per load. But when you are running the loops repeatedly in your test code, then you are seeing the speedup since the first run loads everything to some cache level and later runs take advantage of it. Can you comment on that, assuming it’s correct?”

  • [03:11:45] “Follow-up question: once data is retrieved, how does the CPU determine where it gets cached (L1, L2 or L3) ? I'm going to guess recent loads always end up in L1 and everything else gets pushed "down" to L2 or L3 (until inevitably being removed from all caches)”

  • [03:14:14] “Why is there no L4, L5 caches”

This video is for paid subscribers