Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #38 (2023-12-11)

Answers to questions from the last Q&A thread.
20

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course. Transcripts are not available for Q&A videos due to length. I do produce closed captions for them, but Substack still has not enabled closed captions on videos :(

Questions addressed in this video:

  • [00:03] “Maybe you can touch FPGAs and how would they, if at all, plug into performance aware programming?”

  • [02:30] “Evidently, I have a Coffee Lake processor, but the Intel Optimization Manual doesn't have a section for Coffee Lake. I'm guessing I just look at the section on Skylake, but *not* Ice Lake nor Alder Lake -- is that right? (Man, these naming conventions...)”

  • [08:01] “Is there a mechanism that we can use to track branch mispredictions like the one we were able to use for counting page misses? I'm thinking both within the codebase and in tools like perfmon.”

  • [15:26] “I wonder what the branch prediction overhead is. As it gets more and more sophisticated, I assume the branch predictor has more and more work to do and stuff to consider. I'm curious if even at times when it guesses correctly, if there are cases where it does a lot of work to get that guess and other cases when takes less work. If that's the case, is there still space for optimization even for a predictable branch pattern, by changing the code to make the predictor do less work?”

  • [18:02] “With the branch predictor knowledge, how often are you thinking about writing branchless code? Are there code patterns that you always write as branchless vs the simple approach first?”

  • [20:08] “Branch Prediction Homework Question: I have an older iMac running on an intel chip, Kaby Lake. When I run the tests after compiling with the Apple provided clang, the results depend strongly on the optimization level. I looked at the assembly around my loop in GDB, and it looked similar no matter what optimization level I used.”

  • [23:12] “So from my understanding normally when a branch mispredicts it gets a massive misprediction penalty because it has to flush it's pipeline and grab new instructions. Wouldn't a good solution be (and this might be more of a hardware question) If there was a separate cache specifically designed to hold some instructions of the "less likely branch" at least enough instructions so that if a misprediction happens there is enough instructions of the other branch stored to give time for the pipeline to flush and pick right back up without any penalty? I could see the potential trade offs might be increased CPU cost and complexity in how the prediction logic works, but obviously there are some other big downsides that I'm not seeing to this approach otherwise we would be seeing more of this right?”

  • [29:33] “When a program is memory bound, can you speed it up by using multiple cores? That is, if you could process for example 5gb on one core, would you be able to double that with two cores?”

  • [31:52] “In the course "Branch Prediction" you said we no longer have an IP register like in 8086. Can you elaborate on it?”

  • [35:31] “Do you see a future where AI stuff will be integrated inside compilers like CPU designers use for IC creation (it becomes an abstraction layer)? Do you see darkness and suffering if it happens?”

  • [36:43] “How does running 32 bit applications happen when running on a 64 bit capable CPU? Does anything special need to happen on the CPU side or does it simply execute the instructions as they are produced by the compiler, i.e. not using any R* registers? Can the CPU internally "optimize" any 64 bit operations that might need to execute, i.e. if using types like UINT64 in the 32 bit compiled application.”

  • [38:34] “The feeling in terms of documentation seems like the following: Intel is more open about the internal of their CPU architecture than AMD, is it correct?”

  • [40:11] “In 'Software Optimization Guide for AMD Family 15h Processors' they suggest to prefer branches which are not taken. If I understand correctly it is similar on Intel where static branch predictor predicts that forward branches will not be taken… I wonder why they suggest this form? Is there a way to write assembly for it so that no branches are usually taken like in the first example?

    Also, does making forward branches not taken matter in your experience? Or is it mostly about avoiding branches completely or reducing their number?”

  • [49:01] “As floating point calculation is involved in CPU performance a lot, will you teach us how they are represented? And when we should avoid them?”

  • [50:18] “CPU will not execute instruction in the order as we wrote them. Is it not too complex for us, human, to get an accurate intuition on the uops and in which order it will be produced, today and in the future when things get more complex?”

The full video is for paid subscribers

Programming Courses
A series of courses on programming topics.