Each Monday I answer questions from the comments on the prior week’s videos. Transcripts are not available for Q&A videos due to length. I do produce closed captions for them, but Substack still has not enabled closed captions on videos :(
Questions addressed in this video (timestamps thoughtfully contributed by Fox Caminiti):
[00:01:12] Is there a way to change the playback speed of videos on Substack / you can't play the video in landscape mode
[00:02:20] As a programmer relying on a VM or an interpreter, do you really have a fighting chance to improve performance without rewriting in C?
[00:10:19] I am facing a small issue while watching the video. It keeps pausing frequently. I wonder if its my network problem or, if everyone is seeing the same issue.
[00:11:31] Outside of the gamedev industry is there a place where performance aware companies or startups hire employees?
[00:14:53] Is the waste that python produces all really waste, or are there checks like overflow in there that might be useful in themselves / the C code only handles 32-bit integers while the python interpreter handles arbitrary-sized integers / etc.
[00:18:19] Can you get Python itself to optimize the instructions generated?
[00:19:48] There was quite a bit of assembly beyond the add function in the C program, what is it doing?
[00:22:04] Is the JVM interpreter also similar or is it much better than python?
[00:24:11] Can you also share the whole performance measurement code or is there a github repo already set up? It would be nice for typing along, looking things up and actually trying things out on our own.
[00:27:08] Why is the screenshot of visual studio showing the value of the registers rcx and rax while debugging python showing huge numbers instead of 1235 and 5678?
[00:30:38] It would be cool to hear you being devil's advocate: why would anyone want to pay the price of using these languages?
[00:35:54] Are you interested in having the typos in the transcript reported back to you?
[00:36:04] Would it help to do a short video on how, at a super high-level a CPU works and interacts with the memory?
[00:38:56] I remember reading in some random stackoverflow answer that you can "compute" the loop unrolling factor by multiplying the instruction pipeline depth by the amount of execution units that can perform the op. For example, floating point add on Skylake has latency of 4 and there are two adders. So this would mean there can be 8 adds in-flight at one time, so we should be issuing 8 adds per iteration?
[00:45:37] This might be a little out-of-scope (?), but it would be good to learn a bit about when a compiler may and may not do these things automatically and see some examples of compiler being both helpful and unhelpful.
[00:49:29] I had to look up associative vs. communicative, which is something I haven't thought about since elementary school.
[00:54:52] Why pick the minimum add per cycles value, and not something else like the mean, the median, or even the whole distribution of values?
[00:59:49] Is there any supplemental reading out there that covers this stuff?
[01:01:03] I'm guessing the QuadScalarPtr might be faster because it avoids the additional index sum (index + 1, index + 2), I'm looking forward to the explanation!
[01:04:58] It was surprising to me that the unrolling involved keeping separate sums and then adding them at the end. I would have expected to still keep one sum but adding multiple numbers at the same time.