In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.
The questions addressed in this video are:
[00:00:09] “How far can we take SIMD use? What if we put a way bigger onus on eliminating branching and developing parallellizable code? The language K, the simdjson library, the Co-dfns compiler, and the new Box2D engine all blast competitors out of the water by thinking of data parallelism. SIMD use is arduous now - should we start designing our languages and APIs around it? What’s the path to broader cross pollination as an industry? It really seems an untapped potential, right?”
[00:04:01] “Could you give examples of the kind of substrate work you’re hoping more people take seriously? Are you referring to teams like the WSL or Visual Studio performance teams, or something even deeper (or higher-level)? You’ve cited companies rewriting websites for performance and Microsoft’s console output claims - are these the kinds of things you mean? And if you mean something different, how do you think someone can get involved in that kind of work? I’m trying to understand the bigger vision. Is it that everything on the internet runs as fast as the McMaster-Carr Supply Store website?”
[00:07:45] “When you critique tools like Git, are you expressing a personal frustration, or do you think there’s an objectively better way these tools could work? It’s hard for me to imagine a world where I don’t need to memorize Git or AWS minutiae - are there products today that represent what you think “just working” should look like?”
[00:13:00] “What do you think about coroutines?”
[00:15:50] “Will a VOD of the ‘Research Overview for “The Big OOPs”’ livestream be made available later?”
[00:16:10] “Do you find that this performance aware development you are talking about also improves the quality of the software in general? Is there any connection? Further, once you decide that something needs to have its own test or tests, what techniques(as in test code organization, handling of test data etc.) and tools you find useful in creating those tests?”
[00:23:18] “I’m currently onboarding at my first ever job. It’s an giant legacy PHP codebase which primarily uses OOP. I don’t think OOP is great. But if you were forced to use classes for everything, what would be the best way to do it?”
[00:26:40] “It’s interesting how I get much better performance on Zen3 on Linux, than reference haversine even for the replacement case. ”
[00:30:47] “I’ve been watching nearly all the BSC talks, and they’ve made me realize I might have a wrong understanding of what a type is. What would be the best definition?”
[00:35:19] “It seems like there has been a push in recent years towards languages with stronger type systems and static analysis (such as Rust’s borrow checker). Do you think that this trend meaningfully improves software quality, and if so what static analysis tools (both existing and hypothetical) do you think would be the most beneficial for a performance-minded programmer?”
[00:40:35] “I watched a YouTube video about the Montana mini-computer, and I understood how the concept of a function is implemented at the assembly level. I was wondering: what is a virtual function as defined in a high-level language, and how does that translate to a CPU? Along the same lines, I didn’t fully understand the concept of volatile. It seems to be related to the stack—could you explain how a volatile variable is represented at the assembly/CPU level?”
[01:04:10] “With Intel AMX becoming more mature or widely-known about, do you think it is or will be possible to start doing things typically done on GPUs (texturing, filtering, convolving, etc) on CPUs in the future? Assuming it will be, do you think GPU vendors will finally start opening up a lá the 30 million line problem in a fight to remain competitive with CPU vendors?”
[01:13:00] “Unlike games, a lot of the value in the Apple world comes from tight integration with all of their ecosystem and design language (ex: iphone widgets, watch now-playing view, siri searchability, sharing to other apps or airdrop, accessibility integration, etc...) But of course Apple is has a lot of OOP style and “declarative frameworks,” which means that I wouldn’t have control over the code that actually runs these features.
How can I still use a handmade philosophy of writing my own simpler more focused code rather than depending on lots of slow and volatile libraries?”
[01:15:36] “do you have any advice for gracefully avoiding or recovering when Apple helpfully deletes your stuff in the background (kills your process when you switch apps, makes you redownload files from icloud)”
[01:19:21] “In your talk at the Better Software Conference you mention the “fat struct” as a good default option for programming in a systems level language. I think I know the gist of what you mean by this, but I am curious if you have a slightly more formal definition for the term and a general explanation for why it’s a good default approach.”
[01:27:32] “Most of the course so far talk about programs and assembly for actual chips. How do the aspects of concern under performance aware programming change, if at all, if the target is WASM?”
[01:30:13] “I perfectly understand why writing to al in a loop is slower than writing to rax, but I get different results for al, ax, eax and rax. The loop writing to al and ax goes at nearly 1/4 of the loop writing to rax, but the eax loop goes at 1/2 that speed. Shouldn’t they run the same since writing to eax does not preserve the upper bits?”
[01:31:01] “If there’s really no ‘rax’ in a cpu at any given point of time, why (how?) debuggers show a single value what you stop them, or linux will only write a single value in the core dump file? Shouldn’t there be some kind of a tree? I just don’t know if I can’t trust this information for debugging... What if it shows a register value from one branch, but the bug was caused by the value from another, won’t I be misled?”
[01:40:21] “I was trying to reproduce your results from the RAT and register file lecture. I am running them on Alder Lake chip (i7-12700H). My results were quite the opposite to yours the add only loop was either having similar performance or run much faster compared to mov and add one. I found your article that hints that Alder Lake is able to decouple those chained adds.”