0:00
/
0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #78 (2025-07-21)

Answers to questions from the last Q&A thread.
3

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

The questions addressed in this video are:

  • [00:03] “Hi, I've seen you talk before about the state of gpu apis, and I am aware that you were talking about SoC solution to this and many other problems with current computers. My question is this, how would you design a gpu api, if you had a magic stick that you could shake to make it appear and be common on all computers today? If I understand your position, you'd have the programmer talk directly(or at least through a paper thin layer) to the gpu, but how would that work in practice? If that's not too much to ask, I'd love seeing a short snippet of pseudocode of how it would work on the consumer side, thanks!”

  • [04:10] “When dealing with async io, do you think an await style interface like in javascript/go is a good model, or do you think some simple state machine or any other method is better? I find that writing state machines for this kind of stuff gets overly explicit in a lot of cases. Thanks!”

  • [07:52] “Given the course, I was trying to apply some techniques to my own toy problem that given a list of words and an NxN grid tries to generate a wordpuzzle where each word is added horizontally, vertically or diagonally.

    I tried to reduce pessimization as much as possible. The program performs a recursive search where each word added successfully to the grid increases the depth by one. Memory is pushed and popped with a single memory arena of max 512KiB. The hot code is the check that given a word and the current board, to see if the word will fit.

    And I am having a hard time vectorizing this loop such that it's actually more performant. The single byte checks seem to outperform vectorization by a factor 10 as the data is too sparse? I also could not find an AVX "scatter" function that does the opposite of a movemask_epi8. Was wondering if you have any thoughts on how one would optimize this further.”

  • [13:03] “I was making a homework about asm volatile, and I noticed that all fma instructions were using memory operands and now I'm wondering why there are no fma instructions with immediate operands? Surely it should be beneficial to bake values in some cases, right? Or maybe there are never immediate operands for floating point instructions.”

  • [17:54] “Will estimating the cost of more ‘branchy’ workloads like lexing or e.g. json parsing be covered?

    For context I am trying to apply what I've learned from this course to optimize a lexer for a programming language and have gotten from ~0.8 GB/s to ~1.3 GB/s lexing the linux source code. However, it seems impossible for me to get it to run any faster, eventhough 1.3 GB/s is nowhere near memory bandwidth and most of the work is just deciding what kind of token to spit out and how much to advance. It feels to me like ~2GB/s or so could be the limit to how fast you could lex one token at a time, and going above that would require producing more than one token like I believe simdjson does. However, I have no clue if this is remotely correct, since my intuition about the cost of branchy code is provably very bad.”

  • [20:12] “Do you have any good resources that outline how to improve a user's experience. The course emphasise performance but I'm curious what other things would you consider to improve a user's experience.”

The full video is for paid subscribers