Q&A #82 (2026-01-27)

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Computer, Enhance!

Q&A #82 (2026-01-27)

Answers to questions from the last Q&A thread.

Jan 27, 2026

∙ Paid

In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.

The questions addressed in this video are:

[01:02] “Could you suggest a resource to learn how to work with large datasets that don’t fit into VRAM, or even regular RAM?”
[07:24] “I do not know how this compares to other approaches, as I’ve not really tried to optimize my profilers much as it’s not really my main area of work, but your profiling explanation made me think of a lock-debug profiler I built. It sounds similar to me at least. Do you agree?
In my solution, each worker thread owns its profiling buffers completely. I keep a thread-local pointer to a block-based buffer that grows in chunks when needed, and all writes are done by the thread with no locks. The only shared step is that when a thread creates its first block, it registers it once by pushing it into a global linked list so the UI can later iterate all threads. To avoid stalling the writers, I double-buffered it. Each thread allocates two buffers and writes into one of them.”
[09:46] “About memory usage and the program stack, I believed that this stack could more easily fit in the L1 cache. Isn’t there a higher risk of cache misses at that level if the data to use has been allocated elsewhere? Would it even be noticeable in terms of performance?”
[14:32] “What’s the best way to multithread a software rasterizer? In my experience, scan line interlacing per triangle was horrific.”
[18:18] “Do you know why in the file processing test on the same machine on linux mmaping the file could be outperforming all other methods on medium and large mapped chunk sizes, and on windows the same test was worse than everyone else?”
[21:18] “Regarding callbacks, I think I understood the argument about moving things to a queue instead. However, I also can see the benefit of a callback because that’s synchronous. What if the file IO read some data and stored in a buffer, and the callback now is supposed to handle that read data. If instead you move things to a queue, you’d have to keep allocating new buffers to keep working on those asynchronous reads, because you don’t know when the client code will handle those reads. If instead you can very quickly handle this buffer of read data synchronously in the callback, the file IO can reuse the same buffer to store the next chunk.”
[25:45] “Can you steelman cases for which an arena/bump allocator (are these the same thing?) is not the preferred way to allocate memory (I imagine it is when lifetimes are not apriori known, but perhaps I am missing more subtlety)? In such cases, what is your preferred method of allocation? Are you forced to go back to new/delete?”
[33:01] “Re: Dead Code Elimination Prevention Macros, I’ve got identical results with `asm volatile (”“ : “+v”(Value));` which tells the compiler that `Value` is both input and output, forcing it to initialize it, as well as preventing from assuming a specific value. It would generate `vpxor` for `0.0` and `vmovaps` for `0.5`. The advantage here is that it’s quite generic and doen’t depend on operand size/type.Could you please elaborate on your choice of explicitly using instructions?”
[35:14] “Concerning callbacks, is there a benefit to use callbacks for print-outs/messaging? We have a part of the program that do some calculations which can take time, and it uses callbacks to notify the user about the progress, and waiting to the end of all the computations is not good enough. We also use a callback for a type of mesh calculation that depends on things that this other part of the program shouldn’t know about. (These things were not my decision, but I guess making the module as isolated as possible makes it easier to use it as a module in another program)”
[37:42] “How do you determine if a solution to a problem is more complex than it needs to be? And for inherently complex and interconnected problems, how do you determine if it needs to be subdivided or not? Is there a general approach for working on a complex system with lots of moving parts? (other than me complaining about it :) )”

Computer, Enhance!

Paid episode

Q&A #82 (2026-01-27)

The full video is for paid subscribers