In each Q&A video, I answer questions from the comments on the previous Q&A video, which can be from any part of the course.
Questions addressed in this video:
[00:03] “Can you give some discussion on lookup tables vs compute. From traditional CS classes we are told that you can trade space for compute and that for speed a look up table would be faster then computation. However as we are exploring here maybe memory is really the bottleneck and a large look up table would require consistent non cache reads.
Also some computation like sine(x) has symmetry and can be stored as a quarter wave table where you have 4 if sections for the different symmetries. So does that have issues with branching now as well?
I'm sure the answer is it depends on how big a table vs how much compute and how unique are the lookups etc. but id like to hear your thoughts on lookup tables and conditional lookups vs compute.”
[03:25] “Are we not going to measure the latency of our caches/dram? In my limited experience this is a lot more intricate to set up and get proper reproducible results.”
[04:37] “With regards to cache optimisation, how does that manifest in the design of real software, say a game with meshes, textures, entity data, sound, etc?
My intuition says that it would be easy within such a system for one part of the program to 'clobber' the cache used by a different part. Thus it seems best to design a system to chunk data processing, do everything operating on one dataset, then process something else, avoiding overlapping data dependencies.
Does that actually matter? How does the OS preempting processes factor in?”
[10:33] “What does the mechanism of an exclusive cache look like? The inclusive style makes (I think) sense to me as data flows from main -> L1 where the oldest cache line on the L1 gets evicted but still exists on L2 and the new cache line exists on both because it went through the pipeline.
For an exclusive cache, when the new cache line gets to L1 it must have passed through L2. So then where does it go on L2 for it to be exclusive to L1? I'd guess the oldest cache line from L1 gets evicted _into_ L2 to take the slot that the new cache line in L1 was at when it came through the pipe, but that seems like it would add an extra bit of work.”
[13:18] “Fetching the stack trace is slower the more frames you jump through, and after just a few it gets to the point where it feels too expensive. This program sadly has places where the stack is over 60 calls deep, which is the very expensive. But, it made me think... Most of the time, one allocation call has already fetched most of the stack which this calls need... There should be a way to partially reuse that. So, it also made me think, should already be a solved problem which many profiles already have done. So, I just wonder, have you ever done something like this, and/or, do you know about methods that has been used to speed this up?”