This seems like it should be so obvious but for some reason has never really occurred to me: splitting tasks into smaller tasks that can be pipelined improves throughput.

I mean, obviously! That's why CPUs have a pipeline for decoding and executing instructions.

I'm pretty sure I studied this during University but somehow it didn't stick with me because I ended up doing web development in Python and Javascript where no one cares about how the system works under the hood.

So now, given this picture, perhaps the job of a system engineer (or at least one aspect of his job) is to come up with a way to split complex tasks in the system into a series of smaller tasks that can be processed in parallel, have a queue for each task, and assign one CPU Core to it.

So if you have a complex task T that needs to be applied to N items, we can split it into 4 steps, T0, T1, T2, T3 and have each core handle one subtask: take an item from the Ti queue, apply Ti, pass the object to T[i+1] queue.

Hmm. I wonder if memory caching issues introduce complications to the dependency chains. Assuming a shared L2 cache, the system should work fairly well assuming each subtask can start and finish in a timely manner with no delays in moving items from one step in the pipeline to the next.

(Sorry for the all over the place mumbling comment; just thinking out loud).

Expand full comment
Nov 11, 2023·edited Nov 11, 2023

This is correct, but also because of the latency of individual instructions and dependency chains made of individual instructions, single-core workloads also end up benefitting from some explicit pipelining. It's hard to explain this in terms of the video, maybe there are some loads of laundry that really do require as an input a dry sock from a completed load of laundry, but if you're clever somehow you can get multiple independent streams of laundry loads going, so that you don't just have to wait all the time.

The rub here is that if you are e.g. computing prefix sums in two steps as in https://en.algorithmica.org/hpc/algorithms/prefix then immediately using the value you "just computed" will be slower than using values you "computed earlier" because the one you "just computed" is actually still in the washing machine. On the other hand, the cursor for performing the 2nd pass of the computation shouldn't be so far behind the cursor for performing the 1st pass that the data in between them doesn't fit into L1, or you'll have to pay for constant cache misses to L2.

simdjson contains a similar thing, where the whitespace masks just sit around for 1 loop iteration because otherwise they wouldn't actually be ready. The code has been refactored since I last looked at it so this is not as clear, but a comment explains that json_structural_indexer::next actually processes the structural character location bitmask (a u64 representing the positions of structural characters within a 64-byte block of the input) for the previous chunk, not the current chunk, because the one for the current chunk won't actually be computed by the time these instructions would want it. Link: https://github.com/simdjson/simdjson/blob/9c45f1f292d9b3feb812b4a2c0329c459af27eb7/src/generic/stage1/json_structural_indexer.h#L244

Expand full comment

Hi, my opinion (ignorant) I see contention already on the registers themselves, right? If I have some of those SIMD instruction where I am reading multiple registers, and the other instr. that wants to write in same cycle...complicated.

I guess (at cost) you can solve this with more hardware.

Expand full comment

That's a real contention, but it _has been_ solved by more hardware for the most part.

Internally the CPU has a register file which has more registers than actually exposed by the instruction set architecture. When you modify the contents of a register in a way that discards its previous value (like a mov), then internally that register will be remapped to a different one in the register file through a process called register renaming, and it can be used straight away even if some previous instruction hasn't yet completed (like a previous mov from this register to memory, as in your example).

Relating back to this episode, what this does is it breaks _false_ dependency chains in the instruction stream that arise from not having enough register names in the ISA.

I'm sure Casey will cover this throughout the course in more detail, but your comment might be worth a Q&A question for a little teaser. :)

Expand full comment

Comment worth saving, thank you.

Expand full comment

here is a video that shows a real world example of this pipelining idea, and how to increase throughput by doing many things at the same time: https://www.youtube.com/watch?v=CPaNNiB2H-s

Expand full comment