Single Instruction, Multiple Data
How many operations can the CPU perform with a single instruction?
This is the fourth video in the Prologue of the Performance-Aware Programming series. It discusses one of five multipliers that cause programs to be slow. Please see the Table of Contents to quickly navigate through the rest of the course as it is updated weekly. A lightly-edited transcript of the video appears below.
In the overview video for the course, I specifically said that there are only two things we can do to improve our performance:
Reduce the total number of instructions that we issue
Figure out how to move instructions through the CPU faster by either changing what they, changing their order, or changing their memory access pattern
We saw in the post about waste that there are often a lot of instructions we can simply eliminate. They aren't necessary for solving the problem. When we get rid of those instructions, the code goes faster.
In the previous post on instruction parallelism, we saw how even the same instructions can sometimes be made to go through the CPU a lot faster. If we change things like whether or not they're dependent on each other, which is something we often have the flexibility to do, we can get more performance out of the same instructions.
Now it’s time to go back to that first type of improvement, where we reduce the total number of instructions.
The idea of instruction level parallelism that we talked about last time — where we're looking at a bunch of adds and trying to get the CPU to simultaneously issue as many of them as possible — can also be approached in another way. We can do the same thing, but using a different method. We can actually reduce the total number of instructions that the CPU has to process to do the same number of adds.
That approach is something you've probably heard of before, and it's the third multiplier in our set of five: SIMD.