How many bytes per cycle can the CPU read in?
This is the fifth video in the Prologue of the Performance-Aware Programming series. It discusses one of five multipliers that cause programs to be slow. Please see the Table of Contents to quickly navigate through the rest of the course as it is updated weekly. A lightly-edited transcript of the video appears below.
In all the previous videos, when we looked at performance, we were focused on trying to get as many adds per cycle as we could. The summation was the work we actually cared about, so it made sense to measure the routine specifically in terms of those adds.
But as we saw when we looked at loop overhead, there are many other things the CPU has to do each time through the loop. To get the add done for the summation, it has to do loop maintenance overhead, like incrementing an index and comparing it to make sure we're not off the end of the loop. There was also a big part that we didn't talk much about, which was what I referred to as a “load”.
Loads and stores, which are two different things that we'll talk about in great detail later in the class, are the ways CPUs get things from memory and put things back into memory. You can think of them as reading and writing. When we read from memory it's called a load. When we write to memory it's called a store.
I mentioned loads in passing because they are an important part of this loop, but I didn't talk very much about how they might impact performance. Loads are actually very important to the performance of any piece of code and we will have to learn to analyze them in detail if we want to fully understand why our code runs the way it does. But right now what I'd like to do is just give you a little perspective on how much loads matter.
Stores are also very important, of course, but it happens that our loop doesn't do any stores. So they’ll have to wait for later.
When the CPU sees an instruction like the one we had written, where we add a value from memory into an accumulator, the CPU sees the memory access as a load.
When we used the term dependent to describe the relationship between two adds, we said that if we had one add took as input the result of another add, that created a dependency between them. It creates a dependency because the input to one was the output of the other. The CPU can't proceed with the dependent instruction until it knows the result of the instruction on which it depends.
The exact same thing happens with the add and its associated load: