Instructions Per Clock

How many instructions is the CPU executing in a single clock cycle?

A dual-propeller plane taxing down a runway in bright sunlight.

This is the third video in the Prologue of the Performance-Aware Programming series. It discusses one of five multipliers that cause programs to be slow. Please see the Table of Contents to quickly navigate through the rest of the course as it is updated weekly. A lightly-edited transcript of the video appears below.

In the previous post we talked about wasted instructions. Waste typically makes up the largest multiple for slow software. It is made up of instructions that you're forcing the CPU to execute even though they aren’t necessary for your workload. But there are other multiples that occur in the instructions that are necessary.

In other words, if you get rid of all the instructions you really don’t need, and you're left with just instructions you do need, there's still a great deal of variability in how fast the CPU can actually do those instructions. Continuing the example from last time, modern CPUs have lots of ways in which they can do basic operations like addition. If you're not careful, you can easily get yet more large slowdowns because you ask the CPU to perform the addition in an inefficient way.

The code I showed you last time was adding an array of integers together. It was written and compiled by me in such a way as to demonstrate a poorly running version of a summation loop. We were trying to compare Python to C in terms of waste, so I wanted to focus on just the wasted instructions.

Now it's time to talk about the C loop itself, and how to make it run more efficiently. To do that, the first thing we need to understand is the next multiplier in our list of multipliers. Waste was number one, and number two is a thing called “IPC”, or sometimes “ILP”.

IPC (instructions per clock) and ILP (instruction-level parallelism) are two terms that mean essentially the same thing, although they are each used in a slightly different way. Instructions per clock is exactly what it sounds like: it's the average number of machine instructions the CPU executes on every clock cycle. It's exactly like that value I measured last time when I said “adds per cycle”. Instructions per clock is what that number would be if we counted all the instructions, not just the adds.

Instruction-level parallelism is more of a general term used to refer to the fact that a particular CPU is capable of doing some number of instructions at the same time.

Either way, they both describe the aspect of CPU performance which we need to look at now: how was the CPU getting 0.8 adds per cycle last time anyway? And is that the best it could do for integer addition like that? Is 0.8 the highest, or could we have gotten it to do more if we’d asked it to perform the summation in a slightly different way?

Let's think about our loop again. I want to talk about two separate things we can do in the loop to see what happens to the number of instructions we are able to execute per clock. Our loop looked something like this:

This video is for paid subscribers