Stupid me, I forgot I have laptop with 12700H. Not sure what a difference between between mobile and desktop cpus, but I could to test things on windows 11 and linux.
With discoveries like this, every passing day my jestful claim that Assembly is really a declarative language, rings more and more true. Thank you for these deep dives!
I have an Alder Lake CPU, and observed this when doing the homework for part 3.
I have also tested this on a Raptor Lake CPU, which follows the exact same pattern.
I'm using Google Benchmark, since it makes it really easy to read the performance counters you want, without recompiling.
I was able to create a few other benchmark programs that probe a bit at how the CPU executes these things. I've observed that the front-end will fuse these instructions with a jump when they occur contiguously, at which point each cycle can only execute a single of immediate addition or subtraction. Adding a nop prevents the fusion, and the optimization happens again. So a question would be how often the CPUs leverage this optimization in the wild.
I also found something in Agner Fog's microarchitecture manual, in the section about Alder Lake, where it says: "Integer addition with a small immediate constant has zero latency in some cases."
says "Intel’s renamer received some impressive capabilities in Golden Cove, including the ability to execute up to 6 dependent adds with small immediates per cycle."
It was also apparently published on the exact same day as this post?
Great article thanks for sharing this research! And, sorry you had to go through the Ultimate Sadness...
On another note I wonder why they didn't stick with this for newer processors? Maybe it was only something they experimented in Golden Cove and turned out not as beneficial?
I am not certain what processesors have this, since I only was able to test Golden Cove. It's possible that it does happen in some other Intel processors!
Such an interesting read ! Thank you for sharing it with us !
Another Casey banger :)! Thank you!
Stupid me, I forgot I have laptop with 12700H. Not sure what a difference between between mobile and desktop cpus, but I could to test things on windows 11 and linux.
With discoveries like this, every passing day my jestful claim that Assembly is really a declarative language, rings more and more true. Thank you for these deep dives!
I have an Alder Lake CPU, and observed this when doing the homework for part 3.
I have also tested this on a Raptor Lake CPU, which follows the exact same pattern.
I'm using Google Benchmark, since it makes it really easy to read the performance counters you want, without recompiling.
I was able to create a few other benchmark programs that probe a bit at how the CPU executes these things. I've observed that the front-end will fuse these instructions with a jump when they occur contiguously, at which point each cycle can only execute a single of immediate addition or subtraction. Adding a nop prevents the fusion, and the optimization happens again. So a question would be how often the CPUs leverage this optimization in the wild.
I also found something in Agner Fog's microarchitecture manual, in the section about Alder Lake, where it says: "Integer addition with a small immediate constant has zero latency in some cases."
I've created a gist with my benchmark program, and the output from running these on Alder Lake and Raptor Lake CPUs, with a few relevant performance counters: https://gist.github.com/danielbendix/a377a976e62b6e8a8ea9c93636f0ff1e
Anyone let me know if you have something you'd really like tried on these, and I'll see what I can do.
I definitely initially misread the title as “The Case of the Missing Excrement” and was *really* confused
Since I had to use Event Tracing for Windows, I can assure you that not only was the excrement not missing, it was present in abundance.
- Casey
This article: https://chipsandcheese.com/p/lion-cove-intels-p-core-roars
says "Intel’s renamer received some impressive capabilities in Golden Cove, including the ability to execute up to 6 dependent adds with small immediates per cycle."
It was also apparently published on the exact same day as this post?
Great article thanks for sharing this research! And, sorry you had to go through the Ultimate Sadness...
On another note I wonder why they didn't stick with this for newer processors? Maybe it was only something they experimented in Golden Cove and turned out not as beneficial?
I am not certain what processesors have this, since I only was able to test Golden Cove. It's possible that it does happen in some other Intel processors!
- Casey