73 Comments
Jun 21, 2023·edited Jun 21, 2023Liked by Casey Muratori

Interesting findings for Mac OS / M1. There is `_mach_absolute_time` in `libsystem_kernel.dylib`.

It starts by reading a byte from absolute address (!) 0xfffffc088 + 8, and then based on the value of the byte it:

- makes a syscall `mov w16, #-3; svc #0x80`, or

- reads value from `CNTVCT_EL0`, or

- reads value from msr `S3_3_C14_C0_6`, or

- reads value from msr `S3_4_C15_C10_6`.

Looks like a dispatch by machine type. One of the latter registers appears in Asahi Linux msr dumps, I guess it is Apple Silicon-specific.

Does anyone know what is that address with a magic byte, where does the data come from and how it is mapped into the userspace process' address space?

Expand full comment
Jun 21, 2023·edited Jun 21, 2023Liked by Casey Muratori

Answering it myself (with the help of Asahi folks):

0xfffffc000 is `commpage` with a bunch of useful data, Darwin kernel maps it into userspace: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/arm/cpu_capabilities.h#L203

Also:

[12:09:43] <j`ey_> S3_3_C14_C0_6 is CNTVCTSS_EL0

[12:10:01] <j`ey_> it's like CNTVCT_EL0 but you don't need an ISB

Expand full comment

Thanks to Marcan, the last MSR is also demystified:

<@marcan> dottedmag: and S3_4_C15_C10_6 is ACNTVCT_EL0 which is a pre-standard version of CNTVCTSS_EL0

Expand full comment

What's the diff between mach_absolute_time and _mach_absolute_time? Couldn't you have done something like the following to get current CPU ticks?

#include <stdio.h>

#include <stdint.h>

#include <mach/mach_time.h>

int main() {

uint64_t start_time;

start_time = mach_absolute_time();

printf("mach abs time %llu\n", start_time);

return 0;

}

Expand full comment

mach_absolute_time is _mach_absolute_time (_ is due to name mangling IIRC).

mach_absolute_time gives you 24MHz timer, not CPU ticks, unfortunately.

Expand full comment

Not sure if I am following you. What do you mean by timer?

Official documentation[1] says this function returns current value of a clock that increments monotonically in tick units

[1] https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

Expand full comment

absolute time clock is not instruction count, it's just time, so it does not give any good idea of how many instructions were actually executed, only how much time has passed.

Expand full comment
Jun 26, 2023·edited Jun 26, 2023

Is 1 GHz high enough resolution for the homework we'll be doing later in the course? That's the frequency I'm getting using mach_absolute_time in last week's homework and I'm tempted to use it with my mac but don't want to invest the time in it if not.

Expand full comment
Jun 20, 2023Liked by Casey Muratori

What a coincidence! My copy of Hacker's Delight just arrived this morning.

Expand full comment
Jun 24, 2023Liked by Casey Muratori

Here's what mach_absolute_time does on my Intel MacBook Pro running macOS 12.6:

```

; Symbol: mach_absolute_time

; Source: unknown

7FF80322BE20: 55 push rbp

7FF80322BE21: 48 89 E5 mov rbp, rsp

7FF80322BE24: 48 BE 50 00 E0 FF FF 7F > movabs rsi, 0x7fffffe00050

7FF80322BE2E: 44 8B 46 18 mov r8d, dword ptr [rsi + 0x18]

7FF80322BE32: 45 85 C0 test r8d, r8d

7FF80322BE35: 74 F7 je 0x7ff80322be2e ; <+14>

7FF80322BE37: 0F AE E8 lfence

7FF80322BE3A: 0F 31 rdtsc

7FF80322BE3C: 0F AE E8 lfence

7FF80322BE3F: 48 C1 E2 20 shl rdx, 0x20

7FF80322BE43: 48 09 D0 or rax, rdx

7FF80322BE46: 8B 4E 0C mov ecx, dword ptr [rsi + 0xc]

7FF80322BE49: 83 E1 1F and ecx, 0x1f

7FF80322BE4C: 48 2B 06 sub rax, qword ptr [rsi]

7FF80322BE4F: 48 D3 E0 shl rax, cl

7FF80322BE52: 8B 4E 08 mov ecx, dword ptr [rsi + 0x8]

7FF80322BE55: 48 F7 E1 mul rcx

7FF80322BE58: 48 0F AC D0 20 shrd rax, rdx, 0x20

7FF80322BE5D: 48 03 46 10 add rax, qword ptr [rsi + 0x10]

7FF80322BE61: 44 3B 46 18 cmp r8d, dword ptr [rsi + 0x18]

7FF80322BE65: 75 C7 jne 0x7ff80322be2e ; <+14>

7FF80322BE67: 5D pop rbp

7FF80322BE68: C3 ret

7FF80322BE69: 90 nop

7FF80322BE6A: 90 nop

7FF80322BE6B: 90 nop

```

It brackets `rdtsc` (not `rdtscp`!) with `lfence` instructions and combines the 64 bits of rax and rdx. After that, it does a bunch of transformations I don't semantically understand yet.

Expand full comment
Jun 24, 2023Liked by Casey Muratori

I found Apple's annotated assembly here, which explains the algorithm used: https://opensource.apple.com/source/xnu/xnu-4570.1.46/libsyscall/wrappers/mach_absolute_time.s.auto.html

```

/*

* 64-bit version _mach_absolute_time. We return the 64-bit nanotime in %rax.

*

* The algorithm we use is:

*

* ns = ((((rdtsc - rnt_tsc_base)<<rnt_shift)*rnt_tsc_scale) / 2**32) + rnt_ns_base;

*

* rnt_shift, a constant computed during initialization, is the smallest value for which:

*

* tscFreq << rnt_shift) > SLOW_TSC_THRESHOLD

*

* Where SLOW_TSC_THRESHOLD is about 10e9. Since most processor's tscFreqs are greater

* than 1GHz, rnt_shift is usually 0. rnt_tsc_scale is also a 32-bit constant:

*

* rnt_tsc_scale = (10e9 * 2**32) / (tscFreq << rnt_shift);

*

*/

```

On my machine, `rnt_tsc_scale` seems to always be `0x6aaaaaaa`, while `rnt_tsc_base` and `rnt_ns_base` change on reboot.

I used a debugger to manually modify `rax` and `rdx` after the call to `rdtsc` to pretend that the full 64 bit value was `0x1` in one run and `0x2` in another. The transformed results in `rax` were just `0x1` apart, so I think there is no loss of precision from using `mach_absolute_time`.

Expand full comment

When I single step through using lldb, I seem to get into an infinite loop in the outer function __commpage_gettimeofday_internal which repeatedly calls mach_absolute_time. It's strange, since I only make a single call to the clock_gettime library function. Has anyone else observed similar behavior?

My test program:

```

#include <time.h>

#include <stdio.h>

#include <assert.h>

int main()

{

struct timespec tp;

int rc;

rc = clock_gettime(CLOCK_MONOTONIC, &tp);

assert(rc == 0);

return 0;

}

```

Expand full comment

For folks on linux who haven't done a ton with GDB, I highly recommend a GDB wrapper to make GDB a bit more friendly and visual. My two favorites are below:

Pwndbg : https://github.com/pwndbg/pwndbg/blob/dev/FEATURES.md

GEF : https://github.com/hugsy/gef

Expand full comment

I haven't used it much but there is also this gui for gdb: https://www.gdbgui.com/installation/

Expand full comment

Ah for sure! I'm mostly curious about terminal things, but gdbgui looks good for those who like getting out of the terminal.

Expand full comment

It has a terminal at the bottom-left too: https://pasteboard.co/hqvyVTfWrxQ0.png

And this way I don't have to remember all the commands. Its also supposed to be able to visualize the data but I haven't figured out how to make it do that.

I do like how it merges the assembly with the source. I like it more than how visual studio does it. I don't like how remedybg does it at all. I can't understand assembly if there is only assembly there. It takes me way too long.

Expand full comment

Agreed. I really love RemedyBG but I do wish it had the option to interleave the disassembly with program symbols.

Expand full comment

The multiply-to-divide trick requires accessing the top 64 bits that the multiply instruction puts in RDX. Is there a way to access those top 64 bits in C? Are there intrinsics for this?

Expand full comment
author

Yes, there is. It's __mulh.

- Casey

Expand full comment

A 64 bit value will be of type long long so I would assume you can just directly read it. I don't know if you can say "read this register into this variable in C" though. But that trick is for assembly and I would imagine is supposed to be written in assembly directly.

Although I have never seen it being used in some code I have read, and I have most certainly never used it, there is this feature that supposedly allows you to write assembly in a C file: https://en.cppreference.com/w/c/language/asm

Expand full comment

You can write same multiply, or even 128->64-bit division with help of compiler intrinsics/types. No need for assembly. In MSVC you can do full 64x64->128 multiply with _umul128 (or __umulh if you need just the top 64 bits). In gcc/clang you do that with help of __int128 type (it represents pair of 64-bit registers).

See examples on how to do that here - PCG64_MUL macro: https://gist.github.com/mmozeiko/1561361cd4105749f80bb0b9223e9db8#file-pcg64-h

Expand full comment

If you can always make divisions faster with the mult/add/shift trick, why isn't division implemented that way at the cpu level?

Expand full comment

Is the homework applicable in Linux?

In Listing 74, platform_metrics, a Linux implementation of ReadOSTimer is provided. The Windows version of that calls into QueryPerformanceCounter, but the Linux version calls gettimeofday(), which seems to return the timestamp since 1970: https://github.com/cmuratori/computer_enhance/blob/main/perfaware/part2/listing_0074_platform_metrics.cpp#L32-L51

I'm not sure where to continue from here.

Expand full comment

So, I've been doing the homeworks in Rust. You have to use the winapi crate to get access to QPF. However, as best as I can tell there is no call to rdtsc/p. When the exe runs somehow there is a value in memory set to 10,000,000 and it just loads that into RAX, then copies it to a location held by RCX.

I ran through the assembly a handful of times via RemedyBG and the only call to rdtsc is when i call it directly later in the program.

Link to Git outlining the steps in RemedyBG https://github.com/ruinedme/cpu-timer/blob/master/remedybg-steps.txt

Has anyone else run into something like this?

Expand full comment
author

I could be misreading your TXT file, but that looks like QueryPerformanceFrequency, not QueryPerformanceCounter. It makes sense that QueryPerformanceFrequency would load a constant 10MHz because that is what the Windows timer is specified to be, so, it just returns you that...

- Casey

Expand full comment
Jun 21, 2023·edited Jun 21, 2023

You're right i'm not sure what i was thinking. I found the QPC call and it looks similar to your example.

Expand full comment
Jul 15, 2023·edited Jul 15, 2023

Did a little bit of digging on the linux side (AMD Zen 2, linux 6.3.9) looking at clock_gettime(CLOCK_MONOTONIC, ...), which seems to be the recommended call for monotonic OS time nowadays (gettimeofday(...) maps to clock_gettime(CLOCK_REALTIME, ...), but this does depend on what libc you run).

On my machine clock_gettime(CLOCK_MONOTONIC, ...) ends up performing a handful of jumps and 2 calls before ending up in a rdtscp in the kernel vDSO library. Whether or not you end up with a rdtscp here depends on what clocksource your kernel chooses (see dmesg | grep -i clocksource), it's probably tsc but it might fallback on hpet or similar which maps to a syscall instead. In doing this digging I found out that my system often considers tsc as unstable and falls back on hpet due to a bios bug :(

Here's some benchmarking of various methods of getting a monotonic timestamp (average over 1000 samples measured using rdtsc)

- clock_gettime (hpet) 1262 ns

- clock_gettime (tsc) 33 ns

- rdtsc 15 ns

A word of caution if you're stepping through the asm of clock_gettime in a debugger: I eventually ended up in vDSO which actually calls rdtscp (assuming tsc clocksource) in a loop, see source here: https://github.com/torvalds/linux/blob/master/lib/vdso/gettimeofday.c#L144. My guess is that it's a spin lock waiting for the sequence number `seq` to change before returning (not 100% on this), and when single stepping in a debugger `seq` never updates and you'll be stuck in an infinite (?) loop. Running at full speed, gdb Still attached, it seems to pass through the loop twice. Maybe it would only perform a single rdtscp when running without gdb attached, no idea. Maybe it's the reason clock_gettime is ~2x rdtsc in the timing above.

Expand full comment

So here we are again. The evening before the new episode. The worst is when it gets on my mind right before I hit the bed (like now) and then the anticipation starts building and building... Can't wait but have to.

Expand full comment

After the sad chuckle promised on the previous video, I have two questions:

Why go through all the extra work to reduce the precision from GHz to MHz? If API users are supposed to query the frequency through QueryPerformanceFrequency and take that value into account, why not just provide the best frequency available?

To accomplish the frequency reduction, the OS has to somehow find out the frequency of rdtsc. Any hint of how this is done and if it can be done at user level?

Expand full comment
author

A) I don't know why they reduce the frequency. I assume it is for some kind of compatibility reason, like they don't want to have an API that might return 4 ghz or might return 10 mhz, so they tried to keep it at 10 mhz to maximize compatibility knowing that apps wouldn't test on both kinds of counters, etc.

B) AFAIK the OS times the TSC against a known clock, but the known clock depends on the hardware. There is a way to get the TSC frequency from the chip on modern Intel chips (there's CPUID info for it), but not on AMD. So AFAIK, AMD chips are always timed, whereas Intel ones _might_ be read out of the CPUID, or might not be (you'd have to ask Microsoft which they do, I don't know).

- Casey

Expand full comment

For those who are on Linux and use gdb, here is the command that set disassembly style to Intel (default style is AT&T) :

set disassembly-flavor intel

If you want to set it default, just write the command into ~/.gdbinit

Expand full comment

I've tested this in Linux Mint and I can see the rdtscp instrution followed by the others, but it's buried behind a couple of calls and it does a lot of checks before and after. One of those checks makes the whole thing loop indefinitely, my guess is that it must check some time sensitive thing for accuracy and fails because I'm debugging.

Expand full comment

XD Elon Musk would be proud haha. Thanks for making all of this information digestible! As I start to understand more and more of this stuff, programming gets EVEN MORE enjoyable.

Expand full comment

Interesting. On an older machine (2 cores, 2.8GHz) rax is 66,033,427,158,431,422, cl is 0

10MHz / 2.8GHz = 0.0035714285714286

RAX / 2^64 = 0.0035796792590917

So equal up to the 5 digit, a bit worse than Casey.

Expand full comment

I was trying to debug in Rust is Visual Studio w/ lldb and found I couldn't step into QPC. It would just step over the function. Turns out once the breakpoint hits on QPC you can right click and "Open Disassembly View" and start stepping into the disassembly. At that point, you can also just set breakpoints in the disassembly which bypasses doing that song & dance every time.

I was able to confirm things are as expected on Windows 10: https://pastebin.com/YbK1Q12w .

My machine runs at 3.5GHz.

So in my case I was looking for 1 / 350, which is...

1 / 350 = 0.00285714285714285714

and what was loaded into rax for the mulq was 52,644,714,841,660,999 and...

52,644,714,841,660,999 / 2^64 = 0.00285387571006043668

Not quite as close as Casey's measurements but definitely what we're looking for.

Expand full comment