Another Day In The Code Mines: Fact-checking some Apple Silicon exuberance

The first Apple Silicon Macs are out in customers' hands now, and people are really excited about them. But there are a couple of odd ideas bouncing around on the Internet that are annoying me. So, here's a quick fact check on a couple of the more breathless claims that are swirling around these new Macs.

Claim: Apple added custom instructions to accelerate JavaScript!

False.

There is one instruction on the M1 which is JavaScript-related, but it's part of the standard ARM v8.3 instruction set, not a custom Apple extension. As far as I know, there are no (documented) instruction set extensions for M1 outside of the ARM standard.

The "JavaScript accelerator" instruction is FJCVTZS, or "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero". You might well wonder why this *one* operation is deserving of its own opcode. Well, there are a couple of things at work here: All JavaScript numbers are stored as double-precision floats (unless the runtime optimizes that out), and if you do something as simple as applying a bit mask to a number, you've got to convert it from the internal representation to an integer, first.

Claim: Faster reference counting means the M1 can do more with less RAM!

False.

This one is just a baffling combination of ideas, mashed together. David Smith, an Apple engineer, posted a thread on Twitter about how the M1 is able to do a retain/release operation much faster than Intel processors. They're even faster when translated with Rosetta than they are on native x86 code.

fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1
— David Smith (@Catfish_Man) November 10, 2020

This is a pretty awesome micro-optimization, but it got blown all out of proportion, I think largely due to this post on Daring Fireball, which takes the idea that this one memory management operation is faster, and combines that with a poorly understood comparison between Android and iOS from this other blog post, and somehow comes to the conclusion that this means that 8GB of RAM in an M1 system is basically equivalent to 16GB on Intel. There's a caveat in there, which I guess isn't being noticed by most people quoting him, that if you "really need" 16GB, then you really need it.

Yes, reference counting is lighter on RAM than tracing garbage collection, at a potential cost in additional overhead to manipulate the reference counts. But that advantage is between MacOS and other operating systems, not between ARM and Intel processors. It's great that retain/release are faster on M1, but application code was not spending a significant percentage of time doing that. It was already really fast on Intel. If you actually use 16GB of RAM, then 8GB is going to be an issue, regardless of Apple's optimizations.

Claim: The M1 implements special silicon to make emulating Intel code faster!

True.

Amazingly, this one is basically true. One of the most surprising things about Rosetta2 is that the performance of the emulated code is really, really good. There's a lot of very smart software engineering going into that of course, but it turns out that there is actually a hardware trick up Apple's sleeve here.

The ARM64 and x64 environments are very similar - much more so than PowerPC and x86 were, back in the day. But there's one major difference: the memory model is "weaker" on the ARM chips. There are more cases where loads and stores can be rearranged. This means that if you just translate the code in the "obvious" way, you'll get all sorts of race conditions showing up in the translated code, that wouldn't have happened on the Intel processor. And the "obvious" fix for that is to synchronize on every single write to memory, which will absolutely kill performance.

Apple's clever solution here is to implement an optional store ordering mode on the Performance cores on the M1, that the OS can automatically turn on for Rosetta-translated processes, and leave off for everything else. Some random person posted about this on GitHub, along with a proof of concept for enabling it on your own code. Apple hasn't documented this switch, because of course they don't want third-party developers to use it.

But it's a very clever way to avoid unnecessary overhead in translated applications, without giving up the performance benefits of store re-ordering for native code.

Claim: The unified memory on M1 means that you don't need as much memory, since the GPU and CPU share a common pool.

False.

This one is kind of hard to rule on, because it depends a lot on how you're counting. If you compare to a typical CPU with a discrete GPU, then you will need less total memory in the unified design. Instead of having two copies of a lot of data (one for the CPU, and one for the GPU), they can just share one copy. But then again, if you've got a discrete GPU, it'll have its own RAM attached, and we normally don't count that when thinking about "how much RAM" a computer has.

An 8GB M1 Mac is still going to have the same amount of usable RAM as an 8GB Intel Mac, so again, this is not a reason to think that an 8GB model will be an adequate replacement for your 16GB Intel Mac.

Claim: Apple's M1 is an amazing accomplishment

True!

Leaving aside some of the the wilder claims, it is true that these processors represent am exciting new chapter for Apple, and for the larger industry. I'm excited to see what they do next. I have some thoughts about that, for a future post...

Another Day In The Code Mines

Monday, November 30, 2020

Fact-checking some Apple Silicon exuberance

Claim: Apple added custom instructions to accelerate JavaScript!

Claim: Faster reference counting means the M1 can do more with less RAM!

Claim: The M1 implements special silicon to make emulating Intel code faster!

Claim: The unified memory on M1 means that you don't need as much memory, since the GPU and CPU share a common pool.

Claim: Apple's M1 is an amazing accomplishment

No comments: