Recently I landed changes to enable Wasm on Windows in WebKit. This built upon previous work to support fast webassembly memory. There were some fun challenges in getting this to work which I’ll dig into below.

I’m currently exploring ways to fund continuing my work to improve the Windows port for WebKit. I’m also thinking about exploring other avenues for improving web engine diversity (and web platform funding diversity). Get in touch if either of those sound interesting to you, I’d love to chat.

Background

WebKit’s WebAssembly low level interpreter (LLInt) is written in a DSL called “offlineasm”, which looks like assembly. There’s a compiler for this language written in Ruby, to compile offlineasm into assembly for the target architectures supported by WebKit - x86, x86_64, ARM64, ARMv7, RISCV and more!

The LLInt is quite slow at running code, but it’s able to start executing quickly. For small short-lived WebAssembly code blocks, or code that’s executed once, startup time dominates the total time taken so it’s a good option to have. For code that’s executed many times, WebKit wants to tier up into a JIT quickly - that’s the goal of the BBQ JIT tier (Build Bytecode Quickly). There’s a good talk about the LLInt from WebAssembly Summit 2020 that’s worth a watch if you’re interested in learning more, Tadeu Zagallo — JavaScriptCore’s new WebAssembly interpreter.

Windows has a few differences to macOS and Linux which are relevant for the LLInt. I think the build is still using ML64 despite the great work to move the Windows port to build with clang instead of MSVC. This means offlineasm needs to generate x86_64 assembly code in Intel syntax instead of AT&T for the Windows port. The calling convention is different, with other platforms following the System V AMD64 ABI which is different from the Microsoft x64 calling convention. In the Windows calling convention there’s two fewer callee-save registers (two extra caller-save). Dynamic libraries follow a different format (DLLs instead of Shared Objects). There’s also a long tail of unix things that aren’t present, like signals which we previously tackled for fast webassembly memory.

First steps - getting it to compile

Started out by enabling the compile flag in the cmake config and see what breaks! First up was fixing some compilation errors where MSVC wouldn’t accept some C++ that clang would. This work was started prior to the switch to clang, though there might still be differences between what clang and clang-cl accept.

After that the WebAssembly.asm offlineasm file wouldn’t compile as ws1 was unmapped on Windows. This problem went through some iteration, to start with I used r14 as a placeholder to get it to compile. More on this later.

There were various operators in offlineasm which needed modification to add Intel-syntax output. Thankfully many (all?) of these surfaced as errors in the assembler, because instructions took two different register types and the assembler would validate they were correct. For example - the cvttss2si instruction takes a floating point register and a general purpose register, so if you pass them in the wrong order the assembler will fail.

A hack needed to be put in place to attempt to load a label from a dynamic library. LLInt references a global configuration struct from the WTF (Web Template Framework) library.

Segfault #1

The label referring to the location of the global configuration struct successfully compiled, but at runtime was pointing at garbage. To resolve this, I ended up copying the struct in C++ in JavaScriptCore, and publishing a label to that so offlineasm didn’t need to load labels from a DLL on Windows.

This one took a couple of weeks to track down - hampered by time take to set up debugging in Visual Studio, long recompile times, and a number of dead ends.

Segfault #2

When looping over locals on the stack to zero out the values, we’d underflow an integer and try and zero memory out of bounds. This was fixed by updating the NumberOfWasmArgumentJSRs constant to reflect the NUMBER_OF_ARGUMENT_REGISTERS on Windows (caused by the different calling convention).

C++ Assertion Failure

There’s a concept of Wasm “slow path” functions - code that’s defined in C++ which offlineasm calls into for various operations. These are exported C functions which are named “slow_path_wasm_##name”, and take three arguments - the call frame, the program counter, and the Wasm Instance.

When we were calling the “slow_path_wasm_call”, we’d attempt to cast the passed program counter bytecode to a WasmCall and it’d fail the assertion that it is a WasmCall.

I tracked this down to using the offlineasm cCall4 macro (designed for 4 arguments) to make this function call, instead of the cCall3 macro (designed for 3 arguments). These macros are identical on non-Windows platforms, but under the Windows calling convention they are passed differently so aren’t interchangeable.

Segfault #3

Inside the slowPathForWasmCall macro we were saving the wasmInstance pointer into the PB register and restoring it later, but in between we’d call reloadMemoryRegistersForInstance (which loads the memoryBase and boundsCheckingSize for the associated WebAssembly Memory object):

# Preserve the current instance
move wasmInstance, PB

# ... a bunch of code, including
reloadMemoryRegistersFromInstance(targetWasmInstance, wa0, wa1)

# ... later on we restore the instance from PB
move PB, wasmInstance

On Windows, both PB and boundsCheckingSize were both using the csr4 offlineasm register (which maps to r13 on Windows). This meant the wasmInstance pointer was clobbered when we called reloadMemoryRegistersFromInstance. I switched the WebAssembly memory registers (memoryBase, boundsCheckingSize) to use (csr5, csr6) on Windows to resolve this.

Divide by zero

The x86_64 div and idiv instructions work by setting up the dividend in the edx:eax registers, and then dividing those by the passed register. The result is stored in eax, remainder is stored in edx.

However the offlineasm register mapping is different on Windows, with edx mapping to t2 instead of t1 in offlineasm. This broke the division and remainder WebAssembly operations (there’s signed and unsigned opcodes for both 32 bit and 64 bit). I modified those operations to use the correct registers on Windows.

There’s an open bug for adding static asserts to ensure the register mapping matches the assumptions made here, which was raised when the code was originally written. These static asserts would’ve caught this set of bugs at build time.

Scratch register problems

After starting with r14 as a placeholder, I switched to using r11. This register was reserved for usage by the compiler, and I thought by changing the offlineasm instructions that used it to instead take it as a parameter, I could return control of r11 to the programmer. However it was pointed out in the review that MacroAssemblerX86_64 also used it behind the scenes, so that approach wasn’t viable. It seemed to work for the tests I ran but it’d be brittle moving forward as Windows would differ from other x86_64 ports in this regard, and future changes in the MacroAssembler could break the LLInt on Windows only.

I switched to using one of the remaining callee-save registers, and saved / restored it as appropriate. I wrote about this a little previously when it was working on debug builds, but crashing on release builds.

Wrapping up

The review cycle on this PR took a while - the JavaScriptCore team is relatively small and I created the review during a particularly busy time for them. I gave a talk at the WebKit Contributors Meeting about this work, and was able to meet most of the people who had been helping and encouraging me along the way. It was a great way to wrap up the last week in my batch at the Recurse Center.

About half way through this work I upgraded my computer as the 60-75 minute full build times became unbearable. Compute has never been cheaper, and I was able to pick up an Intel Core i9-12900K, Motherboard and 32GB of RAM bundle from Microcenter for $400. That brought my full build times down to 15-20 minutes (faster if I don’t use the computer while it’s building), which is a major improvement but still slower than I’d like. I should’ve upgraded weeks earlier, I wasted a lot of time waiting for builds.

I found a concurrency bug in WebAssembly LLInt compilation when testing this work. This is likely a bug that’s present for all platforms, so it’s good that this work was able to surface that.

Repeating what I put at the top - I’m currently exploring ways to fund continuing my work to improve the Windows port for WebKit. I’m also thinking about exploring other avenues for improving web engine diversity (and web platform funding diversity). Get in touch if either of those sound interesting to you, I’d love to chat.