Theseus, a Static Windows Emulator

nxobject•about 2 hours ago

A follow-up question I'd like to ask to the author: if I understand correctly, one of the ideas is that you can write "high-level", almost platform-independent translations like:

    regs.eax = 3;
    regs.eax = add(regs.eax, 4);
    windows_api();  // some native implementation of the API that was called

and rely on optimizers to do as much "lowering" or specialization to a platform as possible. I'd like to ask: how reliable have optimizers been in gettting those results (e.g. how much fiddling with various -Oxxx flags)?

evmar•about 1 hour ago

It depends on what results you’re expecting! Relative to an interpreter, even the simplest unoptimized translation of that code is already significantly more efficient code at runtime.

In the post there is a godbolt link showing a compiler inlining a simple add, but a real implementation of x86 add would be much more complex.

I have read other projects where the authors put some effort into getting exactly the machine code they wanted out. For example, maybe you want the virtual regs.eax to actually exist in a machine register, and one way you might be able to convince a compiler to do that is by passing it around as a function parameter everywhere instead as a struct element. I have not investigated this myself.

jcranmer•about 2 hours ago

So I've definitely played around with a lot of these ideas, because the thing that I'm trying to emulate/port/reverse engineer is a 16-bit Windows application, which coincidentally means that a lot of the existing infrastructure just keels over and dies in supporting it.

I originally started the emulation route, but gave that up when I realized just how annoying it was to emulate the libc startup routine that invoked WinMain, and from poking around at the disassembly, I really didn't need to actually support anything before WinMain. And then poking further, I realized that if I patched the calls to libc, I only had to emulate a call to malloc at the, well, malloc level, as opposed to implementing the DOS interrupt that's equivalent to sbrk. (And a side advantage to working with Win16 code is that, since every inter-segment call requires a relocation, you can very easily find all the calls to all libc methods, even if the disassembler is still struggling with all of the jump tables.)

After realizing that the syscalls to modify the LDT still exist and work on Linux (they don't on Windows), I switched from trying to write my own 386 emulator to actually just natively executing all the code in 16-bit mode in 64-bit mode and relying on the existing relocation methods to insert the thunking between 16-bit and 64-bit code [1]. The side advantage was that this also made it very easy to just call individual methods given their address rather than emulating from the beginning of main, which also lets me inspect what those methods are doing a lot better.

Decompilation I've tried to throw in the mix a couple of times, but... 16-bit code really just screws with everything. Modern compilers don't have the ontology to really support it. Doing my own decompilation to LLVM IR runs into the issue that LLVM isn't good at optimizing all of the implemented-via-bitslicing logic, and I still have to write all of the patterns manually just to simplify the code for "x[i]" where x is a huge pointer. And because it's hard to reinsert recompiled decompiled code into the binary (as tools don't speak Win16 file formats anymore), it's difficult to verify that any decompilation is correct.

[1] The thunking works by running 16-bit code in one thread and 64-bit code in another thread and using hand-rolled spinlocks to wait for responses to calls. It's a lot easier than trying to make sure that all the registers and the stack are in the appropriate state.

evmar•38 minutes ago

[post author] I went down some similar paths in retrowin32, though 32-bit x86 is likely easier.

I was also surprised by how much goop there is between startup and main. In retrowin32 I just implemented it all, though I wonder how much I could get away with not running it in the Theseus replace-some-parts model.

I mostly relied on my own x86 emulator, but I also implemented the thunking between 64-bit and 32-bit mode just to see how it was. It definitely was some asm but once I wrapped my head around it it wasn't so bad, check out the 'trans64' and 'trans32' snippets in https://github.com/evmar/retrowin32/blob/ffd8665795ae6c6bdd7... for I believe all of it. One reframing that helped me (after a few false starts) was to put as much code as possible in my high-level language and just use asm to bridge to it.

charcircuit•about 2 hours ago

Have you tried just asking an LLM to reverse it to C and then work from there?

mmastrac•about 4 hours ago

I think "recompiler" is the accepted term in the art, is it not?

evmar•30 minutes ago

Looking through Wikipedia at least, it's not exactly clear to me. They have separate pages for 'binary recompiler' and 'binary translation' that link to each other, and the latter is more about going between architectures (which is the main objective here).

mysterydip•about 4 hours ago

That’s a really clever idea/solution I hadn’t thought of before, and yet it makes “of course, duh” sense as well.

pema99•about 3 hours ago

See also the authors previous project retrowin32 https://github.com/evmar/retrowin32

Theseus, a Static Windows Emulator

⚡ Community Insights

Discussion (9 Comments)Read Original on HackerNews