Mathew J. Hostetter <mat@ardi.com>
Clifford T. Matthews <ctm@ardi.com>
Executor is a commercial Macintosh emulator that uses no software from Apple, but is still able to run much 680x0 based Macintosh software faster on Pentiums than the same software runs on 680x0 based Macs. This paper contains some implementation details, including descriptions of Executor's synthetic CPU, graphics subsystem and debugging environment. Portability issues, current limitations and future plans are also presented.
Executor is a commercial emulator that allows PCs to run many Macintosh applications. Executor does not require Macintosh ROMs or a Macintosh System file and contains no Appple code itself. Executor was written entiredly by engineers without Macintosh backgrounds who have not disassembled any of Apple's ROMs or System file.
Because Executor was written strictly from publicly available documentation (Inside Macintosh, Tech. Notes, etc.), programs which make use of undocumented features of MacOS may fail under Executor. Furthermore, there are some portions of MacOS that we haven't implemented yet. Executor is sufficiently large that there are probably bugs in some of our code as well. We realize these are major limitations, but this paper is primarily concerned with implementation details that are interesting to our fellow programmers as opposed to feature sets and limitations which are of more concern to end users and our marketing department.
Our goal is for Executor to be accurate, fast and portable. Beyond that, completeness is a secondary issue.
Accuracy means that each subsystem that we implement should behave exactly according to the functional specs for the subsystem that we've derived from a combination of reading documentation, writing test cases and running programs under Executor.
Fast is harder to qualify. As programmers we like to use advanced techniques that will result in programs running under Executor as quickly as possible. Unfortunately, we have a limited number of engineer hours in a week and most engineering time is spent implementing new subsystems or finding and fixing subtle incompatibilities. We're proud of the speed that we've obtained so far, but we know that we can do better in the future.
Portability is the ability to support multiple platforms from the same source base. A platform is a combination of CPU, operating system and graphics device or windowing system. Executor currently supports Intel 80[3456]86 and compatible CPUs, Motorola m680[34]0 CPUs, the operating systems DOS, Linux and NEXTSTEP and can interact with VGA, SVGA, Display PostScript and X-Windows. To get the best performance on some architectures we do use architecture specific code, but we also write portable versions to be used where the platform specific versions can't be. Although not supported as a product, Executor was ported to DEC's Alpha, but since ARDI has no Alpha and DEC lost interest, the Alpha port is no longer current. Although not recently, ROMlib, ARDI's rewrite of the MacOS OS and Toolbox routines, has been ported to a wide variety of platforms, including MIPS , m88k , Clipper, IBM RT, SPARC and even VAX based systems.
Those three design goals have led us in the direction of dynamic code generation for both the 680x0 emulation and for our blitter. In both cases we use high level descriptions of what we want accomplished and then use special purpose tools at compile time to translate these high level descriptions into constructs that we can then use at run time.
High level descriptions are less error prone, allowing us to document the semantics that we wish to see in our synthetic CPU or blitter using a special purpose language that is directly suited to the task at hand, rather than a general purpose language like C or the traditional language of speed freaks -- assembler.
High level descriptions also lend themselves to portability. We have our tools generate portable constructs for the general case and, with a little more programming effort, faster architecture specific constructs for the architectures that we consider most important.
Since the conversion from high level description to useful construct takes place at compile time, there is no need to worry about the CPU cycles spent doing the mapping. This allows us to design our code by thinking: "At runtime, what would be the optimal instruction sequence to perform a specific task?" Once we know the answer to that question we can ask: "How can we represent at a high level, the task is being accomplished by that optimal set of instructions?". Then, the final question is "Given what we want to generate and how we want to represent it, what does the compile time mapping look like?". The entire time we're pondering those three questions, we're keeping accuracy, portability and efficiency in mind.
Syn68k is the name of the synthetic CPU that Executor 2 uses. Syn68k is both highly portable and fast. The portable core of Syn68k, which works by dynamically compiling 680x0 code into an efficient interpreted form, was designed to run on all major CPU's. On supported architectures, Syn68k can also translate 680x0 code into native code that the host processor can run directly.
Syngen analyzes a lisp-like file describing the bit patterns and semantics of the 680x0 instruction set and produces lookup tables and C code for the runtime system to use. The code and tables generated by syngen depend somewhat on the characteristics of the host processor; for example, on a little endian machine it is advantageous to byte swap some extracted 680x0 operands at translation time instead of at runtime.
The 680x0 description file can describe multiple ways to emulate any particular 680x0 opcode. The runtime system looks at what CC bits are live after the instruction and chooses the fastest variant it can legally use. In Figure 1, we have two CC variants of lsrw; one computes no CC bits, and the other computes all of them.
(defopcode lsrw_ea (list 68000 amode_alterable_memory () (list "1110001011mmmmmm")) (list "-----" "-----" dont_expand (assign $1.muw (>> $1.muw 1))) (list "CN0XZ" "-----" dont_expand (list (assign ccx (assign ccc (& $1.muw 1))) (ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1)))))) Figure 1. Syn68k description of lsrw
The 680x0 description file can also specify which 680x0 operands should be "expanded" to become implicitly known by the corresponding synthetic opcode. For example, fully expanding out "addl dx,dy" would result in 64 synthetic opcodes, one for each combination of data register operands. This results in smaller and faster synthetic opcodes at the expense of increasing the total number of synthetic opcodes. To conserve space, we only "expand out "common 680x0 opcodes. On host architectures where we can compile to native code, we don't waste space by "expanding out" common synthetic opcodes.
Our interpreted code consists of contiguous sequences of "synthetic opcodes" and their operands. Syngen can generate ANSI C, but when compiled with GCC it uses C language extensions that make synthetic opcodes be pointers to the C code responsible for interpreting that opcode. This "threaded interpreting" entirely eliminates switch dispatch and loop overhead.
For the 80x86 architecture, Syn68k supports an optional architecture-specific native code extension that tries to generate native code whenever possible. In those rare cases when it cannot, it reverts to our interpreted code. Since Syn68k supports both native and synthetic code, the runtime system automatically inserts gateways between the two whenever there is a transition.
Three major problems make translating 680x0 code to 80x86 code difficult:
On the other hand, several factors make the job easier than it would be for RISC machines:
The toughest problem is the lack of registers. On 32-register RISC architectures it's easy to allocate one RISC register for each 680x0 register, but on the 80x86 a different approach is needed. The obvious solution is to perform full-blown inter-block register allocation, but we fear that using traditional compiler techniques would be unacceptably slow.
For now, we have adopted a simple constraint: between basic blocks, all registers and live CC bits must reside in their canonical home in memory. Within a block, anything goes. So what liberties does Syn68k take within a block?
The 80x86 register set is treated as a cache for recently used 680x0 registers, and the 80x86 CC bits are used as a cache for the 680x0 CC bits. At any particular point within a block, each 680x0 register is either sitting in its memory home or is cached in an 80x86 register, and each live 680x0 CC bit is either cached in its 80x86 equivalent or stored in its memory home. Cached registers may be in canonical form, may be byte swapped, may have only their low two bytes swapped, or may be offset by a known constant from their actual value.
Each 680x0 instruction can require that 680x0 registers be cached in particular ways. For example, {\f22 movel d0, mem} requires d0 to be cached in big endian byte order. The compilation engine generates the minimal code needed to satisfy those constraints and then calls a sequence of routines to generate the native code. As each 680x0 instruction is processed, each 680x0 register's cache status is updated. Dirty registers are canonicalized and spilled back to memory at the end of each block (or when we run out of 80x86 registers and we need to make room).
We allow 680x0 registers to be cached with varying byte orders and offsets so that we can perform the optimizations of lazy byte swapping and lazy constant offsetting. If the 680x0 program loads a register from memory and then ends up writing it out later, we avoid unnecessary byte swaps by not canonicalizing the value immediately. Lazy constant offsetting mitigates the overhead of postincrement and predecrement side effects. Figure 2 is an example of lazy constant offsetting.
pea 0x1 pea 0x2 pea 0x3 pea 0x4 becomes this 80x86 code: movl _a7,%edi movl $0x01000000,-4(%edi) ; "push" big-endian constant movl $0x02000000,-8(%edi) movl $0x03000000,-12(%edi) movl $0x04000000,-16(%edi) ... <more uses of a7 may follow, and they'll use %edi> subl $16,%edi movl\tab $edi,_a7 ... Figure 2. Lazy Constant Offsetting
As mentioned above, we use the 80x86 condition code bits as a cache for the real 680x0 CC bits. Although live cached CC bits are occasionally spilled back to memory because some 80x86 instruction is about to clobber them, this trick almost always works.Using 80x86 CC bits, we can frequently get away with extremely concise code sequences; for example, a 680x0 compare and conditional branch becomes an 80x86 compare and conditional branch.
Like most dynamically compiling emulators, Syn68k doesn't detect self-modifying code; the overhead is too high. Fortunately, self-modifying programs don't work on the real 68040 either. We rely on the program making explicit system calls to flush the caches whenever 680x0 code may have been modified or created. Some programs (like HyperCard) flush the caches very often, which can cause real performance headaches if code is continuously recompiled. We have solved this problem by checksumming 680x0 blocks as they are compiled and only decompiling blocks which fail their checksums. This optimization alone sped up some HyperCard stacks by a factor of three or so.
Figure 3 contains two sample 680x0 code sequences from real applications, and the 80x86 code that Syn68k generates for them. We chose these code sequences specifically to showcase several of the techniques we use, so you shouldn't use them as a substitute for benchmarks. Not all 680x0 code translates as well as these examples do, but these examples are far from exotic.
Example 1 (Solarian):
680x0 code: addqb #1,a4@(1) movel #0,d0 moveb a4@,d0 swap d0 clrw d0 swap d0 asll #2,d0 lea a5@(-13462),a0 addal d0,a0 moveal a0@,a0 movel #0,d0 moveb a4@(1),d0 cmpw a0@,d0 bcs 0x3fffee2} 80x86 code: movl _a4,%edi ; addqb #1,a4@(1) addb $0x1,0x1(%edi) xorl %ebx,%ebx ; movel #0,d0 movb (%edi),%bl ; moveb a4@,d0 rorl $0x10,%ebx ; swap d0 xorw %bx,%bx ; clrw d0 rorl $0x10,%ebx ; swap d0 shll $0x2,%ebx ; asll #2,d0 movl _a5,%esi ; lea a5@(-13462),a0 leal 0xffffcb6a(%esi),%edx addl %ebx,%edx ; addal d0,a0 movl (%edx),%edx ; moveal a0@,a0 xorl %ebx,%ebx ; movel #0,d0 movb 0x1(%edi),%bl ; moveb a4@(1),d0 bswap %edx ; cmpw a0@,d0 movw (%edx),%cx rorw $0x8,%cx cmpw %cx,%bx movl %edx,_a0 ; <spill dirty 68k movl %ebx,_d0 ; registers back to memory< jb 0x6fae0c ; bcs 0x3fffee2 jmp 0x6faf0c ; <go to "fall through" code>Example 2 (PageMaker):
680x0 code: movel #0,d2 moveb d0,d2 lslw #8,d0 orw d0,d2 movel d2,d0 swap d2 orl d2,d0 movel a0,d2 lsrb #1,d2 bcc 0x3fffed4} 80x86 code: xorl %ebx,%ebx ; movel #0,d2 movl _d0,%edx ; moveb d0,d2 movb %dl,%bl shlw $0x8,%dx ; lslw #8,d0 orw %dx,%bx ; orw d0,d2 movl %ebx,%edx ; movel d2,d0 rorl $0x10,%ebx ; swap d2 orl %ebx,%edx ; orl d2,d0 movl _a0,%ecx ; movel a0,d2 movl %ecx,%ebx shrb %bl ; lsrb #1,d2 movl %ebx,_d2 ; <spill dirty 68k movl %edx,_d0 ; registers back to memory> jae 0x3b734c ; bcc 0x3fffed4 jmp 0x43d48c ; <go to "fall through" 68k code>Figure 3. 680x0 -> 80x86 examples
The DOS world is one of standards. Many standards. Standards made by engineers who were even more short-sighted than the folks who brought you ROM85, only to be replaced by SysEnvirons which was then replaced by Gestalt. The first color graphics adapter for the PC (CGA) was replaced with EGA, which was then replaced by VGA, which eventually gave way to several different Super Video Graphics Array (SVGA) cards.
SVGA cards have a couple of properties that make them less than perfect targets for the output of Macintosh emulators. First, the default is for SVGA's video memory to only be mapped into the PC address space through a 64k window (or bank). If you want to display 640x480x8 bits you need to write 64k of information to the 64k screen address range, then tell the video card that you want that same address to represent a different 64k chunk of the screen, then you write to that address range again, then you switch banks again, and so forth.
The second major complication is that under DPMI, the address space that contains the SVGA video memory is not in the same address space that a 32-bit application uses. For those of you used to programming in a flat address space, it might be hard to believe that you need special machine language address space overriding prefixes to access screen memory, but under DPMI 0.9 (which is the version of DPMI that Microsoft supports; we wouldn't have to do this under 1.0) "selector" overrides really are necessary.
A Region is a data structure that describes a set of pixels. Regions can be created by the application by calling various MacOS toolbox routines. In addition the toolbox routines themselves sometimes create Regions for their own purposes.
A blitter is a set of software or hardware which takes sets of bits, representing pixels, and combines them with other sets of bits in a variety of different ways. A Region blitter is a blitter that processes pixels by Regions (rather than by rectangles or rectangle lists).
One way to write a simple Region blitter is to start with a subroutine that parses the start/stop pairs of a Region scanline and draws the corresponding pixels. This subroutine is then called once for each row of pixels to be displayed.
Unfortunately, this approach is slow since each scanline gets re-parsed every time it is drawn. The Region for a 300 pixel tall rectangle consists of a single scanline with a repeat count of "300"; this "simple Region blitter" will parse that scanline 300 times! That's a lot of redundant work.
There are many possible ways to get away with parsing each scanline only once. One approach is to convert the start/stop pairs into a bit mask where the bits in the mask correspond to the bits in the target bitmap that are to be changed. The inner blitting loop then becomes an exercise in bitwise arithmetic. In C, such a loop might look something like this:
for (x = left; x < right; x++) dst[x] = (dst[x] & ~mask[x]) | (pattern_value & mask[x]);That's not bad, but we can do better.
Using an explicit bit mask array is unnecessarily slow in the common case of filling a rectangle. For a rectangular Region, mask[x] is usually all one bits, making the bit munging a waste of time. And even when the masks are never solid (e.g. when drawing a thin vertical line), this technique is still unnecessarily slow. As it turns out, even the cycles the CPU spends loading mask bits from memory are unnecessary. Furthermore, even if we were satisfied with the level of performance that C code like the above provides, we couldn't use it on a stock SVGA system because it wouldn't know how to access the SVGA portion of memory.
Executor's blitter uses the techniques of partial evaluation and dynamic code generation to eliminate redundant work and also give us access to SVGA memory. On the 80x86 each scanline is quickly translated into executable code, and that code gets executed once each time the scanline needs to be drawn. On non-80x86 platforms, each scanline is compiled into threaded code which is executed by a machine-generated interpreter to draw the scanlines.
Before describing how the dynamic compilation process works, let's take a look at an example. Consider the case where a 401x300 rectangle is to be filled with white pixels (pixel value zero on the Macintosh). This might happen, for example, when erasing a window. Furthermore, let's assume that the target bitmap has four bits per pixel, since that's somewhat tricker to handle than 8 bits per pixel. Figure 4 shows the subroutine that Executor dynamically generates to draw this rectangle on a Pentium.
loop: andl $0xff,0x50(%edi) ; clear leftmost 6 boundary pixels addl $0x54,%edi ; set up pointer for loop movl $0x31,%ecx ; set up loop counter rep stosl ; slam out 49 aligned longs andl $0xffff0f00,0x0(%edi) ; clear 3 right boundary pixels addl $0x28,%edi ; move to next row decl %edx ; decrement # of rows left jne loop ; continue looping if appropriate ret ; we're done! Figure 4. Dynamically generated blitting code
This code, when called with the proper values in its input registers, will draw the entire rectangle. Note how the inner loop is merely a
"{\f22 rep ; stosl}"...it doesn't get much more concise than that! The astute reader will know that on certain 80x86 processors "rep ; stosl" is not the fastest possible way to set a range of memory. This is true, but because our code generation is dynamic, in the future we can tailor the specific code sequence generated to the processor on which Executor is currently running. The blitter already does this when it needs to emit a byte swap; on the 80486 and up we use the "bswap" instruction, and on the 80386 (which doesn't support "bswap") we use a sequence of rotates.
One thing you may notice in this example is that the bit masks used to clear the boundary pixels look strange. They are actually correct, since 80x86 processors are little endian.
Unlike some processors, such as the 68040, the 80x86 instruction and data caches are always coherent. Consequently, no cache flushes need to be performed bef ore the dynamically created code can be executed.
Figure 5 contains another example, this time drawn from a real application. The program "Globe", by Paul Mercer, draws a spinning globe on the screen as fast as it can. Each "globe frame" is a 128x128 Pixmap. Here is the code that Executor generates and runs when Globe uses CopyBits to transfer one frame to the screen at 8 bits per pixel.
Again the inner loop is very tight, just a
"rep ; movsl"
this time.
No matter how fast the generated code, if Executor spends too much time generating that code then any speedup will be negated by the increased time required for dynamic compilation. Consequently, the dynamic compilation from Region to 80x86 code needs to be fast. We solved this problem with a "meta-assembler" written in Perl.
Whereas an assembler tells a computer how to translate assembly instructions into machine code, our meta-assembler tells the computer how to generate tiny translators. These translators will then be used to translate pixel manipulation requests into machine code. Another way of looking at it is that the meta-assembler generates code that generates code. This meta-assembly process is done only once: when Executor is compiled.
The blitter operates on aligned longs in the destination bitmap. As the compilation engine strides through the scanline's start/stop pairs from left to right, it identifies which bits in each long are part of the Region and determines which of several pixel manipulation requests to issue to the tiny translators that were created by the meta-assembler.
loop: movl $0x20,%ecx ; set up loop counter for 32 longs rep movsl ; copy one row (128 bytes) addl $0xffffff00,%esi ; advance to previous src row addl $0xfffffd00,%edi ; advance to previous dst row decl %edx ; decrement # of rows remaining jne loop ret Figure 5. Blitting code from Globe
The particular case encountered determines which function pointer to load from a lookup table corresponding to the current drawing mode. For example, the "patCopy" drawing mode has one table of function pointers, "patXor" another. There are also some special case tables for drawing patterns that are either all zero bits or all one bits.
The main blitter doesn't care what drawing mode is being used, since it does all mode-specific work through the supplied function pointer table.
Each function pointer points to a function that generates 80x86 code for the appropriate case. For example, one function generates code for a "patCopy" to three contiguous longs, one generates code for "patXor" only to certain specified bits within one long, etc.
The blitter compilation engine marches through the Region scanline from left to right, calling code generation functions as it goes. The generated code is accrued into a 32-byte aligned buffer on the stack. In this way, the blitter constructs a subroutine to draw the Region.
The compilation engine isn't very complicated. The tricky part is the numerous generation subroutines, which need to be fast since they are called so often and need to be easy to write since there are so many of them. For each drawing mode there's one for each case the compilation engine cares about. For pattern drawing modes, there are separate specialized subroutines for cases like patterns that can be entirely expressed in one 32-bit value ("short/narrow") patterns, patterns which can be expressed as one 32-bit value for each row, but which vary per row ("tall/narrow"), as well as "wide" variants of both. Beyond that, there are some versions specialized for 80486 and higher processors (which have the "bswap" instruction).
Generating fast and robust code generators is where the Perl meta-assembler comes into play.
The meta-assembler takes as input an assembly language template, and generates as output Pentium-scheduled assembly code that outputs an 80x86 binary for the input template. This process only takes place when Executor is compiled. Got it? This can be a little confusing, so a few examples are in order.
Here is perhaps the simplest template:
@meta copy_short_narrow_1 movl %eax,@param_offset@(%edi) @endmeta
This template describes what should be done when the blitter wants to write one long to memory. The meta-assembler processes that into this 80x86 assembly code which is to be called by the blitter compilation engine:
.align 4,0x90 _xdblt_copy_short_narrow_1: movw $0x8789,(%edi) movl %eax,2(%edi) addl $6,%edi ret
The subroutine that the meta-assembler has produced above, when executed, will generate the movl instruction (i.e. the movl instruction in the template) followed by its argument. The meta-assembler has deduced that "movl" in the example template is 80x86 opcode 0x8789.
Let's take a look at a more complicated template. This template handles the case where we want to bitwise OR a pattern to the destination bitmap, and the number of longs to transfer equals zero mod 4 (e.g. if the blitter wants to OR 36 longs to memory):
@meta or_short_narrow_many_mod_0 addl $@param_offset@,%edi movl $@param_l_cnt_div_4@,%ecx 1: orl %eax,(%edi) orl %eax,4(%edi) orl %eax,8(%edi) orl %eax,12(%edi) addl $16,%edi decl %ecx jnz 1b @lit leal (%eax,%edx,4),%ecx @lit addl %ecx,edi_offset @endmeta
The meta-assembler compiles that to this:
.align 4,0x90 _xdblt_or_short_narrow_many_mod_0: movw $0xC781,(%edi) movl %eax,2(%edi) movl $0x47090709,11(%edi) movb $0xB9,6(%edi) movl $0x8470904,15(%edi) movl $0x754910C7,23(%edi) movl $0x830C4709,19(%edi) movb $0xEF,27(%edi) movl %edx,%ecx shrl $2,%ecx movl %ecx,7(%edi) addl $28,%edi leal (%eax,%edx,4),%ecx addl %ecx,edi_offset ret
This mechanically generated subroutine generates the executable 80x86 binary for the "or_short_narrow_many_mod_0" template. It gets called by the blitter compilation engine when it needs code to OR a bunch of longs to memory.
The output of the meta-assembler isn't meant for human consumption. As such, the output contains a hodge-podge of magic numbers ({\f22 0x47090709, 0xB9, 0x8470904,} etc.). These numbers are fixed machine code values corresponding to opcodes, constant operands, and other values.
Even though this subroutine is longer than the previous example, it still doesn't take very long to execute. Furthermore, it only gets called when the blitter has determined that many longs are to be ORed to memory, so the time taken actually blitting to memory will typically dwarf the time taken to execute these 15 code generation instructions.
The meta-assembler is a Perl script that works by running numerous syntactically modified versions of the assembly template through "gas", the GNU assembler, and examining the output bytes to discover which bits are fixed opcode bits and which bits correspond to operands. Once it has figured out what goes where, it generates 80x86 assembly code which writes out the constant bytes and computes and writes out the operand bytes. That code is run through a simple Pentium instruction scheduler and the meta-assembler is done. This entire process is, of course, done only once, when Executor is compiled.
Although the meta-assembler-based blitter works only on 80x86 processors, Executor itself can run on non-Intel processors. On other CPUs (such as the 68040 used in the NeXTstation) Executor's blitter works somewhat differently.
The basic idea is still the same: translate Region scanlines into an efficient form once and then use that effi