Lisp Machine Inc. K-machine:


The Deffenbaugh, Marshall, Powell, Willison architecture
as remembered by Joe Marshall

Abstract:

The LMI K-machine was the last processor designed and built by Lisp Machine, Inc. Unlike other Lisp Machines, the K-machine is not descended from Tom Knight's original CONS architecture; the K-machine is an original design. This paper provides an overview of the more interesting elements of the K-machine architecture.

Introduction

The LMI K-machine was the last processor designed and built by Lisp Machine, Inc. The K-machine was designed in late 1985 - early 1986 and the first instructions were executed December 31, 1986. The hardware was significantly debugged by April 1986 and an effort to port the legacy Lisp Machine code to the K-machine was underway. Financial difficulties kept the K-machine from reaching the marketplace. Lisp Machine's successor, Gigamos, was beset by legal difficulties which again prevented marketing of the K-machine processor. Nonetheless, several K-machine board sets were produced and debugged before the demise of Lisp Machine, Inc.

The K-machine processor is notable for several reasons: Unlike other Lisp Machines, the K-machine is not descended from Tom Knight's original CONS architecture; the K-machine is an original design. The K-machine is designed to run Lisp, and Lisp only. As a result, the performance of the K-machine on the Gabriel benchmarks is significantly higher than that of any of the several contemporary competing products of the mid-to-late 1980's, among them the TI Explorer chip, the Symbolics Ivory, and Lucid Common Lisp. In fact, the K-machine was competitive with Lisp implementations on the Cray 1 and the IBM 3670. Since the project was kept relatively secret during its development, very few people know the details of its architecture.

The authors of this paper were the primary architects of the original K-machine as designed and built by Lisp Machine, Inc. The hardware implementation of the architecture was primarily designed by Robert Powell, Bruce Deffenbaugh, and Kent Hoult. Subsequent minor hardware changes were made by the staff of Gigamos. Many people were involved in the project.

This paper provides an overview of the more interesting elements of the K-machine architecture. The K-machine contained its share of mundane elements such as instruction caches, virtual memory hardware, etc. Details about the mundane hardware is beyond the scope of this paper.


Architectural Principles

In any processor design there are fundamental "principles" from which the rest of the architecture derives. For the LMI K-machine, these were as follows:

The final data layout is a 6-bit tag at the most-significant end of the (4-byte) word, and a 26-bit word pointer in the least significant end. This makes a 226 word (64 MWord) address space, or a 1/4 GB address space. Again, this is smaller than Symbolics, but nothing to sneeze at. The machine is correct-endian, i.e. little.


The Pipeline

The K machine has a 4 stage pipeline. While this is transparent to the end user, and mostly transparent to the system programmer, there are occasions where the pipeline is exposed. Notes about this will be interspersed below.

The clock cycle to the machine registers run at twice the speed of the main processor clock and alternate between read and write cycles. This means that each instruction can both read from the registers and write back to them.

The four pipelines stages are as follows:

  1. Fetch
    The instruction is fetched from the instruction cache.


  2. Read
    The instruction source data are fetched from the register file, functional input bus, or instruction register (for immediate values).


  3. ALU
    The data are passed through the ALU and latched at the output register.


  4. Write
    The value in the output register is written back to the register file or to a functional destination.

An instruction may be aborted at any time before the output register value is written.

A four-stage pipeline generally means that branch instructions are not taken immediately. Usually, there are "branch delay instructions" that are unconditionally executed after the branch is specified. The K-machine branch delay is unusual in that the condition codes are set in the instruction before the conditional branch instruction and the branch target is specified in the instruction after the conditional branch. Thus, when the condition codes are generated by the ALU, the conditional branch is in the instruction register and the machine can then decide whether to follow the branch at this point. This ameliorates the effect of branch delay by spreading the branch across several instructions. Of course, the compiler hides the details of conditional branches.


The Primary Data Paths

Within the processor itself, the data paths are 33 bits wide. The extra "boxed" bit allows us to manipulate 32-bit non-pointer data (such as floating point numbers) directly. The 33rd bit is constructed on the fly from context when a pointer is read into the machine, and appropriate action is taken to "box" non-pointer data when it is written to memory.

The register file

Unlike the CONS derivatives, but much like a RISC machine, the K-machine has a register file consisting of 4096 33-bit words. This register file is duplicated to allow 2 simultaneous reads from different registers. Writes to the register file always update both copies. The output of the register file is latched and presented only to the ALU. In order to use the contents of a register as a virtual address or to write it to memory, it must be passed through the ALU. This incurs a one-cycle delay.

The register file is both read and written during each machine instruction, and thus requires fast RAM. Static RAM is used both for speed, and to allow single stepping of the K-machine hardware for debugging purposes.

The register file is addressed by the function calling hardware. This will be described in detail below.

The ALU

Integer unit

The K-machine was designed to use the AMD 29332 ALU. This is a 32-bit ALU that has quite a bit of functionality including a barrel shifter (for tag extraction), and 8, 16, 24, and 32-bit modes of operation. The decision was made to make fixnums 24-bits wide to take advantage of the natural ability of the ALU to handle this. This was a step backward, as the LMI-lambda had 25-bit fixnums, but the gain in performance was considered more important.

Floating point unit

In parallel with the integer ALU (the AMD 29332), is a floating point adder and a floating point multiplier. The output register, which holds the value to written back to the registers, may receive its value from any of these sources.

Type checking unit

Neither the CADR or the LMI LAMBDA had hardware for checking the types of operands in parallel with the actual computation done by the ALU. It is my understanding that the Symbolics 3600 has a type checking unit, but we are unfamiliar with it. We invented a simple, but effective, type checking hardware module that operates in parallel with the arithmetic units. This hardware consists of a 256Kbit x 1bit RAM. This RAM is addressed by 7bits each from the left and right operands of the ALU (the boxed and tag bits), two bits from the instruction to specify the type check operation (for instance, operands must be boxed fixnums, or left operand is an aggregate data structure and the right operand is a fixnum), and two bits from a mode register (this allowed the program to change type checking strategies when garbage collecting, for instance). These 7+7+2+2 are concatenated to form a 218 address for the type check RAM. This single bit output indicates whether the operation should succeed or cause a type check error. This turns out to be an elegant, simple, and inexpensive mechanism for parallel type checking.

The type check RAM is loaded at boot time by presenting all possible combinations to the RAM (by sending synthesized data through the ALU). Traps are turned off during programming, and a special control register toggles the write line at the appropriate time as the data flies by. It is tricky, but it avoids the necessity of adding special data paths in order to program the type checking hardware. By implementing the checking in a RAM, we can experiment with which types are represented, and even add new hardware tags on the fly.

Boxed unit

A small piece of logic computes the boxed bit for the ALU result. The instruction can specify that the result is forced to be boxed, forced to be unboxed, is to have the same boxedness as the left-hand operand, or is to have the opposite boxedness as the left-hand operand (this last one has no discernible use, but it is what the hardware does).

Interconnection

As mentioned previously, the output of the register file is latched and presented to the inputs of the ALU. The output of the ALU is similarly latched (by the output register) and written back to the register file in the next instruction. Thus the primary data paths in the machine are composed of a two-stage pipeline.

A set of address comparators enable a pass-around path when the result of an ALU operation is needed immediately for the next computation. This relieves the compiler from part of the instruction scheduling burden. In general, the compiler can ignore the effect of the pipeline except when functional sources and destinations are involved.

There are additional points to inject and extract data from the main data paths. Functional destinations can be written with the value of the output register. These consist of various control registers, the memory maps, and the virtual memory address and data registers. Functional sources may be injected into the right hand side of the ALU input registers. Additionally, the virtual memory data may be injected directly into the output register (see below for why).

Like the CADR, but unlike the LAMBDA, the functional busses (called the MFI and MFO bus in the CADR), are not connected. The MFI bus passes data destined for the processor, the MFO bus passes data sourced from the processor. This admits the possibility that some inconsistency may arise in a read-write register (for instance, the memory data register), but the difficulty in "turning around" the bus, quite evident in the LMI LAMBDA, is easily avoided in this manner.

Because the functional destinations are located beyond the output register, the effect of the pipeline is exposed to the compiler and care is taken to ensure that enough time is allowed for the data to arrive at its destination before an attempt is made to use it; in particular, it is important to not attempt to use the virtual memory data register until two instructions after initiating the read. Like the CADR and LAMBDA, the K-machine has interlocks to stall the pipeline while the processor awaits data from the memory system; the two instruction rule is to make sure that the request to read from the memory system precedes the request for the value read.

As mentioned before, the virtual memory data can be injected into the pipeline at the output register rather than at the right hand input to the ALU. The unusual placement of the injection point of the memory data saves time; memory data is usually written to a register and this will bypass a pointless trip through the ALU.

Since the functional destinations are delayed by a clock tick, certain operations, such as loading the type checking RAM, can take advantage of this delay by specifying that the data is to be written, And providing said data in the next instruction in the stream. The data will therefore be at the correct point in the pipeline when the write pulse is generated by the delayed functional destination. This is a bit of a hack, but it allows us to use a baroque set of instructions to program the hardware without requiring any extra data paths. (Of course this can only be done with interrupts off).

The instruction register can also be injected into the right hand side of the ALU. This allows us to embed immediate constants into the instruction stream. However, since the immediate constants had to be part of an instruction, there is no room for an ALU operation to be specified when this mechanism is used. The hardware forces the ALU to perform a pass right instruction, so all immediate instructions must be loaded into a register before any operation can take place (note that if the immediate was loaded to a functional destination it would simply pass through the ALU on its way to the functional output bus).

A special case is made for immediate fixnums. These are represented in the instruction stream as untagged immediate values. The hardware generates the fixnum tag on the fly as it is injected, and the 8 bits saved (remember that fixnums are 24 bits), allow for an ALU operation to be specified.

The PC can be loaded from the output of the ALU. This allows computed branches to be executed.

Function calling hardware

While the LMI Lambda was advertised as having special hardware to speed function calls, there is no particular piece of hardware devoted to rapid function calling. No machine that I am aware of had function calling hardware like the K-machine.

Register addressing

The function calling mechanism is intimately tied to the register addressing scheme. A register address consists of 6 bits. Two bits select a "frame" and 4 bits provide an offset within that frame. Each frame therefore has 16 general purpose registers.

There are 4 frames available to each instruction (more below). Analysis of the LMI LAMBDA showed that most functions have one or two arguments and one or two local parameters. The unusual functions that overflow the register frame are handled in software. The compiler generates the calls to the register spilling code.

The register frames are not overlapped as in other RISC machines. This allows us to allocate and free register frames in a non-stack (LIFO) manner. The advantages of this are discussed below.

Unlike the CADR, LAMBDA, and 3600, registers do not have a virtual address. In the LAMBDA, virtual memory accesses to the stack cause a load from the stack cache. In the K-machine, it is not possible to address register values in this way. Thus it is not possible to create a pointer into the stack.

The 4 frames available to each instruction are as follows:

Call sequence

The K-machine uses a "3-point" calling sequence. A new open frame is allocated, then it is filled with arguments, then control is transferred to the callee. When the callee returns, the open frame that was allocated is discarded (actually moved to the RETURN frame). While it may not sound like it, this implements a callee saves convention.

When a function is called as a subproblem, a continuation must be created (and pushed) in order for the function to return a value. Note, however, that the continuation does not need to be in a register. In fact, it is rare for a continuation to be manipulated as a value, so a separate "hidden" stack is used to hold it. This allows us to manipulate continuations in parallel with operations that use the register set.

The continuation has 4 parts: the caller's active frame, the caller's previous open frame, the return PC, and the return destination. The actual control sequence is complicated, but I'll make a stab at explaining it:

Return destination

In addition to all the above activity, the call instruction specifies a "return destination" for the computed value. Rather than return a value to a canonical location and then move the value to the desired register, a register address is specified in the call instruction, and the return value is loaded into that register when the function returns.

The open, call, and return instructions each can be executed in one clock cycle.

For the sake of illustration, here is how the compiler might generate some code:

LISP:  (defun xxx (f0 f1 z0 b1 b2 f3)
               (foo f0 f1 (bar (baz z0) b1 b2) f3))

OPEN                ; open a frame for call to foo, 

O0 <- A0            ; load open register 0 with contents of active
                    ; register 0 (f0)

O1 <- A1            ; load open register 1 with contents of active
                    ; register 1 (f1)

OPEN                ; open a frame for call to bar, 

OPEN                ; open a frame for call to baz,

O0 <- A2            ; load open register 1 with contents of active
                    ; register 2 (z0)

O0 <- CALL BAZ      ; call function BAZ, result to appear in slot 0
                    ; in the frame for BAR.

O1 <- A3            ; BAZ's open frame is gone, result of call is in
                    ; open register 0, load open register 1
                    ; with contents of active register 3 (b1)

02 <- A4            ; load open register 2 with contents of active
                    ; register 4 (b2)

O2 <- CALL BAR      ; call function BAR, result to appear in slot 2
                    ; of frame for FOO

O3 <- A5            ; BAR's open frame is gone, result of call is in
                    ; open register 2, load open register 3
                    ; with contents of active register 5 (f3)

CALL FOO            ; and away we go.

12 instructions. Now, you might think that this is pretty good code, but we can do better. The K-machine supports the following optimizations:


Here is what the code looks like with the above optimizations:

LISP:  (defun xxx (f0 f1 z0 b1 b2 f3)
               (foo f0 f1 (bar (baz z0) b1 b2) f3))

TOPEN, 00 <- A0     ; open a frame for tail recursive call to foo, 
                    ; in parallel, load open register 0 with contents of active
                    ; register 0 (f0)

O1 <- A1            ; load open register 1 with contents of active
                    ; register 1 (f1)

NEW-OPEN0 <- OPEN CALL BAZ, O0 <- A2
                    ; open a frame for a call to baz,
                    ; load open register 0 with the contents
                    ; of active register 2 (z0), result to appear
                    ; in slot 0 of open frame to be allocated upon
                    ; return from BAZ.

O1 <- A3            ; load open register 1 with contents of
                    ; active register 3 (b1)


O2 <- CALL BAR, O2 <- A4
                    ; call bar, returning value to slot 2 in FOO's
                    ; frame.
                    ; in parallel, load open register 2 with contents
                    ; of active register 4 (b2)

TCALL FOO, O3 <- A5 ; tail-recurse to foo, in parallel move 
                    ; contents of active register 5 (f3) into
                    ; open register 3

As you can see, we have removed half the instructions! This kind of optimization turns out to be fairly common. Remember, that each of these instruction execute in one clock cycle.

The K-machine function calling hardware is implemented as a maze of registered 2-to-1 multiplexers and RAM. The only restriction that the implementation placed upon programmers is that two return operations cannot be performed in sequence. In other words, when a function returns, it's caller is not allowed to return in the next instruction. This is not a problem because a return immediately followed by a return is a tail-recursive call and is handled differently.

The K-machine function calling hardware is easy to describe, but many hours were spent proving that it operated correctly even in the presence of interrupts. The hardware implementation evolved through several versions. Each had equivalent functionality, but quite different approaches to getting the data to the right place at the right time.

To compare the K-machine call hardware with the LMI LAMBDA we counted the number of microinstructions executed by a simple function call. The LMI LAMBDA executes a minimum of 100 microinstructions to implement a function call. These instructions execute at about 250 ns each. The K machine executes a function call in a single 80ns instruction. This results in a factor of 300 in performance. The LMI LAMBDA takes 7 seconds to run the tak benchmark. The K-machine executes the same code in .03 seconds. This is faster than PCL on a Cray-1.

Multiple values

There are two mechanisms for returning arguments. When a value is returned, the special return functional destination makes sure the return value ends up in the appropriate slot in the appropriate frame. This implements the standard case of the caller expecting one return value and the callee providing one return value.

When a callee wishes to return multiple values, the first value is returned via a mechanism that differs from the single value return only in that it sets a flag indicating that additional values are present. Callers expecting a single value never check this flag.

When a caller is prepared to accept multiple values, it checks the multiple value return flag. If this flag is clear, the caller sets the remaining multiple values to NIL. If this flag is set, the multiple values are specified by special info left in the stack frame of the callee. This is available to the caller in the RETURN frame.

Some functions, such as unwind protect, handle multiple values in a pass-through mode. There is a return path that leaves the multiple-value-flag unchanged.

Linking

As you may have noticed in the code above, functions are statically linked. Functions are re-linked on the fly to simulate value cells. Special hardware support is provided to enable "lazy linking".

Each instruction had a link bit that was normally set to 0. When a function is re-defined, the old definition is "killed" by setting the link bit to 1. No other action is taken at this point.

The link bit is noticed by the hardware when an instruction is fetched. Because of the pipeline, the processor will still be executing code within the caller. If a function is linked to a dead function, the interrupt handler calls the linker. It is important to interrupt at instruction fetch time because a call to a tail-recursive function leaves no trace of where it came from.

If a function is linked to dead function, that does not mean it should be unconditionally re-linked to the new definition. Imagine this scenario:

(defun foo () 'foo)
(setf (symbol-function bar) foo)

(defun test1 () (foo))
(defun test2 () (bar))
In this case, the symbol functions for foo and bar point to the same function object. Thus the code for test1 and test2 will be statically linked to the same function.

Now suppose foo is re-defined as follows:

(defun foo () 'new-foo)
Test1 and test2 are not immediately re-linked to the new definition. If test1 is executed, when the code for foo is fetched in preparation for the call, an interrupt occurs that invokes the linker. The linker determines (from extra data associated with test1 and foo) that test1 should be re-linked to the new definition of foo. The linker performs this re-linking and restarts the call instruction.

The situation is different for test2 however. In this case, the linker determines that the code for bar needs to be unshared from that associated with foo. The linker "resurrects" a copy of the dead code associated with foo and re-links bar to this new code. Essentially, if a function is called via a function cell, the linker relinks to the new definition, however if the function is called via a function pointer, the linker relinks to a copy of the old definition.

I have been told that this is similar to UUO-LINKS used in ITS, but we developed this mechanism independently.

Interrupt handling

The CONS, CADR, and LAMBDA machines polled for interrupts at the microcode level. The K-machine, not having microcode, does not poll for interrupts. Interrupts are either synchronous, for instance integer overflow or page fault, or asynchronous, such as the 16 millisecond clock.

Interrupts are indicated by a bit vector stored in a functional source. If multiple interrupts occur at the same time, the interrupt code can check this source to decide what order in which to process the interrupts.

The Interrupt Handler

When interrupts are enabled, an interrupt causes several things to happen:

As the next few instructions proceed, various parts of pipeline are re-enabled. If the right code is in place at location 0, the pipeline state will be correctly saved in a global frame reserved for this purpose. The interrupt sequence goes something like the following (remember that the pipeline registers are not being clocked until further notice):

  1. The output of the ALU is saved, then the output register clock is enabled.


  2. The condition codes are read from a functional source. This allows us to restart conditional branches should the processor be interrupted during one of the "branch delay" instructions. The saved condition code clock is re-enabled.


  3. The right hand side of the ALU is passed through the ALU to be saved in the global frame, then the right hand side clock is enabled.


  4. The left hand side of the ALU is passed through the ALU to be saved in the global frame, then the left hand side clock is enabled.


  5. The PC of the interrupted instruction is read and stored in the global frame.

At this point, the entire state of the pipeline has been saved and execution proceeds as normal. The interrupt functional source contains the value of the interrupt, so the correct interrupt handler can be located.

Note that there is no other state to indicate that the processor has been interrupted. Thus it is not necessary to balance every interrupt with an interrupt restore.

When the interrupt has been processed, there are three paths back into the interrupted code. All of them reload the pipeline in a similar way, differing only in the last instruction. As the pipeline is reloaded, the pipeline clocks are incrementally disabled in the same order that they were enabled for unloading the pipeline. Pipeline reload happens something like this:

  1. The saved ALU output is passed through and captured in the output register. The output register is then disabled.


  2. The saved left hand and right hand operands are read and latched into the ALU input registers.


  3. At this point, the pipeline is reloaded. All the pipeline clocks are enabled and control is transferred to the saved PC. In addition, the saved condition codes are used to specify a conditional branch. Should a pending branch be present when restarting, it will be taken as if the condition codes had been generated normally.

As mentioned before, there were different interrupt crawlout routines. The standard one, described above, was used for asynchronous interrupts such as the clock. In this case, the pipeline state after the interrupt crawlout is identical to that when the interrupt occurred, and the interrupted instruction is re-executed.

Another interrupt crawlout is used for synchronous interrupts. These interrupts are caused by the program and generally indicate an error-like condition that can be fixed (something like integer overflow). For these interrupts, the ALU output register is not enabled until one instruction after the interrupted one. To see why this is interesting, consider the case of integer overflow.

When integer overflow occurs, an interrupt is taken. The interrupt handler conses a bignum result and proceeds to restart the instruction. If the instruction were simply re-executed, the integer overflow would occur again. Instead, the bignum value is loaded into the ALU output register during the crawlout. When the instruction is re-executed, the overflow condition and the output of the ALU are ignored, and the bignum is used instead. This presents the illusion that the addition of two fixnums resulted in a bignum pointer. Thus we are able to abstract away the limitations of the ALU.

One more crawlout is provided. This causes an immediate interrupt before the execution of a single instruction. This seemingly useless ability allows the computer to read the contents of the instruction cache without actually executing instructions.

The critical positioning of the instructions in the interrupt crawlin and crawlout routines forced us to write them by hand. Other than a few pieces of code that clearly benefited from hand tuning (for instance, a mapcar would speculatively fetch the CDR while testing the CAR), there was no code in the system that was not written in Lisp.

Single Stepping

The interrupt logic could be programmed to re-interrupt after a single instruction. This allows a process to single step another for debugging purposes.

Interrupt disabling

It is important to be able to disable interrupts, and the instruction to do so must be atomic. A functional source is used to disable interrupts. Reads from this source clear the interrupt enable and return the value that the interrupt enable flag had before the source was read. This allows reliable control of interrupts, even if an asynchronous interrupt interrupts the disable interrupts instruction.

Synchronous interrupts are discarded when interrupts are disabled, but asynchronous interrupts are simply deferred. When interrupts are re-enabled, any pending asynchronous interrupts are immediately taken. Synchronous interrupts may be disabled independently.

Bibliography

Many sources were consulted for the design of the K-machine. Unfortunately, a complete list of sources has been lost in the mists of time. For those forgotten, be assured that no intellectual dishonesty is intended. Among the sources consulted were:

Appendix

What ever happened to the hardware?

During the investigation of Guy Montpetit, owner of Gigamos, for fraud surrounding his use of Japanese investment capital, the Canadian government requested that the United States assets be seized pending resolution of the case. Coopers and Lybrand held the assets. Gigamos eventually filed for bankruptcy in the U.S., and Coopers and Lybrand sought out a buyer. After a period of time, when no investors were found, the material assets of Gigamos, including the K-machine board sets, specifications, schematics, and printed circuit artwork were sold for salvage to Eli Hefron and Sons in Cambridge, MA. I purchased these materials from Eli Hefron and Sons and they are currently in my possession.