Notes

A Stack-Based Bytecode VM for Vorpal

This weekly note is a bit less self-contained than most of the previous ones, as it is about implementing a simple bytecode virtual machine for the language I'm experimenting with, Vorpal. The VM is not particularly novel, so this note is more of a summary for me than an article with a central argument.

Desugaring Into Lambda Calculus

All source code is first parsed, then desugared into a simple form that is similar to untyped lambda calculus. Explicit bindings are desugared into normal lambda abstractions. Expressions can either be lambda calculus forms (variables with de Bruijn indices, lambda abstractions or lambda applications), string constants or recursion, or one of two built-in primitives, “pop” or “if”:

enum Expr {
    Var(usize), // variables (with de Bruijn indices)
    String(usize), // interned strings
    Abs(Box<Expr>), // lambda abstraction
    Rec(Box<Expr>), // recursion
    App(Box<Expr>, Box<Expr>), // lambda application
    Pop(Box<Expr>, Box<Expr>, Box<Expr>),
    If(Box<Expr>, Box<Expr>, Box<Expr>, Box<Expr>),
}

Recursion binds a lambda abstraction to itself: While Abs(...) corresponds to a function x => ..., the expression Rec(Abs(...)) corresponds to the function x => ... where x is bound to that very function. This can be used to build recursive functions or values. For example, assuming that the string "Foo" is interned as String(0), the expression Rec(Abs(App(String(0), Var(0)))) would correspond to the infinite expansion App(Foo, App(Foo, App(Foo, ...))).

The primitive function “pop” can be used to destructure strings applied to other expressions. If x is App(y, z), the expression pop(x, f, g) will evluate to f applied to the arguments y and z (in other words App(App(f, y), z)). If x is not an application, pop(x, f, g) will evaluate to g applied to some nil-like value.

The primitive function “if” compares two strings for equality. If a and b in if(a, b, t, f) are the same interned strings, the expression will evaluate to t applied to some nil-like value, otherwise it will evaluate to f applied to some nil-like value.

A Stack-Based VM

Desugared expressions are then compiled into a list of bytecode instructions:

enum Op {
    PushVar(usize /* index */),
    PushString(usize /* interned */),
    PushFn(usize /* code */, usize /* captured vars */),
    Rec, // bind the topmost value to itself
    Apply, // apply the two topmost values
    Return, // return from the current fn
    Pop,
    If,
}

The VM uses 3 stacks:

  1. A stack of variables, corresponding to function arguments: A function can access the variable with de Bruijn index 0 at the top of the variable stack, the variable with de Bruijn index 1 is one lower on the stack, etc.
  2. A stack of temporaries, which are used for the operands and results of Rec, Apply, Return, Pop and If. A PushVar instruction copies a value from the variable stack and pushes it onto the stack of temporaries.
  3. A stack of call frames, which store both how many captured variables a closure has pushed onto the variable stack (which will all be popped once the closure returns) and which instruction in the list of bytecodes should be used as the return address when the function / closure returns.

Call frames are stored as pairs of unsigned numbers, whereas the stacks for variables and temporaries both store values, which are of the following form:

pub enum V {
    Fn(usize, usize, usize),
    String(usize),
    Record(usize, Vec<Rc<V>>),
    Closure(usize, Rc<Vec<V>>),
    Recursive(usize, Rc<Vec<V>>),
}

All values start out as either Fns (which store their code pointer, the number of variables to capture and the call frame they belong to) or Strings (which store the index of the string in the interned string pool). When strings are applied to other values, they become Records, which store their applicands in a Vec. When Fns are not applied immediately but instead used as data (either as arguments of other functions or returned from their containing function) they capture the variables that they reference as a closure. (The compiler statically tracks how many variables a function might capture.) A Recursive value is a closure whose function argument is bound to itself.

Tracking the number of variables-to-capture as well as the call frame where a function originated allows the creation of closures to be avoided in many cases, because the current call frame (and an unchanged stack of variables) can just be reused if a function is applied directly. The idea is similar to upvalues in Lua 5, where values are only captured when a closure is returned from a function.

Right now there is no tail call optimization, but it should be possible to add it relatively easily by checking whether an Apply instruction is immediately followed by a Return, then reuse the current call frame if possible. This could even be checked statically, which would allow emitting a special TailReturn instruction, but it's probably just as easy to do it directly in the VM. (The opposite case, a Return followed by an Apply, which can be optimized so that no variables are captured unnecessarily in a closure, cannot be checked statically by the compiler, as only the VM is aware of the return address and thus the following instruction.)

Unifying Vars and Temps using Registers?

Separating variables and temporaries into separate stacks turned out to be easier than trying to track the locations of variables using call frames and putting everything on a single stack, but still seems somewhat suboptimal, as a lot of variables are cloned and pushed onto the temp stack using PushVar instructions only to be immediately consumed by Rec / Apply / Pop / If instructions.

A possible alternative that I want to explore more might be a register-based approach which gets rid of the temp stack completely and uses a single stack for variables and values. PushVar instructions would completely fall away, Rec / Apply / Pop / If would directly specify their operands using indices pointing to variables on the stack.

Here's a sketch of how the instructions might change:

enum Op {
    PushString(usize /* interned */),
    PushFn(usize /* code */, usize /* captured vars */),
    Rec(usize), // bind the topmost value to itself
    Apply(usize, usize), // apply the two topmost values
    Return, // return from the current fn
    Pop(usize, usize, usize),
    If(usize, usize, usize, usize),
}