Notes
2025/12/27
Thanks to a post by Max Bernstein (where he links to a comment by Bob Nystrom), I've been thinking about the relationship between bytecode and ASTs, especially in relation to comptime evaluation / partial evaluation.
In a nutshell, the observation is that a tree structure with pointers to its child nodes can not only be represented as a flat structure in an array by using array indices instead of pointers, but pointers/indices can even be omitted completely if we represent nodes as operations that manipulate their child nodes as elements on a stack.
This is basically what a bytecode for a stack-based programming languages is: Operands are pushed onto the stack, operations then operate on these values by popping from and pushing to the stack. Since a stack can be represented using a flat, contiguous array, we can achieve much better cache locality by using bytecode instead of a boxed AST that requires a lot of pointer chasing.
So far, so good. But how far can we push this similarity? When I was experimenting with partial evaluation, I ended up implementing two interpreters that had very similar semantics: the “main” one was a bytecode interpreter that was executed at run time, but the partial evaluator was built as a tree-walk interpreter for the (boxed) AST and ran at compile time. Can these two evaluators be unified?
What would it take for the compile time partial evaluator to operate on bytecode? The current run time evaluator turns bytecode into run time values (like strings, closures, cons lists, etc), but using bytecode for the compile time partial evaluation step would require turning bytecode into bytecode, for later use by the run time evaluator. This is clearly doable, because transforming a (boxed) AST into a (boxed) AST is routinely done as part of compiler passes and using (flat) bytecode would amount to just using a different representation of such an AST.
The more interesting part is the shift in perspective that comes with turning bytecode into bytecode, instead of turning it into a run time value representation. Let's compare a bytecode-to-bytecode evaluator that operates exclusively on flat bytecode to a traditional bytecode evaluator that turns it into a run time value:
Let's look at each of these differences in turn:
The most interesting aspect of using a bytecode representation as the output of an evaluator (be it a partial evaluator or otherwise) is that the output is flat and tightly packed. Lists wouldn't be represented as cons cells, but rather as a sequence of values followed by one or more CONS instructions (or similar).
Could this be adopted as an actual run time value representation? This might seem absurd at first, especially because it has the drawback that different list elements can have different sizes, which makes traversing a list awkward, to say the least. But isn't such a tightly packed representation similar to how structs are laid out in memory? A flat bytecode representation would not be a good fit for homogenous lists with a large number of elements, but could it have much better locality for small struct-like use cases that would otherwise be represented by cons cells in dynamic languages?
Notably, the Sixten language seems to be built around a similar idea.
We take it for granted that in the run time value representation of a functional language, heap allocation is used to share parts of larger data structures, so that the same large list or closure can be stored in two other lists by just cloning a pointer twice. But a fully flattened out bytecode AST is just a bunch of instructions, so the easiest way of partially evaluating a lambda application is to substitute the bound variable in the lambda body by copying the bytecode instructions.
This sounds extremely inefficient (and it would be, if taken to the extreme), but the flip side is that values can be directly mutated, because (viewed from the bytecode interpreter perspective) each value at the top of the stack “belongs” to the bytecode instruction that follows it, which can do with the value whatever it likes, by popping and pushing to the stack.
There are some echoes of Perceus reference counting (as used by Koka) and Mutable Value Semantics (MVS) here. Both approaches (as well as Rust's ownership system) are built on the insight that whenever there's a single owner of a data structure, it can safely be mutated in place. (Called “functional but in place” in the Perceus paper.) The two approaches differ in how they deal with the case where there are multiple owners: Perceus uses RC and persistent data structures that are common in functional programming, whereas MVS use copy-on-write semantics that copy the entire value on the next write.
Crucially, whether sharing through persistent data structures or copy-on-write makes more sense depends on how common the single-owner case is compared to the multi-owner case, as well as how common reads are compared to writes: If single owners are rare and reads are common, persistent data structures as used by Perceus shine. If multiple owners are rare and writes are common, copy-on-write as used by MVS is likely the better choice. (I've heard that Roc implements a variant of Perceus that uses plain arrays under the hood, which sounds pretty close to the ideas in MVS, but I haven't found any details on Roc's implementation.)
It seems unlikely that using a flat-and-copied bytecode representation instead of a boxed-and-shared run time value representation would work for all use cases. But that's not the point: What's interesting is that a flat bytecode representation gives us flat-and-copied by default, which could potentially be mixed with boxed-and-shared, for example by introducing a separate instruction for boxing parts of a data structure. If we further add instructions that can selectively use ref counting or copy-on-write, we are getting pretty close to some of the ideas explored in Perceus or MVS.
Whether using a bytecode-like representation for run time values is a good idea remains to be seen. (Well, it's probably not.) However, at the very least it could lead to a clean and unified architecture that shares a lot of code between the compile time partial evaluator and the run time bytecode interpreter. At best it could lead to a bytecode instruction set that cleanly exposes the distinctions between boxed and flat values as well as sharing and copying (and moving) values.