Throw Away the Compiler, But Keep the Tests

Notes
2025/05/11

Throw Away the Compiler, But Keep the Tests

While working on a stack-based VM for my programming language Vorpal, I have been experimenting with throwing away all unfinished code at the end of the day, less dogmatically than at first, but still on a regular basis. In Tyler Neely's talk, where I first heard of the idea, he mentions keeping the tests. This makes sense: After all, tests benefit much less from the conceptual simplification that you get when you repeatedly throw away and rewrite the core logic of your code. Also, tests are by their very nature orthogonal and accumulative. You want a large collection of tests to cover all edge cases, ideally a very large collection.

On the other hand, writing tests too early on can tie you to a particular interface and make refactoring code a lot more cumbersome than it should be. I experienced this several times while testing the intermediate representation of Vorpal (basically a slightly extended lambda calculus) and its stack-based bytecode. Changing these representations slightly meant changing a lot of tests in a very superficial way.

Testing Different Compiler Stages

This is why I've been experimenting with a more rigorous testing infrastructure that can more quickly adapt to changes in the various intermediate representations. Here's how one test looks like:

    #[test]
    fn test_run_apply() -> Result<(), String> {
        test(PathBuf::from("tests/run_fn.txt"))
    }

And here's (some of) the content of tests/run_fn.txt:

('x => { Bar })(Foo)

((=>
  'x
  {
    Bar})
  Foo)

( ( ( =>
        =>
          0
      "x" )
    =>
      "Bar" )
  "Foo" )

00000: PushString("Bar")
00001:   Return
00002: PushVar(00000)
00003:   Return
00004: PushFn(00002, 00000)
00005:   Return
00006: PushString("Foo")
00007: PushFn(00000, 00000)
00008: PushString("x")
00009: PushFn(00004, 00000)
00010: Apply
00011: Apply
00012: Apply

Bar

---

('x => { x })(Foo)

((=>
  'x
  {
    x})
  Foo)

( ( ( =>
        =>
          0
      "x" )
    =>
      0 )
  "Foo" )

00000: PushVar(00000)
00001:   Return
00002: PushVar(00000)
00003:   Return
00004: PushFn(00002, 00000)
00005:   Return
00006: PushString("Foo")
00007: PushFn(00000, 00000)
00008: PushString("x")
00009: PushFn(00004, 00000)
00010: Apply
00011: Apply
00012: Apply

Foo

The file contains two tests, which are separated by “\n\n---\n\n”. Each test is a collection of the 5 stages, separated by “\n\n”, that go from source code all the way to the resulting expression:

“('x => { Bar })(Foo)” is the source code before it's parsed.
“((=> 'x {Bar}) Foo)” is a pretty-printed representation of the AST as S-expressions, which is deliberately abstracted from some of the details of the AST (such as whether a call is prefix or infix) so that some aspects of the AST can be changed without having to change the intermediate representation.
“( ( ( => => 0 "x" ) => "Bar" ) "Foo" )” is a pretty-printed representation of the lambda calculus intermediate representation that the AST is desugared to, using de Bruijn indices. “=>” is used instead of “λ” to stick more closely to the surface syntax of Vorpal, but it's otherwise basically a lambda calculus extended with string constants and a couple of primitive functions.
“00000: PushString("Bar") ...” is the pretty-printed bytecode that the lambda calculus expressions are compiled to. All strings are interned in a separate section and referred to by their index, but the pretty-printed representation resolves them to be more human-readable.
“Bar” is the expected output of running the bytecode using Vorpal's VM.

Partial Tests

A test doesn't have to specify all stages, it's also possible to test that the compiler aborts with an expected error message by leaving out the later stages. For example, this is part of the test file for the parser:

a + b - c

Expected all infix functions starting at line 1, col 3 to be named '+', but found '-' at line 1, col 7

---

a +

Expected an infix argument after the function '+' at line 1, col 3, but the code just ended

---

f(a +)

Expected an infix argument after the function '+' at line 1, col 5, but found ')' at line 1, col 6

By separating the stages using “\n\n”, test files remain quite compact and it's possible to see all error messages at a glance in a single file, without having to set up a ton of tests.

Iterating Quickly by Overwriting Tests

The nice thing about keeping the tests in separate text files that specify all or just a subset of the full compiler stages is that the testing infrastructure supports overwriting the test files with the results of a test run by setting a boolean flag in the Rust code. This means that whenever I change the bytecode representation I can just run the test suite again, compare the results with the test files as they were last committed, then commit the changed files.

Whenever I add a new test, I only specify the initial source code, then write “?” for each of the other 4 stages and let the test suite overwrite the intermediate representations, which would be tedious to specify by hand. When I rewrite some implementation code, I still have all the stages in the test file and can immediately see whether a regression occurred in the VM, the compiler or perhaps even earlier when the code is desugared.

Abstraction by Pretty-Printing

All the stages (after the initial source code) that are specified in a test file abstract away some of the details of their corresponding intermediate representations by not using a debug representation and instead relying on a pretty-printer that is part of the test code:

The AST S-expressions abstract away how a function is called.
The lambda-like calculus abstracts away the concrete tree structure.
The bytecode representation abstracts away the concrete storage layout.

This allows changing the intermediate representations without having to touch the test files, it's enough to update the pretty-printer.

All in all, it didn't take long to build this simple testing infrastructure, but so far I'm happy with the balance between being able to pin down intermediate stages without writing bytecode by hand and also being able to change the intermediate representations on a whim by overwriting the tests.