Notes
2025/08/31

RC week 15: Kombucha's fundamental data types

I've recently been thinking a bit more about the fundamental data types in Kombucha. For context, Kombucha started as an attempt to build a programming language that stays as close as possible to lambda calculus while avoiding to be a Turing tarpit. In contrast to other languages, which take it for granted that numbers, lists and hashmaps are built into the language, I wanted to explore the middle ground between lambda calculus and traditional languages.

Kombucha's minimalism isn't just an end in itself, but also relevant because Kombucha is meant to be used symbiotically in combination with a host platform that implements Kombucha's bytecode virtual machine. Since the goal is to keep Kombucha small and simple enough that its virtual machine can be implemented in just a few hundred lines of code (so that I can easily host it for example both in Rust and in JS), every data type that is built in increases the instruction set of the virtual machine.

Let's look at what Kombucha offers right now (atoms and lists) and then explore what other data types might make sense (maps, numbers and bytes).

Atoms

Every identifier that starts with an uppercase letter, such as True, False, Zero or Foo is an atom in Kombucha. Atoms are just string-like things that do not support any operations apart from comparing two atoms for equality. Under the hood, atoms are simply interned strings (known as atoms in Elixir, keywords in Clojure and symbols in Ruby).

Technically, atoms are the only form of data type in Kombucha. Composite data structures can be built by using atoms as if they were functions: Applying an atom Foo to the two arguments Bar and Baz creates the structure Foo(Bar, Baz), which is conceptually a data type Foo that holds the two values Bar and Baz. Since Kombucha is dynamically typed, there is no need to declare data types. To construct a pair of the two values Foo and Bar it's enough to just apply the atom Pair to Foo and Bar, using the syntax Pair(Foo, Bar).

Composite data structures can be destructured, by splitting off the last element of a composite data structure. For example, splitting Pair(Foo, Bar) returns Pair(Foo) and Bar, while splitting Pair(Foo) returns Pair and Foo. This also makes it possible to build higher level constructs such as pattern matching out of step-by-step destructuring operations. It is currently only possible to split off elements at the end, but it might make sense to extend to instruction set to allow elements to be split off from the front as well (and to inspect the “head” of a composite data structure, in other words Pair in the case of Pair(Foo, Bar)).

Lists

Lists in Kombucha are written as [Foo, Bar] and are merely syntactic sugar for applying the atom Nil to a bunch of arguments. In other words, [Foo, Bar] is sugar for Nil(Foo, Bar), [Foo, [Bar, Baz], Qux] is sugar for Nil(Foo, Nil(Bar, Baz), Qux).

Since lists are just sugar, the are only special from the perspective of the parser and the pretty printer, but the compiler and virtual machine can be completely oblivious to this special treatment, all they see are atoms and composite data structures built out of atoms.

Maps

Kombucha currently does not provide any associative maps. Instead, lists of pairs are used, for example [[One, Foo], [Two, Bar]] for the association of One with Foo and Two with Bar.

This leads to a problem when serializing/deserializing to/from other data structures such as JSON: While something like JSON's null can just be mapped to/from some Kombucha atom such as Null or None, the same is not possible for the empty JSON object {}, because both {} and [] in JSON would map to the empty list [] in Kombucha, which makes the conversion between JSON and Kombucha lossy.

There are three ways to solve this:

The easiest “solution” would be to accept that conversion is lossy, because Kombucha simply does not have a distinction between an empty list and map. This does not feel like a great solution.
Maps in Kombucha could become a primitive data type of unordered associative pairs, so that the JSON object {"One": "Foo", "Two": "Bar"} is represented by a hashmap in Kombucha's virtual machine and has special syntax, such as {One: Foo, Two: Bar}.
Maps in Kombucha could use a special atom, for example Map, instead of relying on lists (which are just sugar for the atom Nil). The JSON object {"One": "Foo", "Two": "Bar"} would then correspond to Kombucha's Map([One, Foo], [Two, Bar]), possibly with special syntactic treatment just like lists.

If the goal is to maximize performance, the best option would be to add associative maps as a primitive data type to Kombucha, so that each virtual machine implementation can use efficient hashmaps to represent associative data.

But if the goal is to keep the language minimal, both for its theoretical properties and ease of virtual machine implementation, it probably makes more sense to just use a different atom such as Map for associative maps. (However, is it then still justified to privilege lists syntactically? Perhaps the most consistent approach would be to use List(Foo, Bar) for lists and Map(List(One, Foo), List(Two, Bar)). This feels extremely verbose though.)

Numbers

The same question arises for numbers, which Kombucha simply doesn't support at the moment. The easiest option would be to add fixed-size integers, for example 64 bit ints. This is problematic however, because it would commit an otherwise machine-independent language to a specific number of bits and require all virtual machine implementations to use 64 bit ints, even if that ends up being a bad fit for the architecture (for example because it uses 32 bit words). Kombucha's aim is to be quite high level in other regards, which would make 64 bit ints feel strangely out of place.

The alternative is to use arbitrary-precision arithmetic and guarantee that Kombucha's numbers are always bignums. This leaves us with two options on how to implement this:

Bignums could become a primitive data type, with each virtual machine having to implement arbitrary-precision arithmetic.
Bignums could be encoded in Kombucha itself as data types built out of (composite) atoms, either as (unary) Peano arithmetic or using binary encoding.

Once again, if the goal is to maximize performance, the best option would clearly be to add support for numbers to Kombucha's virtual machine. However, implementing arbitrary-precision arithmetic is more complicated than just relying on a platform's native arithmetic operations and would likely make it impossible to implement a virtual machine in a few hundred lines of code.

If the goal is to keep the language minimal, it might make sense to encode numbers using Kombucha data types. This would make it much easier to implement virtual machines and keep the instruction set small, but would result in a big performance hit. Using (unary) Peano arithmetic would certainly be prohibitively slow, but even a binary encoding would lead to a lot of pointer chasing and most likely be several orders of magnitude slower than most bignum libraries. This would make Kombucha unsuitable for anything math-heavy, but perhaps that's acceptable given that Kombucha's symbiotic nature makes it possible to offload math-heavy computation to the host?

Bits, bytes and blobs

Lastly, it might be useful to extend Kombucha with a data type for storing bits and bytes. Such a data type could hold opaque blobs of data, which are not meant to be operated on inside of Kombucha but rather by the host platform. As a result, there is no need to extend the instruction set, as Kombucha's virtual machine would not need to implement any operations on these opaque blobs of data.

The question is then mostly how to represent these blobs in Kombucha: Would it be enough to just add bytes to the language, with larger structures represented as lists of bytes? Or should it be possible to declare blobs with an arbitrary number of bits (or perhaps bytes), so that machine words can be efficiently represented?

A tentative answer

A first approach worth exploring might be to encode both maps and numbers using Kombucha's existing data structures and to only extend the virtual machine with a way to represent opaque blobs of bytes. This would keep the virtual machine small while also providing an “escape hatch” through blobs of bytes, by allowing the host platform to handle math-heavy operations.

It would then even be possible to switch out the standard library if higher performance is necessary: By default, the standard library would use an encoding of numbers using Kombucha data structures, with minimal requirements for the virtual machine. But a different standard library might implement numbers using blobs of bytes and let the host handle the number operations through a bignum library outside of Kombucha, without having to change any of the application code.