Notes
2026/01/17
What are the most fundamental ways of representing data in a minimal programming language? Let's assume a language with some model of computation (such as lambda calculus functions and variables), what are the most minimal ways to extend it so that we can specify arbitrary data types?
There is of course the option of sidestepping the issue and pretending that computation and data are the same thing. This is what pure lambda calculus does. Another option is to add a single collection type to the language and build everything else out of it, for example using cons lists in Lisp, or using objects in a minimal OOP language like Self. Both of these options are quite abstract and hard to map to an efficient abstract machine model.
Two other well-known options, which allow us to specify new types using low level data type definitions, are ML-style data types and C's combination of unions and structs. Apart from C's unions being untagged (and thus allowing for some unsafe shenanigans), these are conceptually pretty similar, because they allow us to encode sum types (variants of a data type in ML-style languages, tagged unions in C) and product types (fields of a data type in ML-style languages, structs in C).
But where they differ is that C's abstract machine model exposes additional distinctions that ML-style data types abstract away. What are these distinctions? And would it be possible to build a functional language with an ML-like feel that exposes these distinctions?
The most glaring difference between ML and C is that C gives us control over when data is allocated “boxed” on the heap vs. “unboxed” on the stack, by exposing pointers as part of the language. In ML-style languages, the decision of whether to box or unbox values is not exposed: Values could be kept unboxed for data type definitions that are small enough or don't have parameters, or boxed if the compiler decides so.
The only way to build a list-like data structure out of data type definitions (in the absence of other collection types) is to recursively nest data types, such as using cons lists or trees. This is possible in C as well, but C also provides arrays and thus a contiguous way of storing sequential data. ML-style data types and C-style structs are tuples, which can pack heterogeneous data together, whereas C arrays contain homogenous data in the form of elements of the same size.
In the purely functional subset of ML-style languages with data types definitions, all values are shared and no operation destructively mutates them. In C, direct mutation is possible. From the perspective of a functional programmer, this comes with a wealth of problems, but it also makes C flexible enough to serve as a compilation target for languages that distinguish data with a single owner (thus allowing safe mutation) from data with multiple references to it (where mutation would thus have observable side effects).
Traditionally, languages have presented these three distinctions as intimately linked: Either we have C's free for all with all the implications of unsafety that C brings with it, or the choice has been made for us and the three distinctions are rolled together, in the form of persistent data types in purely functional languages, which are boxed, nested and shared.
But as more recent languages such as Rust have demonstrated, these distinctions can be untangled without necessarily introducing C-style unsafety. Rust exposes all of the above distinctions, at the cost of requiring a borrow checker and rejecting some valid programs.
It's possible to untangle these distinctions even in the absence of Rust-style static borrow checking. Koka's Perceus reference counting mutates values with a single owner, while sharing values with multiple references to it, based on runtime semantics, so that the same algorithm can be used for both owned and shared data. Mutable Value Semantics is another approach that uses copying instead of sharing when multiple references are involved.
In a functional language, the above distinctions could be fully independent while relying purely on runtime checks:
The distinction between owned and shared data can either be tracked statically (using linear types / borrowing like in Rust) or dynamically at run-time (like in Koka/Perceus and MVS).
Koka/Perceus and MVS each go beyond an ML-style model by building on the distinction between owned and shared data, but without exposing that distinction as user-level annotations and static checks. Instead, algorithms are polymorphic over owned/shared data and mutate in place when possible. Koka/Perceus is (mainly) built on boxed and nested data, whereas MVS is (mainly) built on boxed and contiguous data. Neither of them make it possible to mix and match between the two approaches, nor do they make it possible to keep all data unboxed.
What would such a language look like? Hard to say. No existing language seems to expose all three distinctions independently while maintaining functional semantics. The design space remains largely unexplored.
A functional language that exposed all of these distinctions would let programmers choose: boxing for indirection, contiguous data for cache locality, and ownership for safe mutation. Each choice independent, each with clear tradeoffs, none privileged by the language designer's assumptions about what matters most.