The Boehm GC Feels Like Cheating
A while ago, I worked on a simple interpreted Lisp, primarily as a way for me to understand the philosphy and ideas behind Lisp languages.
However, I ran into a problem when trying to integrate garbage collection. Because I didn’t plan garbage collection from the beginning, my code left pointers to Lisp objects (expr*
) all over the C stack. As an example, take a look at my implementation of the CONS
function, which is a builtin (a language primitive written in C). CONS
takes two expressions, evaluates them, and returns a cell where the head and tail point to the first and second evals respectively.
expr *builtin_cons(scope *scope, expr *arguments) {
expr *first = eval(scope, nth(0, arguments));
expr *second = eval(scope, nth(1, arguments));
return create_cell(first, second);
}
Note how expr *first
is stored on the stack (in a local variable) while expr *second
is being evaulated. This is a problem - what if a GC triggers during the eval of the second argument 1. How will I know that there is still a reference to the first expression so that it isn’t freed?
I considered rewriting my interpreter to do the following:
- Keep a global stack of
expr*
pointers. - Push
expr*
pointers to the stack before they are in danger of being collected. - Pop them when they are no longer in danger of being collected.
The code would basically look like this:
expr *builtin_cons(scope *scope, expr *arguments) {
expr *first = expr_stack_push(eval(scope, nth(0, arguments)));
expr *second = expr_stack_push(eval(scope, nth(1, arguments)));
expr* result = create_cell(first, second);
expr_stack_pop(2);
return result;
}
Now, I can discover all the reachable objects by scanning the global stack of expr*
pointers.
However, I already wrote most of the interpreter, so it’s too much work to go back and fix everything. Turns out there’s a shortcut - enter conservative garbage collection.
Conservative Garbage Collection
Conservative garabage collectors are drop-in GCs that work with any C program. At GC points, they look at the C stack and registers directly to figure out what objects are still reachable. However, the GC doesn’t actually know if a value is pointer, thus it must behave conservatively. If a number happens to look like a heap pointer, then the GC will assume that it’s a heap pointer. I decided to integrate the Boehm GC, which has been used by Mono and other production languages.
To start, we add a call to GC_INIT()
before we do any allocations. Now, we replace all our malloc(size)
calls with GC_MALLOC(size)
and explicit free(ptr)
calls with GC_FREE(ptr)
. We can also use GC_MALLOC_ATOMIC(size)
to allocate atomic buffers - the GC won’t scan inside these when looking for more heap pointers, so it’s a slight optimization. I used these for strings, and you probably want to use it for things like matrices.
Now we… Wait! That’s it! Full garbage collection in a few lines. This is the diff where I added garbage collection.
This is pretty awesome - the Boehm GC works for any program, as long as pointers aren’t obfuscated in some way. It was far easier than having to do the hard work of writing my interpreter properly!
Precise Mode
I noticed that the Boehm GC has a precise mode, where you can tell it exactly where in structs the pointers to other objects might be. This should make GC faster, as it doesn’t have to scan as much, but I couldn’t find any documentation on how to use this.
Another Method for Tracking the Root Set
I later came across another lisp project - minilisp by rui314. It features an interesting approach to tracking the root set.
In each C function, you create an array void* root
of n + 2
pointers, where n
is the number of expressions that this C function will have handles to. Obj*
(the equivalent of my expr*
) pointers will go in here. The C code will then hold Obj**
double pointers in local variables, which point into the root
array.
Now, when you recieve an Obj*
, you save it using code like this *obj = read_expr(...);
, where obj
is an Obj**
. Thus, all Obj*
pointers used in a C function are in this array.
Here’s the important part - the first of the n + 2
pointers is a pointer to the root
array of the caller of the current function. Furthermore, whenever we call a function, we always pass in the current root
as the first argument, so that our callee’s root
can point back to our root
. Therefore we have a stack of root
arrays.
Finally, the last of the n + 2
pointers is a sentinel ROOT_END
. Now, when we do a GC, we start with our current root
, scan until the ROOT_END
, then we traverse to the caller’s root
and scan until ROOT_END
, and so on. Thus we traverse the entire set of reachable objects.
-
Interpreters typically trigger GC during the allocation routine, so if the second expression allocates a new object, it could trigger a GC. ↩