The Boehm GC Feels Like Cheating

Thursday, March 15th 2018

A while ago, I worked on a simple interpreted Lisp, primarily as a way for me to understand the philosphy and ideas behind Lisp languages.

However, I ran into a problem when trying to integrate garbage collection. Because I didn’t plan garbage collection from the beginning, I shot myself in the foot, due to the fact that I had pointers to Lisp objects (expr*) all over the C stack. As an example, take a look at my implementation of the CONS function, which is a builtin (a language primitive written in C). CONS takes two expressions, evaluates them, and returns a cell where the head and tail point to the first and second evals respectively.

expr *builtin_cons(scope *scope, expr *arguments) {
  expr *first = eval(scope, nth(0, arguments));
  expr *second = eval(scope, nth(1, arguments));
  return create_cell(first, second);

Note how expr *first is stored on the stack while expr *second is being evaulated. This is a problem - what if a GC triggers during the eval of the second argument 1. How will I know that there is still a reference to the first expression so that it isn’t freed?

I considered rewriting my interpreter to do the following:

The code would basically look like this:

expr *builtin_cons(scope *scope, expr *arguments) {
  expr *first = expr_stack_push(eval(scope, nth(0, arguments)));
  expr *second = expr_stack_push(eval(scope, nth(1, arguments)));

  expr* result = create_cell(first, second);
  return result;

Well, I already wrote most of the interpreter, so it’s too much work to go back and fix everything. Turns out there’s a shortcut - enter conservative garbage collection.

Conservative Garbage Collection

Conservative garabage collectors are drop-in GCs that work with any C program. At GC points, they look at the C stack and registers directly to figure out what objects are still reachable. However, the GC doesn’t actually know if something is a pointer to the stack, thus it must behave conservatively. If a number happens to look like a heap pointer, then the GC will assume that it’s a heap pointer. I decided to integrate the Boehm GC, which has been used by Mono and other production languages.

To start, we add a call to GC_INIT() before we do any allocations. Now, we replace all our malloc(size) calls with GC_MALLOC(size) and explicit free(ptr) calls with GC_FREE(ptr). We can also use GC_MALLOC_ATOMIC(size) to allocate atomic buffers - the GC won’t scan inside these when looking for more heap pointers, so it’s a slight optimization. I used these for strings, and you probably want to use it for things like matrices.

Now we… Wait! That’s it! Full garbage collection in a few lines. This is the diff where I added garbage collection.

This is pretty awesome - the Boehm GC works for any program, as long as pointers aren’t obfuscated in some way. It was far easier than having to do the hard work of writing my interpreter properly!

Precise Mode

I noticed that the Boehm GC has a precise mode, where you can tell it exactly where in structs the pointers to other objects might be. This should make GC faster, as it doesn’t have to scan as much, but I couldn’t find any documentation on how to use this.

Another Method for Tracking the Root Set

I later came across another lisp project - minilisp by rui314. It features an interesting approach to tracking the root set.

In each C function, you create an array void* root of n + 2 pointers, where n is the number of expressions that this C function will have handles to. Obj* (the equivalent of my expr*) pointers will go in here. The C code will then hold Obj** double pointers in local variables, which point into the root array.

Now, when you recieve an Obj*, you save it using code like this *obj = read_expr(...);, where obj is an Obj**. Thus, all Obj* pointers used in a C function are in this array.

Here’s the important part - the first of the n + 2 pointers is a pointer to the root array of the caller of the current function. Furthermore, whenever we call a function, we always pass in the current root as the first argument, so that our callee’s root can point back to our root. Therefore we have a stack of root arrays.

Finally, the last of the n + 2 pointers is a sentinel ROOT_END. Now, when we do a GC, we start with our current root, scan until the ROOT_END, then we traverse to the caller’s root and scan until ROOT_END, and so on. Thus we traverse the entire set of reachable objects.

  1. Interpreters typically trigger GC during the allocation routine, so if the second expression allocates a new object, it could trigger a GC. 

#programming-languages #c #lisp #garbage-collection

Similar Posts