No More Primitives - What Python and Java Get Wrong

Wednesday, March 14th 2018

Many languages such as Java distinguish between primitives and objects. Primitives are stored directly on the stack, and don’t have any callable methods. Objects have callable methods and fields, and are typically allocated on the heap with only a pointer stored on the stack ¹.

Furthermore, in Java, every primitive has a corresponding object that wraps the value of the primitives - these objects have methods, and can also be used to allocate primitives on the heap when necessary. These wrappers are typically known as boxed values. This duplication can lead to confusion, as in the snippet below. We have two variables that hold the same integer, but we can only call methods on one.

class Main {
  public static void main(String[] args) {
    Integer a = new Integer(3);
    int b = 3;
    System.out.println(a.floatValue()); // Works
    System.out.println(b.floatValue()); // Doesn't work
  }
}

In theory, this confusion is worth it because we get a performance benefit - primitive ints take up exactly 32 bits and don’t require heap allocations, which are slow relative to stack operations. They are also stored directly on the stack, and thus don’t require a dereference (and possible cache miss) to access the value.

However, I believe primitive types could be removed entirely without hurting performance. All you need are objects. First let’s see how dynamic languages do this.

Dynamic Languages

Lots of dynamic languages unify values under one object model, meaning that there are only objects. Python does this, and as a result integers are always boxed and stored on the heap. This means that addition requires dereferences, as well as the allocation of new objects, which makes integer operations slow ².

Unlike Python, Ruby stores integers, floating points, and other simple values directly on the stack. However, at the same time, they act as objects, and are part of a unified object model.

1.next() #=> 2

Ruby seems to have the best of both worlds - we have callable methods and we don’t have dereferences/allocations for primitive operations. Ruby does this through a special-case hack for certain types called immediates, which it stores directly on the stack. Every stack value has a simple enum tag which allows us to differentiate between an object pointer on the stack and integers and floats on the stack³.

Now assume that we are searching for the next method on a reciever⁴. We start by getting the class of the object on the stack. For immediates, the rb_class_of method returns hard-coded pointers to the appropriate class for the immediate - for regular objects we use the stored class pointer. Now we can scan the class’s method table to find the method implementation. An outline of the ruby VM code that does this is below.

// https://github.com/ruby/ruby/blob/ff43e24374fadd8f20dd6c63e0f8447dc6c560a9/vm_insnhelper.c#L1311
static void vm_search_method(..., VALUE recv)
{
  VALUE klass = rb_class_of(recv);
  ...
}

// https://github.com/ruby/ruby/blob/96db72ce38b27799dd8e80ca00696e41234db6ba/include/ruby/ruby.h#L2001
static inline VALUE rb_class_of(VALUE obj)
{
  if (RB_IMMEDIATE_P(obj)) {
    if (RB_FIXNUM_P(obj)) return rb_cInteger;
    if (RB_FLONUM_P(obj)) return rb_cFloat;
    if (obj == RUBY_Qtrue)  return rb_cTrueClass;
    if (RB_STATIC_SYM_P(obj)) return rb_cSymbol;
  }
  else if (!RB_TEST(obj)) {
    if (obj == RUBY_Qnil)   return rb_cNilClass;
    if (obj == RUBY_Qfalse) return rb_cFalseClass;
  }
  return RBASIC(obj)->klass;
}

Back to Statically Typed Languages

It’s clear that we can have the best of both worlds in dynamic languages. However, we can also have this in statically typed languages like Java. Consider C#, where primitive types can be dispatched on.

class MainClass {
  public static void Main (string[] args) {
    double a = 0.3;
    Console.WriteLine (a.ToString());
  }
}

This is likely implemented in a similar fashion where certain types are hard-coded to look up in the appropriate method table. In this situation there is no type enum on the stack, because we can statically know the types of objects at any position on the stack. Java could have done something similar. Primitives would still need to be boxed when dealing with generics and polymorphism ⁵, but that could happen transparently without the user noticing. Furthermore, the Integer class would be both immutable and final, in order to prevent mutation and subclassing.

Ultimately, we would have had a cleaner, object-oriented heirarchy suitable for a higher level language.

The Java VM can use escape analysis to determine if an object will leave the method where it was created. If it doesn’t escape, then the object can be allocated on the stack. Note that objects still need space for a class pointer, which primitives don’t have. ↩
Python implements an optimization whereby small integers are preallocated in an array so that they don’t need to be reallocated. Some discussion into this can be found here. The CPython code related to this is here. ↩
Check out NaN Tagging for an interesting discussion on how to keep the tag and data for primitives in one 64-bit value. ↩
In a.foo(), a is the “reciever.” ↩
Java generics use the same byte-code for different specializations. For example ArrayList<Integer> must have the same bytecode code as ArrayList<Object>. This means primitives must be boxed to look like objects. C# does better - it actually generates ‘reified’ bytecode for different specializations of a generic class. This means that List<int> can use stack values, while List<object> uses pointers to the heap. ↩

#programming-languages #java #python #c-sharp #ruby

No More Primitives - What Python and Java Get Wrong

Dynamic Languages

Back to Statically Typed Languages

Similar Posts