Tuesday 22 November 2011

Re: Highbrow Java; Or: Java Generics and Why I Still Hate Them

This, well, rant, originated as a more elaborate answer to a comment on an article I left unanswered for over half a year. I wanted to apologise by coming up with a rather exhaustive answer. Turns out I got exhausted before I could finish complaining, so I'll just post what I have right now and will continue ranting at some later stages.

Because this is so long, I'll save everyone the trouble and give an abstract: I argue that, while generics were a neccessary and "good" addition to Java, this particular implementation of what is in essence second-order typed lambda calculus, is poor and severly limits the compiler's ability to guarantee run-time type safety. Turns out I'm focussing on arrays (again,) but that was the low hanging fruit. I'll get to the lack of expressive power in a later post.

Type Safety and Type Polymorphism

Why do we care about static type systems and type safety? I think one can discern two major points, quoting from Wikipedia:

  • Static typing is a limited form of program verification
  • Program execution may also be made more efficient by omitting runtime type checks and enabling other optimizations.

I actually only care about the first point: a good type system can catch mistakes in code paths rarely taken, which may have otherwise eluded testing. It may successfully shift some of the debugging effort from run-time to compile-time. The merit of any performance gain during run-time is debatable, and certainly not a really strong argument in my opinion.

What object-oriented programmers refer to as generic programming is parametric polymorphism (as opposed to ad-hoc or subtype polymorphism, both of which Java also supports.) Parametric polymorphism is also famously captured in System F, or polymorphic (or second-order typed) lambda calculus. It introduces universal quantification over types. There's a similar system developed by Mitchell & Plotkin (1988), which introduces existential quantification instead.

Prior to 1.5, Java lacked any kind of parametric polymorphism1. The need for parametric polymorphism in a statically typed programming language should be obvious to any reader: it heightens the expressive power of the type system without compromising on type safety. Thus, generics were a step in the right direction.

An important mental note: in OOP terminology, polymorphism usually refers to subtype polymorphism. Sometimes polymorphic methods are mentioned, which correspond to ad-hoc polymorphism. When I refer to polymorphism in the rest of this text, I will mean the parametric kind, i.e. generics.

Erasure

In Java, generics are only used at compile-time. The JVM isn't even aware of the existence of generics. This is called erasure, and it exists purely for backwards compatibility reasons. It leads to several unfortunate problems with type safety.

When programming with generics in java, I'm always amazed by the amount of explicit casting one has to perform. Ideally, you'd never want to ever do that. Explicit casts can't generally be checked at compile time — we give up type safety by using them.

Reifcation

The opposite of erasure is reification — a very fancy name deriving from the latin word res (thing.) Naftalin and Wadler thus call it thingification. For us, it means that a type carries run-time information about itself. A Number knows it's a Number at run-time, and you can retrieve that information using the reflection API. A List<Number> only knows it's a List, but it has no idea about the fact that it's carrying Numbers.

The proposal for Reified Generics thus would either completely throw out type erasure, or allow some sort of explicitly reified generics syntax.

There currently are only a couple of non-reifiable types in Java:

  • type variables
  • instantiations of polymorphic types (such as List<String>)
  • bounded instantiations of polymorphic types (such as List<? extends Number>)2

Generics Have Poor Support for Arrays

Arrays in Java feel like a wart nowadays, especially in the presence of generics. There is one cruicial constraint on array creation, which makes it utterly annoying to use them: the component type of an array must be reifiable.

See, given a type variable T, T[] is perfectly valid in Java. The Collection interface defines a method <T> T[] toArray(T[] a) after all! But there's something fishy about this method: why do I need to pass it an array? And isn't Collection defined as Collection<E>? T is only in scope for this one method, as can be gleaned from the type signature. T has nothing to do with E. The following code typechecks:

Collection<String> cs = new ArrayList<String>();
Number[] na = cs.toArray(new Number[10]);

There's an alternative method Object[] toArray(), which is more along the lines of what you'd expect such a method to do. It creates the array for you, (that's the point after all,) but it creates an array of the generic Object type which you'd have to cast into something more appropriate yourself.

The cause of this idiosyncracy is simple: you cannot write code that explictly creates arrays with non-reifiable component types. Recalling our earlier list, it means that type variables and (bounded) instantiations of polymorphic types cannot be used for arrays without casting explicitly.

<T> void f() {
T[] a = new T[1]; // error
List<Integer>[] il = { Arrays.asList(1,2) }; // error
}

Both of these lines will fail with generic array creation (hooray for descriptive compiler error messages.)

But wait, how do the collection classes like ArrayList do it? Don't they have to use arrays with a polymorphic component type internally? Yes, they do. And you can create generic arrays:

T[] ta = (T[]) new Object[1]; // unchecked

Since the compiler cannot ensure the above line will actually work at runtime, it issues an "unchecked" warning for this line of code (see below.) The compiler is right, too. ta now has the run-time type Object not whatever T is! So if you instantiate T to, say String, and try to store it as a String[], you will receive a run-time error even though you didn't explicitly cast anything!

Naftalin & Wadler thus tell you to adhere to the The Principle of Truth in Advertising:

The reified type of an array must be a subtype of the erasure of its static type

That's quite a mouthful isn't it? I recommend reading the appropriate chapter 6.5 in Naftalin & Wadler (2006) for more information, but the gist is this: The run-time type of any array must be a the same as or a subtype of what is left of its compile-time type after erasure kicks in. If you don't adhere to this principle, you will get into trouble selling something as being of type a where it actually is of a completely different type b which can be anything. The compiler can't catch this, and it will result in an unchecked run-time exception most likely terminating your entire program. It's your responsibility (and it shouldn't be.)

This is the reason writing T[] toArray() is not advisable; you have to pass toArray a T[] parameter that you create yourself, and that you are responsible for. The compiler can't help you.

The solution to this is one of: a) don't mix arrays and generics3, it's bad for your mental health, or b) arcane magic aka reflection. I will leave it to the reader to figure out method b)4, since I'm tired of writing this at the moment. It's a pointless exercise anyway: the reality of the matter is, b) doesn't really exist in case you want to write a well-designed library. Why? Well…

I find the next example particularly devious (from N&W 2006, pp. 87)

List<Integer>[] ils =
(List<Integer>[]) new List[] {Arrays.asList(1)}; // unchecked
List<? extends Number> nls = ils;
ils[0] = Arrays.asList(1.01) // storing a Double in an Integer list!
int n = ils[0].get(0); // class cast exception

You just need to inadvertently pass a reference of your array with a non-reifiable type to some devious method and it might put stuff in there that will make your program crash. And the compiler never saw it coming5. N&W thus call for adherence to the Principle of Indecent Exposure

Never Publicly expose an array where the components do not have a reifiable type

These two principles together mean that you should avoid non-reifible component types in both the source and run-time, which means, you should stick rule a) don't mix arrays and generics. At all. Of course the Java standard library doesn't have to follow these rules.

The way ArrayList et al. handle the issue internally is to not externally expose any generic array they do very much internally use, which is why you have to pass your own T[] to toArray. This also entails to tread lightly whenever using generic arrays internally, as well.

Inner Classes of Polymorphic Classes May Not Be Used as Array Component Types

Now that was a long subsection title. But it really is that silly:

public class C<E> {
N[] ns = new N[10]; // *error*: Generic Array Creation
private class N { int data; }
}

This will not compile. If you omit the parameter for C, it's no problem. If you put N into its own class file, it's fine again. I'm not exactly sure why that is, since the type parameter isn't even used, and N should be a reifiable type, since it's not parametric.

No Polymorphism for Exceptions

Type erasure also leads to the awkward consequence that anything deriving from Throwable cannot be a generic type, since the JVM couldn't distinguish different instantiations of that type, but it needs to in the case of exception handling.

Static Fields of Generic Types

It also makes it impossible to have any static fields of a class have the same type as a type parameter:

public class C<A> {
static A a;
}

In order for this to work, the runtime would have to keep track of a value for a for each instantiation of type A, which is not possible, since it doesn't even know that A exists! It only ever knows about the raw type C.

Downcasting Towards a Polymorphic Type is Always Unsafe

And it can result in run-time failures with a ClassCastException that are far removed from the actual origin of the erroneous code. Consider the following snippet, loosely based on FAQ005:

void f() {
List<Date> dl = new ArrayList<Date>();
List<String> sl = (List<String>) ((Object) dl); // unchecked warning
g(sl);
}
void g(List<String> l) { String s = l.get(0); } // ClassCastException

The above code will always issue a warning, and we see why: the compiler cannot generally ensure that the code won't lead to a runtime error. The problem is: there are scenarios were doing this is actually useful, and legal: given Token, a subtype of Annotation, casting a List<Annotation> to a List<Token>, for example, which I have had to do in the context of UIMA very often will always result in a warning. The compiler cannot reason about whether the warning is justified or not.

Don't just ignore those unchecked warnings, it can end badly, and it will confuse you.

Theoretically…

I would recommend to anybody further interested in exploring the theory behind (and above!) Java's Generics, to have a look at Wadler's page on them. A lot of very interesting papers are linked there, as well as the authoritative O'Reilly book on Java Generics.


  1. Well, that's not exactly true. Technically, the array type does count for a polymorphic type

  2. Curiously, wildcards are reified, so List<?> has full run-time type information, but List<? extends Object>, which is equivalent, does not.

  3. Did I mention that varargs are actually just arrays? Yes, don't mix generics and varargs either; same restrictions apply. Some methods in the standard library use varargs, among them Arrays.asList. And this does mean that creating a List<T> or List<List<Integer>> with said method shouldn't be attempted…

  4. Hint, it involves Arrays.asList and can't avoid unchecked casts either.

  5. Well, it did issue a warning alright, but as we know, there is no way around that in case you're dealing with arrays anyway.