Saturday, October 28, 2023

The two times I have been misled by Compiler Canon (and the Dragon book)

Twice in my life there have these weird moments in compilers where I realize something I have been taught, something that has been considered canon turns out to be misleading.

The time this happened was when I started at IBM Research in a compiler group. Previously my formal knowledge of compilers was from the first edition of the dragon book.

Register allocation focus was on reducing the stack depth in evaluating expressions. There was something called Sethi-Ullman numbers which computed the maximum stack depth.

Since one of the authors of the Dragon Book was Jeff Ullman, it is not a surprise would be mentioned.

And for the DEC PDP-11 machines this made sense, because there were stack-like increment/decrement instructions on a special "stack-pointer"register. It never made sense for IBM's computers which had general-purpose registers without such increment/decrement instructions.

When I got to IBM around 1982, there was this register allocation implementation by Greg Chaitin based on an idea by Russian Computer Scientist A. P. Ershov via IBM Fellow John Cocke. While the idea had been floating around since 1971 or so by John Cocke, Greg's simple elegant implementation it happen and put it into the PL.8 compiler.

Subsequent editions of the Dragon book ditched Sethi Ullman numbers and described this instead. 

The second time though is recent in relation to how compilers work and how decompilation works.

In around 2006, I was interested in decompilation as a way to get precise location information about where a program is when it is running code, such as inside a debugger, or just before producing a stack dump. There was some code that was started around 2000 and largely abandoned a year or two later after going through two people. Subsequent maintainers I don't think understood it as well since it has a compiler-centric orientation.

There were a couple of people who wrote thesis on this topic, and their work seems to be canon for decompilation. It assumes a general-purpose decompiler -- one that starts with machine code and produces source code. It has no understanding of the source code used, translator, or compile system used. Therefore its source code has to be some sort of lower-level language bridgings the semantics of machine-code-level instructions and control flow. C is a good language for this kind of thing.

The decompiler-as-a-kind-compiler approach however is at odds with this. It does assume a specific compilation system and its transformations into a high-level bytecode. And by doing this the source code is the same as the source code we started out with. Furthermore, by doing this we don't need really to understand the semantics of the bytecode instructions!

At first I thought this was kind of a one-off. Part if this was thinking that bytecode as a general principle is a one off. I now realize, that this indeed is not the case. As the Wikipedia article on p-code-machine explains, the idea goes back as early as 1966 and traces through Euler and Pascal via Niklaus Wirth. Rather than bytecode being a one-off. It is used a a number of systems like Smalltalk, and Java, probably gaining popularity from the Pascal P-code interpreter, UCSD Pascal and Turbo Pascal.

There was this school of thought at IBM is that there was a holy grail of a universal source-code and machine independent register-based intermediate language. Something akin to LLVM. To some extent there  still is a school of though that believes this. And there is possibly some truth to this. However there are a lot of systems that use a bytecode designed for a particular language for a particular family of languages.  Microsoft CIL and JVM are examples of a family of languages camp.

See https://www.youtube.com/watch?v=IT__Nrr3PNI&t=5311s for the history of JVM and how it was influenced by bytecode.