Saturday, October 7, 2017

What the Dragon Book Didn't Tell Us?

Originally posted on Quora, but they've stopped allowing updates. The Quora post has some reader comments.

First I'll say that I like this book on Compilers. It is probably the single book that has changed the landscape of compiling technology the most.

Also, In the first edition there was a nice dragon on the cover and the tail flowed around the spine over to the back. That green tail on the spine looked nice on a bookshelf of mostly technical books with boring covers and spines. Think: Knuth's Art of Computer Programming or Aho and Ullman's earlier book: The Theory of Parsing Translation and Compiling to get the idea. Later editions didn't look so nice on the shelf.

Andrew Appel’s excellent book “Modern Compiler Implementation in Java : Basic Techniques” gets honorable mention for having I guess the mid-section of a Tiger on the spine, but that is not as easily recognizable on the shelf as a dragon’s tail. It also has a nice pun in the epigram for the chapter on method lookups for Object-Oriented languages. (The dragon book has nothing on how to handle method lookups in OO Languages.)

But there is a lot that I have learned on compilers and compiling technology in the 40 years since that book has come out, that just isn't in there. The last U.S. edition of this book came out about a 2007, so I wondered if it has some of the things I've come to learn. At over $100-160 to buy a hardback edition or even a kindle edition, that's a bit steep for a more than a decade-old book . So I asked a Maldives computer scientist for his copy which can be had for about $15.

The book arrived the other day and I was eager to see what's been updated. None of the topics I had hoped would be covered are in there. So lemme describe what I think should be in there.

In my case, I make lots of mistakes. So I have needed and have used debuggers a lot. And I've written quite a few of them, even for some obscure things like GNU Makebash, or zsh. (But also for more conventional languages like Python and Go as well.)

More discussion on debuggers or just the on the run-time system and environment support for debuggers would be nice. The runtime system for dynamic languages versus statically compiled languages is a bit different. And Just-in-time compilers extend on but work a bit differently than either interpreters or compilers. Nothing is mentioned here.

I’ve needed to beef up the run-time support in many of the dynamic programming languages that I’ve worked on to create or extend a debugger. For example, between bash 2.05b and bash 3, changes were made based on patches that I submitted. Those changes made it possible to write a debugger.

Such information would include properties of stack frames or code objects that assist stack backtraces, execution tracing, execution profiling, and debugging. Or the different ways one could implement breakpoints; the different ways of storing program-location information. Or the various kinds of debugger APIs like remote out-of-process debugging of Java, nodejs/V8, or PHP, versus in-process callback-hook debugging which is found in many dynamic languages.

Or issues around debugging optimized code. One simple idea that I would like to see explored is the optimizer thinking of transformations as a transaction in the database sense so that it would just log what transformation took place. That might make unravelling the result easier for the hapless programmer who has to unscramble things.

In an area similar to debugging and run time, there are object formats and what goes into an object format. Or what goes on in a binder or linker. When I look in the index for object formats like COFF, ELF, EXE, OMF, or I'll even take the old Unix a.out, there is nothing. Likewise for the associated symbol table sections like DWARF, or stabs. John R Levine’s book “Linkers and Loaders” while 17-years old is still very relevant. (And it too has a cool and appropriate cover.)

And there is the corresponding thing to linking and loading that goes on in dynamic languages: importing, requiring or loading and method lookup strategies.

In working with the Go interpreter, I became aware of the "boxed-value" memory layout it uses. And "thunks". Those are not in the index either. The newer book “Engineering a Compiler” by Keith Cooper and Linda Torczon does talk about the memory shape of datatypes and that is a welcome addition to compiler books.

Recently I've been interested in using decompilers as a way of communicating the exact location in a program when an error occurs or when inside a debugger. Since Python and Ruby do very little in the way of “compiling” , reversing the process is very doable. And in such cases you get freakishly exact results. (Possibly not so surprising in Python where they try to maintain that there is just one right “Pythonic” way to do things.)

But many decompilers don't handle this as a kind of syntax-directed compilation pipeline (for reasons I need to describe elsewhere), so the implementation of the decompiler code is very hacky. If guidance was given in the same way as it is in the dragon book for a compiler pipeline, implementations might be less hacky, have more rigor and be more robust.

There are a some small but important differences between a compiler and a decompiler. Namely:

  • you don't start out with program text but instructions of some sort.
  • you can assume that what you are given is correct which means..
  • you don't have to understand everything completely. For example you can ignore bookkeeping instructions (popping a stack value to make it match a join with another branch)
  • you can assume type checking is correct

I could elaborate on each of the points above, but understanding the differences helps in adapting standard compiling theory.

This is true, for example, with respect to parsing, though what I've found most useful is the Jay Earley parser. It handles ambiguous grammars, which generally would be considered a bad kind of grammar to create in a programming language. But in this context it is the right thing to do. As with translating from one human language to another: there can be several valid ways to do it.

For flexibility and ease of use then, you want the flexibility of the Earley parser. The dragon book mentions Earley only in passing in the last chapter’s end notes, and has a dismissive view of its efficiency. Wikipedia says that for most context-free grammars (or almost context-free grammars which is the case here), it is linear if you prefer left recursion over right recursion, except in pathological cases. And in my experience this is true. Furthermore, the pathological cases can often be addressed once you know where they are. Due to its extreme flexibility like adding rules at run time based on instructions seen, or the ability to do some checks when a grammar reduction occurs, I wouldn't consider using anything else at the moment.

Speaking of grammars, missing is the whole topic of how to debug a grammar, and common test practices, e.g. round tripping, bootstrap testing, and running tests on the entire standard library. In the tool I use, I've added grammar coverage profiling to give an understanding of which grammar rules are extraneous, or whether my test suite is adequate.

There is a curse that comes with any book being successful and good at what it does: I suspect that it reduces incentive in research in compiler topics that broaden and change things slightly. My decompiler example is one such example.

So we are stuck with a book that while over 950 pages long, is still very abbreviated (or totally lacking) for at least my use cases where compiling technology has come in handy. I wonder to what extent material in there has stifled research and innovation in other areas, because compilers were considered a done deal. I mentioned the short-shrift of the Earley parser and pessimistic performance time complexity. I knew about this and so it put me off when I encountered the first decompiler I ran into that used it. I had the same feeling when I first learned about register allocation by graph coloring, because the first edition of the dragon book talked only about Sethi-Ullman stack-based register allocation. I am pleased to say that that omission has been corrected in later editions.

And speaking of parsers, what about packrat grammars also known as Parsing Expression Grammars or PEGs? And for that there's something called the omega trick. PEG is not in the index either.

In the 2013 Pearson edition I have, the index, while decently long, omits authors. This is in contrast to the in contrast to the 1986 Addison Wesley book: “Compilers Principles, Techniques, and Tools”,

Also omitted is an accumulated list of references to look up authors - just what is in the end notes. The preface is gone along with what's changed between the editions, although Amazon’s advertising suggests some things. No acknowledgements in the newest edition. If you accumulate the end-chapter references, there are maybe about half of what was in the 1986 accumulated bibliography. The end-chapter references are so abbreviated that they don’t cover the important papers. Only in a dozen or so references is the publication date after 1990, the bulk coming from 1960–1985.

Here is one an example of a missing reference probably among several. In the “Code Generation” section they mention the programming language ML (as in OCAML) that their pedagogical programming language is modeled after. ML is neither listed in the references nor in the index to the pages talking about characteristics of ML.

So what I'd like to see is a book of similar length covering all the things that I feel are missing in the Dragon book. This book would cover or cover more JIT compilers, decompilers, transpilers, debuggers, object formats, bytecode VMs, and packrat grammars and dynamic interpreters that are given short shrift in the dragon book.

See also Jeffrey Kegler’s Parsing: a timeline -- V3.0