Saturday, October 7, 2017

What the Dragon Book Didn't Tell Us?

Originally posted on Quora, but they've stopped allowing updates. The Quora post has some reader comments.

First I'll say that I like this book on Compilers. It is probably the single book that has changed the landscape of compiling technology the most.

Also, In the first edition there was a nice dragon on the cover and the tail flowed around the spine over to the back. That green tail on the spine looked nice on a bookshelf of mostly technical books with boring covers and spines. Think: Knuth's Art of Computer Programming or Aho and Ullman's earlier book: The Theory of Parsing Translation and Compiling to get the idea. Later editions didn't look so nice on the shelf.

Andrew Appel’s excellent book “Modern Compiler Implementation in Java : Basic Techniques” gets honorable mention for having I guess the mid-section of a Tiger on the spine, but that is not as easily recognizable on the shelf as a dragon’s tail. It also has a nice pun in the epigram for the chapter on method lookups for Object-Oriented languages. (The dragon book has nothing on how to handle method lookups in OO Languages.)

But there is a lot that I have learned on compilers and compiling technology in the 40 years since that book has come out, that just isn't in there. The last U.S. edition of this book came out about a 2007, so I wondered if it has some of the things I've come to learn. At over $100-160 to buy a hardback edition or even a kindle edition, that's a bit steep for a more than a decade-old book . So I asked a Maldives computer scientist for his copy which can be had for about $15.

The book arrived the other day and I was eager to see what's been updated. None of the topics I had hoped would be covered are in there. So lemme describe what I think should be in there.

In my case, I make lots of mistakes. So I have needed and have used debuggers a lot. And I've written quite a few of them, even for some obscure things like GNU Makebash, or zsh. (But also for more conventional languages like Python and Go as well.)

More discussion on debuggers or just the on the run-time system and environment support for debuggers would be nice. The runtime system for dynamic languages versus statically compiled languages is a bit different. And Just-in-time compilers extend on but work a bit differently than either interpreters or compilers. Nothing is mentioned here.

I’ve needed to beef up the run-time support in many of the dynamic programming languages that I’ve worked on to create or extend a debugger. For example, between bash 2.05b and bash 3, changes were made based on patches that I submitted. Those changes made it possible to write a debugger.

Such information would include properties of stack frames or code objects that assist stack backtraces, execution tracing, execution profiling, and debugging. Or the different ways one could implement breakpoints; the different ways of storing program-location information. Or the various kinds of debugger APIs like remote out-of-process debugging of Java, nodejs/V8, or PHP, versus in-process callback-hook debugging which is found in many dynamic languages.

Or issues around debugging optimized code. One simple idea that I would like to see explored is the optimizer thinking of transformations as a transaction in the database sense so that it would just log what transformation took place. That might make unravelling the result easier for the hapless programmer who has to unscramble things.

In an area similar to debugging and run time, there are object formats and what goes into an object format. Or what goes on in a binder or linker. When I look in the index for object formats like COFF, ELF, EXE, OMF, or I'll even take the old Unix a.out, there is nothing. Likewise for the associated symbol table sections like DWARF, or stabs. John R Levine’s book “Linkers and Loaders” while 17-years old is still very relevant. (And it too has a cool and appropriate cover.)

And there is the corresponding thing to linking and loading that goes on in dynamic languages: importing, requiring or loading and method lookup strategies.

In working with the Go interpreter, I became aware of the "boxed-value" memory layout it uses. And "thunks". Those are not in the index either. The newer book “Engineering a Compiler” by Keith Cooper and Linda Torczon does talk about the memory shape of datatypes and that is a welcome addition to compiler books.

Recently I've been interested in using decompilers as a way of communicating the exact location in a program when an error occurs or when inside a debugger. Since Python and Ruby do very little in the way of “compiling” , reversing the process is very doable. And in such cases you get freakishly exact results. (Possibly not so surprising in Python where they try to maintain that there is just one right “Pythonic” way to do things.)

But many decompilers don't handle this as a kind of syntax-directed compilation pipeline (for reasons I need to describe elsewhere), so the implementation of the decompiler code is very hacky. If guidance was given in the same way as it is in the dragon book for a compiler pipeline, implementations might be less hacky, have more rigor and be more robust.

There are a some small but important differences between a compiler and a decompiler. Namely:

  • you don't start out with program text but instructions of some sort.
  • you can assume that what you are given is correct which means..
  • you don't have to understand everything completely. For example you can ignore bookkeeping instructions (popping a stack value to make it match a join with another branch)
  • you can assume type checking is correct

I could elaborate on each of the points above, but understanding the differences helps in adapting standard compiling theory.

This is true, for example, with respect to parsing, though what I've found most useful is the Jay Earley parser. It handles ambiguous grammars, which generally would be considered a bad kind of grammar to create in a programming language. But in this context it is the right thing to do. As with translating from one human language to another: there can be several valid ways to do it.

For flexibility and ease of use then, you want the flexibility of the Earley parser. The dragon book mentions Earley only in passing in the last chapter’s end notes, and has a dismissive view of its efficiency. Wikipedia says that for most context-free grammars (or almost context-free grammars which is the case here), it is linear if you prefer left recursion over right recursion, except in pathological cases. And in my experience this is true. Furthermore, the pathological cases can often be addressed once you know where they are. Due to its extreme flexibility like adding rules at run time based on instructions seen, or the ability to do some checks when a grammar reduction occurs, I wouldn't consider using anything else at the moment.

Speaking of grammars, missing is the whole topic of how to debug a grammar, and common test practices, e.g. round tripping, bootstrap testing, and running tests on the entire standard library. In the tool I use, I've added grammar coverage profiling to give an understanding of which grammar rules are extraneous, or whether my test suite is adequate.

There is a curse that comes with any book being successful and good at what it does: I suspect that it reduces incentive in research in compiler topics that broaden and change things slightly. My decompiler example is one such example.

So we are stuck with a book that while over 950 pages long, is still very abbreviated (or totally lacking) for at least my use cases where compiling technology has come in handy. I wonder to what extent material in there has stifled research and innovation in other areas, because compilers were considered a done deal. I mentioned the short-shrift of the Earley parser and pessimistic performance time complexity. I knew about this and so it put me off when I encountered the first decompiler I ran into that used it. I had the same feeling when I first learned about register allocation by graph coloring, because the first edition of the dragon book talked only about Sethi-Ullman stack-based register allocation. I am pleased to say that that omission has been corrected in later editions.

And speaking of parsers, what about packrat grammars also known as Parsing Expression Grammars or PEGs? And for that there's something called the omega trick. PEG is not in the index either.

In the 2013 Pearson edition I have, the index, while decently long, omits authors. This is in contrast to the in contrast to the 1986 Addison Wesley book: “Compilers Principles, Techniques, and Tools”,

Also omitted is an accumulated list of references to look up authors - just what is in the end notes. The preface is gone along with what's changed between the editions, although Amazon’s advertising suggests some things. No acknowledgements in the newest edition. If you accumulate the end-chapter references, there are maybe about half of what was in the 1986 accumulated bibliography. The end-chapter references are so abbreviated that they don’t cover the important papers. Only in a dozen or so references is the publication date after 1990, the bulk coming from 1960–1985.

Here is one an example of a missing reference probably among several. In the “Code Generation” section they mention the programming language ML (as in OCAML) that their pedagogical programming language is modeled after. ML is neither listed in the references nor in the index to the pages talking about characteristics of ML.

So what I'd like to see is a book of similar length covering all the things that I feel are missing in the Dragon book. This book would cover or cover more JIT compilers, decompilers, transpilers, debuggers, object formats, bytecode VMs, and packrat grammars and dynamic interpreters that are given short shrift in the dragon book.

See also Jeffrey Kegler’s Parsing: a timeline -- V3.0


Saturday, February 4, 2017

Name that interval: Brahms 1st Symphony Horn solo

I have an obsession with pitch and timbres which brings me to this ramble.

We start with the Big Ben chimes:

Compare with this horn solo part of the 4th movement of the Brahms 1st symphony:

It is similar.

But the part I want to focus on is is the interval between the 8th and 9th notes which is not in the clock chimes. In the Brahms symphony is it is between a B natural and a Bb. By the way, when you listen to the solo flute repeat the theme, you can hear a little pitch glitch and between the 5th and 6th notes (a D and E) and you can see what's going on when the fingers go between 011-000 and 110-110. I am always fascinated by such little glitches that give a little character to the sound.

But now I get to the part that fascinated me. I had heard that Brahms horn parts were not for the modern valve horns you seen the Youtube clip, but are for a hand horn. And this instrument doesn't play that interval as perfectly.

In my mind I got that this interval was slightly less than the equal-temperament note. When I mentioned it an eminent musicologist, he said, well you know in the harmonic series, you have it wrong. That note should be slightly greater than equal temperament.

Brahms wrote Clara Schumann in a letter that he got this theme from listening to Alpine Horns when he was on vacation in the Alps. So I went to check out what the alpine horn does. See

and he is right! It is slightly flatter then equal temperament. If you map this to the natural overtone sequence this is harmonics 12, 11, 9 (G, F#, D). And even more undisputable, listen to this, 7 seconds in:

and even played with a C crook 4 minutes in :

you will hear it less flat, but not sharp as it was in my memory. So what gives with my bad memory? Is it just more senility?

Possibly. It could be that I was just hearing it as the intervals between F#, E, C. Or more likely translating the slight less than half step between C and B in the natural horn when the B is muted with the hand. With hand adjustments, the French Horn no longer strictly follows the harmonic series.

Or just wrong. I’ve updated my memory on this though.

Python in 2016 feels like C of 1996

In 2016 I use Python a lot. I've been using it over 10 years or so. But even with that much use it still feels clunky. It reminds me of my experiences with C which go back even longer.

Let me explain.

There are people who feel that Python has revolutionized the ease with which they work. xkcd had an comic about it: xkcd: Python,  and I guess the pythonistas liked it so much that since Python 2.7 (and pypy 2.6) it has been in the standard library distribution that you can get to it via the antigravity module that was added since the comic came out.

And yes, many find Python an improvement over say earlier languages, much in the same way was felt about C. I can see how they feel that way. It is true that there is such a great body of libraries in the language, like C. Python has about everything.

Except elegance, and orthogonality.

But before diving into that, let me come back to C circa 1996. As I said that's exactly how people felt and I suppose still feel about C. For example, what is the most-used version of Python written in? C, of course. In fact, that version of Python is often called CPython.

Back in 1996, people were ranting about C, I guess rightfully, was because it was like the Fortran for systems programming. So I guess in a sense C around 1989 or so was the Fortran of 1959. And I'll go into the Fortran comparison just a little bit as well.

When I first started working at IBM research I worked with a guy, Dick Goldberg, who worked on the original Fortran project. Back then I felt Fortran was a little clunky too: there was no BNF language definition (but note that the B in BNF stands Backus, the guy that also was behind Fortran. Fortran came before this and this aspect was addressed in general, if not in Fortran).

And it had these weird rules like you couldn't index arbitrary expressions like x[i*j] but you could do some limited forms of that like x[2*i] or is it x[i*2]? And the reason for that was that back then they translated this kind of thing into IBM 360 assembler which allowed a memory index (a multiplication) along with a memory offset adjustment (an add) as a single instruction and the Fortran compiler could turn some of these into that form although it couldn't even re-associate 2*i to i*2 if that was needed.

But the big deal, as Dick explained to me, is that scientific programmers and numerical analysts didn't have to learn assembly language, because Fortran did a good enough job, on average, to obviate that need for the most part.

And then that brings us back to C. Fortran is not suitable for writing operating systems, or systems code and so around 1990 many Operating Systems were written in assembly language. Microsoft's OS's were. IBM's were as well.

But now for Unix, then minix and then Linux, you could write in the higher level language, the same as you could in Fortran for numerical or scientific software.

And libraries were written and to this day the bulk of complex libraries are still written in C: for regular expressions, for XML parsing, Database systems like mysql and postgres, and a lot of systems code. Even the guts of numpy, the numerical package for Python, has a lot of C code at its core and that is very common. You'll find other popular programming languages like Ruby, or Perl are written in C and their numerical packages fall back to C as well. Also the bindings that interface to those C systems.

Inelegance

Ok. So now let’s get back to inelegance in Python.

In Python, some functions are top-level functions like len()hasattr()type()getattr() str()repr() and so on, and some functions are methods off of an object instance, like the .join() or .append() methods of string and list objects, even though lists and strings are built-in types. So instead of writing in a uniform o.join('a').split('x').len().str() or str(len(split('x', join(o)))) you have to go back and forth depending on which arbitrary way it was decided for the function, e.g. str(len(‘o’.join(‘a’).split('x'))).

This is like learning in arithmetic expression which operators are infix, postfix, prefix and what the precedence is when there aren't parenthesis. And yes, Python follows in that tradition like many programming languages including C nowadays as well. This is in contrast to Polish prefix of Lisp, or the old HP calculators, Forth, or Adobe's Postfix.

I'm not suggesting languages change it. But I am saying it is more cumbersome. And when you extend that notion, it hurts my little brain.

C was notorious for extending that into "address of" (&) and "indirection of" (*), pre and post increment (e.g. ++) bitwise operators (|), logical operators (&&) and indexing ( [] ), and selection/indexing ( . ->). Keeping in mind the precedence was so difficult that most people just used parenthesis even when not strictly needed.

I do need to back off the comment about top-level versus method functions in Python. But only a little...

"len" for example is a method off of some type like string or list. But it has these ugly double underscores around it, I think that was added to deter people from writing it in the more natural or uniform way like you would in say javascript.

Python does stuff like this: take something that would be usable but muck it up more to make it less usable. And there is a community of Python Nazis out there who will scream at you if you use x.__len__() however ugly it is instead of len(x).

These are the same kinds of people who insist that you should write:

import os

import sys

rather than the shorter:

import os, sys

When you ask why? They say just because. Well, actually they don't say it that way, they refer you to an acronym and number like PEP8, which is taken like legal law. Everyone has to do it that way. Fascism.

The Fascists say that doing things this way makes it easier to understand if people do things the same way. Really? Mankind has long been able to deal with variation in expression without any effort whatsoever. I can write "It is one thing" or "It is a thing" and most people don't obsess over whether it has to be "one" or "a". Actually, in English there is a certain style difference if I want to emphasise the singleness I might choose "one" over "a".

And so I'd argue that kind of nice subtlety is missing by the Python fascists. For my part, I'd prefer stipulating it would be okay to stipulate that every program imports "os" and "sys" and be done with that and get on with the rest of my life and the more important task of programming. The next best thing is just to string along that preamble crap in one line the way Jewish blessings always start "Baruch atoy adenoy eluhanu" (Blessed is God, King of the universe").

Oh, and by the way the "import" statements have to be at the top, (except that sometimes there are situations where they won’t work if they are at the top). That seems so 1980's Pascal like to me which required all constants, types and variables to be put at the top of the program. But in Pascal it was not for the same kind of fascism, but just that the Pascal grammar was limited in that way.

And while on the topic of Pascal, let me mention another aspect related to Pascal. Originally and to the end of Python 2, "print" was a reserved word. Very similar to Pascal's "println". And because it was a reserved word, you didn't put parenthesis around the arguments to print. I'm sure this was thought clever in the same way it was clever in Pascal. When Python 3 came about that changed so that print is now the more regular and uniform function call. Oh, but by then a decade-or-so-old code base then had to change.

Now you might think, "who knew"? Well, although Pascal had that feature (and most other languages including C just didn't), by the time the successor to Pascal, Modula, was developed the mistake was corrected. And here's the point: this was all done before or contemporaneous with Python's start.

It's one thing to make a mistake and correct it. But another thing to make the same mistake as was just fixed in another language, and then spend several years before you fix it the same way everyone else does. Pascal and Modula were developed in Europe, same as Python, so it's really no excuse not to have known about it.

Stubbornness and Doing things differently

So why is it that such things take so long to address? The Python language has been stubborn and unrelenting in its wrongness. I think Guido now agrees that indentation thing was a mistake. However there was a long period of time when its superiority was asserted. To me, the mistake is not so much about using indentation for a block begin, but the lack of adding even an optional "end" terminator.

But because of Python's indentation, printing python programs in print is subject to error since there isn’t the needed redundancy check. And more seriously, you can't embed that in another a templating language because the templating can mess up the fragile indenting. So if you are doing say web development in Django, you need to learn another language which of course doesn't have that indentation rule. Ruby programmers don't suffer that limitation and their templating systems use Ruby.

It also feels to me like stubbornness and arrogance that Python has long resisted using common idioms of other languages. (By the way, I sense that Dennis Ritchie and Guido Rossum are a little similar here). So the backtick notation of shell, Perl, and Ruby is not in Python. Nor for a long time was there simple variable interpolation. (In Ruby that's done via # rather than $ but the change was for technical reasons, not arrogance or a desire to do things differently).

In Python, how to do subprocesses has changed over the years, and still isn't as simple as backtick. But I suppose that has some cover because there are security considerations. Variable interpolation inside a string was also long resisted, although there was something like the C format specifiers inside a string with the '%'. But once Python added its own flavor of variable interpolation using the .format() method C style variable format specifiers are eschewed.

This kind of being the last on the block to give in to something that is common among other languages and then add it in a different way seems to be the Python way. A ternary if/else operator that C introduced and adopted in other languages is another example.

For those that want to work across multiple languages this kind of awkward originality just adds more warts.

Like Python, when C first came out, it had a number of unfamiliar things like all those post and pre operators, I guess the format specifier thing, variable arguments, the C preprocessor with its macro expansion. (Many of these weren’t strictly unique: you could find examples of these things elsewhere, such as in assemblers; but compared to say the high-level ALGOL-like languages and things like Cobol or Fortran, this was different). In C's case though they largely stuck to this and didn't change course. When developing Go, the mistakes of C were just fixed and so there I feel that Go is a very welcome upgrade to C.

Misc other things: comprehensions (not needed in better-designed languages), language drift, import statements.