Writing Compilers Quickly

tptacek · on Feb 5, 2010

Could not disagree more about lexer generators. Every device we support comes with 1-2 parsers, and we started out thinking that Ruby's built-in regex support made flex unnecessary. Bzzzt. Lexers are fiddley whether you write them in Flex or Ruby, but the Ruby version is harder to write.

We've settled on Ragel for lexers; Craig wrote "ralex", which I don't use or understand but which he insists we're about to open source. So I recommend Ragel. Or Flex. Or anything but just hand-hacking your own tokenizer.

barrkel · on Feb 6, 2010

I don't agree, for a simple reason - most programming language lexical analysis is extremely simple and a lexer can be hand-written in less than an hour. Here's the outline of a NextToken routine (or whatever you want to call it) that does the lexical analysis:

* Skip whitespace and comments, keeping track of newlines as needed

* Check for end of file

* Determine token type from first character (typically a switch statement or equivalent)

* Scan in remainder of token's characters (depending on token type - and possibly more discrimination for common prefixes, such as 10 vs 10.25, ints vs floats)

* Do any conversions, such as string -> numbers as necessary

* Look up identifiers for keywords

A simple hand-written scanner for a language like C (excluding the preprocessor etc.) written in C itself shouldn't exceed a couple of hundred lines or so. If you're willing to sacrifice some readability and performance, I guess you could squish it down close to 60.

With languages that need more complicated state transition mechanics than C or similar, using a generator may make sense. But fully general regexes are ill-suited for creating lexical analysers. With lexical analysis, you're looking to classify a pattern of characters from the head of a list; language and library regex support will generally only tell you if a particular string matches / starts with a regex, or lets you search a string for the regex. A lexer effectively wants to annotate multiple acceptance states in the FSA, and return different values depending on which acceptance state was chosen. Regexes from languages and libraries generally only pop out with a boolean, having merged the acceptance states together in the DFA (never mind non-regular regular expressions, such as those with backrefs etc.). So using regexes for writing a lexer will be inefficient and won't give you hints when the language defined by one regex is actually a subset of the language defined by another.

camccann · on Feb 6, 2010

As an addendum: In some languages (who will remain unnamed) lexical analysis is actually a pain due to funny whitespace rules or syntax-dependent tokenizing. For instance, something like:

* whitespace around operators is optional, except that sometimes a symbol adjacent to a variable means something completely different

* "foo" is a keyword in type signatures, but a normal identifier otherwise

* indentation is meaningful, and must generate INDENT or DEDENT tokens depending on the indentation of some arbitrary number of prior lines

* indentation is meaningful based on whether the first non-whitespace token on a line is aligned in the same column as a specific token in the prior line, based on parsed syntax, oh god what were these people smoking

...and so on. It's easy to get into situations where merely lexing requires keeping some sort of context, requires interleaving with parsing such that parse errors may cause the lexer to backtrack and reinterpret things, or possibly even worse.

In fact, I wouldn't be surprised if it were possible to construct an especially pathological bit of perl that couldn't be lexically analyzed without running the program.

chromatic · on Feb 6, 2010

> I wouldn't be surprised if it were possible to construct an especially pathological bit of perl that couldn't be lexically analyzed without running the program.

Yes and no. It's trivial to construct a pathological Perl 5 program which has indeterminate parsing without running... but in those cases where it's possible, you have only one of two or three possible interpretations for a specific sequence of tokens.

As well, no one would write code that indeterminate by accident. It's always malicious.

statictype · on Feb 6, 2010

>* "foo" is a keyword in type signatures, but a normal identifier otherwise

Isn't this decision something that should be punted out of the tokenizer and into the parser?

camccann · on Feb 6, 2010

You can punt as much into the parsing stage as you like, of course. It's not like there's even any requirement that the two stages be separated.

All of the stuff I mentioned gives you the choice of either more work during parsing (which is usually hard enough already), or building a crazy Rube Goldberg-esque tokenizer--neither of which is appealing.

tptacek · on Feb 6, 2010

I'd rather complicate a lexer a little bit than debug parser conflicts.

pkaler · on Feb 6, 2010

Mostly, agreed. Here is the lexer for Lua: http://www.lua.org/source/5.1/llex.c.html

It's 382 lines of ANSI C according to sloccount. Reading the array at the top of the file, there are about 30 tokens in Lua. So, you are probably going to write at least 30 test cases.

Writing that in less than an hour is being generous. But, it should be doable in a day.

ssp · on Feb 6, 2010

Here is the complete scanner for a C-ish language I wrote a compiler for a while back:

http://www.daimi.au.dk/~sandmann/scanner.c

It's 387 lines of C, and if you wanted to prove a point, you could get the line-count down by compacting the table and removing braces etc.

Note that it doesn't make any C specific assumptions. Each type of token is matched by its own function, and then in the end the longest match is chosen. This is exactly what lex does - with two differences: (a) my scanner is slower, and (b) you have the full recognition power of C instead of just regular expressions. These days (b) outweighs (a) so much that it's just not funny.

I'd agree that flex/lex are just not worth it anymore.

nostrademons · on Feb 6, 2010

And here's the lexer for a Python-ish language I'm working on:

http://github.com/nostrademons/eve-language/blob/master/boot...

That's 52 lines of rules. 68 lines of boilerplate code that was copied verbatim out of the Alex manual and changed a bit to work in a different monad. About 70 lines of code to add the indent/dedent/newline tokens for whitespace sensitivity. Note that that's with whitespace rules that are more permissive than just about any language out there (you can break lines without a continuation character between any pair of matched delimiters, as well as before or after any infix operator).

Yes, hand-writing a lexer is easy. Writing one with a lexer generator is even easier (and would be easier still if Alex had an easier API to work with).

wglb · on Feb 6, 2010

I would agree with you on this. However, the opinion on this seems to be somewhat varied: in Coders at Work, Ken Thompson says I love yacc. I just love yacc. It just does exactly what you want done. Its complement, lex, is horrible. It does nothing you want done. On the other hand, Dave Conroy says YACC makes the hard part harder and the easy part easier.

Personally, I like lexers, but would rather move the language definition to one where recursive descent parsing works.

DavidMcLaughlin · on Feb 6, 2010

> Or anything but just hand-hacking your own tokenizer.

I've only just got into parsing and I know for sure that there are more challenging problems, but for a dynamic language like Javascript there is a pretty major example of where this worked. JS Lint: http://www.jslint.com/fulljslint.js

1) Split source into lines

2) Run each line through a regex: /^\s([(){}\[.,:;'"~\?\]#@]|==?=?|\/(\(jslint|members?|global)?|=|\/)?|\[\/=]?|\+[+=]?|-[\-=]?|%=?|&[&=]?|\|[|=]?|>>?>?=?|<([\/=!]|\!(\[|--)?|<=?)?|\^=?|\!=?=?|[a-zA-Z_$][a-zA-Z0-9_$]|[0-9]+([xX][0-9a-fA-F]+|\.[0-9]*)?([eE][+\-]?[0-9]+)?)/

3) Look at the resulting token character by character and define what kind of token it is.

Once it has these parsed tokens JS Lint then uses a Pratt Parser to check syntax and style, but you could easily use the same tokenizer and build a parse tree as part of a translator or compiler instead.

hga · on Feb 5, 2010

Read the linked http://en.wikipedia.org/wiki/Monkey_patch if you want to find out what monkey patching and duck punching are ^_^.

pmjordan · on Feb 5, 2010

I'm wondering if he's tried to do real work with ANTLR and ANTLRWorks in particular. I found them extremely temperamental and fiddly to work with, but this was more than a year ago.

gchpaco · on Feb 6, 2010

They haven't gotten any less temperamental or brittle. The default tokens have a rude tendency to store the entire input stream as it is being parsed, which was a really rude surprise for us (had to rip it out and do LL(1) by hand, with minimal memory demands). I like a lot of features of ANTLR, but it is not the sort of tool I like to use and abuse.

kingkongreveng_ · on Feb 5, 2010

A perl programmer talking about compilers in 2010 doesn't mention Parrot? Parrot has to be the easiest and fastest way to get a compiler up and running these days.

tptacek · on Feb 5, 2010

He's not targeting a VM or a runtime, is he? His compiler looks more like a code generator, a la yacc.

kingkongreveng_ · on Feb 5, 2010

Well the CaPerl project he mentions just translates perl5 with his special keywords to standard perl5 code, I think. Who knows what he's doing here.

tptacek · on Feb 5, 2010

He's writing yacc for crypto protocols. You write an algorithm in his Stupid language, you and others verify it Stupidly, and then compile it to your real language.

statictype · on Feb 6, 2010

So the language's compiler does analysis of the algorithm for common crypto flaws? (like buffer-overruns and timeable iterations ...)?

tptacek · on Feb 6, 2010

Nope. It does nothing at all except to standardize the notation.

marshallp · on Feb 6, 2010

use lisp macros or prolog dcg's