Parsers for network protocols (24 Apr 2005)
(If you wish, you can see this post as being related to the previous two. This is all about automatically building state machines, which happens to be very useful in Actor systems. If you can make all your actors stack-free then you can save memory, schedule them across worker threads etc. But, it also stands alone.)
Parser theory is well established. People usually reach for the parser generator when it comes to handling complex files. (Possibly not quite as often as they should if the state of some config parsers is anything to go by.) Parsers are understood and work.
Why then does no one use them for parsing network protocols? It's not like network protocols are too simple. Take section nine of the IMAP RFC. That's complex. Why does anyone want to write a parser for that when the EBNF is provided?
Did you know that the following is valid HTTP?:
GET / HTTP/1.1 Host: www.webserver.com
If you read the HTTP RFC and the EBNF definitions of LWS etc you can check it. It's the sort of thing that hand built parsers often miss. It doesn't work with whatever webserver /. is using for one.
These aren't simple protocols and, if you're coding in C/C++, chances are you'll screw it up in a buffer-overflowable or a seg-fault-killing-the-whole-process way. Parsers can do it better.
So why don't people use generated parsers? Probably because they've been designed for parsing files and all the toolkits are built around that idea:
- Common generated parsers don't allow partial results. They parse the whole thing and give you a big tree but you probably want information about what's happening as they happen.
- Parsers often generate foul right-recursive parse trees.
- Parsers often need a tokenising pre-processing step which doesn't work well for network protocols which have complex token rules and binary data mixed in with the text.
- Parsers often need lots of rule mangling before the grammar works.
Less importantly...
- Parser generators aren't that simple. Even an SLR parser (a simple one) takes 800 lines of Python in my implementation.
- They're slower. Probably not an issue with C code, but my pure Python parser does only 200 HTTP headers/second on a 450MHz PII.
I've addressed the first four problems something I'm hacking up at the moment. Firstly, you can tell when any reduction happens as soon as you feed the data into the parser. So you could ask for the HTTP headers as they happen.
To explain the second problem, consider SLR type rules:
Token := TokenChar Token := TokenChar Token
That parses one-or-more TokenChars into a single Token. But that leaves you with a parse tree like this for the input POST: ['P', ['O', ['S', ['T']]]]. The key to this are reduction functions which do something sensible with the parse tree as they're being generated. The above would be:
Token = OneOrMore(TokenChar, string_)
Where string_ is a special marker which says "It's a string, so give me something sensible in the parse tree" If nothing does what you need, you can define your own:
Token = OneOrMore(TokenChar) def Token_reduce(values): return ''.join(values)
(and that does exactly the same thing.) Also, special forms like OneOrMore save you from thinking (too much) about how to write the grammar. OneOrMore is a simple case but something like Alternation(a, b) (a sequence of a, separated by b with optional bs at the beginning and end with reduction functions which give you a flat list of them) isn't.
So that's the evangelising for now. People should use parser generators for network protocols because who the hell wants to write another damm HTTP parser by hand (which you'll probably screw up)?
More tricks
Parsing is good. But there are more state machines in network protocols than just parsing. Take SMTP:
220 imperialviolet.org ESMTP MAIL FROM: foo@foo.com 250 ok RCPT TO: bar@bar.net 250 ok DATA 354 go ahead Subject: testing Hello there! . 250 ok 1114338782 qp 5232 QUIT 221 imperialviolet.org
Here a valid conversation can't have a DATA after the RCPT TO if the RCPT TO failed. So you could have the parser working line-by-line and a higher level state machine tracking the valid commands at this point etc. (I would admit that a per-line generated parser for SMTP would be overkill.)
So let us introduce two new terminals: T and ⊥ which are «command successful» and «command failed». We can inject these into the parser when we're finished processing a command and then define a conversation something like this:
RCPTTO = RCPTTOCommand RCPTTO = RCPTTOCommand, ⊥ RCPTTO SMTPConversation := HELOCommand, T, MAILFROM, T, RCPTTO, T, DATA, T
(That's not RFC at all, but humor me.) So a client can have as many failed RCPT TO commands as they like, but can't send a DATA command until one has completed. Thus, if the parser parsed it, it's a valid command and you don't need to keep track of the state yourself.