Inside F#

Brian's thoughts on F# and .NET

Monadic parser combinators, part five

Posted by Brian on December 8, 2007

Last time we added error-handling to our parsers, so that when they fail, they yield good diagnostics.  Today, we’ll discuss three more useful features to add to the library.  First, we’ll add add support for the new C# LINQ syntax, as I mentioned here.  This will make our code easier to read and write.  Second, we’ll make our parsers generic with respect to the input type, so that rather than parsing just strings of characters, we can parse IEnumerable<Token> for an arbitrary Token type.  This makes it possible to attach a separate "scanner" stage to the front of our parsers that tokenizes the input.  Finally, we’ll add user-defined state to the parser.  This will allow us to remember arbitrary information about what has already been parsed (such as what variables have already been declared in a program), enabling some semantic checking.  With all these cool features, we’ll be able to author parsers for interesting languages by writing very little code.

Today’s goal

Let’s be more concrete about where we’re headed.  By the end of today, we’ll be able to parse a little expression language that comprises a number of assignment statements followed by a final expression to evaluate.  Since that’s today’s goal, I’ll refer to this little language as "Goal".  Here’s an example "Goal" program that yields the final value 3:

let x = 3;
let y = 2 * x;
let xSquared = pow(x,2);
  let  zero =  ( 611    - pow(2 ,pow(3  , 2) ) + 1   ) /   10  -5   * 2   ;
xSquared-y+zero

"Goal" deals with the same kind of arithmetic expressions as before, but now we have introduced assignment statements that use variable identifiers and the keyword "let".  The language now also handles arbitrary whitespace interspersed between the various tokens in the language.  A scanner for this language would recognize four kinds of tokens: keywords like "let", identifiers like "x", symbols like "+", and numbers like "611".

The semantics of this language are straightforward.  We evaluate mathematical expressions in the usual way.  The right-hand side of an assignment statement is an expression that can only reference variables defined before that statement.  Duplicate variable definitions are disallowed.  The final "statement" of the language is just an expression representing the final result of evaluating the program.  Our parser will evaluate programs in a single pass, either yielding a successful value (like 3 for the previous program) or else failing gracefully with an error message that diagnoses problems in syntax or semantics.  Here’s an example of a program with semantic error:

let x = 3;
x+y

And here’s what our parser will report:

Failure: At line 2 column 3, variable "y" was not defined

Get the idea?  All right, let’s get to it!

Leveraging C#’s LINQ syntax sugars

As promised, the first thing we will do is leverage the new LINQ syntax sugar.  If you haven’t seen it, I’m taking about the stuff that lets you write code like this:

var strings = new[] { "one", "two", "three", "four" };
var firstCharsOfShortWords = from s in strings
                             where s.Length <= 4
                             select s[0];

These new syntax features were designed with the main intention of succinctly expressing queries on arbitrary data.  The language designers were clever and recognized that these query operations form a monad, and so they designed the sugars to work with any data type that supported functions with the right names and shapes.  As a result, we can add these extension functions to our parser type and then use the new syntax. 

(Aside: Bummer, I just noticed that the Windows Live Writer plugin I am using for code-snippets doesn’t know the new C# features, and so var, from, and select are not getting automagically highlighted in blue like they do inside Visual Studio.  But we’ll live.)

In order to get to use from and select, we just need to add this code to our library:

// LINQ syntax enablers
public static P<U, Token> Select<T, U, Token>(this P<T, Token> p, Func<T, U> selector)
{
    return p.Then(x => Return<U, Token>(selector(x)));
}
public static P<V, Token> SelectMany<T, U, V, Token>(this P<T, Token> p, 
                                  Func<T, P<U, Token>> selector, Func<T, U, V> projector)
{
    return p.Then(r1 => selector(r1).Then(r2 => Return<V, Token>(projector(r1, r2))));
}

Explaining those definitions would just cause us to stray into more explanation of monads, and I want to avoid that, so just take them at face value.  (One thing worth noting, though, is that these functions are built atop the existing library, just using Then and Return calls.  As a result, if this code was missing from the library, you could have added it yourself in user code, without needing to know anything about the underlying parser representation.)  Once you add those functions, you can start using the syntax sugar.  Specifically, code that used to appear as

p1.Then(x =>
p2.Then(y =>
p3.Then(z =>
Return<T>(f(x,y,z)))))

Can now be written more elegantly as

from x in p1
from y in p2
from z in p3
select f(x,y,z)

It’s easiest to grok the new syntax with an example, so here’s some code from the final goal of today – this is the part of the "Goal" parser that parses and evaluates exponentiation – that is, stuff like "pow(e1,e2)":

pow = from p  in Keyword("pow")
      from o  in Symbol("(")
      from e1 in expression
      from x  in Symbol(",")
      from e2 in expression
      from c  in Symbol(")")
      select (int)Math.Pow(e1, e2);

As we’ll see in a little bit when we get to the code for example parser, Keyword() and Symbol() return parsers that parse the corresponding tokens in the input, and expression parses and evaluates an expression.  So this bit of code describes a parser that parses a string like "pow(3,4)" binding each portion to a local variable name (so, for instance, e1 gets bound to the result of parsing the "3") and returns the result expressed on the last line.    So that gives you an idea of how to "read" parser code that uses the from and select syntax.

Now that we have LINQ, let’s move on to generalizing our parsers for different kinds of input.

Scanners and parsing arbitrary token types

The parsers we looked at in my previous blog entries were all character-based.  That is, parsers operated on strings, and the "atomic" parsers Item and Sat were parsers whose results had type char.  This is a nice way to make a simple introduction to a parsing library using familiar data types, but in the real world, you’ll rarely see a character-based parser.  Instead, the input typically gets streamed through two stages.  The first stage is a scanner (sometimes called a lexer), which groups characters together into the logical tokens of the language.  The second stage is the actual parser, which then parses the stream of tokens that exits the scanner.

This two-stage process of scanning and parsing is motivated by a few different considerations.  First, most languages you will parse allow for all the significant tokens to be separated by optional, insignificant characters.  Whitespace and comments are the two main examples here.  The scanner can strip these out in the first stage so that the second stage doesn’t need to worry about them.  Second, the logic of scanners is usually simple, and scanners afford themselves very nicely to being represented as highly performant state machines, which can be generated by tools like lex.  Finally, the scanner stage massages the input into a form that’s at a better thought/abstraction layer for parsing.  When you think about parsing text like

if x = 34 or s = "foo"

you are more likely to grok it as a series of 8 tokens (keywords, identifiers, operators, literals) rather than as a series of 22 characters. Indeed, when you glance at the example text above, your consciousness probably doesn’t even see the individual quotation marks, rather you just see the string foo.  Similarly, you don’t see the digit three followed by the digit four, you just see thirty-four.

To sum up, then, a separate scanner stage has a few benefits.  With a scanner, we can localize all the processing of insignificant characters (rather than having whitespace/comment processing throughout our parser), we can use tools to make this stage very efficient, and we end up parsing a stream of tokens that’s at a more appropriate layer of abstraction.

In order for our parser library to accommodate tokens, we’ll need to change the representation.  Recall that the parser representation type and top-level interface function had been defined like this:

// representation type for parsers
public delegate Consumed<T> P<T>(ParserState input);
// clients call p.Parse("input") to avoid implementation details like Consumed & ParserState
public static ParseResult<T> Parse<T>(this P<T> p, string toParse)
{
    return p(new ParserState(0, toParse)).ParseResult;
}

We’ll now change it to this:

public delegate Consumed<T, Token> P<T, Token>(ParserState<Token> input);
public static ParseResult<T, Token> Parse<T, Token>(this P<T, Token> p, 
                                                    IEnumerable<Token> toParse) { ... }

That is, the parser type P (and its associated helper representation types like Consumed and ParserState) now has an added generic parameter called Token which is the type of tokens we are parsing.  Whereas previous the code assumed the tokens were always chars, now we allow any token type.  Rather than Parse() taking a string, now it takes a list of tokens (an IEnumerable<Token>).  This is a breaking change to all our code, as we have fundamentally altered the structure of the public type P.  Rather than walk through all the changes, I’ll just point out the highlights.

Most of the code itself is relatively unchanged, but for the extra generic parameter.  For instance, in the definition of the Many combinator, the only changes are that <T> becomes <T, Token> in four places in the code.  The vast majority of changes are similar to that.

Most of the rest of the changes involve parsers that fundamentally deal with characters and don’t generalize to arbitrary tokens.  For example, the Letter and Nat parsers now explicitly call out the token type char in their type:

// Letter parses char tokens and returns a char
public static P<char, char> Letter = Sat((char c) => char.IsLetter(c)).Tag("letter");
// Nat parses char tokens and returns an int
public static P<int, char> Nat = ...

There are a few other minor changes to the library, but that is the gist of the representation change.

Once we have made these changes, another challenging issue surfaces.  Previously, errors were reported by position – the message "Failure: At position n, unexpected…" represented a failure in the nth character of the input.  However, with an arbitrary token type, it is not very useful to issue diagnostics to humans using language that points to the nth token.  The position still ought to be described in terms of character input, and while we are at it, we should generalize the Position type to represent both line numbers and column numbers.  Generalizing the Position to include line numbers as well as column numbers is very straightforward, and we can just have the definer of the Token type also provide a function to compute the Position of a given Token.

To sum up, once we have made the changes described in this section, the parser library now works on an arbitrary token type (which means we can attach it to a separate scanner that returns Tokens), and the library now has improved error-reporting, in that the location of errors is described by both a line number and a column number.

Adding state

The final new feature we need to add to the library is state.  During the course of parsing, we sometimes need to remember some things about what has already been parsed in order to parse the rest of the input correctly.  A typical example of a context-sensitive language that requires memory while parsing is XML – after parsing the string "<foo>", we parse some arbitrary amount of other stuff in the body of foo and then we eventually need to parse "</foo>".  That is, we need to remember the name of the opening tag to ensure it matches the closing tag name.  Another example is our "Goal" language, where we need to detect both duplicate definitions of already-defined variables as well as uses of not-yet-defined variables.

Though we could keep track of such state in the usual C# way – by newing up a mutable object on the heap and side-effecting that state object while we parse, it wouldn’t be very much in the spirit of the library to do things that way.  Furthermore, it becomes messy to deal with mutable state when the parser may be backtracking as it tries to find a legal parse.  Instead, we can keep track of the state in a side-effect-free manner in the parser monad.  Indeed, we already have seen how to do this; the parser representation code contains a type called ParserState that keeps track of the state of the parser.  This is an immutable object; each time we move forward in the parse, we create a new ParserState object with updated values (e.g. for the current line and column numbers).  It is easy to add another field to this type to store some user-defined state:

public class ParserState<Token>
{
    // arbitrary state object for clients to use
    public readonly object UserState;   
    // other fields as before...
    public readonly Position CurrentPosition;
    ...

Then we just need to expose some new functions for clients of the library to read and write that state.  And while we are at it, there is one other bit of state already inside the parser that can be useful to allow clients to get at – the current position.  So we define these new core functions:

// return the user state
public static P<object, Token> GetUserState<Token>()
// set the user state, return previous value
public static P<object, Token> SetUserState<Token>(object newState)
// get current Position
public static P<Position, Token> GetCurrentPosition<Token>()

I’ve elided the definitions, but they’re all one-liners.  These functions all return parsers that don’t actually parse anything – that is, they don’t affect the input tokens.  Instead these functions provide access to state values that are threaded through the parser during a parse.  We’ll see examples of their utility shortly.

Ok, now that we have the ability to distinguish scanners and parsers, a way to keep track of state during the parse, and new LINQ syntax, we are ready to write code to parse the "Goal" language.

Scanning the "Goal" language

Though we could write a scanner using a separate tool, we might as well illustrate more uses of our parser library.  So we’ll use the parser library to create a scanner.  A scanner is, after all, just a parser whose input token type is char and whose result is IEnumerable<X> for some new token type X.

Since the language is called "Goal", we’ll define a type called GToken (‘G’ for "Goal") that defines the tokens of the language.  As mentioned when introducing the language at the start of this blog entry, there are four types of tokens: KeywordToken, IdentifierToken, NaturalToken, and SymbolToken.  These tokens represent keywords, identifiers, natural numbers, and the various operators and symbols of the language.  Each token has a Value field that stores the associated value; for example IdentifierToken has a string Value field that holds the name of a variable, and NaturalToken has an int Value field that holds the value of a natural number.  A GToken is one of these tokens, coupled with a Position describing the location in the input (e.g. line 3, column 4).  I won’t show the code definitions of these types because they are so straightforward (as always, I’ll post the full code at the end).

Before we start, let’s define one more handy character parser:

// OptionalWhiteSpace consumes zero or more whitespace characters
public static P<IEnumerable<char>, char> OptionalWhiteSpace = Many(Sat<char>(char.IsWhiteSpace));

as well as a handy combinator:

// p1.FollowedBy(p2) matches p1 then p2, and returns p1's result (discards p2's result)
public static P<T, Token> FollowedBy<T, U, Token>(this P<T, Token> p1, P<U, Token> p2)
{
    return from result in p1
           from discard in p2
           select result;
}

Those go in the general library.

Let’s define the scanner for "Goal", starting with the ability to tokenize the symbol characters:

var symbols = new[] { "+", "-", "*", "/", "(", ")", "=", ";", "," }
              .Select(tok => from pos in GetCurrentPosition<char>()
                             from sym in Literal(tok).FollowedBy(OptionalWhiteSpace)
                             select new GToken(pos, new SymbolToken(sym)))
              .Aggregate(Or)
              .Tag("symbol");

Wow!  Ok, let’s break that code apart.  We start with an array of strings – all the symbols we want to scan.  Recall that Select() applies a function to every element in a list, returning a list of results.  So for each of those symbols, we make a parser that

  • gets the current input position (line & column number), and call it pos
  • parse the literal string for this symbol (followed by optional whitespace), call it sym
  • returns a new GToken with location pos and token value new SymbolToken(sym)

So that yields a list of parsers, one for each symbol.  We then Or them all together using Aggregate, yielding a single parser that can tokenize any symbol.  Finally we Tag that resultant parser with a descriptive name that can be used when reporting errors about expected tokens.  Neat!

Next let’s deal with keywords.  One thing to be careful of with keywords is that keywords can be prefixes of identifiers.  That is, whereas "let" is a keyword, "lettuce" is a legal identifier, so we need to ensure that the character after an apparent keyword isn’t an identifier character.  Here’s the code:

var identifierChar = Letter.Or(Digit).Or(Literal('_')).Tag("identifier character");
var keywords = new[] { "pow", "let" }
               .Select(tok => 
                   Try(from pos in GetCurrentPosition<char>()
                       from key in Literal(tok)
                                   .NotFollowedBy(identifierChar,"identifier character")
                                   .FollowedBy(OptionalWhiteSpace)
                       select new GToken(pos, new KeywordToken(key))
                       ).Tag("keyword \"" + tok + "\"")
                   )
               .Aggregate(Or);

Once again, we use Select() to iterate over each keyword, and then eventually Aggregate() the results with Or.  To build a parser for each keyword token, we make a parser that will

  • get the current input position
  • parse the literal string for this keyword (ensuring it’s not just the prefix of some longer identifier)
  • return a GToken containing the right position and KeywordToken

and we

  • wrap the parser in a Try(), so that for example we can recognize the identifier "potato" even after it starts matching the keyword "pow" but then fails after consuming the first two characters
  • Tag each individual keyword parser with a useful description (so as to enable error messages like "expected keyword "let"")

Great, we’re done with half our token types (two of the four) already.  Let’s do identifiers next:

var identifier =
        (from pos   in GetCurrentPosition<char>()
         from first in Letter
         from rest  in Many(identifierChar).FollowedBy(OptionalWhiteSpace)
         select new GToken(pos, new IdentifierToken(new string(Cons(first, rest).ToArray())))
         ).Tag("identifier");

This is getting to be old hat.  Identifiers must start with a letter, followed by any number of other identifier characters (letter/digit/underscore).  We turn the list of characters into a string and store it in the appropriate kind of token, along with the starting position.

Finally, numbers:

var number = 
    (from pos   in GetCurrentPosition<char>()
     from chars in Many1(Digit.Tag("")).FollowedBy(OptionalWhiteSpace)
     select new GToken(pos, new NaturalToken(Nat.Parse(new string(chars.ToArray())).Result))
     ).Tag("natural number");

The same general pattern applies.  Note that we use our trusty parser Nat to turn the string of digits into the corresponding integer.

Finally, the complete scanner is just

var anyLegalToken = symbols.Or(number).Or(keywords).Or(identifier).Tag("any legal token");
var scanner = from ws     in OptionalWhiteSpace
              from tokens in Many(anyLegalToken)
              from eof    in EndOfInput()
              select tokens;

Each individual token consumes trailing whitespace, so we need to suck up the initial optional whitespace, followed by a bunch of tokens, followed by the end of the input stream.  If you hover over the "var" in Visual Studio, you’ll see a tool-tip pop up that tells you that scanner has type

P<IEnumerable<GToken>, char>

which is exactly what we want.  We’re done with the scanner!

More

There’s more to write, but the server doesn’t like long blog posts, so I am forced to break today’s entry in two.  Check out the rest in "part six".

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: