Inside F#

Brian's thoughts on F# and .NET

Monadic parser combinators, part six

Posted by Brian on December 8, 2007

The entire blog for today was too long for the server to upload all at once, so this is the continuation of the previous entry.  If you haven’t read that already, you should now, as you need it for context.

Parsing and evaluating the “Goal” language

I don’t know if you noticed, but ignoring the definitions of the various token classes (which are mostly boilerplate code), the scanner took only about 30 lines of code to define!  That’s pretty nice.  But then again, it is just the scanner.  Surely parsing and evaluating the language will be a lot more work, right?  Let’s find out.

Since “Goal” has variables that will be assigned integer values, we need a way to store those variable-value associations in the state of the parser.  A list of key-value pairs is a fine representation:

using VarEnv = IEnumerable<KeyValuePair<string,int>>;

At the start of a program we’ll store an empty list as the user state of the parser, to represent that at the start, no variables have yet been defined in the environment.  As we parse and evaluate assignment statements, we can add <variable-name,value> pairs to that list.

When parsing, it is helpful to have parsers that recognize each individual kind of token.  So we define some helper functions.  In the case of identifiers and natural numbers, we will want to match them regardless of their value, whereas in the case of symbols and keywords, we will want to match specific values.  For example, there are places in the language grammar where any identifier can appear, whereas there are places in the grammar where the specific keyword “let” (and not just any keyword) must appear.  So these helpers differ based on whether they match a general type of token or a token with a specific value:

static P<T, GToken> MatchType<T>() where T : TokenBase
{
    return from x in SatShow<GToken>(t => t.Token.GetType() == typeof(T), GToken.ShowToken)
           select x.Token as T;
}
static P<IdentifierToken, GToken> Identifier = MatchType<IdentifierToken>().Tag("identifier");
static P<NaturalToken, GToken> Natural = MatchType<NaturalToken>().Tag("natural number");

static P<GToken, GToken> Symbol(string val)
{
    return SatShow<GToken>(t => t.Token.Equals(new SymbolToken(val)), GToken.ShowToken)
        .Tag("symbol \"" + val + "\"");
}
static P<GToken, GToken> Keyword(string val)
{
    return SatShow<GToken>(t => t.Token.Equals(new KeywordToken(val)), GToken.ShowToken)
        .Tag("keyword \"" + val + "\"");
}

SatShow is a new function in the parser library, that works like Sat (e.g. recognizes tokens that match a particular predicate), but takes an extra function that returns a string representation of tokens (for use in error messages).  The definition of GToken.ShowToken is just

public static string ShowToken(GToken pt)
{
    return "token \"" + pt.Token.ToString() + "\"";
}

To sum up, these helpful parsers (Identifier, Natural, Symbol(), and Keyword()) match particular kinds of tokens, or else yield useful error diagnostics (e.g. “unexpected token “+”, expected keyword “let””).

One more useful helper function is

static P<Func<int, int, int>, GToken> BinaryOp(string op, Func<int, int, int> function)
{
    return from o in Symbol(op)
           select function;
}

which makes it easy to turn operator symbols into the corresponding functions.  Note that this just calls Symbol(), which parses the corresponding SymbolToken.

With these helper functions out of the way, let’s build the parser, starting with the infix operators:

var addOp = BinaryOp("+", (x, y) => x + y)
        .Or(BinaryOp("-", (x, y) => x - y))
        .Tag("add/subtract op");
var mulOp = BinaryOp("*", (x, y) => x * y)
        .Or(BinaryOp("/", (x, y) => x / y))
        .Tag("multiply/divide op");

We’ve seen code like this before, nothing new to explain.  Recall that these are defined as two separate parsers because they have different precedence and will be used in different layers of the grammar to enforce that precedence.

Parsing numbers is a snap:

var nat = from n in Natural select n.Value;

We use the Natural parser to parse a NaturalToken and return its Value field.

Now let’s define the various main portions of expressions.  We’ll declare all the names up front, to deal with the recursion/forward-declaration issue, as before.  We’ll start with the “pow” function for exponentiation:

P<int, GToken> expression = null, term = null, factor = null,
               paren = null, pow = null, variable = null;
pow = from p  in Keyword("pow")
      from o  in Symbol("(")
      from e1 in expression
      from x  in Symbol(",")
      from e2 in expression
      from c  in Symbol(")")
      select (int)Math.Pow(e1, e2);

We saw this code earlier in the blog entry.  The Keyword() and Symbol() calls match those specific tokens, and each expression parser will match a full expression and evaluate it (binding the results to variables named e1 and e2).  The result is computed by calling the corresponding C# Pow(e1,e2) function.

Parenthesized expressions are a breeze:

paren = from o in Symbol("(")
        from e in expression
        from c in Symbol(")")
        select e;

Variables are where it first gets really interesting:

variable = from pos in GetCurrentPosition<GToken>()
           from v   in Identifier
           from env in GetUserState<GToken>()
           from lookup in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
                   .IfNonEmpty((kvp, rest) => Return<int, GToken>(kvp.Value))
                   .Else(() => Fail<int, GToken>("variable \"" + v + "\" was not defined", pos))
           select lookup;

There are a couple new pieces here.  First we get the current token location of the parse, and save that Position in the variable pos (we’ll see why in a moment).  Then we parse an identifier, calling it v.  Then we fetch the state stored in the parser, which will be the variable environment of type VarEnv.  Finally we run this interesting bit of code labeled lookup.  It takes the variable environment and calls Where(), the C# function on an IEnumerable<T> that filters out just those values that match a predicate.  Here we filter out just those key-value pairs whose key is the variable name v.Value (recall that since v is an IdentifierToken, its Value is the string that is the variable name).  For reasons we’ll enforce later, there will be at most one such element in the list.  So we have a list of either zero or one values, and need to make a choice depending on which it is (zero or one).  Which brings us to a C# aside…

IEnumerables do not offer a convenient way to test if they are empty or not.  You can do

someEnumerable.GetEnumerator().MoveNext() == false

as a test for empty-ness, but that is verbose, and it involves creating a new enumerator and calling a side-effect-ful method on it.  To deal with this more nicely, I have defined an extension method on IEnumerable<T> that allows you to write code like this:

someList.IfNonEmpty( (firstVal, restOfList) => yadda )
        .Else( () => yadda )

That is, you hand in two functions, one of which (that gets called when the list isn’t empty) takes as its two arguments the single first value and the rest of the list, and the other (which gets called when it is empty) has no arguments.  Here’s an example of the function in practice:

static int Sum(IEnumerable<int> list)
{
    return list.IfNonEmpty( (x, rest) => x + Sum(rest) )
               .Else( () => 0 );
}
Sum(new[] { 1, 2, 3, 4 });     // 10

Get the idea?  (For those of you familiar with functional languages, this is a kind of like pattern matching on an algebraic data type; if a List is either a Cons(first,rest) or Nil, then this helper is like a catamorphism on the type List.)

Ok, so enough of an aside, let’s get back to our parser.  We were examining this code:

variable = from pos in GetCurrentPosition<GToken>()
           from v   in Identifier
           from env in GetUserState<GToken>()
           from lookup in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
                   .IfNonEmpty((kvp, rest) => Return<int, GToken>(kvp.Value))
                   .Else(() => Fail<int, GToken>("variable \"" + v + "\" was not defined", pos))
           select lookup;

We had gotten to where we filter out the <variable name, value> pairs from the environment that had the right name key.  So, if that list is non-empty, we Return the corresponding value from the first element of the list.  If the list is empty, then, uh-oh, someone used an undefined variable, so we Fail with a helpful error message.  (Exception handling without throwing Exceptions!)  And this is a new overload for Fail that we haven’t see yet before.  In addition to an error message, it takes a Position to describe the location of the failure.  We pass in the pos we saved before parsing the variable name.  That’s so the error gets reported on the first character of the IdentifierToken.  If we hadn’t done this, the error would be reported at the current position of the parse, which is just after the variable.  In this example, the difference is minor, but we’ll see shortly another example where this kind of care makes a big difference.

The rest of the parser for expressions is mostly familiar:

factor = nat.Or(paren).Or(pow).Or(variable);
term = factor.Chainl1(mulOp);
expression = term.Chainl1(addOp);

The new bit is that factor can now be a “pow” call or a variable, in addition to a number or a parenthesized expression.  This reflects the grammar of the language “Goal”.

Now for assignment statements:

var stmt =
  from l   in Keyword("let")
  from pos in GetCurrentPosition<GToken>()
  from v   in Identifier
  from q   in Symbol("=")
  from e   in expression
  from s   in Symbol(";")
  from env in GetUserState<GToken>()
  from update in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
         .IfNonEmpty((kvp, rest) => Fail<object, GToken>(
                                         "variable \"" + v + "\" was already defined", pos))
         .Else(() => SetUserState<GToken>(Cons(new KeyValuePair<string, int>(v.Value, e),
                                               env as VarEnv)))
  select e;

The “parsing” part of this parser is straightforward, we match the “let” keword, we match an identifier, then an equals sign, … The interesting bit is the “update” part.  As before, we filter the environment state looking for variables with this name.  If one has already been defined, we Fail with a helpful error message.  If not, then we create a new key-value pair with the variable name and its value, and Cons it onto the front of the environment list, setting this new environment value as the updated user state.  Note the use of pos to save the position just as we are about to parse the Identifier.  If we hadn’t done this here, and just called the one-argument version of Fail (that reports an error at the current location), then the error would mistakenly point to the character after the semicolon in the assignment statement.  This illustrates the value of GetCurrentPosition – it makes it easy to save the location of a token so that you can choose to parse more, but then report errors back from an earlier location.

Finally we can write the code that parses an entire “Goal” program:

var program = from setup in SetUserState<GToken>(Enumerable.Empty<KeyValuePair<string, int>>())
              from stmts in Many(stmt)
              from e     in expression
              from eof   in EndOfInput<GToken>()
              select e;

First we initialize the variable environment as an empty list of key-value pairs.  Then we parts any number of statements.  As we just saw, the act of parsing statements will update the variable environment with new variable-value pairs as each statement is parsed.  Finally we parse an expression (calling the result e), followed by the end of input, and return the value e.  That’s it – we’ve finished the parser!  Though the exposition was long, if you were keeping track it was only about 70 lines of code!  That’s impressively succinct.  Look back at those code, and you’ll see that many of those lines were mostly whitespace, too (e.g. just “from x in something”).

Putting together the pieces

We have a scanner, we have a parser.  How do we connect them together?  For this I have defined a function in the library called Compose that has this signature:

public static P<T1,Tok2> Compose<T1,Tok1,Tok2>(
    P<T1,Tok1> p1,
    P<IEnumerable<Tok1>,Tok2> p2,
    Func<Position, Tok1, IEnumerable<Tok1>, Position> computeNextTokenPosition)

It takes two parsers (the first of which consumes the same kind of tokens that the second one produces), as well as a function to compute token-positions (recall that for user-defined tokens, we need a way to translate those tokens back into line and column numbers).  As a result, we can chain together our parser and scanner like so:

P<int, char> PROGRAM = Compose(program, scanner, GToken.ComputeNextTokenPosition);

and then parse and evaluate whole programs (here’s a tiny one squeezed on one line):

PROGRAM.Parse("let x=3;let y=2*x;let xSq=pow(x,2);xSq-y")  // Success: 3

Great!  Check out the full source code for more examples, including a few that have errors and yield terrific diagnostics that explain the failures.

Summary

Today we added a few more features to the parser library that make it really useful.  By adding a couple of simple extension methods to our parser type, we enabled the elegant new LINQ syntax using from and select that saves us from uglier code with Then and Return.  By generalizing away from the fixed token type char, we made it possible to hook up our parsers to external scanners that yield a stream of tokens.  And we saw that we can still use the library to build a scanner, too, and chain both the scanner and the parser together into one parser that goes directly from characters to results.  By adding user-defined state to the parser, we enabled a straightforward way to do on-the-fly semantic checking during the parse.  In the end we were able to quickly author a parser for the “Goal” language, a language which, while very tiny, exhibits many of the major issues and features one encounters in real languages.

We built an entire scanner and parser for “Goal” in just about 100 lines of code.  Even when you add in another 100 lines or so of mostly-boilerplate code to define the token classes, that’s still a pretty succinct way to express a parser-evaluator for “Goal” that also yields great error diagnostics.  Of course, we built it using a monadic parser combinator library which itself is another 600 lines of code or so.  But that library of parsers and combinators is reusable – indeed at this point the library is a pretty decent little chunk of code to use when playing around with parsers.  I say “playing around”, as this isn’t at all a production-quality library; it’s just some code I enjoyed hacking together over the last week or two.  In addition to undoubtedly having some bugs (as well as not admitting a particularly friendly debugging experience), I expect the perf is awful.  We’re lambda-capturing, IEnumerable<T>ing, and newing up immutable data structures all over the place, which is probably creating a ton of work for the garbage collector.  Even when just running the parser on a handful of “Goal” programs inside the debugger, there’s noticeable delay of about a half-second before the program finishes on my (slow, old) box.  At some point I should lay my hands on a profiler to get a feel for whether the performance is intrinsic to the idiomatic style of programming I’m using for this, or whether I’ve simply perpetrated some gross buffoonery in a core combinator like Then that makes things slow or garbage-prone.  Regardless of the deficiencies (some of which can surely be corrected with more effort), I think the monadic combinator approach to structuring parsers is extremely expressive and succinct.

Looking ahead

What’s left to do?  On the one hand, we’ve reached a nice plateau at this point; the library is sufficiently fleshed out to be useful for expressing parsers for a number of languages.  On the other hand, there is still plenty that can be done, including

  • introducing more combinators in the library (Parsec, for example, has tons more useful stuff)
  • improving the perf (I would really enjoy profiling this to get a better feel for how lambda and enumerables behave at runtime; I would also like to find and fix the leaks)
  • adding laziness, so that it’s possible to get online behavior, where initial results can be produced while the rest of the input is still streaming in
  • adding laziness to the internals, so that for example, rather than eagerly computing error information as we go along, we wait until there is an error (if there is one) to compute the corresponding diagnostics (I am not sure if this is pragmatic, but it sounds kinda fun)
  • a ton of ideas I haven’t thought of yet, I’m sure

But after five major blogging (and coding!) installments, I think I’m probably ready for a short break.  :)

Full Source code

I’ll post the code in a separate blog entry soon.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: