Inside F#

Brian's thoughts on F# and .NET

Archive for December, 2007

Monadic parser combinators, part seven

Posted by Brian on December 12, 2007

Today’s blog entry will be a little scattered, as there’s a handful of small topics I want to write about.  I was looking for an excuse to download and try out the trial version of VSTS, so of course I decided that monadic parser combinators were a good reason to try out the VSTS features for unit testing, code coverage, and performance profiling. 

Debugging monadic parser combinators

To start today off, I want to discuss what it’s like to use the debugger on the monadic parser combinator code.  Among the most useful features a debugger provides are the abilities to set break points, see call stacks, and inspect the values of local variables.  All three of these features I just named are somewhat hindered by the monadic style of writing parsers.  Let’s talk about each problem and possible mitigations.

Breakpoints

The only thing problematic about breakpoints is how they interact with lambdas whose source code is formatted on a single line.  When you write code like this:

s => f(s)

and press F9 on that line of code to set a breakpoint in VS, the breakpoint is set at the lambda creation point.  If instead you want to break when the function f is about to be called, it seems like either you have to be single-stepping through the code and hit F9 when you get inside the body of the lambda (which sets a breakpoint on the same "line", but with only the body of the lambda highlighted, rather than the whole lambda expression), or you must break up the code so that the body is a statement on a separate line like so:

s => 
{
    return f(s);
}

in which case you can press F9 on the first line to break at the lambda-creation point, or set a breakpoint on the return statement to break inside the body of the lambda.  Perhaps there is a way to tell VS that you want to set a breakpoint inside the lambda body when the whole thing is on one line, but if there is, I haven’t discovered it yet.  (I think there are similar issues arising from use of the ?: operator, and as we shall see, "line-oriented tools" are a little problematic when it comes to code coverage, too.)  To sum up, you may have to exert a little more effort/care to get breakpoints where you want them when using lambdas.

Call stacks

An unfortunate consequence of the monadic style of programming parsers is that practically every call stack during the course of the program is an unreadable pile of calls to Then(), Or(), and the lambdas nested therein.  Whereas in more typical styles of programming, a glance at a call stack provides immediate insight into "where you are" in the course of the computation, in the case of monadic parsers there’s no such luck.  I haven’t come up with any good way to deal with this issue; call stacks are just not very useful for debugging monadic parsers.

Local variables 

Things here start off dismally, but there is potential for salvation.  The representation type we have chosen for parsers (P<T>) has been a delegate type, and delegates are effectively opaque.  That is to say, when you inspect a local parser variable p in the debugger, it has no useful members that suggest what p actually is (e.g. is this a parser for expressions, for natural numbers, or for what?).  This is straightforward to overcome, however; we can alter the representation type like so:

public delegate Consumed<T, Token> PFunc<T, Token>(ParserState<Token> input);
public class P<T, Token>
{
    public readonly string Name;
    readonly PFunc<T, Token> func;
    public Consumed<T, Token> Run(ParserState<Token> input)
    {
        this.func(input);
    }
    public P(PFunc<T, Token> f, string name) { this.func = f; this.Name = name; }
    public override string ToString()
    {
        return this.Name;
    }
}

Representing parsers as a class type, rather than a delegate type, allows us to decorate parsers with extra metadata that provides informative descriptions about parser objects in the debugger.  The Tag() function, which is used to provide useful runtime error messages for failed parses, can be re-leveraged to set high-level names for parsers.  The other core functions can be written to also supply useful names, as suggested by one example here:

public static P<T, Token> Or<T, Token>(this P<T, Token> p1, P<T, Token> p2)
{
    return new P<T,Token>(input => { /* usual logic */ }, 
                          string.Format("'{0}'.Or('{1}')", p1.Name, p2.Name));
}

so that in the end, code like this:

var p3 = nat.Or(paren);

yields a value for p3 that shows up as:

'natural number'.Or('parenthesized expression')

inside the debugger when inspecting the value of p3.  This drastically improves the debugging experience, as parsers are no longer completely opaque – you can now look at a given parser object and know what it should parse.

Other pragmatics

I was talking to my friend Jeff about parser combinators, and whining about the poor debugging experience and how there is no context for "where you currently are in the parse", and he offered a great suggestion – keep the entire history as you’re parsing, and display where you currently are.  I chewed on this a little and came up with an effective history representation, as well as a (barely) adequate visualization of the history.  Here’s a picture of what it looks like to parse an expression that starts with "(611-2^3":

MPCViz

The idea is to have a tree view: each time we have a Tag()-ed parser, we start a new level of nesting, and each Then() or Or() adds a new child to the current level.  Since it’s a tree view, you can expand/collapse parts so as to drill down or sum up the history of the parse as you like.  I don’t know enough about WinForms (yet) to make this pretty, but rather than the "starting at (line,col) ending at (line,col)" you could imagine a much nicer visualization that highlighted the input in context, so that maybe it shows the input string and when you mouse over a successful portion of the parse, it highlights that portion of the input: "(611-2^3…".  Anyway, there is a lot of potential here.  The representation is easy, you just add a "history" field to the ParserState<Token> type, and have the core combinators update the history as they go (being especially careful in Or() to "backtrack" on the input string, but thread the history forward, so failures are not forgotten).  Of course you’d want to add all this extra stuff to the library only in DEBUG mode, so that you’re not lugging around the entire history in a release build.  I’m not sharing the code for all this, but it is straightforward to implement.

(This is also a good opportunity to take an aside to point out the utility of DebuggerVisualizers in Visual Studio.  If you haven’t even seen or used them, you absolutely must check them out.  They are a tremendously useful tool for debugging; I authored a couple of useful ones back when I was a developer on the WCF team which I’ll have to dig up and post in a separate blog entry.  Additionally, for PM-types, DebuggerVisualizers can effectively be used to provide a kind of "live" algorithm animation, which can be a great alternative to powerpoint slides for presenting some new code-oriented topic.)

A final pragmatic note about debugging these parsers – putting a single breakpoint in the SatShow() function is a great way to keep track of the progress of the parse.  In the most recent bit of code I posted, things were factored so that single function was the only place in the entire library that actually consumed a token of input and advanced down the token stream, making it a great location for a breakpoint to keep track of the running progress of the computation.

Unit testing and code coverage

I am a fan of test-driven development, as you may already know if you have read this previous blog entry of mine on the topic.  I saw that VSTS has a unit test framework and tools for automatically creating tests, as well as ways to measure code coverage and complexity, so of course I had to try it out.  Though I haven’t used these VSTS features enough yet to form a strong opinion about them, I am very glad to see a tool "in the box" that enables you to very easily instrument your code for coverage, run a set of unit tests, and then bring up data about coverage, either in a class/method table breakdown (showing percentage numbers), or by showing the code with ‘hit’ blocks highlighted in green and ‘miss’ blocks highlighted in red.  I love the red-and-green code coverage highlighting, as it make it easy to quickly see if your unit tests are at least covering all the various branches of the code you are writing.

One interesting thing to note about coverage as it relates to monadic parser combinators is that block coverage data alone is not very useful.  Since parsers are functions, the combinators like Then and Or end up getting all the coverage that describes different paths through a parse.  For example, if part of your parser has "nat.Or(paren)", that line of code will get covered, but that just tells you that such a parser was constructed.  When the resultant parser eventually runs and makes a choice between parsing a natural number of a parenthesized expression, the corresponding if-then-else branch is the one inside the lambda inside the definition of the Or combinator.  And there is no easy way to tell one instance of a call to Or from another.  Whereas a traditionally structured parser may have two unique blocks in the source code that represent whether a portion of an expression was parsed as a natural number or a parenthesized expression (e.g. so there was a line of code you could put a breakpoint on to determine which alternative was taken at this choice point of the grammar you are parsing), there is no corresponding unique line of code in a monadic parser.  Instead, our expression parser is built using a big block of straight-line code, and the result is a giant lambda that repeatedly calls back into functions like Or() to do all of the decisions that affect the control of the algorithm.  As a result, you would need more discriminating coverage tools (that consider callers or the original constructors of lambdas) to discern the coverage of all the different choice points of a language grammar you are parsing.  And you’d need some kind of conditional breakpoint to break, for example, just on the Or() that chooses between "nat" and "paren" (but not break on every other Or).  I haven’t thought about this much yet, but it’s very much a similar kind of issue as the call-stack problem I described earlier in this blog entry, and I wonder if there’s somehow some clever solution to both.  Deserves more thought.

The take-home moral of the previous paragraph is to be wary of the coverage numbers.  The coverage metrics I got showed nearly 100% block coverage for the handful of examples I do in the Main() algorithm, but there are undoubtedly still many uncovered "branches" which would have shown up as un-covered code blocks in a traditionally structured parser.

One final note about the code coverage highlighting that VSTS provides is that it is "per line", which means that you need to be careful about separating lambda bodies out onto separate lines of code if you want to be able to see the lambda body coverage distinct from the coverage of the code creating the lambda.

Performance

VSTS has tools for performance profiling, so of course I put them to use by profiling the code I posted last time.  Good performance analysis requires more effort that I was willing to put forth this morning, so here are a bunch of quick caveats regarding the data I’m sharing.  I just used the main program from the previously posted code as the code to profile.  It’s not necessarily a very representative sample, it runs on very short inputs, the whole program is pretty short and runs in about 1.5 seconds on my box, and it doesn’t appear to be CPU-bound.  But despite all that, I did a profile run to check out "where the time was spent" to see if there were any hot spots to identify.

Two things stood out.  First is that the top five "application exclusive" time users were all calls to delegate constructors (e.g. constructors of P<T,Token> or Func<X,Y,Z> objects).  This is, of course, partly because there are just so many lambdas being constructed, but it also makes me wonder about the performance of creating lambdas in general.  I haven’t tried rewriting "hot" lambdas as classes-with-invoke-methods that explicitly capture the environment in their constructors, to see if the perf changes at all, but it might be interesting to try.  (I doubt it would have an effect, though, as I imagine the C# compiler ends up emitting similar CIL code in either case.)  The second thing that stood out was how many objects were allocated.  There were about 140,000 calls to the object constructor in that run, and I knew those numbers were high as a result of using pure functional data structures (that is, data structures without any side effects).  So the obvious thing to do (which I wanted an excuse to do anyway) was to rewrite the core of the library to use effects and see if and how this affected perf.

Rewriting the library core to use effects under the hood was surprisingly easy.  In addition to using mutable variables, I also changed some of the data structures away from IEnumerable<T>s to more concrete types.  Here is a example of a typical change:

// old, "pure" code
public ErrorInfo SetExpectation(string label)
{
    if (string.IsNullOrEmpty(label))
        return new ErrorInfo(this.Position, this.History, Enumerable.Empty<string>(), this.Message);
    else
        return new ErrorInfo(this.Position, this.History, Parser.Cons(label, Enumerable.Empty<string>()), this.Message);
}
// new imperative code
public void SetExpectation(string label)
{
    this.Expectations.Clear();
    if (!string.IsNullOrEmpty(label))
        this.Expectations.Add(label);
}

This is a method inside the ErrorInfo class that gets called by Tag() – it updates the list of expectations that correspond to error messages like "expected character ‘(‘ or natural number".  Whereas previously this function returned a new ErrorInfo object containing the new values, now it mutates the existing object, which now contains a List<string> to represent the expectations.

Of course, when you use effects, you need to be careful about aliasing.  If you mutate an object, and some other object has a reference to that object, you are changing the view of the world for that other object, too.  Fortunately inside the core of a monadically-structured parser library, you can impose simple clear rules of object ownership.  The ParserState<T,Token> objects passed as input to each P<T,Token> delegate "belong" to the delegate, which means if you run p(input), then you are giving p permission to modify input.  The result of running a parser is a Consumed<T, Token> object which may contain references pointing back into the input.  Most of the parser is "single-threaded" in the sense that data has a single source and sink, and so the same named object is rarely used in two different expressions inside a function.  The main exception is Or(), which can pass the same input to both p1 and p2.  Since Or() only calls p2 when p1 doesn’t consume (which means p1 doesn’t mutate the input object), this is usually ok.  The exception is Try(), which lies about whether it mutates the input.  As a result, Try() has to clone the input and restore it back to the previous value if it consumes and fails:

public static P<T, Token> Try<T, Token>(this P<T, Token> p)
{
    return input =>
    {
        ParserState<Token> copy = input.Clone();
        Consumed<T, Token> consumed = p(input);
        if (consumed.HasConsumedInput && !consumed.ParseResult.Succeeded)
        {
            consumed.HasConsumedInput = false;
            input.ResetBackTo(copy);
        }
        return consumed;
    };
}

Clone() and ResetBackTo() are new functions on the ParserState<T,Token> class that exist to enable Try to backtrack to a previous state after a mutation.  You can see the full code at the end of this blog entry.

Since these changes are all limited to the "core" of the library, we don’t need to make any changes to the rest of the library (e.g. combinators like Chainl1 and SepBy) or to client code (e.g. the expression parser).  The public interface and behavior remain the same.

Profiling this new program that used effects under the hood revealed a couple noteworthy things.  First, the number of allocated objects dropped dramatically, from about 140,000 objects in the original to about 34,000 in the new version.  Allocating fewer objects is almost always a good thing.  Furthermore, the "application exclusive" time still listed the same top five functions, which means these changes did not introduce new CPU hogs.  (I did note that some new functions like "List.Clear()" now appeared in the top 20 or so functions, but these newcomers were not significant contributors overall.)  In terms of wall-clock time, the program runs about 15% faster, dropping from about 1.4s to 1.2s for a release build on my box.  It’s hard to draw any firm conclusions from the meager perf analysis I did today.  But in the end, I was able to change the code so that it runs faster and allocates less, and I localized those changes to the core of the library without changing the main public interface.  So that makes me happy, as I think it’s suggestive of a general potential design and implementation approach: start things off effect-free, get a correct implementation with a nice interface, and then later, as necessary, go back and add side effects under the hood to make things more efficient.

One thing that might be interesting is to rewrite the same parsing program using a traditional recursive-descent structure, and see how it performs in comparison.  I doubt I’ll find the time and motivation to do that, though.

There’s clearly other potential low-hanging fruit to make some portions faster.  For example, representing the input as a character stream all the way through and peeling off individual characters one by one is pretty awful; buffering big blocks of characters under the hood and then using fast access to those buffers might make a big difference on large inputs.  But for the purposes of this blog, I am less concerned with specific points of practical performance, and more concerned with thinking about the big picture and ensuring that most such optimizations could be made without altering the overall structure and interface of the parsing library.

Summary

Though today’s blog entry is a smattering of different topics, they are all related in the sense that, for each topic I discuss how monadic parser combinators are different from typical applications with respect to specific concerns.  I’ve discussed how the parsing library has an unusual appearance in the debugger, and how to get the debugging information you desire.  I’ve talked about how this code interacts with code coverage tools and needs special consideration to meaningfully understand the resultant coverage metrics.  And I’ve shown a little about how a pure functional program’s performance characteristics can be altered by introducing effects under the hood.  All of these topics demonstrate ways that this kind of code is relatively unique compared to typical applications and libraries you find in C#.

Full source code

You can find the code for the library-using-side-effects-under-the-hood here.

Advertisements

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, parts five and six – source code

Posted by Brian on December 9, 2007

Here is the source code for part five and part six.  (I had to just link externally to the source code, as it’s too big to post inside a blog entry.)

In addition to the parser code, there is also some utility code involving IEnumerable<T> that I haven’t discussed yet on the blog.  It enables you to "backtrack" through the input stream even if the input is a forward-only one-pass enumerable.

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part six

Posted by Brian on December 8, 2007

The entire blog for today was too long for the server to upload all at once, so this is the continuation of the previous entry.  If you haven’t read that already, you should now, as you need it for context.

Parsing and evaluating the “Goal” language

I don’t know if you noticed, but ignoring the definitions of the various token classes (which are mostly boilerplate code), the scanner took only about 30 lines of code to define!  That’s pretty nice.  But then again, it is just the scanner.  Surely parsing and evaluating the language will be a lot more work, right?  Let’s find out.

Since “Goal” has variables that will be assigned integer values, we need a way to store those variable-value associations in the state of the parser.  A list of key-value pairs is a fine representation:

using VarEnv = IEnumerable<KeyValuePair<string,int>>;

At the start of a program we’ll store an empty list as the user state of the parser, to represent that at the start, no variables have yet been defined in the environment.  As we parse and evaluate assignment statements, we can add <variable-name,value> pairs to that list.

When parsing, it is helpful to have parsers that recognize each individual kind of token.  So we define some helper functions.  In the case of identifiers and natural numbers, we will want to match them regardless of their value, whereas in the case of symbols and keywords, we will want to match specific values.  For example, there are places in the language grammar where any identifier can appear, whereas there are places in the grammar where the specific keyword “let” (and not just any keyword) must appear.  So these helpers differ based on whether they match a general type of token or a token with a specific value:

static P<T, GToken> MatchType<T>() where T : TokenBase
{
    return from x in SatShow<GToken>(t => t.Token.GetType() == typeof(T), GToken.ShowToken)
           select x.Token as T;
}
static P<IdentifierToken, GToken> Identifier = MatchType<IdentifierToken>().Tag("identifier");
static P<NaturalToken, GToken> Natural = MatchType<NaturalToken>().Tag("natural number");

static P<GToken, GToken> Symbol(string val)
{
    return SatShow<GToken>(t => t.Token.Equals(new SymbolToken(val)), GToken.ShowToken)
        .Tag("symbol \"" + val + "\"");
}
static P<GToken, GToken> Keyword(string val)
{
    return SatShow<GToken>(t => t.Token.Equals(new KeywordToken(val)), GToken.ShowToken)
        .Tag("keyword \"" + val + "\"");
}

SatShow is a new function in the parser library, that works like Sat (e.g. recognizes tokens that match a particular predicate), but takes an extra function that returns a string representation of tokens (for use in error messages).  The definition of GToken.ShowToken is just

public static string ShowToken(GToken pt)
{
    return "token \"" + pt.Token.ToString() + "\"";
}

To sum up, these helpful parsers (Identifier, Natural, Symbol(), and Keyword()) match particular kinds of tokens, or else yield useful error diagnostics (e.g. “unexpected token “+”, expected keyword “let””).

One more useful helper function is

static P<Func<int, int, int>, GToken> BinaryOp(string op, Func<int, int, int> function)
{
    return from o in Symbol(op)
           select function;
}

which makes it easy to turn operator symbols into the corresponding functions.  Note that this just calls Symbol(), which parses the corresponding SymbolToken.

With these helper functions out of the way, let’s build the parser, starting with the infix operators:

var addOp = BinaryOp("+", (x, y) => x + y)
        .Or(BinaryOp("-", (x, y) => x - y))
        .Tag("add/subtract op");
var mulOp = BinaryOp("*", (x, y) => x * y)
        .Or(BinaryOp("/", (x, y) => x / y))
        .Tag("multiply/divide op");

We’ve seen code like this before, nothing new to explain.  Recall that these are defined as two separate parsers because they have different precedence and will be used in different layers of the grammar to enforce that precedence.

Parsing numbers is a snap:

var nat = from n in Natural select n.Value;

We use the Natural parser to parse a NaturalToken and return its Value field.

Now let’s define the various main portions of expressions.  We’ll declare all the names up front, to deal with the recursion/forward-declaration issue, as before.  We’ll start with the “pow” function for exponentiation:

P<int, GToken> expression = null, term = null, factor = null,
               paren = null, pow = null, variable = null;
pow = from p  in Keyword("pow")
      from o  in Symbol("(")
      from e1 in expression
      from x  in Symbol(",")
      from e2 in expression
      from c  in Symbol(")")
      select (int)Math.Pow(e1, e2);

We saw this code earlier in the blog entry.  The Keyword() and Symbol() calls match those specific tokens, and each expression parser will match a full expression and evaluate it (binding the results to variables named e1 and e2).  The result is computed by calling the corresponding C# Pow(e1,e2) function.

Parenthesized expressions are a breeze:

paren = from o in Symbol("(")
        from e in expression
        from c in Symbol(")")
        select e;

Variables are where it first gets really interesting:

variable = from pos in GetCurrentPosition<GToken>()
           from v   in Identifier
           from env in GetUserState<GToken>()
           from lookup in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
                   .IfNonEmpty((kvp, rest) => Return<int, GToken>(kvp.Value))
                   .Else(() => Fail<int, GToken>("variable \"" + v + "\" was not defined", pos))
           select lookup;

There are a couple new pieces here.  First we get the current token location of the parse, and save that Position in the variable pos (we’ll see why in a moment).  Then we parse an identifier, calling it v.  Then we fetch the state stored in the parser, which will be the variable environment of type VarEnv.  Finally we run this interesting bit of code labeled lookup.  It takes the variable environment and calls Where(), the C# function on an IEnumerable<T> that filters out just those values that match a predicate.  Here we filter out just those key-value pairs whose key is the variable name v.Value (recall that since v is an IdentifierToken, its Value is the string that is the variable name).  For reasons we’ll enforce later, there will be at most one such element in the list.  So we have a list of either zero or one values, and need to make a choice depending on which it is (zero or one).  Which brings us to a C# aside…

IEnumerables do not offer a convenient way to test if they are empty or not.  You can do

someEnumerable.GetEnumerator().MoveNext() == false

as a test for empty-ness, but that is verbose, and it involves creating a new enumerator and calling a side-effect-ful method on it.  To deal with this more nicely, I have defined an extension method on IEnumerable<T> that allows you to write code like this:

someList.IfNonEmpty( (firstVal, restOfList) => yadda )
        .Else( () => yadda )

That is, you hand in two functions, one of which (that gets called when the list isn’t empty) takes as its two arguments the single first value and the rest of the list, and the other (which gets called when it is empty) has no arguments.  Here’s an example of the function in practice:

static int Sum(IEnumerable<int> list)
{
    return list.IfNonEmpty( (x, rest) => x + Sum(rest) )
               .Else( () => 0 );
}
Sum(new[] { 1, 2, 3, 4 });     // 10

Get the idea?  (For those of you familiar with functional languages, this is a kind of like pattern matching on an algebraic data type; if a List is either a Cons(first,rest) or Nil, then this helper is like a catamorphism on the type List.)

Ok, so enough of an aside, let’s get back to our parser.  We were examining this code:

variable = from pos in GetCurrentPosition<GToken>()
           from v   in Identifier
           from env in GetUserState<GToken>()
           from lookup in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
                   .IfNonEmpty((kvp, rest) => Return<int, GToken>(kvp.Value))
                   .Else(() => Fail<int, GToken>("variable \"" + v + "\" was not defined", pos))
           select lookup;

We had gotten to where we filter out the <variable name, value> pairs from the environment that had the right name key.  So, if that list is non-empty, we Return the corresponding value from the first element of the list.  If the list is empty, then, uh-oh, someone used an undefined variable, so we Fail with a helpful error message.  (Exception handling without throwing Exceptions!)  And this is a new overload for Fail that we haven’t see yet before.  In addition to an error message, it takes a Position to describe the location of the failure.  We pass in the pos we saved before parsing the variable name.  That’s so the error gets reported on the first character of the IdentifierToken.  If we hadn’t done this, the error would be reported at the current position of the parse, which is just after the variable.  In this example, the difference is minor, but we’ll see shortly another example where this kind of care makes a big difference.

The rest of the parser for expressions is mostly familiar:

factor = nat.Or(paren).Or(pow).Or(variable);
term = factor.Chainl1(mulOp);
expression = term.Chainl1(addOp);

The new bit is that factor can now be a “pow” call or a variable, in addition to a number or a parenthesized expression.  This reflects the grammar of the language “Goal”.

Now for assignment statements:

var stmt =
  from l   in Keyword("let")
  from pos in GetCurrentPosition<GToken>()
  from v   in Identifier
  from q   in Symbol("=")
  from e   in expression
  from s   in Symbol(";")
  from env in GetUserState<GToken>()
  from update in (env as VarEnv).Where(kvp => kvp.Key == v.Value)
         .IfNonEmpty((kvp, rest) => Fail<object, GToken>(
                                         "variable \"" + v + "\" was already defined", pos))
         .Else(() => SetUserState<GToken>(Cons(new KeyValuePair<string, int>(v.Value, e),
                                               env as VarEnv)))
  select e;

The “parsing” part of this parser is straightforward, we match the “let” keword, we match an identifier, then an equals sign, … The interesting bit is the “update” part.  As before, we filter the environment state looking for variables with this name.  If one has already been defined, we Fail with a helpful error message.  If not, then we create a new key-value pair with the variable name and its value, and Cons it onto the front of the environment list, setting this new environment value as the updated user state.  Note the use of pos to save the position just as we are about to parse the Identifier.  If we hadn’t done this here, and just called the one-argument version of Fail (that reports an error at the current location), then the error would mistakenly point to the character after the semicolon in the assignment statement.  This illustrates the value of GetCurrentPosition – it makes it easy to save the location of a token so that you can choose to parse more, but then report errors back from an earlier location.

Finally we can write the code that parses an entire “Goal” program:

var program = from setup in SetUserState<GToken>(Enumerable.Empty<KeyValuePair<string, int>>())
              from stmts in Many(stmt)
              from e     in expression
              from eof   in EndOfInput<GToken>()
              select e;

First we initialize the variable environment as an empty list of key-value pairs.  Then we parts any number of statements.  As we just saw, the act of parsing statements will update the variable environment with new variable-value pairs as each statement is parsed.  Finally we parse an expression (calling the result e), followed by the end of input, and return the value e.  That’s it – we’ve finished the parser!  Though the exposition was long, if you were keeping track it was only about 70 lines of code!  That’s impressively succinct.  Look back at those code, and you’ll see that many of those lines were mostly whitespace, too (e.g. just “from x in something”).

Putting together the pieces

We have a scanner, we have a parser.  How do we connect them together?  For this I have defined a function in the library called Compose that has this signature:

public static P<T1,Tok2> Compose<T1,Tok1,Tok2>(
    P<T1,Tok1> p1,
    P<IEnumerable<Tok1>,Tok2> p2,
    Func<Position, Tok1, IEnumerable<Tok1>, Position> computeNextTokenPosition)

It takes two parsers (the first of which consumes the same kind of tokens that the second one produces), as well as a function to compute token-positions (recall that for user-defined tokens, we need a way to translate those tokens back into line and column numbers).  As a result, we can chain together our parser and scanner like so:

P<int, char> PROGRAM = Compose(program, scanner, GToken.ComputeNextTokenPosition);

and then parse and evaluate whole programs (here’s a tiny one squeezed on one line):

PROGRAM.Parse("let x=3;let y=2*x;let xSq=pow(x,2);xSq-y")  // Success: 3

Great!  Check out the full source code for more examples, including a few that have errors and yield terrific diagnostics that explain the failures.

Summary

Today we added a few more features to the parser library that make it really useful.  By adding a couple of simple extension methods to our parser type, we enabled the elegant new LINQ syntax using from and select that saves us from uglier code with Then and Return.  By generalizing away from the fixed token type char, we made it possible to hook up our parsers to external scanners that yield a stream of tokens.  And we saw that we can still use the library to build a scanner, too, and chain both the scanner and the parser together into one parser that goes directly from characters to results.  By adding user-defined state to the parser, we enabled a straightforward way to do on-the-fly semantic checking during the parse.  In the end we were able to quickly author a parser for the “Goal” language, a language which, while very tiny, exhibits many of the major issues and features one encounters in real languages.

We built an entire scanner and parser for “Goal” in just about 100 lines of code.  Even when you add in another 100 lines or so of mostly-boilerplate code to define the token classes, that’s still a pretty succinct way to express a parser-evaluator for “Goal” that also yields great error diagnostics.  Of course, we built it using a monadic parser combinator library which itself is another 600 lines of code or so.  But that library of parsers and combinators is reusable – indeed at this point the library is a pretty decent little chunk of code to use when playing around with parsers.  I say “playing around”, as this isn’t at all a production-quality library; it’s just some code I enjoyed hacking together over the last week or two.  In addition to undoubtedly having some bugs (as well as not admitting a particularly friendly debugging experience), I expect the perf is awful.  We’re lambda-capturing, IEnumerable<T>ing, and newing up immutable data structures all over the place, which is probably creating a ton of work for the garbage collector.  Even when just running the parser on a handful of “Goal” programs inside the debugger, there’s noticeable delay of about a half-second before the program finishes on my (slow, old) box.  At some point I should lay my hands on a profiler to get a feel for whether the performance is intrinsic to the idiomatic style of programming I’m using for this, or whether I’ve simply perpetrated some gross buffoonery in a core combinator like Then that makes things slow or garbage-prone.  Regardless of the deficiencies (some of which can surely be corrected with more effort), I think the monadic combinator approach to structuring parsers is extremely expressive and succinct.

Looking ahead

What’s left to do?  On the one hand, we’ve reached a nice plateau at this point; the library is sufficiently fleshed out to be useful for expressing parsers for a number of languages.  On the other hand, there is still plenty that can be done, including

  • introducing more combinators in the library (Parsec, for example, has tons more useful stuff)
  • improving the perf (I would really enjoy profiling this to get a better feel for how lambda and enumerables behave at runtime; I would also like to find and fix the leaks)
  • adding laziness, so that it’s possible to get online behavior, where initial results can be produced while the rest of the input is still streaming in
  • adding laziness to the internals, so that for example, rather than eagerly computing error information as we go along, we wait until there is an error (if there is one) to compute the corresponding diagnostics (I am not sure if this is pragmatic, but it sounds kinda fun)
  • a ton of ideas I haven’t thought of yet, I’m sure

But after five major blogging (and coding!) installments, I think I’m probably ready for a short break.  :)

Full Source code

I’ll post the code in a separate blog entry soon.

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part five

Posted by Brian on December 8, 2007

Last time we added error-handling to our parsers, so that when they fail, they yield good diagnostics.  Today, we’ll discuss three more useful features to add to the library.  First, we’ll add add support for the new C# LINQ syntax, as I mentioned here.  This will make our code easier to read and write.  Second, we’ll make our parsers generic with respect to the input type, so that rather than parsing just strings of characters, we can parse IEnumerable<Token> for an arbitrary Token type.  This makes it possible to attach a separate "scanner" stage to the front of our parsers that tokenizes the input.  Finally, we’ll add user-defined state to the parser.  This will allow us to remember arbitrary information about what has already been parsed (such as what variables have already been declared in a program), enabling some semantic checking.  With all these cool features, we’ll be able to author parsers for interesting languages by writing very little code.

Today’s goal

Let’s be more concrete about where we’re headed.  By the end of today, we’ll be able to parse a little expression language that comprises a number of assignment statements followed by a final expression to evaluate.  Since that’s today’s goal, I’ll refer to this little language as "Goal".  Here’s an example "Goal" program that yields the final value 3:

let x = 3;
let y = 2 * x;
let xSquared = pow(x,2);
  let  zero =  ( 611    - pow(2 ,pow(3  , 2) ) + 1   ) /   10  -5   * 2   ;
xSquared-y+zero

"Goal" deals with the same kind of arithmetic expressions as before, but now we have introduced assignment statements that use variable identifiers and the keyword "let".  The language now also handles arbitrary whitespace interspersed between the various tokens in the language.  A scanner for this language would recognize four kinds of tokens: keywords like "let", identifiers like "x", symbols like "+", and numbers like "611".

The semantics of this language are straightforward.  We evaluate mathematical expressions in the usual way.  The right-hand side of an assignment statement is an expression that can only reference variables defined before that statement.  Duplicate variable definitions are disallowed.  The final "statement" of the language is just an expression representing the final result of evaluating the program.  Our parser will evaluate programs in a single pass, either yielding a successful value (like 3 for the previous program) or else failing gracefully with an error message that diagnoses problems in syntax or semantics.  Here’s an example of a program with semantic error:

let x = 3;
x+y

And here’s what our parser will report:

Failure: At line 2 column 3, variable "y" was not defined

Get the idea?  All right, let’s get to it!

Leveraging C#’s LINQ syntax sugars

As promised, the first thing we will do is leverage the new LINQ syntax sugar.  If you haven’t seen it, I’m taking about the stuff that lets you write code like this:

var strings = new[] { "one", "two", "three", "four" };
var firstCharsOfShortWords = from s in strings
                             where s.Length <= 4
                             select s[0];

These new syntax features were designed with the main intention of succinctly expressing queries on arbitrary data.  The language designers were clever and recognized that these query operations form a monad, and so they designed the sugars to work with any data type that supported functions with the right names and shapes.  As a result, we can add these extension functions to our parser type and then use the new syntax. 

(Aside: Bummer, I just noticed that the Windows Live Writer plugin I am using for code-snippets doesn’t know the new C# features, and so var, from, and select are not getting automagically highlighted in blue like they do inside Visual Studio.  But we’ll live.)

In order to get to use from and select, we just need to add this code to our library:

// LINQ syntax enablers
public static P<U, Token> Select<T, U, Token>(this P<T, Token> p, Func<T, U> selector)
{
    return p.Then(x => Return<U, Token>(selector(x)));
}
public static P<V, Token> SelectMany<T, U, V, Token>(this P<T, Token> p, 
                                  Func<T, P<U, Token>> selector, Func<T, U, V> projector)
{
    return p.Then(r1 => selector(r1).Then(r2 => Return<V, Token>(projector(r1, r2))));
}

Explaining those definitions would just cause us to stray into more explanation of monads, and I want to avoid that, so just take them at face value.  (One thing worth noting, though, is that these functions are built atop the existing library, just using Then and Return calls.  As a result, if this code was missing from the library, you could have added it yourself in user code, without needing to know anything about the underlying parser representation.)  Once you add those functions, you can start using the syntax sugar.  Specifically, code that used to appear as

p1.Then(x =>
p2.Then(y =>
p3.Then(z =>
Return<T>(f(x,y,z)))))

Can now be written more elegantly as

from x in p1
from y in p2
from z in p3
select f(x,y,z)

It’s easiest to grok the new syntax with an example, so here’s some code from the final goal of today – this is the part of the "Goal" parser that parses and evaluates exponentiation – that is, stuff like "pow(e1,e2)":

pow = from p  in Keyword("pow")
      from o  in Symbol("(")
      from e1 in expression
      from x  in Symbol(",")
      from e2 in expression
      from c  in Symbol(")")
      select (int)Math.Pow(e1, e2);

As we’ll see in a little bit when we get to the code for example parser, Keyword() and Symbol() return parsers that parse the corresponding tokens in the input, and expression parses and evaluates an expression.  So this bit of code describes a parser that parses a string like "pow(3,4)" binding each portion to a local variable name (so, for instance, e1 gets bound to the result of parsing the "3") and returns the result expressed on the last line.    So that gives you an idea of how to "read" parser code that uses the from and select syntax.

Now that we have LINQ, let’s move on to generalizing our parsers for different kinds of input.

Scanners and parsing arbitrary token types

The parsers we looked at in my previous blog entries were all character-based.  That is, parsers operated on strings, and the "atomic" parsers Item and Sat were parsers whose results had type char.  This is a nice way to make a simple introduction to a parsing library using familiar data types, but in the real world, you’ll rarely see a character-based parser.  Instead, the input typically gets streamed through two stages.  The first stage is a scanner (sometimes called a lexer), which groups characters together into the logical tokens of the language.  The second stage is the actual parser, which then parses the stream of tokens that exits the scanner.

This two-stage process of scanning and parsing is motivated by a few different considerations.  First, most languages you will parse allow for all the significant tokens to be separated by optional, insignificant characters.  Whitespace and comments are the two main examples here.  The scanner can strip these out in the first stage so that the second stage doesn’t need to worry about them.  Second, the logic of scanners is usually simple, and scanners afford themselves very nicely to being represented as highly performant state machines, which can be generated by tools like lex.  Finally, the scanner stage massages the input into a form that’s at a better thought/abstraction layer for parsing.  When you think about parsing text like

if x = 34 or s = "foo"

you are more likely to grok it as a series of 8 tokens (keywords, identifiers, operators, literals) rather than as a series of 22 characters. Indeed, when you glance at the example text above, your consciousness probably doesn’t even see the individual quotation marks, rather you just see the string foo.  Similarly, you don’t see the digit three followed by the digit four, you just see thirty-four.

To sum up, then, a separate scanner stage has a few benefits.  With a scanner, we can localize all the processing of insignificant characters (rather than having whitespace/comment processing throughout our parser), we can use tools to make this stage very efficient, and we end up parsing a stream of tokens that’s at a more appropriate layer of abstraction.

In order for our parser library to accommodate tokens, we’ll need to change the representation.  Recall that the parser representation type and top-level interface function had been defined like this:

// representation type for parsers
public delegate Consumed<T> P<T>(ParserState input);
// clients call p.Parse("input") to avoid implementation details like Consumed & ParserState
public static ParseResult<T> Parse<T>(this P<T> p, string toParse)
{
    return p(new ParserState(0, toParse)).ParseResult;
}

We’ll now change it to this:

public delegate Consumed<T, Token> P<T, Token>(ParserState<Token> input);
public static ParseResult<T, Token> Parse<T, Token>(this P<T, Token> p, 
                                                    IEnumerable<Token> toParse) { ... }

That is, the parser type P (and its associated helper representation types like Consumed and ParserState) now has an added generic parameter called Token which is the type of tokens we are parsing.  Whereas previous the code assumed the tokens were always chars, now we allow any token type.  Rather than Parse() taking a string, now it takes a list of tokens (an IEnumerable<Token>).  This is a breaking change to all our code, as we have fundamentally altered the structure of the public type P.  Rather than walk through all the changes, I’ll just point out the highlights.

Most of the code itself is relatively unchanged, but for the extra generic parameter.  For instance, in the definition of the Many combinator, the only changes are that <T> becomes <T, Token> in four places in the code.  The vast majority of changes are similar to that.

Most of the rest of the changes involve parsers that fundamentally deal with characters and don’t generalize to arbitrary tokens.  For example, the Letter and Nat parsers now explicitly call out the token type char in their type:

// Letter parses char tokens and returns a char
public static P<char, char> Letter = Sat((char c) => char.IsLetter(c)).Tag("letter");
// Nat parses char tokens and returns an int
public static P<int, char> Nat = ...

There are a few other minor changes to the library, but that is the gist of the representation change.

Once we have made these changes, another challenging issue surfaces.  Previously, errors were reported by position – the message "Failure: At position n, unexpected…" represented a failure in the nth character of the input.  However, with an arbitrary token type, it is not very useful to issue diagnostics to humans using language that points to the nth token.  The position still ought to be described in terms of character input, and while we are at it, we should generalize the Position type to represent both line numbers and column numbers.  Generalizing the Position to include line numbers as well as column numbers is very straightforward, and we can just have the definer of the Token type also provide a function to compute the Position of a given Token.

To sum up, once we have made the changes described in this section, the parser library now works on an arbitrary token type (which means we can attach it to a separate scanner that returns Tokens), and the library now has improved error-reporting, in that the location of errors is described by both a line number and a column number.

Adding state

The final new feature we need to add to the library is state.  During the course of parsing, we sometimes need to remember some things about what has already been parsed in order to parse the rest of the input correctly.  A typical example of a context-sensitive language that requires memory while parsing is XML – after parsing the string "<foo>", we parse some arbitrary amount of other stuff in the body of foo and then we eventually need to parse "</foo>".  That is, we need to remember the name of the opening tag to ensure it matches the closing tag name.  Another example is our "Goal" language, where we need to detect both duplicate definitions of already-defined variables as well as uses of not-yet-defined variables.

Though we could keep track of such state in the usual C# way – by newing up a mutable object on the heap and side-effecting that state object while we parse, it wouldn’t be very much in the spirit of the library to do things that way.  Furthermore, it becomes messy to deal with mutable state when the parser may be backtracking as it tries to find a legal parse.  Instead, we can keep track of the state in a side-effect-free manner in the parser monad.  Indeed, we already have seen how to do this; the parser representation code contains a type called ParserState that keeps track of the state of the parser.  This is an immutable object; each time we move forward in the parse, we create a new ParserState object with updated values (e.g. for the current line and column numbers).  It is easy to add another field to this type to store some user-defined state:

public class ParserState<Token>
{
    // arbitrary state object for clients to use
    public readonly object UserState;   
    // other fields as before...
    public readonly Position CurrentPosition;
    ...

Then we just need to expose some new functions for clients of the library to read and write that state.  And while we are at it, there is one other bit of state already inside the parser that can be useful to allow clients to get at – the current position.  So we define these new core functions:

// return the user state
public static P<object, Token> GetUserState<Token>()
// set the user state, return previous value
public static P<object, Token> SetUserState<Token>(object newState)
// get current Position
public static P<Position, Token> GetCurrentPosition<Token>()

I’ve elided the definitions, but they’re all one-liners.  These functions all return parsers that don’t actually parse anything – that is, they don’t affect the input tokens.  Instead these functions provide access to state values that are threaded through the parser during a parse.  We’ll see examples of their utility shortly.

Ok, now that we have the ability to distinguish scanners and parsers, a way to keep track of state during the parse, and new LINQ syntax, we are ready to write code to parse the "Goal" language.

Scanning the "Goal" language

Though we could write a scanner using a separate tool, we might as well illustrate more uses of our parser library.  So we’ll use the parser library to create a scanner.  A scanner is, after all, just a parser whose input token type is char and whose result is IEnumerable<X> for some new token type X.

Since the language is called "Goal", we’ll define a type called GToken (‘G’ for "Goal") that defines the tokens of the language.  As mentioned when introducing the language at the start of this blog entry, there are four types of tokens: KeywordToken, IdentifierToken, NaturalToken, and SymbolToken.  These tokens represent keywords, identifiers, natural numbers, and the various operators and symbols of the language.  Each token has a Value field that stores the associated value; for example IdentifierToken has a string Value field that holds the name of a variable, and NaturalToken has an int Value field that holds the value of a natural number.  A GToken is one of these tokens, coupled with a Position describing the location in the input (e.g. line 3, column 4).  I won’t show the code definitions of these types because they are so straightforward (as always, I’ll post the full code at the end).

Before we start, let’s define one more handy character parser:

// OptionalWhiteSpace consumes zero or more whitespace characters
public static P<IEnumerable<char>, char> OptionalWhiteSpace = Many(Sat<char>(char.IsWhiteSpace));

as well as a handy combinator:

// p1.FollowedBy(p2) matches p1 then p2, and returns p1's result (discards p2's result)
public static P<T, Token> FollowedBy<T, U, Token>(this P<T, Token> p1, P<U, Token> p2)
{
    return from result in p1
           from discard in p2
           select result;
}

Those go in the general library.

Let’s define the scanner for "Goal", starting with the ability to tokenize the symbol characters:

var symbols = new[] { "+", "-", "*", "/", "(", ")", "=", ";", "," }
              .Select(tok => from pos in GetCurrentPosition<char>()
                             from sym in Literal(tok).FollowedBy(OptionalWhiteSpace)
                             select new GToken(pos, new SymbolToken(sym)))
              .Aggregate(Or)
              .Tag("symbol");

Wow!  Ok, let’s break that code apart.  We start with an array of strings – all the symbols we want to scan.  Recall that Select() applies a function to every element in a list, returning a list of results.  So for each of those symbols, we make a parser that

  • gets the current input position (line & column number), and call it pos
  • parse the literal string for this symbol (followed by optional whitespace), call it sym
  • returns a new GToken with location pos and token value new SymbolToken(sym)

So that yields a list of parsers, one for each symbol.  We then Or them all together using Aggregate, yielding a single parser that can tokenize any symbol.  Finally we Tag that resultant parser with a descriptive name that can be used when reporting errors about expected tokens.  Neat!

Next let’s deal with keywords.  One thing to be careful of with keywords is that keywords can be prefixes of identifiers.  That is, whereas "let" is a keyword, "lettuce" is a legal identifier, so we need to ensure that the character after an apparent keyword isn’t an identifier character.  Here’s the code:

var identifierChar = Letter.Or(Digit).Or(Literal('_')).Tag("identifier character");
var keywords = new[] { "pow", "let" }
               .Select(tok => 
                   Try(from pos in GetCurrentPosition<char>()
                       from key in Literal(tok)
                                   .NotFollowedBy(identifierChar,"identifier character")
                                   .FollowedBy(OptionalWhiteSpace)
                       select new GToken(pos, new KeywordToken(key))
                       ).Tag("keyword \"" + tok + "\"")
                   )
               .Aggregate(Or);

Once again, we use Select() to iterate over each keyword, and then eventually Aggregate() the results with Or.  To build a parser for each keyword token, we make a parser that will

  • get the current input position
  • parse the literal string for this keyword (ensuring it’s not just the prefix of some longer identifier)
  • return a GToken containing the right position and KeywordToken

and we

  • wrap the parser in a Try(), so that for example we can recognize the identifier "potato" even after it starts matching the keyword "pow" but then fails after consuming the first two characters
  • Tag each individual keyword parser with a useful description (so as to enable error messages like "expected keyword "let"")

Great, we’re done with half our token types (two of the four) already.  Let’s do identifiers next:

var identifier =
        (from pos   in GetCurrentPosition<char>()
         from first in Letter
         from rest  in Many(identifierChar).FollowedBy(OptionalWhiteSpace)
         select new GToken(pos, new IdentifierToken(new string(Cons(first, rest).ToArray())))
         ).Tag("identifier");

This is getting to be old hat.  Identifiers must start with a letter, followed by any number of other identifier characters (letter/digit/underscore).  We turn the list of characters into a string and store it in the appropriate kind of token, along with the starting position.

Finally, numbers:

var number = 
    (from pos   in GetCurrentPosition<char>()
     from chars in Many1(Digit.Tag("")).FollowedBy(OptionalWhiteSpace)
     select new GToken(pos, new NaturalToken(Nat.Parse(new string(chars.ToArray())).Result))
     ).Tag("natural number");

The same general pattern applies.  Note that we use our trusty parser Nat to turn the string of digits into the corresponding integer.

Finally, the complete scanner is just

var anyLegalToken = symbols.Or(number).Or(keywords).Or(identifier).Tag("any legal token");
var scanner = from ws     in OptionalWhiteSpace
              from tokens in Many(anyLegalToken)
              from eof    in EndOfInput()
              select tokens;

Each individual token consumes trailing whitespace, so we need to suck up the initial optional whitespace, followed by a bunch of tokens, followed by the end of the input stream.  If you hover over the "var" in Visual Studio, you’ll see a tool-tip pop up that tells you that scanner has type

P<IEnumerable<GToken>, char>

which is exactly what we want.  We’re done with the scanner!

More

There’s more to write, but the server doesn’t like long blog posts, so I am forced to break today’s entry in two.  Check out the rest in "part six".

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, a noteworthy aside

Posted by Brian on December 7, 2007

So I’ve been blissfully coding and blogging about monadic parser combinators for the past few days, enjoying using the new C# features to tackle one of my old-favorite tasks.  And then today I was browsing around and discovered that I’ve been beaten to the punch.  :)  LukeH has already blogged about monadic parser combinators in C#, and furthermore, he’s doing a better job than I am, in that he actually knows the new LINQ syntax sugars.  From scanning his post, I can see that the new LINQ sugars that allow you to write

from x in p1
from y in p2
from z in p3
select f(x,y,z)

can (given a suitable parser representation type) be used in place of the heavier-syntax lambdas I’ve been using like so

p1.Then(x =>
p2.Then(y =>
p3.Then(z =>
Return(f(x,y,z))

And where clauses can automagically do stuff like Sat…  Wow, that is so much more syntactically elegant than what I have been doing.  I feel foolish now for going as far as I have without leveraging this syntax.  Well, now I have a good excuse to go learn the new LINQ syntax sugars!

I don’t think this will invalidate any of the structure of what I’ve done so far in my previous blog entries on parser combinators, but I have a lot of code to go reformat and lightly rework now.  :)

Anyway, this short post is just to inform everyone of my previous ignorance.  :)  I’m going to go learn the rest of the LINQ stuff now, map it into my general understanding of functional programming and monads, and then return to blogging again about parsers, only this time with more readable code, I expect.  Wish me luck!

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part four – source code

Posted by Brian on December 6, 2007

As I said at the end of the previous entry, the server wouldn’t accept a post that large, so I had to post the source code separately from the text.  So here is the code for part four.

using System;
using System.Linq;
using System.Collections.Generic;
using System.Text;
using Position = System.Int32;

namespace MPC_Part4
{
    public class ParserState
    {
        public readonly Position Position;
        public readonly string Input;
        public ParserState(Position position, string input) { this.Position = position; this.Input = input; }
        public override string ToString()
        {
            // place a '^' before the current location in the string; this is helpful when looking at
            // values inside the debugger
            return this.Input.Substring(0, this.Position) + "^" + this.Input.Substring(this.Position);
        }
    }

    public class ErrorInfo
    {
        public readonly Position Position;
        public readonly IEnumerable<string> Expectations;
        public readonly string Message;
        public ErrorInfo(Position position) : this(position, Enumerable.Empty<string>(), "unknown error") { }
        public ErrorInfo(Position position, IEnumerable<string> expectations, string message) { this.Position = position; this.Expectations = expectations; this.Message = message; }
        public ErrorInfo Merge(ErrorInfo other)
        {
            // Note that this.Position == other.Position holds true, except when one is a failed Try that looked ahead.
            // Since the most common idiom is Try(p1).Or(p2), we favor keeping position2 and message2, as p2 is 
            // likely to consume and thus be the only contributor to the final expectations.
            // But if Try(p1).Or(Try(p2)) fails, and both alternatives looked ahead different number of tokens, then
            // a single error message is bound to be confusing, reporting some wrong position or details.
            // This is an intrinsic difficulty when it comes to reporting errors for languages that do not admit 
            // predictive parsers, and it illustrates why Try() and backtracking should be used judiciously.
            return new ErrorInfo(other.Position, this.Expectations.Concat(other.Expectations), other.Message);
        }
        public ErrorInfo SetExpectation(string label)
        {
            if (string.IsNullOrEmpty(label))
                return new ErrorInfo(this.Position, Enumerable.Empty<string>(), this.Message);
            else
                return new ErrorInfo(this.Position, Parser.Cons(label, Enumerable.Empty<string>()), this.Message);
        }
        public override string ToString()
        {
            List<string> expectations = this.Expectations.ToList();
            StringBuilder result = new StringBuilder(string.Format("At position {0}, {1}", this.Position, this.Message));
            if (expectations.Count != 0)
            {
                result.Append(", ");
                if (expectations.Count == 1)
                {
                    result.Append("expected " + expectations[0]);
                }
                else
                {
                    result.Append("expected ");
                    for (int i = 0; i < expectations.Count - 2; ++i)
                    {
                        result.Append(expectations[i]);
                        result.Append(", ");
                    }
                    result.Append(expectations[expectations.Count - 2]);
                    result.Append(" or ");
                    result.Append(expectations[expectations.Count - 1]);
                }
            }
            return result.ToString();
        }
    }

    public class ParseResult<T>
    {
        public readonly bool Succeeded;
        public readonly ErrorInfo ErrorInfo;
        readonly T result;
        readonly ParserState remainingInput;
        public T Result
        {
            get { if (!this.Succeeded) throw new InvalidOperationException(); return this.result; }
        }
        public ParserState RemainingInput
        {
            get { if (!this.Succeeded) throw new InvalidOperationException(); return this.remainingInput; }
        }
        public ParseResult(T result, ParserState remainingInput, ErrorInfo errorInfo) 
        { 
            this.result = result;
            this.remainingInput = remainingInput;
            this.ErrorInfo = errorInfo;
            this.Succeeded = true;
        }
        public ParseResult(ErrorInfo errorInfo) { this.ErrorInfo = errorInfo; }
        public ParseResult<T> MergeError(ErrorInfo otherError)
        {
            if (this.Succeeded)
                return new ParseResult<T>(this.Result, this.RemainingInput, this.ErrorInfo.Merge(otherError));
            else
                return new ParseResult<T>(this.ErrorInfo.Merge(otherError));
        }
        public ParseResult<T> SetExpectation(string label)
        {
            if (this.Succeeded)
                return new ParseResult<T>(this.Result, this.RemainingInput, this.ErrorInfo.SetExpectation(label));
            else
                return new ParseResult<T>(this.ErrorInfo.SetExpectation(label));
        }
        public override bool Equals(object obj)
        {
            ParseResult<T> other = obj as ParseResult<T>;
            if (other == null)
                return false;
            if (!object.Equals(other.Result, this.Result))
                return false;
            if (!object.Equals(other.RemainingInput, this.RemainingInput))
                return false;
            if (!object.Equals(other.ErrorInfo, this.ErrorInfo))
                return false;
            return true;
        }
        public override int GetHashCode()
        {
            return this.Succeeded.GetHashCode();
        }
        public override string ToString()
        {
            if (this.Succeeded)
                return string.Format("Success: {0}", this.Result.ToString());
            else
                return string.Format("Failure: {0}", this.ErrorInfo.ToString());
        }
    }

    public class Consumed<T>
    {
        public readonly bool HasConsumedInput;
        public readonly ParseResult<T> ParseResult;
        public Consumed(bool hasConsumedInput, ParseResult<T> result) { this.HasConsumedInput = hasConsumedInput; this.ParseResult = result; }
        public Consumed<T> Tag(string label)
        {
            if (this.HasConsumedInput)
                return this;
            return new Consumed<T>(this.HasConsumedInput, this.ParseResult.SetExpectation(label));
        }
        public override string ToString()
        {
            return this.ParseResult.ToString();
        }
    }

    // representation type for parsers
    public delegate Consumed<T> P<T>(ParserState input);

    public static class Parser
    {
        // core functions that know the underlying parser representation
        // clients call p.Parse("someString") so as not to deal with implementation details like Consumed & ParseState
        public static ParseResult<T> Parse<T>(this P<T> p, string toParse)
        {
            return p(new ParserState(0, toParse)).ParseResult;
        }
        // Fail() is a parser that always fails with "message", without consuming any input
        public static P<T> Fail<T>(string message)
        {
            return input => new Consumed<T>(false, new ParseResult<T>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), message)));
        }
        // Return(x) is a parser that always succeeds with result x, without consuming any input
        public static P<T> Return<T>(T x)
        {
            return input => new Consumed<T>(false, new ParseResult<T>(x, input, new ErrorInfo(input.Position)));
        }
        public static P<Func<T, T, T>> Return<T>(Func<T, T, T> f)
        {
            return Return<Func<T, T, T>>(f);
        }
        // p1.Then_(p2) runs p1, discards the result, then runs p2 on the remaining input
        public static P<U> Then_<T, U>(this P<T> p1, P<U> p2)
        {
            return p1.Then(dummy => p2);
        }
        // p1.Then(v => p2) runs p1, binds the result to variable v, then runs p2 on the remaining input
        public static P<U> Then<T, U>(this P<T> p1, Func<T, P<U>> f)
        {
            return input =>
            {
                Consumed<T> consumed1 = p1(input);
                if (consumed1.ParseResult.Succeeded)
                {
                    Consumed<U> consumed2 = f(consumed1.ParseResult.Result)(consumed1.ParseResult.RemainingInput);
                    return new Consumed<U>(consumed1.HasConsumedInput || consumed2.HasConsumedInput,
                                           consumed2.HasConsumedInput
                                               ? consumed2.ParseResult 
                                               : consumed2.ParseResult.MergeError(consumed1.ParseResult.ErrorInfo));
                }
                else
                {
                    return new Consumed<U>(consumed1.HasConsumedInput, new ParseResult<U>(consumed1.ParseResult.ErrorInfo));
                }
            };
        }
        // p1.Or(p2) tries p1, but if it fails without consuming input, runs p2 instead
        public static P<T> Or<T>(this P<T> p1, P<T> p2)
        {
            return input =>
            {
                Consumed<T> consumed1 = p1(input);
                if (consumed1.ParseResult.Succeeded || consumed1.HasConsumedInput)
                {
                    return consumed1;
                }
                else
                {
                    Consumed<T> consumed2 = p2(input);
                    if (consumed2.HasConsumedInput)
                        return consumed2;
                    return new Consumed<T>(consumed2.HasConsumedInput,
                        consumed2.ParseResult.MergeError(consumed1.ParseResult.ErrorInfo));
                }
            };
        }
        // Try(p) behaves like p, except if p fails, it pretends it hasn't consumed any input
        public static P<T> Try<T>(this P<T> p)
        {
            return input =>
            {
                Consumed<T> consumed = p(input);
                if (consumed.HasConsumedInput && !consumed.ParseResult.Succeeded)
                    return new Consumed<T>(false, consumed.ParseResult);
                else
                    return consumed;
            };
        }
        // p.Tag(label) makes it so if p fails without consuming input, the error "expected label" occurs
        public static P<T> Tag<T>(this P<T> p, string label)
        {
            return input => p(input).Tag(label);
        }
        // Sat(pred) succeeds parsing a character only if the character matches the predicate
        public static P<char> Sat(Predicate<char> pred)
        {
            return input =>
            {
                if (input.Position >= input.Input.Length)
                    return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected end of input")));
                else if (!pred(input.Input[input.Position]))
                    return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected character '" + input.Input[input.Position] + "'")));
                else
                    return new Consumed<char>(true, new ParseResult<char>(input.Input[input.Position], new ParserState(input.Position+1, input.Input), new ErrorInfo(input.Position+1)));
            };
        }
        // other handy parser functions
        // Item() consumes the first character of the input string and returns it as a result
        public static P<char> Item()
        {
            return Sat(c => true).Tag("any character");
        }
        // Letter parses a letter
        public static P<char> Letter = Sat(char.IsLetter).Tag("letter");
        // Digit parses a digit character
        public static P<char> Digit = Sat(char.IsDigit).Tag("digit");
        // DigitVal parses a digit character and returns the corresponding numeric value
        public static P<int> DigitVal = Sat(char.IsDigit).Then(c => Return((int)(c - '0')));
        // Literal(c) parses a particular character
        public static P<char> Literal(char c) { return Sat(x => x == c).Tag("character '"+c+"'"); }
        // Literal(s, x) parses a particular string, and returns a fixed result
        public static P<T> Literal<T>(string toParse, T result)
        {
            return LiteralHelper(toParse, result).Tag("literal string \"" + toParse + "\"");
        }
        static P<T> LiteralHelper<T>(string toParse, T result)
        {
            if (toParse == "")
                return Parser.Return(result);
            else
                return Literal(toParse[0]).Then_(LiteralHelper(toParse.Substring(1), result));
        }
        // Cons(x, rest) returns a new list like rest but with x appended to the front
        public static IEnumerable<T> Cons<T>(T x, IEnumerable<T> rest)
        {
            yield return x;
            foreach (T t in rest)
                yield return t;
        }
        // Many(p) parses with p zero or more times, returning the sequence of results
        public static P<IEnumerable<T>> Many<T>(P<T> p)
        {
            return Many1<T>(p).Or(Return(Enumerable.Empty<T>()));
        }
        // Many1(p) parses with p one or more times, returning the sequence of results
        public static P<IEnumerable<T>> Many1<T>(P<T> p)
        {
            return p.Then(x =>
                   Many(p).Then(xs =>
                   Return(Cons(x, xs))));
        }
        // one way to parse natural numbers
        public static P<int> Natural = DigitVal.Then(x => NaturalHelper(x)).Tag("natural number");
        static P<int> NaturalHelper(int x)
        {
            return DigitVal.Then(y =>
                   NaturalHelper(10 * x + y))
            .Or(Return(x));
        }
        // p.Chainl1(op) parses a grammar like "p (op p)*"
        // the result is the p results combined by the op results, left-associatively
        public static P<T> Chainl1<T>(this P<T> p, P<Func<T, T, T>> op)
        {
            return p.Then(x =>
                Chainl1Helper(x, p, op));
        }
        static P<T> Chainl1Helper<T>(T x, P<T> p, P<Func<T, T, T>> op)
        {
            return op.Then(f =>
                   p.Then(y =>
                   Chainl1Helper<T>(f(x, y), p, op)))
            .Or(Return(x));
        }
        // another way to parse naturals
        public static P<int> Nat = DigitVal.Chainl1(Return((int x, int y) => 10 * x + y)).Tag("natural number");
        // p.Chainr1(op) is just like p.Chainl1(op), except it's right-associative
        public static P<T> Chainr1<T>(this P<T> p, P<Func<T, T, T>> op)
        {
            return p.Then(x =>
                   (
                   op.Then(f =>
                   Chainr1(p, op).Then(y =>
                   Return(f(x, y))))
                   )
                   .Or(Return(x)));
        }
        // p.SepBy(sep) parses zero or more p, each separated by a sep, returning a list of the p results
        public static P<IEnumerable<T>> SepBy<T, U>(this P<T> p, P<U> sep)
        {
            return SepBy1(p, sep).Or(Return(Enumerable.Empty<T>()));
        }
        // p.SepBy(sep) parses one or more p, each separated by a sep, returning a list of the p results
        public static P<IEnumerable<T>> SepBy1<T, U>(this P<T> p, P<U> sep)
        {
            return p.Then(x =>
                SepByHelper(p, sep).Then(xs =>
                Return(Cons(x, xs))));
        }
        static P<IEnumerable<T>> SepByHelper<T, U>(P<T> p, P<U> sep)
        {
            return sep.Then_(SepBy1(p, sep))
                .Or(Return(Enumerable.Empty<T>()));
        }
        // p1.NotFollowedBy(p2,label) runs p1 and then looks ahead for p2, failing if p2 is there
        public static P<T> NotFollowedBy<T, U>(this P<T> p1, P<U> p2, string p2Label)
        {
            return p1.Then(result =>
                Try(p2.Then_(Fail<T>("unexpected " + p2Label))
                    .Or(Return(result))));
        }
        public static P<int> EndOfInput = Return(0).NotFollowedBy(Item(), "character").Tag("end of input");

        public static P<T> FullWordLiteral<T>(string s, T result)
        {
            return Try(Literal(s, result).NotFollowedBy(Letter, "letter"));
        }

        public static P<int> OneToNine =
            FullWordLiteral("one", 1).Or(
            FullWordLiteral("two", 2).Or(
            FullWordLiteral("three", 3).Or(
            FullWordLiteral("four", 4).Or(
            FullWordLiteral("five", 5).Or(
            FullWordLiteral("six", 6).Or(
            FullWordLiteral("seven", 7).Or(
            FullWordLiteral("eight", 8).Or(
            FullWordLiteral("nine", 9))))))))).Tag("english number 1-9");

        public static P<int> TenToNineteen =
            FullWordLiteral("ten", 10).Or(
            FullWordLiteral("eleven", 11).Or(
            FullWordLiteral("twelve", 12).Or(
            FullWordLiteral("thirteen", 13).Or(
            FullWordLiteral("fourteen", 14).Or(
            FullWordLiteral("fifteen", 15).Or(
            FullWordLiteral("sixteen", 16).Or(
            FullWordLiteral("seventeen", 17).Or(
            FullWordLiteral("eighteen", 18).Or(
            FullWordLiteral("nineteen", 19)))))))))).Tag("english number 10-19");

        public static P<int> HigherTens(string tenName, int tenValue)
        {
            return FullWordLiteral(tenName, tenValue).Then(tens =>
                   (
                       Sat(c => c == '-').Then_(
                       OneToNine.Then(ones =>
                       Parser.Return(tens + ones)))
                   ).Or(Parser.Return(tens)));
        }

        public static P<int> OneToNinetyNine =
            OneToNine
            .Or(TenToNineteen)
            .Or(HigherTens("twenty", 20))
            .Or(HigherTens("thirty", 30))
            .Or(HigherTens("forty", 40))
            .Or(HigherTens("fifty", 50))
            .Or(HigherTens("sixty", 60))
            .Or(HigherTens("seventy", 70))
            .Or(HigherTens("eighty", 80))
            .Or(HigherTens("ninety", 90)).Tag("english number 1-99");

        // various running examples from the blog
        static void AssertEqual<T>(T x, T y)
        {
            if (!object.Equals(x, y))
                throw new Exception("not equal");
        }
        static void AssertEqualEnumerable<T>(IEnumerable<T> x, IEnumerable<T> y)
        {
            List<T> xl = x.ToList();
            List<T> yl = y.ToList();
            AssertEqual(xl.Count, yl.Count);
            for (int i = 0; i < xl.Count; ++i)
                AssertEqual(xl[i], yl[i]);
        }

        static void Main(string[] args)
        {
            Console.WriteLine(Sat(c => c == 'x').Parse(""));
            Console.WriteLine(Sat(c => c == 'x').Parse("y"));
            Console.WriteLine(Sat(c => c == 'x').Parse("x"));
            Console.WriteLine();

            Console.WriteLine(Letter.Or(Digit).Parse("?"));
            P<string> word = Many1(Letter).Then(chars => Return(new string(chars.ToArray())));
            AssertEqual(word.Parse("hello world").Result, "hello");

            AssertEqual(Natural.Parse("256K").Result, 256);
            AssertEqual(Nat.Parse("256K").Result, 256);

            P<Func<int, int, int>> addOp = Literal('+').Then_(Return((int x, int y) => x + y))
                                     .Or(Literal('-').Then_(Return((int x, int y) => x - y)))
                                     .Tag("add/subtract op");
            P<int> simpleExpr = Natural.Chainl1(addOp);
            AssertEqual(simpleExpr.Parse("10+2+3-14").Result, 1);

            P<Func<int, int, int>> mulOp = Literal('*').Then_(Return((int x, int y) => x * y))
                                     .Or(Literal('/').Then_(Return((int x, int y) => x / y))).Tag("multiply/divide op");
            P<Func<int, int, int>> expOp = Literal('^').Then_(Return((int x, int y) => (int)Math.Pow(x, y))).Tag("exponentiation op");
            P<int> expr = null, term = null, factor = null, part = null, paren = null;
            paren = Literal('(').Then(dummy => // use dummy lambda to capture reference to mutable "expr"
                    expr.Then(e =>
                    Literal(')').Then_(
                    Return<int>(e))));
            part = Nat.Or(paren);
            factor = part.Chainr1(expOp);
            term = factor.Chainl1(mulOp);
            expr = term.Chainl1(addOp);
            AssertEqual(expr.Parse("(611-2^3^2+1)/10-5*2").Result, 0);
            Console.WriteLine(expr.Parse("(611-2^-3^2+1)/10-5*2"));
            Console.WriteLine(expr.Parse("(611-2^3^2.2+1)/10-5*2"));
            Console.WriteLine();

            P<IEnumerable<int>> listOfNums = Literal('[').Then_(
                                             Nat.SepBy(Literal(',')).Then(nats =>
                                             Literal(']').Then_(
                                             Return(nats))));
            AssertEqualEnumerable(listOfNums.Parse("[1,10,100]").Result, new int[] { 1, 10, 100 });
            AssertEqualEnumerable(listOfNums.Parse("[1]").Result, new int[] { 1 });
            AssertEqualEnumerable(listOfNums.Parse("[]").Result, new int[] { });
            Console.WriteLine(listOfNums.Parse("[1,2x"));
            Console.WriteLine();

            Console.WriteLine(Literal('1').NotFollowedBy(Literal('2'), "character '2'").Parse("12"));
            Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("1"));
            Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("12"));
            Console.WriteLine();

            Console.WriteLine(OneToNinetyNine.Parse("seven"));
            Console.WriteLine(OneToNinetyNine.Parse("twenty-seven"));
            Console.WriteLine(OneToNinetyNine.Parse("seventy-seven"));
            Console.WriteLine(OneToNinetyNine.Parse("so"));
            Console.WriteLine();

            Console.WriteLine("done, press a key");
            Console.ReadKey();
        }
    }
}

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part four

Posted by Brian on December 6, 2007

Last time we developed some useful combinators that enabled us to easily author some useful little parsers, including an expression evaluator.  Today we’ll focus on error messages, so that if a parse fails, we get back something more useful than just "null".  Along the way we’ll look into some of the guts of the implementation, and make some alterations to our parser "basis" – the small set of core functions that knows the parser implementation.

When parsers fail

Last time we created an expression parser that could parse expressions with natural numbers, the four main math operators as well as right-associative exponentiation, and parentheses.  Thus:

expr("(611-2^3^2+1)/10-5*2")

would succeed in parsing the string and evaluate it to an answer (in this case, zero).  But what happens when we have a bad input string?  "Bad" is a relative term; someone unfamiliar with our little expression language/grammar might reasonably guess that negative numbers or floating-point values were legal, and try strings like these:

expr("(611-2^-3^2+1)/10-5*2"))
expr("(611-2^3^2.2+1)/10-5*2"))

The result?  Our expr parser will fail, yielding the oh-so-helpful result "null".  Clearly we need to improve things so we can fail more gracefully.

There is a huge design space for error diagnostics/recovery in parsers.  For today, we’ll stick with modest ambitions.  When a parse fails, we’d like to be able to point to the location of the failure in the input string and report what we expected to see there instead.  By the end of this blog entry, we’ll be able to get error messages like these

Failure: At position 7, unexpected character '-', 
expected character '(' or natural number
Failure: At position 10, unexpected character '.', 
expected character ')', add/subtract op, multiply/divide op or exponentiation op

from our little expression parser, with only tiny changes to the client code.  Nothing comes free, however, and enabling this kid of useful error diagnostics will require substantially rewriting the parser representation and the basis (the five core functions).  Doing so will provide us with a nice opportunity to investigate more of what’s going on under the hood of the core parser library.

(I’ll emphasize that for the most part I am just mimicking Haskell’s Parsec library… for a good discussion of the error-handling design, check out this Parsec paper.)

Predictive parsers

A predictive parser is a parser that uses a specific, finite amount of lookahead to choose among alternatives.  (This is in contrast to a fully backtracking parser, which after continuing an arbitrarily long way down a given alternative, will back up – through the entire input if necessary – in order to find a legal alternative parse.)  For example, consider parsing a statement of C# code.  There are a lot of alternatives for statements, among them: if statements, while statements, and switch statements.  We only need a single keyword/token’s worth of lookahead in order to decide which of those alternatives can succeed.  That is, if we had defined a parser along the lines of

stmt = ifStmt.Or(whileStmt).Or(switchStmt)

and we encounter input that begins "if (x ==…", then once we have recognized the "if ", we really don’t need to consider the while or switch alternatives.  However the Or combinator we have seen so far has no way to know this.  So if the ifStmt parse fails for some reason, the stmt parser will backtrack and dutifully try the whileStmt and switchStmt alternatives.  That is, our current parser library implementation is backtracking. 

(Note: In the previous paragraph I used the word "token" for the first time.  A token is the atomic unit of input – the input to a parser is just a list of tokens.  For the parsers we’ve been dealing with so far, the input is a string, and the tokens are the individual characters.  In many real-world parsers, there is a separate pre-processing step that groups individual characters into tokens that represent the atomic entities of the language – things like keywords, identifiers, string literals, etc.  In a future installment we’ll see how to generalize our parser library so, rather than assuming a string of characters as input, the parser will accept a stream of an arbitrary user-specified token type.)

Backtracking makes a parser very flexible, but it also causes problems. Backtracking indiscriminately is clearly undesirable for performance reasons, and it also makes it tougher to generate good error diagnostics.  If you manage to parse your way down the input string 1000 characters or so before encountering an error, then not only is backtracking all the way to the beginning of the input unlikely to yield a better parse, but it may cause you to lose useful local context surrounding the problem, making it more difficult to yield useful error diagnostics.  (For example, if you encounter an error while parsing an "if" statement, you don’t want to back out of it and end up issuing an error message involving "while" or "switch" statements.)  When there is an error in the input, we’d like to be able to point at the location of the problem with specific, local information about what failed.

On the other hand, it is unreasonable to want a parser to be completely predictive, as most grammars (including those for most programming languages) have certain features/aspects that require an arbitrary amount of lookahead.  So there is a tension between predictiveness (which enables good performance and diagnostics) and the ability to backtrack (without which, many languages could not be recognized by a parser).

The design of the Parsec library illustrates one good way to mediate this tension.  Parsec’s parser library defaults to being predictive based on just one token of lookahead, but allows the user to selectively enable arbitrary amounts of lookahead for backtracking when necessary.  Today we’ll re-implement the core parser library functions with this design.

Looking at the Or combinator, and introducing the Try combinator

We haven’t looked inside the Or combinator yet, but now is a good time, so as to understand the issue at hand.  Here’s Or:

// p1.Or(p2) tries p1, but if it fails, runs p2 on the original input instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)
{
    return input =>
    {
        ParseResult<T> result1 = p1(input);
        if (result1 == null)
            return p2(input);
        else
            return result1;
    };
}

The body of the function just returns a big lambda – recall that a parser of type "T" (a P<T>) is just a function that takes in a string and returns a ParseResult<T> (which is a structure containing the result of the parse, as well as the remaining unparsed input).  And so the lambda tries parsing the input with p1, but if that fails (returns null), then it tries p2 on the same input.

What we’ll do is change the semantics, so that p2 is only tried if p1 fails without consuming any input.  If p1 progressed some ways down the string and then failed, then typically we don’t want to try p2.  From the previous example, if p1 is the ifStmt parser and p2 is the whileStmt parser, and we already parsed the "if" with p1, then even if p1 eventually fails, we have no interest in trying p2.  On the other hand, if p1 fails without consuming any input (perhaps because the input string had a prefix that the ifStmt parser would immediately fail to match on the first token… for example, the string "while"), then of course we still want to try alternative p2.  Making this change to Or makes everything completely predictive.

To allow for selective backtracking, we’ll add a new combinator called Try:

// Try(p) behaves like p, except if p fails, it pretends it hasn't consumed any input
public static P<T> Try<T>(this P<T> p)

That is, Try(p) is a parser that’s just like p, except that when it fails, it will tell a lie and claim that it didn’t consume any input even if it did.  Phrased another way, Try lies by saying that it discovered it failed to parse the input by looking at just the first token, even though in fact it may have parsed arbitrarily far ahead in the token stream.  Given the semantics of Try, and the new semantics we shall give Or:

// p1.Or(p2) tries p1, but if it fails without consuming input, runs p2 instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)

we can see how to do backtracking:

// old version
p1.Or(p2)           // always backtracks
// new version
p1.Or(p2)           // doesn't backtrack if p1 consumes input
Try(p1).Or(p2)      // backtracks any time p1 fails, just like the old version

We’ll see examples of using Try later on.  For now, we just need a sense of where we are headed.

New representation types for parsers

Ok, so to enable better error diagnostics, we’ll change the representation and implementation of the core parser library in a few ways.  We need to add a way to represent error messages (e.g. "unexpected character ‘-‘") and expectations (e.g. "expected natural number"), as well as locations within the input.  We also need to keep track of which parsers have consumed input and which haven’t.  So we create some new data types and update the parser delegate type P<T> to utilize the new types.  Here’s the skeleton code that describes the structure; we’ll discuss each part in turn.

using Position = System.Int32;

public class ParserState
{
    public Position Position;
    public string Input;
}

public class ErrorInfo
{
    public Position Position;
    public IEnumerable<string> Expectations;
    public string Message;
}

public class ParseResult<T>
{
    public bool Succeeded;
    // if Succeeded, all fields below are used, else only ErrorInfo is used
    public ErrorInfo ErrorInfo;
    public T Result;
    public ParserState RemainingInput;
}

public class Consumed<T>
{
    public bool HasConsumedInput;
    public ParseResult<T> ParseResult;
}

// representation type for parsers
public delegate Consumed<T> P<T>(ParserState input);

First, we define a type alias for int called Position, which represents a location within the input (the nth character of the input).  Using a type alias here is good, because in the future we might enhance the representation of positions in the input (for instance, to deal with newlines, we may keep track of both the line number and the column number), and it is easier to make such changes by looking through the code for every occurrence of "Position", rather than by looking for "int" and trying to determine which integers were representing positions and which ones were doing other tasks.

ParserState will be the new input type for the parser.  It records the input string as well as the current location.  (The previous version used only a string as input, and repeatedly called "input.Substring(1)" as characters were consumed one by one.)

ErrorInfo represents error diagnostic information.  In addition to the location of the error, we keep track of a list of expectations (that is, the list of things that could have validly come next at the point of error, such as "expected character ‘(‘ or natural number") as well as an arbitrary message describing the problem (e.g. "unexpected character ‘-‘").

ParseResult<T> is modified to include an ErrorInfo field as well as a boolean representing whether or not the parse succeeded.  Whereas previously parsers returned null on failure, now failed parses will return a ParseResult with Succeeded set to false.

Consumed<T> wraps the ParseResult information in another structure with an additional boolean that says whether or not the parser consumed any input (advanced through the input string).  As described above, this information will be useful to the Or and Try combinators to do predictive parsing with selective backtracking.

Finally, a parser P<T> is now a function that takes in a ParserState and returns a Consumed<T>.

These new classes have some additional constructors and methods which I’ve elided from the code snippet above.  Moving forward, as we encounter any interesting bits involving those members, I’ll call them out.

A small change to the client interface

Up until now, we could run a parser simply by passing it an input string and it would return a ParseResult.  Now that the delegate representation has changed, that’s no longer possible (we now need to pass in a ParserState object, and a Consumed object is returned).  To re-enable the friendlier interface for clients, we add an extension method:

// clients call p.Parse("input") to avoid implementation details like Consumed & ParserState
public static ParseResult<T> Parse<T>(this P<T> p, string toParse)
{
    return p(new ParserState(0, toParse)).ParseResult;
}
expr("3+4")        // old way
expr.Parse("3+4")  // new way

Now clients call "parser.Parse()", and the Parse() method will cons up the initial state object, as well as strip off the "consumed" data from the result.  Adding this method also shields clients from future breaking changes, in the event that we ever change the parser representation type P<T> from a delegate type to a class type.

Additionally, I’ve added a ToString() method to the ParseResult class, which displays the result in the case of a successful parse, or a nicely worded error message in the case of a failure.

The parser basis with the new representation

Now we’ll look at the implementation of the core parser functions using the new representation.

Item and Sat

Previously, we treated Item as a core function that knew about the underlying parser representation, and then implemented Sat in terms of Item.  Now it makes sense to do the reverse – promote Sat to the "core" functions that knows the underlying parser representation, and define Item in terms of Sat.  This enables Sat to directly access the error information so as to distinguish its two different modes of failure (end of input, or character not matching predicate) with different error messages.

// Sat(pred) succeeds parsing a character only if the character matches the predicate
public static P<char> Sat(Predicate<char> pred)
{
    return input =>
    {
        if (input.Position >= input.Input.Length)
            return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected end of input")));
        else if (!pred(input.Input[input.Position]))
            return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected character '"+input.Input[input.Position]+"'")));
        else
            return new Consumed<char>(true, new ParseResult<char>(input.Input[input.Position], new ParserState(input.Position+1, input.Input), new ErrorInfo(input.Position+1)));
    };
}
// Item() consumes the first character of the input string and returns it as a result
public static P<char> Item()
{
    return Sat(c => true);
}

So, for example, we can call

Console.WriteLine(Sat(c => c == 'x').Parse("")); Console.WriteLine(Sat(c => c == 'x').Parse("y"));
Console.WriteLine(Sat(c => c == 'x').Parse("x"));

and get

Failure: At position 0, unexpected end of input
Failure: At position 0, unexpected character 'y'
Success: x

Already we have an improvement!  Whereas previously all failures just returned "null" with no information as to the cause of the problem, now we can distinguish failure cases and offer more useful diagnostics.

Note also that Sat populates the resultant "consumed" data with either true or false, based on whether it succeeded in parsing a character of the input or not.

Or and Try

Now let’s look at the new Or combinator:

// p1.Or(p2) tries p1, but if it fails without consuming input, runs p2 instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)
{
    return input =>
    {
        Consumed<T> consumed1 = p1(input);
        if (consumed1.ParseResult.Succeeded || consumed1.HasConsumedInput)
        {
            return consumed1;
        }
        else
        {
            Consumed<T> consumed2 = p2(input);
            if (consumed2.HasConsumedInput)
                return consumed2;
            return new Consumed<T>(consumed2.HasConsumedInput,
                consumed2.ParseResult.MergeError(consumed1.ParseResult.ErrorInfo));
        }
    };
}

The logic for whether or not to run p2 has changed; now p2 runs only if p1 fails and it does not consume any input.  If p1 succeeds or consumes, we short-circuit the other alternative and immediately return.

In the case where neither p1 nor p2 consumes any input, you’ll note there’s a call to a new function "MergeError".  The high-level description of that function is that it combines the error information about expectations.  This is what enables error messages of the form "…expected x, y, or z…"; we’ll see an example later.

Here is the new Try combinator:

// Try(p) behaves like p, except if p fails, it pretends it hasn't consumed any input
public static P<T> Try<T>(this P<T> p)
{
    return input =>
    {
        Consumed<T> consumed = p(input);
        if (consumed.HasConsumedInput && !consumed.ParseResult.Succeeded)
            return new Consumed<T>(false, consumed.ParseResult);
        else
            return consumed;
    };
}

It works just as billed – if p consumes input and fails, it returns a value with the same ParseResult, only with HasConsumedInput set to false.

A new combinator: Tag

In addition to Try, there is one other new core combinator I failed to mention.  It’s called Tag, and its the way we can assign labels to individual parsers, so as to provide a high-level description of expectations in error messages.  Here’s the declaration of Tag, along with some typical uses:

// p.Tag(label) makes it so if p fails without consuming input, the error "expected label" occurs
public static P<T> Tag<T>(this P<T> p, string label);

public static P<char> Letter = Sat(char.IsLetter).Tag("letter");
public static P<char> Digit = Sat(char.IsDigit).Tag("digit");

This is what enables error messages like the one here:

Console.WriteLine(Letter.Or(Digit).Parse("?"));
// Failure: At position 0, unexpected character '?', expected digit or letter

Now we can better appreciate what was happening in that MergeError() call inside the definition of Or.  MergeError() concatenates the lists of expectations inside the resultant error information, so that when a set of Or alternatives fails, we can issue an error message that lists the various alternatives we were expecting at that point in the parse.

The rest

There are other changes to the core parser functions, but I’ll skip describing them, to prevent this blog entry from becoming extremely long.  As always, the entire code appears at the end.

Nevertheless, to sum up, we created a new parser representation and changed the parser basis – the set of core functions – to work with the new representation.  The basis now contains these functions: Fail, Return, Then, Or, Try, Tag, Sat, and Parse.  The rest of the library, including combinators like Chainl1 and Many, is unchanged, other than the tiny addition of a few Tag calls to label such parsers as Letter and Digit.  By building the parser library atop a small set of core functionality, we’ve been able to change the representation to greatly enhance the utility of the library, with a minimum of repercussions outside of the core.

Examples

We’ve already encountered one of the best examples of the new utility: the motivating example of the expression-parser error messages at the beginning of today’s blog entry.  In order to get error messages like

Failure: At position 7, unexpected character '-', 
expected character '(' or natural number
Failure: At position 10, unexpected character '.', 
expected character ')', add/subtract op, multiply/divide op or exponentiation op

the only changes we had to make to our old code was the addition of a few Tag calls.  Here’s one such example:

P<Func<int, int, int>> addOp = Literal('+').Then_(Return((int x, int y) => x + y))
                           .Or(Literal('-').Then_(Return((int x, int y) => x - y)))
                           .Tag("add/subtract op");

The only difference compared to our previous code is the addition of the Tag call (underlined in the snippet above).

NotFollowedBy and EndOfInput

The Try combinator enables us to create another useful combinator called NotFollowedBy.  Here is its definition, along with an example of its use.

// p1.NotFollowedBy(p2,label) runs p1 and then looks ahead for p2, failing if p2 is there
public static P<T> NotFollowedBy<T, U>(this P<T> p1, P<U> p2, string p2Label)
{
    return p1.Then(result =>
        Try(p2.Then_(Fail<T>("unexpected " + p2Label))
            .Or(Return(result))));
}
Console.WriteLine(Literal('1').NotFollowedBy(Literal('2'), "character '2'").Parse("12"));
// Failure: At position 1, unexpected character '2'

This lets us write a handy function to recognize the end of the input:

P<int> EndOfInput = Return(0).NotFollowedBy(Item(), "character").Tag("end of input");

Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("1"));
//Success: 0
Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("12"));
//Failure: At position 1, unexpected character, expected end of input

EndOfInput does not need to return any useful value, so it always returns the integer 0 when it succeeds.  It is interesting that this parser can be written in client code.  At first glance, detecting the end of the input seems like something that would require access to the underlying representation.  This just goes to show how expressive the small set of core functions – the parser basis – really is.

NotFollowedBy also helps us correct a problem with the English number parser we previously encountered.  Back in part two, we created a parser to recognize and evaluate English numbers.  However if you read the source code, you discovered it had a problem; when trying to parse a number like "seventy-seven", the literal string "seven" would match and immediately return the value 7 (along with remaining unparsed input "ty-seven").  We can correct this by appending a NotFollowedBy(Letter) call to each Literal string we want to match.  That way, for example, "seven" only gets matched when the next character is not another letter.  Making that fix, and adding a few Tags yields an English number parser we can call like this:

Console.WriteLine(OneToNinetyNine.Parse("seven"));
Console.WriteLine(OneToNinetyNine.Parse("twenty-seven"));
Console.WriteLine(OneToNinetyNine.Parse("seventy-seven"));
Console.WriteLine(OneToNinetyNine.Parse("so"));

to yield these results:

Success: 7
Success: 27
Success: 77
Failure: At position 0, unexpected character 's', expected english number 1-99

Nifty.  The full code appears below.

Summary

Today we added error diagnostics to the parser library.  By changing the parser representation type we can keep track of the position in the parse, an error message, and a list of expectations for that location.  We changed the semantics of Or to be predictive by default.  And we added three new functions to the parser basis (the core functions that know the parser representation):

  • Try(p), which pretends it didn’t consume any input when p fails, enabling selective backtracking
  • p.Tag(label), which makes it easy to get descriptive expectations into parser error messages
  • p.Parse(string), which shields clients from the underlying parser representation

We also created a handy combinator called NotFollowedBy, as well as a parser for recognizing the EndOfInput.

Full source code

Hmmm, it appears the blog server thinks this post is too big.  So I’m going to try posting the code as a separate entry.

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part three

Posted by Brian on December 4, 2007

Last time we learned a bit about monadic parser combinators.  Today we are going to build on that foundation and create some new combinators for the parser library.  We’ll use those new combinators to create a few interesting parsers, including a mathematical expression evaluator.  Along the way we’ll learn a little more about lists and lambdas in C#, too.

Review

Let’s review the parser combinator library we have accumulated so far:

// Fail() is a parser that always fails without consuming any input
P<T> Fail<T>()
// Return(x) is a parser that always succeeds with result x, without consuming any input
P<T> Return<T>(T x)
// p1.Then_(p2) runs p1, discards the result, then runs p2 on the remaining input
P<U> Then_<T, U>(this P<T> p1, P<U> p2)
// p1.Then(v => p2) runs p1, binds the result to variable v, then runs p2 on the remaining input
P<U> Then<T, U>(this P<T> p1, Func<T, P<U>> f)
// p1.Or(p2) tries p1, but if it fails, runs p2 on the original input instead
P<T> Or<T>(this P<T> p1, P<T> p2)
// Item() consumes the first character of the input string and returns it as a result
P<char> Item()
// Sat(pred) succeeds parsing a character only if the character matches the predicate
P<char> Sat(Predicate<char> pred)
// Letter parses a letter
P<char> Letter
// Digit parses a digit character
P<char> Digit
// DigitVal parses a digit character and returns the corresponding numeric value
P<int> DigitVal
// Literal(c) parses a particular character
P<char> Literal(char c)
// Literal(s, x) parses a particular string, and returns a fixed result
T Literal<T>(string toParse, T result)

Not a bad start!  Still, all the parsers we have built thus far could only parse a finite amount of input.  Most parser need to be able to parse an arbitrary number of things.  For example, a mathematical expression can have an arbitrary number of terms (e.g. "3+4+5+…"), and a C# function body is composed of some arbitrary number of statements.  To make realistic parsers, we need to be able to parse lists of things.  So let’s get to it.

Lists in C#

What data type should we use to represent a list of things?  The .NET framework offers a number of different kinds of concrete data structures, including arrays, a List class, and LinkedList class, and some others.  When considering lists from a functional programming perspective, however, there are a few common idioms regarding how lists are typically uses.  Common scenarios include applying a function to each value in a list to compute a new list of results, and accumulating all of the values from a list into a single result using some accumulator function.

It turns out that these common idioms appear as extension methods on the IEnumerable<T> interface.  Query comprehensions in LINQ are extremely similar to list comprehensions in a functional programming language like Haskell.  So we’ll use IEnumerable<T> as our list representation when it comes to parsing lists of things.  (In a future blog, IEnumerable<T> might also be valuable to us because of how it can enable lazy evaluation.)

There are all kinds of cool functions for manipulating IEnumerable<T>s in LINQ – I could spend an entire blog entry discussing them.  That would be a needless aside, though, check out the docs yourself to learn more.  Here is just a quick taste:

IEnumerable<int> nums = new int[] { 1, 2, 3, 4 };
IEnumerable<int> squares = nums.Select(x => x * x);  // [1, 4, 9, 16]
int sumOfSquares = squares.Aggregate((x,y) => x+y);  // 30
Cons(0, squares);                                    // [0, 1, 4, 9, 16]

Select applies a given function to each element of a list and returns a list of results.  Aggregate combines the values in a list with a given function.  Cons is not in the .NET library, but rather is a useful little function I defined like this:

public static IEnumerable<T> Cons<T>(T x, IEnumerable<T> rest)
{
    yield return x;
    foreach (T t in rest)
        yield return t;
}

Cons just appends a single value onto the front of a list.  (If you’re not familiar with iterator blocks and the yield statement, don’t sweat it.)  By the way, I am using the term list loosely, simply because repeatedly using the word "IEnumerable" makes for unwieldy prose.

Ok, now that we have a way to represent lists, let’s start authoring some parsers that can consume arbitrary amounts of input.

The Many() combinator

Let’s make a parser for words.  A word is a sequence of one or more letters.  So the goal is to be able to author a parser that will work like so:

word("hello world")  // yields { "hello", " world" }

(Recall that I am using the shorthand "{ Result, RemainingInput }" to describe the ParseResult; see the previous blog for details, if you’ve forgotten.)

To write word, we need a way to parse an arbitrary number of items.  So we invent these combinators:

// Many(p) parses with p zero or more times, returning the sequence of results
public static P<IEnumerable<T>> Many<T>(P<T> p)
{
    return Many1<T>(p).Or(Return(Enumerable.Empty<T>()));
}
// Many1(p) parses with p one or more times, returning the sequence of results
public static P<IEnumerable<T>> Many1<T>(P<T> p)
{
    return p.Then(x =>
           Many(p).Then(xs =>
           Return(Cons(x, xs))));
}

Many and Many1 are mutually recursive; the former parser zero or more occurrences, the latter one or more.  The definition of Many is easy; we try to parse one or more times with p, but if that fails (because there wasn’t even a single instance of the thing that p parses at the start of the input), then we just return the empty list without consuming any input.  Many1 is only a little more complicated.  It parses a single instance of p, and names the result value x.  Then it calls Many to parse the remaining occurrences, and names those results xs.  Now that we’ve parsed all that we can, we return a list containing the first result (x) tacked on the front of the list of the rest of the results (xs).

Now we can define word:

P<string> word = Many1(Letter).Then(chars => Return(new string(chars.ToArray())));
word("arbitrary number of characters") // yields { "arbitrary", " number of characters" }

Many1(Letter) does all the work.  The result type is IEnumerable<char>, though, and we would prefer that word result in a string, so after parsing the letters, we convert the resulting list to a string.  Easy!  This is our first parser that can parse an arbitrarily long input string.

Before moving on, I should note that names of these functions demonstrate a naming convention we’ll use throughout the rest of the library.  "Foo" and "Foo1" are similar combinators, except the former parses zero or more things whereas the latter parses one or more things.

Parsing numbers

Now let’s try to parse numbers.  A natural number is just a fancy term for a non-negative integer.  Let’s try to write a parser for natural numbers.  Suppose we are parsing the string "153".  We already have the parser DigitVal, so we can get the numeric value of each individual character.  We need to accumulate a value as we parse along – each time we see another digit, we multiply the previously-accumulated value by 10 and add the value of the new digit.  Here is one way we could write such a parser:

public static P<int> Natural = DigitVal.Then(x => NaturalHelper(x));
static P<int> NaturalHelper(int x)
{
    return DigitVal.Then(y =>
           NaturalHelper(10 * x + y))
    .Or(Return(x));
}
Natural("153 happy bunnies") // yields { 153, " happy bunnies" }

The logic is clear.  First we parse the initial digit value to seed the accumulator.  Then we start repeatedly calling a helper, which either sees another digit and continues with an accumulated value of 10*x+y, or when it fails to see another digit, just returns the accumulated value.

Simple expressions, and the Chainl1 combinator

Let’s write a very tiny expression parser.  The expressions we want to parse are just natural numbers and the addition and subtraction operators.  Here’s an illustrative example that tests the limits of the desired parser:

simpleExpr("10+8-13-4")  // yields { 1, "" }

Now, it just so happens that left-associative binary operators are a very common thing to want to parse – practically every computer language has expressions like these.  And as it turns out, there’s a very handy combinator that makes parsing these a breeze.  The combinator is called Chainl1 – "Chain" because it chains together values with operators, "l" because it works left-associatively, and "1" because it parses one or more values.  Here’s the code for it, as well as the code for our simple expression parser:

// p.Chainl1(op) parses a grammar like "p (op p)*"
// the result is the p results combined by the op results, left-associatively
public static P<T> Chainl1<T>(this P<T> p, P<Func<T, T, T>> op)
{
    return p.Then(x => Chainl1Helper(x, p, op));
}
static P<T> Chainl1Helper<T>(T x, P<T> p, P<Func<T, T, T>> op)
{
    return op.Then(f =>
           p.Then(y =>
           Chainl1Helper<T>(f(x, y), p, op)))
    .Or(Return(x));
}
P<Func<int,int,int>> addOp = Literal('+').Then_(Return((x, y) => x + y))
                         .Or(Literal('-').Then_(Return((x, y) => x - y)));
P<int> simpleExpr = Natural.Chainl1(addOp);

Yikes!  Let’s break it down.  Chainl1 works kind of like Natural, only it is more general.  Like Natural, it parses an initial seed value, and then calls a recursive helper to continually accumulate values.  But the values it parses are separated by operators, and the operators define the way the values are combined.  We’ll trace through it in a moment.  First have a look at addOp.  It recognizes operators, and depending on what operator it sees, it returns a corresponding C# function.  Our expression parser-evaluator is then just defined as Natural.Chainl1(addOp).

Let’s trace what happens when we try to parse "10+8-13-4".  Chainl1 is called with p bound to Natural and op bound to addOp.  First we run p.  This will parse the initial Natural 10, and store that value in x.  Then we call the recursive helper.  It runs op, which will parse the + character and bind the name f to the C# lambda function that adds two integers.  Then we run p again, parsing 8 and storing the value in y.  Next we compute f(x,y), which is +(10,8), or 18.  And so having parsed the prefix "10+8" we recurse with 18 as the accumulated value x.  This time we parse a and bind it to f, parse the 13 as y, and recurse with 18-13=5.  This time we’ll parse the remaining "-4" and recurse with 1.  And now when we try to parse another operator, we fail (because we’ve reached the end of the input), so control continues to the right-hand-side of the Or call, and we return the accumulated value 1, which is the answer.  Whew!

Sometimes a picture is worth a thousand words.  Here’s a diagram that explains the parse:

simpleExpr

The first call to Chainl1 is circled in the bottom left, and then the subsequent recursive calls to Chainl1Helper are the larger circles (the whole diagram is the last "circle").  Compare the diagram to the code until you grok what’s going on.

A short C# aside

I fudged things a tiny bit in the previous code snippet.  The code included expressions like

Return((x, y) => x + y)

and it turns out that doesn’t compile.  The compiler complains that it can’t infer the type argument of the Return<T> function.  You might think the problem is due to the fact that + can work on doubles as well as on ints, and so you might try changing it to

Return((int x, int y) => x + y)

but you still get the same error.  The problem is that lambdas do not have types (they are merely convertible to delegate types with compatible signatures).  We can fix this by saying

Return<Func<int,int,int>>((x, y) => x + y)

that is, explicitly calling out the T type of the generic Return function.  However this is pretty verbose, and as you’ll see, we’ll be Returning lambdas somewhat frequently.  We can make things more bearable by adding a new Return overload to our library:

public static P<Func<T, T, T>> Return<T>(Func<T, T, T> f)
{
    return Return<Func<T, T, T>>(f);
}

Now there is a named delegate type (Func) to participate in the overload/coercion rules the compiler uses, and as a result, now the code

Return((int x, int y) => x + y)

will compile and work as desired.

Parsing numbers with combinators

I mentioned that the Chainl1 combinator was very generally applicable.  To drive home that point, let’s rewrite the function to parse natural numbers.  Ready?  Ok, here it is!

P<int> Nat = DigitVal.Chainl1(Return((int x, int y) => 10 * x + y));

That’s it!  We chain the DigitVal parser with a Return call – that Return call parses no characters of the input, but returns a useful function for combining the DigitVal results.  Here’s another diagram of how this particular parser works on the string "153":

Nat

In this case, the "operator" between the digit characters is just the empty string.  Neat, huh?  See, I told you Chainl1 was versatile.

The full expression parser

Now let’s revisit the expression parser, but add the ability to do multiplication, division, exponentiation, and parentheses.  The usual precedence rules will apply, and exponentiation will associate to the right.  To deal with right-associative operators, we need Chainr1:

// p.Chainr1(op) is just like p.Chainl1(op), except it's right-associative
public static P<T> Chainr1<T>(this P<T> p, P<Func<T, T, T>> op) { ... }

We won’t look at the code now; you can find it (and all the code from today) at the end of this blog entry.  With this new combinator in hand, we can write a full expression parser:

P<Func<int,int,int>> addOp = Literal('+').Then_(Return((int x, int y) => x + y))
                         .Or(Literal('-').Then_(Return((int x, int y) => x - y)));
P<Func<int,int,int>> mulOp = Literal('*').Then_(Return((int x, int y) => x * y))
                         .Or(Literal('/').Then_(Return((int x, int y) => x / y)));
P<Func<int,int,int>> expOp = Literal('^').Then_(Return((int x, int y) => (int)Math.Pow(x,y)));
P<int> expr = null, term = null, factor = null, part = null, paren = null;
paren  = Literal('(').Then(dummy =>   
         expr.Then(e =>                // (1)
         Literal(')').Then_(
         Return<int>(e))));
part   = Nat.Or(paren);
factor = part.Chainr1(expOp);
term   = factor.Chainl1(mulOp);
expr   = term.Chainl1(addOp);          // (2)
expr("(611-2^3^2+1)/10-5*2")   // yields { 0, "" }

First we define parsers that recognize the strings representing operators and return the corresponding C# function as a result.  We’ve seen this before, with addOp.  Next, we define a bunch of integer parsers for each portion of the language grammar.  Since these guys are mutually recursive, I declare them all up front to avoid a forward-declaration problem.  Then we create parsers for each portion of the grammar, starting with highest precedence.  The paren parser recognizes strings of the form "(exp)", and returns the value that’s the result of parsing the expression inside the parentheses.  The part parser recognizes natural numbers or parenthesized expressions.  factor chains together parts with the exponentiation operator, which has the highest precedence.  And so on for terms and expressions.  The end result is a nice expression parser that can parse and evaluate complicated expressions like the example at the end of the code snippet.

There is one cute aspect worth pointing out.  These guys are mutually recursive, and so there’s no order to define these to make everything immediately work out.  Note especially that paren references expr (on the line marked (1)), but at that moment in time, expr is still null.  expr doesn’t get assigned the right value until the line marked (2).  So why does it work?  The key is the way C# lambdas capture values from the external environment.  Note that in the line above (1), we call "Then(dummy=>" rather than just "Then_(".  This way, expr is referenced inside a lambda, and since it’s a name from outside the lambda, C# captures a reference to this mutable local variable.  A few statements later, when we mutate the value of the local variable expr, this also affects the lambda inside paren.  So by the time we actually use expr to parse a string, everything works.  This is a cute trick, and it saves us from having to define paren/part/factor/term/expr as functions instead of just as variables inside a function.  It’s also a nice opportunity to point out the subtleties of lambda capture in C#; to learn more, check out section 7.14.4.2 of the C# 3.0 spec.

Parsing lists with SepBy

There’s one more handy combinator I want to show off today.  SepBy makes it easy to parse lists of things that are separated by other things.  Here’s an example that shows off parsing a bracket-delimited list of numbers-separated-by-commas:

// p.SepBy(sep) parses zero or more p, each separated by a sep, returning a list of the p results
public static P<IEnumerable<T>> SepBy<T, U>(this P<T> p, P<U> sep) { ... }
P<IEnumerable<int>> listOfNums = Literal('[').Then_(
                                 Nat.SepBy(Literal(',')).Then(nats =>
                                 Literal(']').Then_(
                                 Return(nats))));
listOfNums("[1,10,100]")
listOfNums("[1]")
listOfNums("[]")            

You can see the code for SepBy at the end of this blog entry.

Summary

We’ve covered a lot today!  We introduced a number of combinators for parsing lists of things, including SepBy and Many.  We also learned about the Chainl1 and Chainr1 combinators, which can be used to build expression parsers and even simple parsers for individual numbers.  We built a few cool little parsers that parse things like words, expressions, and comma-separated lists.  And we learned a little more about new C# features, like extension methods for IEnumerable<T>s and lambda capture semantics.  I hope you enjoyed it!

Looking ahead

There’s still a lot of ground we can cover.  A number of potential future topics spring to mind:

  • walking through the plumbing of functions like Then and Or
  • adding better error handling (so that when a parse fails, we know where and why)
  • generalizing the parser library to parse arbitrary token types (rather than just characters)
  • enabling laziness, so that we can stream text through a parser and deliver initial results without parsing the entire string beforehand

We’ll see what I’m next in the mood to write about.

Full source code

As always, here’s today’s code.

using System;
using System.Linq;
using System.Collections.Generic;

namespace MPC_Part3
{
    public class ParseResult<T>
    {
        public readonly T Result;
        public readonly string RemainingInput;
        public ParseResult(T r, string ri) { this.Result = r; this.RemainingInput = ri; }
        public override bool Equals(object obj)
        {
            ParseResult<T> other = obj as ParseResult<T>;
            if (other == null)
                return false;
            if (!object.Equals(other.Result, this.Result))
                return false;
            if (!object.Equals(other.RemainingInput, this.RemainingInput))
                return false;
            return true;
        }
        public override int GetHashCode()
        {
            return this.RemainingInput.GetHashCode();
        }
    }

    // representation type for parsers
    public delegate ParseResult<T> P<T>(string input);

    public static class Parser
    {
        // core functions that know the underlying parser representation
        // Fail() is a parser that always fails without consuming any input
        public static P<T> Fail<T>()
        {
            return input => null;
        }
        // Return(x) is a parser that always succeeds with result x, without consuming any input
        public static P<T> Return<T>(T x)
        {
            return input => new ParseResult<T>(x, input);
        }
        public static P<Func<T, T, T>> Return<T>(Func<T, T, T> f)
        {
            return Return<Func<T, T, T>>(f);
        }
        // p1.Then_(p2) runs p1, discards the result, then runs p2 on the remaining input
        public static P<U> Then_<T, U>(this P<T> p1, P<U> p2)
        {
            return p1.Then(dummy => p2);
        }
        // p1.Then(v => p2) runs p1, binds the result to variable v, then runs p2 on the remaining input
        public static P<U> Then<T, U>(this P<T> p1, Func<T, P<U>> f)
        {
            return input =>
            {
                ParseResult<T> result1 = p1(input);
                if (result1 == null)
                    return null;
                else
                    return f(result1.Result)(result1.RemainingInput);
            };
        }
        // p1.Or(p2) tries p1, but if it fails, runs p2 on the original input instead
        public static P<T> Or<T>(this P<T> p1, P<T> p2)
        {
            return input =>
            {
                ParseResult<T> result1 = p1(input);
                if (result1 == null)
                    return p2(input);
                else
                    return result1;
            };
        }
        // Item() consumes the first character of the input string and returns it as a result
        public static P<char> Item()
        {
            return input =>
            {
                if (string.IsNullOrEmpty(input))
                    return null;
                else
                    return new ParseResult<char>(input[0], input.Substring(1));
            };
        }
        // other handy parser functions
        // Sat(pred) succeeds parsing a character only if the character matches the predicate
        public static P<char> Sat(Predicate<char> pred)
        {
            return Item().Then(c => pred(c) ? Parser.Return(c) : Parser.Fail<char>());
        }
        // Letter parses a letter
        public static P<char> Letter = Sat(char.IsLetter);
        // Digit parses a digit character
        public static P<char> Digit = Sat(char.IsDigit);
        // DigitVal parses a digit character and returns the corresponding numeric value
        public static P<int> DigitVal = Sat(char.IsDigit).Then(c => Return((int)(c - '0')));
        // Literal(c) parses a particular character
        public static P<char> Literal(char c) { return Sat(x => x == c); }
        // Literal(s, x) parses a particular string, and returns a fixed result
        public static P<T> Literal<T>(string toParse, T result)
        {
            if (toParse == "")
                return Parser.Return(result);
            else
                return Literal(toParse[0]).Then_(Literal(toParse.Substring(1), result));
        }
        // Cons(x, rest) returns a new list like rest but with x appended to the front
        public static IEnumerable<T> Cons<T>(T x, IEnumerable<T> rest)
        {
            yield return x;
            foreach (T t in rest)
                yield return t;
        }
        // Many(p) parses with p zero or more times, returning the sequence of results
        public static P<IEnumerable<T>> Many<T>(P<T> p)
        {
            return Many1<T>(p).Or(Return(Enumerable.Empty<T>()));
        }
        // Many1(p) parses with p one or more times, returning the sequence of results
        public static P<IEnumerable<T>> Many1<T>(P<T> p)
        {
            return p.Then(x =>
                   Many(p).Then(xs =>
                   Return(Cons(x, xs))));
        }
        // one way to parse natural numbers
        public static P<int> Natural = DigitVal.Then(x => NaturalHelper(x));
        static P<int> NaturalHelper(int x)
        {
            return DigitVal.Then(y =>
                   NaturalHelper(10*x + y))
            .Or(Return(x));
        }
        // p.Chainl1(op) parses a grammar like "p (op p)*"
        // the result is the p results combined by the op results, left-associatively
        public static P<T> Chainl1<T>(this P<T> p, P<Func<T, T, T>> op)
        {
            return p.Then(x =>
                Chainl1Helper(x, p, op));
        }
        static P<T> Chainl1Helper<T>(T x, P<T> p, P<Func<T, T, T>> op)
        {
            return op.Then(f =>
                   p.Then(y =>
                   Chainl1Helper<T>(f(x, y), p, op)))
            .Or(Return(x));
        }
        // another way to parse naturals
        public static P<int> Nat = DigitVal.Chainl1(Return((int x, int y) => 10 * x + y));
        // p.Chainr1(op) is just like p.Chainl1(op), except it's right-associative
        public static P<T> Chainr1<T>(this P<T> p, P<Func<T, T, T>> op)
        {
            return p.Then(x =>
                   (
                   op.Then(f =>
                   Chainr1(p, op).Then(y =>
                   Return(f(x, y))))
                   )
                   .Or(Return(x)));
        }
        // p.SepBy(sep) parses zero or more p, each separated by a sep, returning a list of the p results
        public static P<IEnumerable<T>> SepBy<T, U>(this P<T> p, P<U> sep)
        {
            return SepBy1(p, sep).Or(Return(Enumerable.Empty<T>()));
        }
        // p.SepBy(sep) parses one or more p, each separated by a sep, returning a list of the p results
        public static P<IEnumerable<T>> SepBy1<T, U>(this P<T> p, P<U> sep)
        {
            return p.Then(x =>
                SepByHelper(p, sep).Then(xs =>
                Return(Cons(x, xs))));
        }
        static P<IEnumerable<T>> SepByHelper<T, U>(P<T> p, P<U> sep)
        {
            return sep.Then_(SepBy1(p, sep))
                .Or(Return(Enumerable.Empty<T>()));
        }

        // various running examples from the blog
        static void AssertEqual<T>(T x, T y)
        {
            if (!object.Equals(x, y))
                throw new Exception("not equal");
        }
        static void AssertEqualEnumerable<T>(ParseResult<IEnumerable<T>> x, ParseResult<IEnumerable<T>> y)
        {
            AssertEqual(x.RemainingInput, y.RemainingInput);
            List<T> xl = x.Result.ToList();
            List<T> yl = y.Result.ToList();
            AssertEqual(xl.Count, yl.Count);
            for (int i = 0; i < xl.Count; ++i)
                AssertEqual(xl[i], yl[i]);
        }

        static void Main(string[] args)
        {
            IEnumerable<int> nums = new int[] { 1, 2, 3, 4 };
            IEnumerable<int> squares = nums.Select(x => x * x);  // [1, 4, 9, 16]
            int sumOfSquares = squares.Aggregate((x,y) => x+y);  // 30
            foreach (int x in Cons(0, squares))                  // [0, 1, 4, 9, 16]
                Console.WriteLine(x);

            P<string> word = Many1(Letter).Then(chars => Return(new string(chars.ToArray())));
            AssertEqual(word("hello world"), new ParseResult<string>("hello", " world"));

            AssertEqual(Natural("256K"), new ParseResult<int>(256, "K"));
            AssertEqual(Nat("256K"), new ParseResult<int>(256, "K"));

            P<Func<int,int,int>> addOp = Literal('+').Then_(Return((int x, int y) => x + y))
                                     .Or(Literal('-').Then_(Return((int x, int y) => x - y)));
            P<int> simpleExpr = Natural.Chainl1(addOp);
            AssertEqual(simpleExpr("10+2+3-14"), new ParseResult<int>(1, ""));

            P<Func<int,int,int>> mulOp = Literal('*').Then_(Return((int x, int y) => x * y))
                                     .Or(Literal('/').Then_(Return((int x, int y) => x / y)));
            P<Func<int,int,int>> expOp = Literal('^').Then_(Return((int x, int y) => (int)Math.Pow(x,y)));
            P<int> expr = null, term = null, factor = null, part = null, paren = null;
            paren = Literal('(').Then(dummy => // use dummy lambda to capture reference to mutable "expr"
                    expr.Then(e =>
                    Literal(')').Then_(
                    Return<int>(e))));
            part = Nat.Or(paren);
            factor = part.Chainr1(expOp);
            term = factor.Chainl1(mulOp);
            expr = term.Chainl1(addOp);
            AssertEqual(expr("(611-2^3^2+1)/10-5*2"), new ParseResult<int>(0, ""));

            P<IEnumerable<int>> listOfNums = Literal('[').Then_(
                                             Nat.SepBy(Literal(',')).Then(nats =>
                                             Literal(']').Then_(
                                             Return(nats))));
            AssertEqualEnumerable(listOfNums("[1,10,100]"), new ParseResult<IEnumerable<int>>(new int[] { 1, 10, 100 }, ""));
            AssertEqualEnumerable(listOfNums("[1]"), new ParseResult<IEnumerable<int>>(new int[] { 1 }, ""));
            AssertEqualEnumerable(listOfNums("[]"), new ParseResult<IEnumerable<int>>(new int[] { }, ""));
            
            Console.WriteLine("done, press a key");
            Console.ReadKey();
        }
    }
}

Posted in Uncategorized | Leave a Comment »

Monadic parser combinators, part two

Posted by Brian on December 3, 2007

Last time we introduced the concept of monadic parser combinators, but got only a vague feel for what they are and how they work.  Today we will start from the ground up and lay the foundation for a cool parsing library.  We’ll see that just five little functions form a basis for parsing, and that we can build an entire expressive library of parsers and parser-combinators building atop just those five functions.  Along the way we’ll also explore some more aspects of C# 3.0, namely delegates and extension methods.

Representing parsers

To start things off, let’s describe the data type we’ll use to represent parsers.  If we consider a few examples, we can discover an effective representation type.  Consider the expression parser we looked briefly at last time.  We want to be able to take in a string like "40+2" and yield the result 42.  This one example in a vacuum suggests that parsers might be functions that take strings and return integers.  However it’s clear that returning an integer is specific to that particular example.  We might write another parser that will be able to parse "3.1+2.1" and yield 5.2, for example, in which case the return type would be float or double.  Or we might write a parser that, rather than evaluating an expression, just returns a parse tree representing the expression, so that parsing "40+2" yields a tree with a ‘+’ at the root and with two integer leaf nodes.  In each case, the input was a string, but the output depended on what exactly our parser intended to do.  So we might imagine a parser function signature looking something like this:

T Parse<T>(string input);

That is, a parser is a generic function, something that can take in a string and return a value of some arbitrary type.  This is a good first approximation.

Consider another issue.  We intend to build a parser combinator library, which means we intend to build parsers by combining simpler parsers.  For example, the expression parser that parses "40+2" and yields 42 will probably be built atop a simpler parser that just parses natural numbers – that is a parser that can just take in the string "40" and yield the integer 40.  The point is, while the top-level parser is designed to consume the entire string, we want to build it out of smaller component parsers that just eat the portions of the input that they know about.  So, if we feed the string "40+2" to a parser designed to parse just natural numbers, we would like it to return the result 40, but only consume the first two characters of the input.  We can accomplish this by having parsers return not just the result (40), but also return the remaining unparsed input ("+2").  That is, the return type should actually be:

public class ParseResult<T>
{
    public readonly T Result;
    public readonly string RemainingInput;
    public ParseResult(T r, string ri) { this.Result = r; this.RemainingInput = ri; }
}

Great.  One final issue is that a parser can fail.  For example if we hand our expression parser the string ";x%" or some other garbage, we want the result to indicate that the parse failed.  For now, we’ll just return null if the parse fails.  (In the future, we’ll come back and add more elaborate error-handling, so we can create customized error messages and point to the location of the failure in the original input string.)

Ok, so with all these ideas in mind, we can now describe a C# type that represents parsers.  I’ll call the type P, which stands for "Parser" (we’ll be using this type a lot, and so a very short name is good for bloggers like me who want to show you lots of code without worrying that the line lengths will get too long and scroll off the right-hand page margin).  P is generic, and since it is a function type, we represent it in C# as a delegate:

public delegate ParseResult<T> P<T>(string input);

And here are some examples of how we intend to be able to use that type:

P<int> expr = ...;
expr("40+2")  // yields { 42, "" }

P<int> nat = ...;
nat("40+2")   // yields { 40, "+2" }
nat(";x%")    // yields null

P<string> word = ...;
word("This is a sentence.")  // yields { "This", " is a sentence." }

Get it?  A parser of type P<T> takes in a string, and either succeeds (consuming some prefix of the string, and returning a result of type T as well as the remaining unparsed string), or fails (returning null).

(Note that throughout this blog entry, I’ll often have code snippets like the one above that have comments with values in curly braces.  The curly braces are just shorthand; { 40, "+2" } is an easy way for me to write a ParseResult<int> whose Result field is 40 and whose RemainingInput field is "+2".)

C# delegate types

Now that we know the representation type for parsers, it’s almost time to describe those five core functions I alluded to at the beginning of this blog entry.  But first, let’s take a quick aside to look at some C# delegate types that are new to the latest release.  As you may know, the .NET library has included a delegate type for predicates for a while:

public delegate bool Predicate<T>(T x);

An example use of this type would be

Predicate<char> isDigit = char.IsDigit;

The latest library release includes a new set of delegates called just Func, that represent arbitrary function signatures.  The general form is "Func<A1,A2,…,An,Result>" – that is, it’s a generic delegate type where the first few type parameters are the argument types of the function, and the final one is the result type.  So, for example, a function that compares two integers to test if the first is less than the second could be represented using Func like so:

Func<int, int, bool> less = (x, y) => x < y;

That is, a Func<int,int,bool> is a function that takes two integers and returns a boolean.  Here I’ve assigned it a lambda that computes "x less than y", which so that for example less(3,4) yields true.

Note that Func<T, bool> describes the same shape function as Predicate<T>.

These new function delegate types are important to know as we shall occasionally encounter them as we do more with monadic parser combinators.  We’ll see one in just a moment, as it’s time now to look at those five core functions that provide a basis for all of parsing.

A monadic basis for parsing

In linear algebra, a basis is a set of vectors, linear combinations of which span an entire vector space.  In the domain of parsing, we’re about to see that five core functions provide a similar kind of programming basis that spans the entire space of parsing.  That is, we can describe any parser using just combinations of these five core components.  Without further ado, here are the five functions:

public static P<T> Fail<T>();                             // zero
public static P<T> Return<T>(T x);                        // unit
public static P<U> Then<T, U>(P<T> p1, Func<T, P<U>> f);  // bind
public static P<T> Or<T>(P<T> p1, P<T> p2);               // plus
public static P<char> Item();

We’ll describe each one in turn in just a moment.  The first four functions are what make these parsers a monad.  Previously I said that a monad was a specific kind of abstract data type, but didn’t go into much detail.  Monads are a deep topic and very hard to grasp, so I don’t want to spend time talking about them now, other than to say that what all monads have in common is four functions with particular type signatures, and the first four functions here have those specific signatures.  Often times you’ll see these monadic functions represented by the names I have in comments at the right.  However for a parser library, the names I’ve chosen on the left make more sense, so we’ll stick with them.  Like most monads, in addition to supporting the core monad functions, there are some other functions that are specific to the underlying monad representation and the purpose of the monad.  In this case, Item is the only "extra" function we need for the parser monad.

Ok, let’s describe each function now.  Fail returns a parser that always fails, regardless of the input string.  Useful, huh?  So, for example:

P<int> failure = Fail<int>();
failure("anything")  // yields null

That was easy.  Return is almost as simple.  This function results in a parser that does not consume any input, but always succeeds, yielding the result value that was passed to it.  So, for example:

P<int> p = Return(42);
p("whatever")  // yields { 42, "whatever" }

We’re off to a running start; we’ve got a way to create a parser that always fails, and a way to create a parser that ignores the input and always succeeds, yielding a particular value.  Next up is the Then combinator, the most complicated one of the bunch.  Then is used to chain parsers in a sequence.  Let’s take a close look at its signature, which is a little complicated:

public static P<U> Then<T, U>(P<T> p1, Func<T, P<U>> f);

There are two arguments.  The first argument is a parser that yields a result of type T.  The second argument is a function; the function takes a T and returns a new parser that yields a U.  The result is a parser that yields a U.  Let’s give a rough description of the semantics of the resultant parser.  First we run parser p1 on the input.  If it fails, we’re done, and the parse fails.  If it succeeds, we take the resulting T value and pass it to function f, resulting in a new parser that yields a U.  We’ll run that parser on the remaining unparsed input (that’s left over after running p1).  Don’t worry if that description was hard to follow, we’ll see examples shortly that make things clearer.

Next let’s consider the Or function:

public static P<T> Or<T>(P<T> p1, P<T> p2);

This combinator two parsers of the same type, and combines them into a new parser of that same type.  The new parser has simple semantics: first try running parser p1 on the input – if it succeeds, great, that’s the answer, else if it fails, then run parser p2 on the input instead.  This is the way we’ll express alternatives: "try parsing like this, but if that doesn’t work then try this instead".  Again, we’ll see examples shortly.

Finally, we have Item, the only one of the five that does any "real work".  The resultant parser consumes the first character of the input string and returns it.  Or, if the input string is empty, then this parser fails.  So, for example:

P<char> item = Item();
item("foo")  // yields { 'f', "oo" }
item("")     // yields null

That’s it!  With these five components, we’re now prepared to conquer any parsing challenge.  Let’s sum up:

  • Fail() returns a parser that always fails
  • Return() returns a parser that always succeeds with a given value, ignoring the input string
  • Then() is used for chaining parsers in a sequence (first parse this, then do this)
  • Or() returns a parser that tries two alternatives (first parse this, but if that fails, parse this instead)
  • Item() returns a parser that consumes the first character of the input string

Now let’s start authoring some more interesting parsers.

Some small examples

Time to write our first non-trivial parser.  Here it is:

public static P<char> Sat(Predicate<char> pred)
{
    return Then(Item(), c => pred(c) ? Return(c) : Fail<char>());
}
 
Sat(pred) is a character parser that considers the first character of the input.  If that character matches predicate pred, then the parse succeeds, yielding that character.  If the character doesn’t match the predicate (or if we’ve already reached the end of the input string), the parse fails.  Here’s an example of using Sat:
 
P<char> letter = Sat(char.IsLetter);
letter("foo")  // yields { 'f', "oo" }
letter("123")  // yields null
letter("")     // yields null

We used the predicate char.IsLetter to create a parser that parses letters.  If the first character of the input is a letter, the parse succeeds, returning that letter, along with the remaining unparsed input.  If the first character is not a letter, the parse fails.

Let’s trace through these examples.  When we call letter("foo"), here’s what happens.  Look inside the body of Sat.  First, we call Item.  This consumes the first character of the input, the ‘f’.  Next we call Then.  The way Then works is it binds the parse result of its first argument (‘f’) as the parameter to the function that is Then‘s second argument (in this instance, the lambda).  So now we’re inside the lambda expression, with variable c bound to the character value ‘f’.  We test the predicate, and it returns true, so now we call Return(c), which just returns that result (‘f’) without consuming any more input.

Similarly, when we call letter("123"), we call Item to parse off the ‘1’, Then binds that value to c, but this time the predicate returns false, so we call Fail and the parse fails.  And in the final example, when we call letter(""), the original Item call fails, and so the call to Then immediately fails without even considering its second argument.

It’s not important to fully understand the details of logic of tracing through these calls yet.  I haven’t yet shown the bodies of functions like Then, so we can’t trace all the way through yet anyway.  (We’ll see all the code in its entirety at the end of this article.)  For now, you just need to have a sense of what’s going on.  Here’s another example; with each example you’ll get a little more intuitive sense of this.

P<char> letter = Sat(char.IsLetter);
P<char> digit = Sat(char.IsDigit);
P<char> letterOrDigit = Or(letter, digit);
letterOrDigit("foo")  // yields { 'f', "oo" }
letterOrDigit("123")  // yields { '1', "23" }
letterOrDigit(";x%")  // yields null

The example here demonstrates a call to Or.  As a result, letterOrDigit is a parser that will first try letter, and if that succeeds, great, but if it fails, it will then try digit on the same input instead.  Get the idea?  Here’s another example of using Then:

P<int> two = Then(Sat(c => c == 't'), t =>
             Then(Sat(c => c == 'w'), w =>
             Then(Sat(c => c == 'o'), o =>
             Return(2))));
two("two bits")  // yields { 2, " bits" }
two("four")      // yields null

Here, two is a parser that recognizes the string "two" and results in the integer 2.  It does so by using Then to string together a number of parsers in sequence.  Each of the Sat calls recognizes an individual character.  Once we have consumed all three characters, we return the desired result (the integer 2).  By the way, this idiom of stringing together a number of Then calls and finally returning a result is a common one, and in monad-lingo the idiom is known as a comprehension.

Ok, hopefully you’re starting to grok how we can build useful parsers in this framework.  In a moment we’ll build a cute parser that has the behavior suggested here:

englishNumber(@"one hundred forty-six thousand 
                five hundred twenty-two widgets"); // yields { 146522, " widgets" }

But first let’s look at another new feature of C# that we can leverage to improve the syntax of our parser combinators a little.

C# extension methods

Extension methods are a kind of syntax sugar that allows static methods to appear as though they are member methods of the type of the first argument.  For example, if you want to "add a method to the string class", you can do something like this:

public static class StringExtensions
{
    public static string PigLatin(this string s, int n)
    {
        return s.Substring(n) + s.Substring(0, n) + "ay";
    }
}

Note the this keyword before the first parameter of the method – that’s what makes it an extension method.  As a result, you can now write code like this:

"foo".PigLatin(1)  // yields "oofay"
"star".PigLatin(2) // yields "arstay"
      

Cute.  Now that we know what extension methods are, let’s put them to use.  Two of our parser combinators, Then and Or, read better as extension methods.  That is, rather than the code on the left, we can write the code on the right:

Or(p1, p2)              p1.Or(p2)
Then(p, x => rest)      p.Then(x => rest)

just by adding the this keyword to the first parameter of the definition of each of these functions.  This makes the code read better in a few places already; let’s revisit a couple little parsers we wrote, and use the new extension method syntax:

P<char> letterOrDigit = letter.Or(digit);
P<int> two = Sat(c => c == 't').Then(t =>
             Sat(c => c == 'w').Then(w =>
             Sat(c => c == 'o').Then(o =>
             Return(2))));

The code now reads better: "letter or digit", "parse a ‘t’, then parse a ‘w’, …"  While we’re at it, let’s add one more combinator.  In the two parser, you note that we bind the result of each Sat call to a lambda parameter variable, but we don’t ever use those variables (like t, which gets bound to the value ‘t’).  As we’ll see, frequently we won’t need the value produced from parsing the left-hand-side of a Then.  As a result, we’ll add one more function:

public static P<U> Then_<T, U>(this P<T> p1, P<U> p2)
{
    return p1.Then(dummy => p2);
}

Then_ differs from Then in that Then_ discards the value produced by the first parser.  Incidentally, if you’ve still been having trouble grokking Then, you’ll probably find Then_ easier to understand.  The result of p1.Then_(p2) is a parser that first parses the input with p1 and then parses the remainder of the input with p2, yielding the same result as p2.  In contract, Then does the same thing, except along the way it binds the result from p1 to a new lambda variable name.  Put another way, it’s easy to describe Then_ as a combinator that "sequences" two parsers.  Relative to that, Then is the same kind of thing, except that it remembers the first result (whereas Then_ forgets it).

Let’s move on now and bring this all together to create our first non-trivial parser that does something interesting, namely parse numbers spelled out in English.

English number parser

We’ve already seen a parser that parses the word "two" and returns the number 2.  Let’s generalize that parser a little into this function:

public static P<int> Literal(string toParse, int result)
{
    if (toParse == "")
        return Parser.Return(result);
    else
        return Sat(c => c == toParse[0]).Then_(Literal(toParse.Substring(1), result));
}

Now it’s easy to create parsers for other numbers:

P<int> three = Literal("three", 3);
three("three!")  // yields { 3, "!" }
three("blah")    // yields null

The Literal function is recursive.  It takes in a literal string we want to parse, and parses it letter by letter, consuming one character of input and making a recursive call with the rest.  Once we’ve parsed the whole string, we return the value passed in to the original Literal call.  If at any point along the way a Sat call fails, the whole parse fails (yielding a null result for parsing that input string).

Now we can start making parsers for lots of numbers:

public static P<int> OneToNine =
    Literal("one", 1).Or(
    Literal("two", 2).Or(
    Literal("three", 3).Or(
    Literal("four", 4).Or(
    Literal("five", 5).Or(
    Literal("six", 6).Or(
    Literal("seven", 7).Or(
    Literal("eight", 8).Or(
    Literal("nine", 9)))))))));

public static P<int> TenToNineteen = 
    Literal("ten", 10).Or(
    Literal("eleven", 11).Or(
    Literal("twelve", 12).Or(
    Literal("thirteen", 13).Or(
    Literal("fourteen", 14).Or(
    Literal("fifteen", 15).Or(
    Literal("sixteen", 16).Or(
    Literal("seventeen", 17).Or(
    Literal("eighteen", 18).Or(
    Literal("nineteen", 19))))))))));

Why did we split these up into two different parsers?  Because we plan to reuse OneToNine, of course.  Consider how you write numbers in the 20s: twenty, twenty-one, twenty-two, twenty-three, … Though we could write a function that just knows about the 20s, the same pattern generalizes to the 30s, 40s, … to the 90s.  Here’s our next function, along with a sample use:

public static P<int> HigherTens(string tenName, int tenValue)
{
    return Literal(tenName, tenValue).Then(tens =>
           (
               Sat(c => c == '-').Then_(
               OneToNine.Then(ones => 
               Return(tens + ones)))
           ).Or(Return(tens)));
}
P<int> thirties = HigherTens("thirty", 30);
thirties("thirty!")         // yields { 30, "!" }
thirties("thirty-two!")     // yields { 32, "!" }

Do you see how HigherTens works?  First we parse a literal string (e.g. "thirty"), and bind the value (30) to the lambda parameter tens.  Then, either (1) we parse a hyphen, then parse a OneToNine, binding its value to variable ones, and then return the sum of tens and ones, or else if that fails we just (2) return tens.  This function is our first really nice example of using combinators (like Then and Or) to create an interesting new parser out of some simpler parsers.  It’s easy to keep going along these lines to create a parser that goes up to a million or more.  Rather than belabor the point by talking about all the rest of the code, just check out the full source code for it at the end of this blog entry.

Summary

Let’s sum up what we’ve discussed today.  We learned about five functions that form a basis for a monadic parser combinator library:

  • Fail() returns a parser that always fails
  • Return(x) returns a parser that always succeeds with the given value, ignoring the input string
  • Then(p1, f) is used for chaining parsers in a sequence (first parse this, then do this)
  • Or(p1, p2) returns a parser that tries two alternatives (first use p1, but if that fails, use p2 instead)
  • Item() returns a parser that consumes the first character of the input string

We wrote a few more generally useful functions, like

  • Sat(pred) which parses a character only of the character matches a predicate
  • Then_(p1, p2) which runs two parsers in sequence, discarding the first result
  • Literal(str, val) which parses a literal string and returns a given value

and then created a parser that can parse English numbers from one to a million.  Along the way, we also learned some new C# features, including the Func delegate types for representing arbitrary function shapes, and a syntax sugar known as extension methods.

Looking ahead

There’s plenty more to explore with monadic parser combinators.  In my next blog entries, I know I want to cover a few more topics, including

  • a walk through the "plumbing details" of functions like Then and Or
  • write the numeric expression parser evaluator, and explore fun combinators like Many and Chainl1
  • talk about how to enhance the parser representation to deal with error handling

And who knows what else.  I only hope you’ll enjoy reading it as much as I enjoy writing it!

Full source code

No doubt some of you are itching to see the code for functions like Then, so as to gain a better understanding of the "plumbing".  Here is the full code used in today’s blog entry.  Note that the main parser functions are in static class Parser, which means some code says Parser.foo whereas the code snippets in the blog just said foo.  Apart from that, the code should all match up with the blog. Enjoy!

using System;

namespace MPC_Part2
{
    public class ParseResult<T>
    {
        public readonly T Result;
        public readonly string RemainingInput;
        public ParseResult(T r, string ri) { this.Result = r; this.RemainingInput = ri; }
        public override bool Equals(object obj)
        {
            ParseResult<T> other = obj as ParseResult<T>;
            if (other == null)
                return false;
            if (!object.Equals(other.Result, this.Result))
                return false;
            if (!object.Equals(other.RemainingInput, this.RemainingInput))
                return false;
            return true;
        }
        public override int GetHashCode()
        {
            return this.RemainingInput.GetHashCode();
        }
    }

    // representation type for parsers
    public delegate ParseResult<T> P<T>(string input);

    public static class Parser
    {
        // core functions that know the underlying parser representation
        public static P<T> Fail<T>()
        {
            return input => null;
        }
        public static P<T> Return<T>(T x)
        {
            return input => new ParseResult<T>(x, input);
        }
        public static P<U> Then<T, U>(this P<T> p1, Func<T, P<U>> f)
        {
            return input =>
                {
                    ParseResult<T> result1 = p1(input);
                    if (result1 == null)
                        return null;
                    else
                        return f(result1.Result)(result1.RemainingInput);
                };
        }
        public static P<T> Or<T>(this P<T> p1, P<T> p2)
        {
            return input =>
                {
                    ParseResult<T> result1 = p1(input);
                    if (result1 == null)
                        return p2(input);
                    else
                        return result1;
                };
        }
        public static P<char> Item()
        {
            return input =>
                {
                    if (string.IsNullOrEmpty(input))
                        return null;
                    else
                        return new ParseResult<char>(input[0], input.Substring(1));
                };
        }
        // other handy functions
        public static P<U> Then_<T, U>(this P<T> p1, P<U> p2)
        {
            return p1.Then(dummy => p2);
        }
    }

    // example of extension methods
    public static class StringExtensions
    {
        public static string PigLatin(this string s, int n)
        {
            return s.Substring(n) + s.Substring(0, n) + "ay";
        }
    }

    // various running examples from the blog
    class Program
    {
        static void AssertEqual<T>(T x, T y)
        {
            if (!object.Equals(x, y))
                throw new Exception("not equal");
        }
        public static P<char> Item = Parser.Item();
        public static P<char> Sat(Predicate<char> pred)
        {
            return Item.Then(c => pred(c) ? Parser.Return(c) : Parser.Fail<char>());
        }

        public static P<char> Letter = Sat(char.IsLetter);

        public static P<int> Literal(string toParse, int result)
        {
            if (toParse == "")
                return Parser.Return(result);
            else
                return Sat(c => c == toParse[0]).Then_(Literal(toParse.Substring(1), result));
        }

        public static P<int> OneToNine =
            Literal("one", 1).Or(
            Literal("two", 2).Or(
            Literal("three", 3).Or(
            Literal("four", 4).Or(
            Literal("five", 5).Or(
            Literal("six", 6).Or(
            Literal("seven", 7).Or(
            Literal("eight", 8).Or(
            Literal("nine", 9)))))))));

        public static P<int> TenToNineteen = 
            Literal("ten", 10).Or(
            Literal("eleven", 11).Or(
            Literal("twelve", 12).Or(
            Literal("thirteen", 13).Or(
            Literal("fourteen", 14).Or(
            Literal("fifteen", 15).Or(
            Literal("sixteen", 16).Or(
            Literal("seventeen", 17).Or(
            Literal("eighteen", 18).Or(
            Literal("nineteen", 19))))))))));
        
        public static P<int> HigherTens(string tenName, int tenValue)
        {
            return Literal(tenName, tenValue).Then(tens =>
                   (
                       Sat(c => c == '-').Then_(
                       OneToNine.Then(ones => 
                       Parser.Return(tens + ones)))
                   ).Or(Parser.Return(tens)));
        }

        public static P<int> OneToNinetyNine = 
            OneToNine
            .Or(TenToNineteen)
            .Or(HigherTens("twenty", 20))
            .Or(HigherTens("thirty", 30))
            .Or(HigherTens("forty", 40))
            .Or(HigherTens("fifty", 50))
            .Or(HigherTens("sixty", 60))
            .Or(HigherTens("seventy", 70))
            .Or(HigherTens("eighty", 80))
            .Or(HigherTens("ninety", 90));

        public static P<char> Space = Sat(char.IsWhiteSpace);

        public static P<int> OneHundredTo999 =
            OneToNine.Then(hundreds =>
            Space.Then_(
            Literal("hundred", 0).Then_(
            Space.Then_(
            OneToNinetyNine.Then(rest =>
            Parser.Return(hundreds * 100 + rest))))));

        public static P<int> OneTo999 = OneHundredTo999.Or(OneToNinetyNine);

        public static P<int> OneTo999999 =
                OneTo999.Then(thousands =>
                Space.Then_(
                Literal("thousand", 0).Then_(
                Space.Then_(
                OneTo999.Then(rest =>
                Parser.Return(thousands * 1000 + rest))))))
            .Or(OneTo999);

        static void Main(string[] args)
        {
            Console.WriteLine("foo".PigLatin(1));
            Console.WriteLine("star".PigLatin(2));

            Predicate<char> isDigit = char.IsDigit;
            Func<int, int, bool> less = (x, y) => x < y;

            AssertEqual(Item(""), null);
            AssertEqual(Item("foo"), new ParseResult<char>('f', "oo"));
            P<char> letter = Sat(char.IsLetter);
            P<char> digit = Sat(char.IsDigit);
            AssertEqual(letter("foo"), new ParseResult<char>('f', "oo"));
            AssertEqual(digit("foo"), null);
            P<char> letterOrDigit = letter.Or(digit);
            AssertEqual(letterOrDigit("foo"), new ParseResult<char>('f', "oo"));
            AssertEqual(letterOrDigit("123"), new ParseResult<char>('1', "23"));
            AssertEqual(letterOrDigit(";x%"), null);

            P<int> two = Sat(c => c == 't').Then_(
                         Sat(c => c == 'w').Then_(
                         Sat(c => c == 'o').Then_(
                         Parser.Return(2))));
            AssertEqual(two("two bits"), new ParseResult<int>(2, " bits"));
            AssertEqual(two("four"), null);

            P<int> three = Literal("three", 3);
            AssertEqual(three("three!"), new ParseResult<int>(3, "!"));
            P<int> thirties = HigherTens("thirty", 30);
            AssertEqual(thirties("thirty!"), new ParseResult<int>(30, "!"));
            AssertEqual(thirties("thirty-two!"), new ParseResult<int>(32, "!"));
            
            AssertEqual(OneToNinetyNine("twenty!"), new ParseResult<int>(20, "!"));
            AssertEqual(OneToNinetyNine("twenty-five!"), new ParseResult<int>(25, "!"));
            // Note how the next two get the 'wrong' answer.  We'll talk about
            // a combinator called NotFollowedBy, which allows for lookahead, 
            // in a future installment, rather than fix this now.
            AssertEqual(OneToNinetyNine("seventeen!"), new ParseResult<int>(7, "teen!"));
            AssertEqual(OneToNinetyNine("eighty!"), new ParseResult<int>(8, "y!"));

            P<int> englishNumber = OneTo999999;
            AssertEqual(englishNumber("one."), new ParseResult<int>(1, "."));
            AssertEqual(englishNumber("one hundred six."), new ParseResult<int>(106, "."));
            AssertEqual(englishNumber("one hundred forty-six thousand five hundred twenty-two widgets"), new ParseResult<int>(146522, " widgets"));

            Console.WriteLine("done, press a key");
            Console.ReadKey();
        }
    }
}

Posted in Uncategorized | Leave a Comment »

C# 3.0, lambda, and the first post of a series about “monadic parser combinators”

Posted by Brian on December 2, 2007

Where to begin?  Today I’ll start briefly with a little background about me.  Then I want to describe the new lambda feature of C# 3.0 and why it is good. Finally I’ll introduce "monadic parser combinators" as a cool application of the new C# 3.0 features – I really enjoy parser combinator libraries, and so they are my current motivation for writing a series of blog entries about code-related stuff.

Some quick background on Brian

Since this is a new blog for me, I should write a tiny bit of relevant history about myself.  In my previous life, I was a programming languages Ph.D. student who focused on multi-paradigm programming.  Much of my work in grad school went in to researching and developing a library called FC++, a C++ template library that enabled functional programming in the style of modern functional programming languages like Haskell.  I think the work I did in school is indicative of a more general trend.

For a while now, it seems that pragmatic real-world programming has been dominated by object-oriented languages, whereas functional languages have been outside the mainstream, used mostly in academia.  But the trend is that more and more functional features are appearing in mainstream object-oriented languages.  I am glad for this trend, as the functional programming style often makes it possible to raise the abstraction bar to a higher level, realize new kinds of designs, and author more succinct, precise code.

Lambda and C# 3.0

Ok, let’s talk some code.  One of the most important features common to any functional programming language is lambda.  In a functional language, functions are first-class values, and anonymous functions can be consed up on the fly using lambda syntax.  Both the choice of syntax and the anonymity matter a lot.  Consider, for example, a few different ways to write a particular function.  First, in C#, without delegates:

public static int TenXPlusY(int x, int y) 
{ return 10 * x + y; }

Then in C# 2.0, using delegates:

delegate(int x, int y) { return 10*x+y; }

Next, in C# 3.0 using the new lambda feature:

(x, y) => 10*x+y

And finally, for comparison, the same function expressed in Haskell (a popular functional language):

\x y -> 10*x+y

Lambdas differ from the other options in two main ways: anonymity, and succinct syntax.

Anonymity is important, because in functional programming, one often codes in a style that involves using combinators and higher-order functions (functions that take other functions as parameters).  When coding in this style, you end up writing a lot of tiny one-off functions that don’t need or deserve a name.  Being forced to use named functions clutters up the code and puts needless physical distance between the function definition and its sole use.  To be sure, in functional languages you still author plenty of named functions.  The point is that some functions need names and reuse, whereas others do not, and so it is important for a language to support both styles.

Succinctness of syntax is the other important detail.  Some people may argue that syntax details of a language are just that – details – and those details don’t qualitatively affect the use of a language.  Those people are wrong.  :)  Syntax is an important consideration; the most commonly used features of a language must be succinct.  To illustrate the point at the extreme, imagine if instead of the code on the left you had to use the code on the right:

x+3               Plus(var x, literal integer 3)
obj.ToString()    select method ToString from obj and invoke with parameters: ()

The point of this contrived example: when common language constructs are verbose, simple everyday code becomes a complex deluge of noise.  This is what makes the new lambda capabilities of C# 3.0 so notable to me; there is a big difference between these two forms:

delegate(int x, int y) { return 10*x+y; }   // C# 2.0
(x, y) => 10*x+y                            // C# 3.0

The obvious difference is the succinctness, which I’ve already demonstrated the value of.  The other notable difference is that the new lambda does not require redundant type annotations.  Static typing is an invaluable asset, but statically-typed languages often go overboard when it comes to requiring the programmer to decorate every function, parameter, and variable with a type annotation.  When source code is riddled with too many type annotations, they become a noisy hindrance to readability.

Applications – introducing monadic parser combinators

Though it is fun to wear the hat of the hobbyist programmer and tool around with the new features in C# 3.0, at the end of the day I would like to be at least a little pragmatic and talk about useful applications of those features.  First I must pause briefly to note that programming language design itself requires a great deal of balance among pragmatics, aesthetics, and complexity; language design is tricky business, and I think the C# guys have done a good job incrementally delivering important features cleanly into the language.  Here’s a quote from a blog from someone on the C# team; I think it captures the spirit of what I’m trying to communicate about the pragmatics of language design:

We’re not adding lambdas because we’re a bunch of functional language wonks who love lambdas — I mean, we are that, but that’s not why we’re adding lambdas.  We’re adding lambdas because lambdas will help massively with query comprehensions, and we’re adding query comprehensions because our research shows that pro devs could really use a language-integrated approach to querying disparate datasets without losing the benefits of static typing.

Nice.  (Check out the various documents about LINQ to learn more about query comprehensions.)

Ok, so let’s talk about a cool application of the new C# features: they make it easier to author monadic parser combinators.  What in the world are monadic parser combinators? Let’s look at each individual word.  Monads are a concept from functional languages like Haskell – they are a specific kind of abstract data type with certain properties, which I won’t go into the details of now.  A Parser is a basically a function that takes as input a string (or more generally, a stream of tokens), and returns the result of parsing the input (which might be a parse tree, or any other evaluation of the input).  A Combinator is a function that combines other functions to create new ones.  And thus, a monadic parser combinator library enables a programmer to assemble complex parsers out of simpler ones, via a library of generally useful parser combinators, which manipulate parsers that are represented using a specific sort of abstract data type.  Clear as mud, right?

Let’s look at an example.  Consider a tiny expression parser, which can parse and evaluate numeric expressions like this one:

"(100+41+1-42)/10-5*2"      // evaluates to 0

That is, the language we want to parse knows about natural numbers, the four main math operators with the usual precedence, and parentheses.  A grammar for this little language would look something like this:

expression ::= term (addOp term)*
term ::= factor (expOp factor)*
factor ::= natural | group
group ::= '(' expression ')'

It turns out that with a suitable library, we can write C# code for this parser-evaluator that looks like this:

Expr   = Term.Chainl1(AddOp);
Term   = Factor.Chainl1(MulOp);
Factor = Nat.Or(Group);
Group  = Parser.Literal("(").Then_(
         Expr.Then(e =>
         Parser.Literal(")").Then_(
         Parser.Return(e))));

Note the similarity to the original language grammar.  We end up with an expression parser that works like this:

Expr.Parse("(100+41+1-42)/10-5*2")  // returns 0

Here, entities like "Expr", "Term", and "Nat" are parsers that take a string and evaluate it down to an integer.  The various functions like "Chainl1", "Or" and "Then" are the parser combinators, which build up complex parsers out of simpler ones.  I’ve left out the definitions of some of the parsers for this expression evaluator (like "AddOp" and "Nat") to keep things simple for now.  (In fact, I posted the code for "Nat", which parses a natural number, in my previous blog entry.)  But hopefully you at least get some vague sense of what parser combinators are, based on this quick glimpse at an example.  There is still a lot of mystery, but it’s time to quit writing for today.

Over the next few days I intend to write some more blog entries that delve into the details of monadic parser combinators.  Apart from being an interesting application to show off new features of C# in practice, a monadic parser combinator library demonstrates an effective way to rapidly author parsers, letting you succinctly code at a very high level of abstraction (on par with the grammar of the language being parsed).  The other chief advantage of parser combinators is that they are expressed as a library in a familiar host language (e.g. C#), rather than being expressed in a separate language (with its own toolset) that then generates parsers in a host language (as with such tools as yacc or ANTLR).

If you want to preview some of the material I intend to present, you can check out a talk I gave a few years back about monadic parser combinators in C++ using FC++.  Or check out the Parsec library for parsing in Haskell.  Regardless, stay tuned to this blog space to learn about monadic parsing in C#.

Fun!  :)

Posted in Uncategorized | Leave a Comment »