Inside F#

Brian's thoughts on F# and .NET

Monadic parser combinators, part four

Posted by Brian on December 6, 2007

Last time we developed some useful combinators that enabled us to easily author some useful little parsers, including an expression evaluator.  Today we’ll focus on error messages, so that if a parse fails, we get back something more useful than just "null".  Along the way we’ll look into some of the guts of the implementation, and make some alterations to our parser "basis" – the small set of core functions that knows the parser implementation.

When parsers fail

Last time we created an expression parser that could parse expressions with natural numbers, the four main math operators as well as right-associative exponentiation, and parentheses.  Thus:

expr("(611-2^3^2+1)/10-5*2")

would succeed in parsing the string and evaluate it to an answer (in this case, zero).  But what happens when we have a bad input string?  "Bad" is a relative term; someone unfamiliar with our little expression language/grammar might reasonably guess that negative numbers or floating-point values were legal, and try strings like these:

expr("(611-2^-3^2+1)/10-5*2"))
expr("(611-2^3^2.2+1)/10-5*2"))

The result?  Our expr parser will fail, yielding the oh-so-helpful result "null".  Clearly we need to improve things so we can fail more gracefully.

There is a huge design space for error diagnostics/recovery in parsers.  For today, we’ll stick with modest ambitions.  When a parse fails, we’d like to be able to point to the location of the failure in the input string and report what we expected to see there instead.  By the end of this blog entry, we’ll be able to get error messages like these

Failure: At position 7, unexpected character '-', 
expected character '(' or natural number
Failure: At position 10, unexpected character '.', 
expected character ')', add/subtract op, multiply/divide op or exponentiation op

from our little expression parser, with only tiny changes to the client code.  Nothing comes free, however, and enabling this kid of useful error diagnostics will require substantially rewriting the parser representation and the basis (the five core functions).  Doing so will provide us with a nice opportunity to investigate more of what’s going on under the hood of the core parser library.

(I’ll emphasize that for the most part I am just mimicking Haskell’s Parsec library… for a good discussion of the error-handling design, check out this Parsec paper.)

Predictive parsers

A predictive parser is a parser that uses a specific, finite amount of lookahead to choose among alternatives.  (This is in contrast to a fully backtracking parser, which after continuing an arbitrarily long way down a given alternative, will back up – through the entire input if necessary – in order to find a legal alternative parse.)  For example, consider parsing a statement of C# code.  There are a lot of alternatives for statements, among them: if statements, while statements, and switch statements.  We only need a single keyword/token’s worth of lookahead in order to decide which of those alternatives can succeed.  That is, if we had defined a parser along the lines of

stmt = ifStmt.Or(whileStmt).Or(switchStmt)

and we encounter input that begins "if (x ==…", then once we have recognized the "if ", we really don’t need to consider the while or switch alternatives.  However the Or combinator we have seen so far has no way to know this.  So if the ifStmt parse fails for some reason, the stmt parser will backtrack and dutifully try the whileStmt and switchStmt alternatives.  That is, our current parser library implementation is backtracking. 

(Note: In the previous paragraph I used the word "token" for the first time.  A token is the atomic unit of input – the input to a parser is just a list of tokens.  For the parsers we’ve been dealing with so far, the input is a string, and the tokens are the individual characters.  In many real-world parsers, there is a separate pre-processing step that groups individual characters into tokens that represent the atomic entities of the language – things like keywords, identifiers, string literals, etc.  In a future installment we’ll see how to generalize our parser library so, rather than assuming a string of characters as input, the parser will accept a stream of an arbitrary user-specified token type.)

Backtracking makes a parser very flexible, but it also causes problems. Backtracking indiscriminately is clearly undesirable for performance reasons, and it also makes it tougher to generate good error diagnostics.  If you manage to parse your way down the input string 1000 characters or so before encountering an error, then not only is backtracking all the way to the beginning of the input unlikely to yield a better parse, but it may cause you to lose useful local context surrounding the problem, making it more difficult to yield useful error diagnostics.  (For example, if you encounter an error while parsing an "if" statement, you don’t want to back out of it and end up issuing an error message involving "while" or "switch" statements.)  When there is an error in the input, we’d like to be able to point at the location of the problem with specific, local information about what failed.

On the other hand, it is unreasonable to want a parser to be completely predictive, as most grammars (including those for most programming languages) have certain features/aspects that require an arbitrary amount of lookahead.  So there is a tension between predictiveness (which enables good performance and diagnostics) and the ability to backtrack (without which, many languages could not be recognized by a parser).

The design of the Parsec library illustrates one good way to mediate this tension.  Parsec’s parser library defaults to being predictive based on just one token of lookahead, but allows the user to selectively enable arbitrary amounts of lookahead for backtracking when necessary.  Today we’ll re-implement the core parser library functions with this design.

Looking at the Or combinator, and introducing the Try combinator

We haven’t looked inside the Or combinator yet, but now is a good time, so as to understand the issue at hand.  Here’s Or:

// p1.Or(p2) tries p1, but if it fails, runs p2 on the original input instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)
{
    return input =>
    {
        ParseResult<T> result1 = p1(input);
        if (result1 == null)
            return p2(input);
        else
            return result1;
    };
}

The body of the function just returns a big lambda – recall that a parser of type "T" (a P<T>) is just a function that takes in a string and returns a ParseResult<T> (which is a structure containing the result of the parse, as well as the remaining unparsed input).  And so the lambda tries parsing the input with p1, but if that fails (returns null), then it tries p2 on the same input.

What we’ll do is change the semantics, so that p2 is only tried if p1 fails without consuming any input.  If p1 progressed some ways down the string and then failed, then typically we don’t want to try p2.  From the previous example, if p1 is the ifStmt parser and p2 is the whileStmt parser, and we already parsed the "if" with p1, then even if p1 eventually fails, we have no interest in trying p2.  On the other hand, if p1 fails without consuming any input (perhaps because the input string had a prefix that the ifStmt parser would immediately fail to match on the first token… for example, the string "while"), then of course we still want to try alternative p2.  Making this change to Or makes everything completely predictive.

To allow for selective backtracking, we’ll add a new combinator called Try:

// Try(p) behaves like p, except if p fails, it pretends it hasn't consumed any input
public static P<T> Try<T>(this P<T> p)

That is, Try(p) is a parser that’s just like p, except that when it fails, it will tell a lie and claim that it didn’t consume any input even if it did.  Phrased another way, Try lies by saying that it discovered it failed to parse the input by looking at just the first token, even though in fact it may have parsed arbitrarily far ahead in the token stream.  Given the semantics of Try, and the new semantics we shall give Or:

// p1.Or(p2) tries p1, but if it fails without consuming input, runs p2 instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)

we can see how to do backtracking:

// old version
p1.Or(p2)           // always backtracks
// new version
p1.Or(p2)           // doesn't backtrack if p1 consumes input
Try(p1).Or(p2)      // backtracks any time p1 fails, just like the old version

We’ll see examples of using Try later on.  For now, we just need a sense of where we are headed.

New representation types for parsers

Ok, so to enable better error diagnostics, we’ll change the representation and implementation of the core parser library in a few ways.  We need to add a way to represent error messages (e.g. "unexpected character ‘-‘") and expectations (e.g. "expected natural number"), as well as locations within the input.  We also need to keep track of which parsers have consumed input and which haven’t.  So we create some new data types and update the parser delegate type P<T> to utilize the new types.  Here’s the skeleton code that describes the structure; we’ll discuss each part in turn.

using Position = System.Int32;

public class ParserState
{
    public Position Position;
    public string Input;
}

public class ErrorInfo
{
    public Position Position;
    public IEnumerable<string> Expectations;
    public string Message;
}

public class ParseResult<T>
{
    public bool Succeeded;
    // if Succeeded, all fields below are used, else only ErrorInfo is used
    public ErrorInfo ErrorInfo;
    public T Result;
    public ParserState RemainingInput;
}

public class Consumed<T>
{
    public bool HasConsumedInput;
    public ParseResult<T> ParseResult;
}

// representation type for parsers
public delegate Consumed<T> P<T>(ParserState input);

First, we define a type alias for int called Position, which represents a location within the input (the nth character of the input).  Using a type alias here is good, because in the future we might enhance the representation of positions in the input (for instance, to deal with newlines, we may keep track of both the line number and the column number), and it is easier to make such changes by looking through the code for every occurrence of "Position", rather than by looking for "int" and trying to determine which integers were representing positions and which ones were doing other tasks.

ParserState will be the new input type for the parser.  It records the input string as well as the current location.  (The previous version used only a string as input, and repeatedly called "input.Substring(1)" as characters were consumed one by one.)

ErrorInfo represents error diagnostic information.  In addition to the location of the error, we keep track of a list of expectations (that is, the list of things that could have validly come next at the point of error, such as "expected character ‘(‘ or natural number") as well as an arbitrary message describing the problem (e.g. "unexpected character ‘-‘").

ParseResult<T> is modified to include an ErrorInfo field as well as a boolean representing whether or not the parse succeeded.  Whereas previously parsers returned null on failure, now failed parses will return a ParseResult with Succeeded set to false.

Consumed<T> wraps the ParseResult information in another structure with an additional boolean that says whether or not the parser consumed any input (advanced through the input string).  As described above, this information will be useful to the Or and Try combinators to do predictive parsing with selective backtracking.

Finally, a parser P<T> is now a function that takes in a ParserState and returns a Consumed<T>.

These new classes have some additional constructors and methods which I’ve elided from the code snippet above.  Moving forward, as we encounter any interesting bits involving those members, I’ll call them out.

A small change to the client interface

Up until now, we could run a parser simply by passing it an input string and it would return a ParseResult.  Now that the delegate representation has changed, that’s no longer possible (we now need to pass in a ParserState object, and a Consumed object is returned).  To re-enable the friendlier interface for clients, we add an extension method:

// clients call p.Parse("input") to avoid implementation details like Consumed & ParserState
public static ParseResult<T> Parse<T>(this P<T> p, string toParse)
{
    return p(new ParserState(0, toParse)).ParseResult;
}
expr("3+4")        // old way
expr.Parse("3+4")  // new way

Now clients call "parser.Parse()", and the Parse() method will cons up the initial state object, as well as strip off the "consumed" data from the result.  Adding this method also shields clients from future breaking changes, in the event that we ever change the parser representation type P<T> from a delegate type to a class type.

Additionally, I’ve added a ToString() method to the ParseResult class, which displays the result in the case of a successful parse, or a nicely worded error message in the case of a failure.

The parser basis with the new representation

Now we’ll look at the implementation of the core parser functions using the new representation.

Item and Sat

Previously, we treated Item as a core function that knew about the underlying parser representation, and then implemented Sat in terms of Item.  Now it makes sense to do the reverse – promote Sat to the "core" functions that knows the underlying parser representation, and define Item in terms of Sat.  This enables Sat to directly access the error information so as to distinguish its two different modes of failure (end of input, or character not matching predicate) with different error messages.

// Sat(pred) succeeds parsing a character only if the character matches the predicate
public static P<char> Sat(Predicate<char> pred)
{
    return input =>
    {
        if (input.Position >= input.Input.Length)
            return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected end of input")));
        else if (!pred(input.Input[input.Position]))
            return new Consumed<char>(false, new ParseResult<char>(new ErrorInfo(input.Position, Enumerable.Empty<string>(), "unexpected character '"+input.Input[input.Position]+"'")));
        else
            return new Consumed<char>(true, new ParseResult<char>(input.Input[input.Position], new ParserState(input.Position+1, input.Input), new ErrorInfo(input.Position+1)));
    };
}
// Item() consumes the first character of the input string and returns it as a result
public static P<char> Item()
{
    return Sat(c => true);
}

So, for example, we can call

Console.WriteLine(Sat(c => c == 'x').Parse("")); Console.WriteLine(Sat(c => c == 'x').Parse("y"));
Console.WriteLine(Sat(c => c == 'x').Parse("x"));

and get

Failure: At position 0, unexpected end of input
Failure: At position 0, unexpected character 'y'
Success: x

Already we have an improvement!  Whereas previously all failures just returned "null" with no information as to the cause of the problem, now we can distinguish failure cases and offer more useful diagnostics.

Note also that Sat populates the resultant "consumed" data with either true or false, based on whether it succeeded in parsing a character of the input or not.

Or and Try

Now let’s look at the new Or combinator:

// p1.Or(p2) tries p1, but if it fails without consuming input, runs p2 instead
public static P<T> Or<T>(this P<T> p1, P<T> p2)
{
    return input =>
    {
        Consumed<T> consumed1 = p1(input);
        if (consumed1.ParseResult.Succeeded || consumed1.HasConsumedInput)
        {
            return consumed1;
        }
        else
        {
            Consumed<T> consumed2 = p2(input);
            if (consumed2.HasConsumedInput)
                return consumed2;
            return new Consumed<T>(consumed2.HasConsumedInput,
                consumed2.ParseResult.MergeError(consumed1.ParseResult.ErrorInfo));
        }
    };
}

The logic for whether or not to run p2 has changed; now p2 runs only if p1 fails and it does not consume any input.  If p1 succeeds or consumes, we short-circuit the other alternative and immediately return.

In the case where neither p1 nor p2 consumes any input, you’ll note there’s a call to a new function "MergeError".  The high-level description of that function is that it combines the error information about expectations.  This is what enables error messages of the form "…expected x, y, or z…"; we’ll see an example later.

Here is the new Try combinator:

// Try(p) behaves like p, except if p fails, it pretends it hasn't consumed any input
public static P<T> Try<T>(this P<T> p)
{
    return input =>
    {
        Consumed<T> consumed = p(input);
        if (consumed.HasConsumedInput && !consumed.ParseResult.Succeeded)
            return new Consumed<T>(false, consumed.ParseResult);
        else
            return consumed;
    };
}

It works just as billed – if p consumes input and fails, it returns a value with the same ParseResult, only with HasConsumedInput set to false.

A new combinator: Tag

In addition to Try, there is one other new core combinator I failed to mention.  It’s called Tag, and its the way we can assign labels to individual parsers, so as to provide a high-level description of expectations in error messages.  Here’s the declaration of Tag, along with some typical uses:

// p.Tag(label) makes it so if p fails without consuming input, the error "expected label" occurs
public static P<T> Tag<T>(this P<T> p, string label);

public static P<char> Letter = Sat(char.IsLetter).Tag("letter");
public static P<char> Digit = Sat(char.IsDigit).Tag("digit");

This is what enables error messages like the one here:

Console.WriteLine(Letter.Or(Digit).Parse("?"));
// Failure: At position 0, unexpected character '?', expected digit or letter

Now we can better appreciate what was happening in that MergeError() call inside the definition of Or.  MergeError() concatenates the lists of expectations inside the resultant error information, so that when a set of Or alternatives fails, we can issue an error message that lists the various alternatives we were expecting at that point in the parse.

The rest

There are other changes to the core parser functions, but I’ll skip describing them, to prevent this blog entry from becoming extremely long.  As always, the entire code appears at the end.

Nevertheless, to sum up, we created a new parser representation and changed the parser basis – the set of core functions – to work with the new representation.  The basis now contains these functions: Fail, Return, Then, Or, Try, Tag, Sat, and Parse.  The rest of the library, including combinators like Chainl1 and Many, is unchanged, other than the tiny addition of a few Tag calls to label such parsers as Letter and Digit.  By building the parser library atop a small set of core functionality, we’ve been able to change the representation to greatly enhance the utility of the library, with a minimum of repercussions outside of the core.

Examples

We’ve already encountered one of the best examples of the new utility: the motivating example of the expression-parser error messages at the beginning of today’s blog entry.  In order to get error messages like

Failure: At position 7, unexpected character '-', 
expected character '(' or natural number
Failure: At position 10, unexpected character '.', 
expected character ')', add/subtract op, multiply/divide op or exponentiation op

the only changes we had to make to our old code was the addition of a few Tag calls.  Here’s one such example:

P<Func<int, int, int>> addOp = Literal('+').Then_(Return((int x, int y) => x + y))
                           .Or(Literal('-').Then_(Return((int x, int y) => x - y)))
                           .Tag("add/subtract op");

The only difference compared to our previous code is the addition of the Tag call (underlined in the snippet above).

NotFollowedBy and EndOfInput

The Try combinator enables us to create another useful combinator called NotFollowedBy.  Here is its definition, along with an example of its use.

// p1.NotFollowedBy(p2,label) runs p1 and then looks ahead for p2, failing if p2 is there
public static P<T> NotFollowedBy<T, U>(this P<T> p1, P<U> p2, string p2Label)
{
    return p1.Then(result =>
        Try(p2.Then_(Fail<T>("unexpected " + p2Label))
            .Or(Return(result))));
}
Console.WriteLine(Literal('1').NotFollowedBy(Literal('2'), "character '2'").Parse("12"));
// Failure: At position 1, unexpected character '2'

This lets us write a handy function to recognize the end of the input:

P<int> EndOfInput = Return(0).NotFollowedBy(Item(), "character").Tag("end of input");

Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("1"));
//Success: 0
Console.WriteLine(Literal('1').Then_(EndOfInput).Parse("12"));
//Failure: At position 1, unexpected character, expected end of input

EndOfInput does not need to return any useful value, so it always returns the integer 0 when it succeeds.  It is interesting that this parser can be written in client code.  At first glance, detecting the end of the input seems like something that would require access to the underlying representation.  This just goes to show how expressive the small set of core functions – the parser basis – really is.

NotFollowedBy also helps us correct a problem with the English number parser we previously encountered.  Back in part two, we created a parser to recognize and evaluate English numbers.  However if you read the source code, you discovered it had a problem; when trying to parse a number like "seventy-seven", the literal string "seven" would match and immediately return the value 7 (along with remaining unparsed input "ty-seven").  We can correct this by appending a NotFollowedBy(Letter) call to each Literal string we want to match.  That way, for example, "seven" only gets matched when the next character is not another letter.  Making that fix, and adding a few Tags yields an English number parser we can call like this:

Console.WriteLine(OneToNinetyNine.Parse("seven"));
Console.WriteLine(OneToNinetyNine.Parse("twenty-seven"));
Console.WriteLine(OneToNinetyNine.Parse("seventy-seven"));
Console.WriteLine(OneToNinetyNine.Parse("so"));

to yield these results:

Success: 7
Success: 27
Success: 77
Failure: At position 0, unexpected character 's', expected english number 1-99

Nifty.  The full code appears below.

Summary

Today we added error diagnostics to the parser library.  By changing the parser representation type we can keep track of the position in the parse, an error message, and a list of expectations for that location.  We changed the semantics of Or to be predictive by default.  And we added three new functions to the parser basis (the core functions that know the parser representation):

  • Try(p), which pretends it didn’t consume any input when p fails, enabling selective backtracking
  • p.Tag(label), which makes it easy to get descriptive expectations into parser error messages
  • p.Parse(string), which shields clients from the underlying parser representation

We also created a handy combinator called NotFollowedBy, as well as a parser for recognizing the EndOfInput.

Full source code

Hmmm, it appears the blog server thinks this post is too big.  So I’m going to try posting the code as a separate entry.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: