Easy text parsing in C# with Sprache

A few days ago, I discovered a little gem: Sprache. The name means "language" in German. It’s a very elegant and easy to use library to create text parsers, using parser combinators, which are a very common technique in functional programming. The theorical concept may seem a bit scary, but as you’ll see in a minute, Sprache makes it very simple.

Text parsing

Parsing text is a common task, but it can be tedious and error-prone. There are plenty of ways to do it:

  • manual parsing based on Split, IndexOf, Substring etc.
  • regular expressions
  • hand-built parser that scans the string for tokens
  • full blown parser generated with ANTLR or a similar tool
  • and probably many others…

None of these options is very appealing. For simple cases, splitting the string or using a regex can be enough, but it doesn’t scale to more complex grammars. Building a real parser by hand for non-trivial grammars is, well, non-trivial. ANTLR requires Java, a bit of knowledge, and it relies on code generation, which complicates the build process.

Fortunately, Sprache offers a very nice alternative. It provides many predefined parsers and combinators that you can use to define a grammar. Let’s walk through an example: parsing the challenge in the WWW-Authenticate header of an HTTP response (I recently had to write a parser by hand for this recently, and I wish I had known Sprache then).

The grammar

The WWW-Authenticate header is sent by an HTTP server as part of a 401 (Unauthorized) response to indicate how you should authenticate:

# Basic challenge
WWW-Authenticate: Basic realm="FooCorp"

# OAuth 2.0 challenge after sending an expired token
WWW-Authenticate: Bearer realm="FooCorp", error=invalid_token, error_description="The access token has expired"

What we want to parse is the "challenge", i.e. the value of the header. So, we have an authentication scheme (Basic, Bearer), followed by one or more parameters (name-value pairs). This looks simple enough, we could probably just split by ',' then by '=' to get the values… but the double quotes complicate things, since quoted strings could contain the ',' or '=' characters. Also, the double quotes are optional if the parameter value is a single token, so we can’t rely on the fact they will (or won’t) be there. If we want to parse this reliably, we’re going to have to look at the specs.

The WWW-Authenticate header is described in detail in RFC-2617. The grammar looks like this, in what the RFC calls "augmented Backus-Naur Form" (see RFC 2616 §2.1):

# from RFC-2617 (HTTP Basic and Digest authentication)

challenge      = auth-scheme 1*SP 1#auth-param
auth-scheme    = token
auth-param     = token "=" ( token | quoted-string )

# from RFC2616 (HTTP/1.1)

token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
               | "," | ";" | ":" | "\" | <">
               | "/" | "[" | "]" | "?" | "="
               | "{" | "}" | SP | HT
quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext         = <any TEXT except <">>
quoted-pair    = "\" CHAR

So, we have a few grammar rules, let’s see how we can encode them in C# code with Sprache, and use them to parse a challenge.

Parsing tokens

Let’s start with the most simple parts of the grammar: tokens. A token is declared as one or more of any characters that are not control chars or separators.

We’ll define our rules in a Grammar class. Let’s start by defining some character classes:

static class Grammar
{
    private static readonly Parser<char> SeparatorChar =
        Parse.Chars("()<>@,;:\\\"/[]?={} \t");

    private static readonly Parser<char> ControlChar =
        Parse.Char(Char.IsControl, "Control character");

}
  • Each rule is declared as a Parser<T>; since these rules match single characters, they are of type Parser<char>.
  • The Parse class from Sprache exposes parser primitives and combinators.
  • Parse.Chars matches any character from the specified string, we use it to specify the list of separator characters.
  • The overload of Parse.Char that we use here takes a predicate that will be called to check if the character matches, and a description of the character class. Here we just use System.Char.IsControl as the predicate to match control characters.

Now, let’s define a TokenChar character class to match characters that can be part of a token. As per the RFC, this can be any character not in the previous classes:

    private static readonly Parser<char> TokenChar =
        Parse.AnyChar
            .Except(SeparatorChar)
            .Except(ControlChar);
  • Parse.AnyChar, as the name implies, matches any character.
  • Except specifies exceptions to the rule.

Finally, a token is a sequence of one or more of these characters:

    private static readonly Parser<string> Token =
        TokenChar.AtLeastOnce().Text();
  • A token is a string, so the rule for a token is of type Parser<string>.
  • AtLeastOnce() means one or more repetitions, and since TokenChar is a Parser<char>, it returns a Parser<IEnumerable<char>>.
  • Text() combines the sequence of characters into a string, returning a Parser<string>.

We’re now able to parse a token. But it’s just a small step, and we still have a lot to do…

Parsing quoted strings

The grammar defines a quoted string as a sequence of:

  • an opening double quote
  • any number of either
    • a "qdtext", which is any character except a double quote
    • a "quoted pair", which is any character preceded by a backslash (this is used to escape double quotes inside a string)
  • a closing double quote

Let’s write the rules for "qdtext" and "quoted pair":

    private static readonly Parser<char> DoubleQuote = Parse.Char('"');
    private static readonly Parser<char> Backslash = Parse.Char('\\');

    private static readonly Parser<char> QdText =
        Parse.AnyChar.Except(DoubleQuote);

    private static readonly Parser<char> QuotedPair =
        from _ in Backslash
        from c in Parse.AnyChar
        select c;

The QdText rule doesn’t require much explanation, but QuotedPair is more interesting… As you can see, it looks like a Linq query: this is Sprache’s way of specifying a sequence. This particular query means: match a backslash (named _ because we ignore it) followed by any character named c, and return just c (quoted pairs are not escape sequences in the same sense as in C, Java or C#, so "\n" isn’t interpreted as "new line" but just as "n").

We can now write the rule for a quoted string:

    private static readonly Parser<string> QuotedString =
        from open in DoubleQuote
        from text in QuotedPair.Or(QdText).Many().Text()
        from close in DoubleQuote
        select text;
  • the Or method indicates a choice between two parsers. QuotedPair.Or(QdText) will try to match a quoted pair, and if that fails, it will try to match a QdText instead.
  • Many() indicates any number of repetition
  • Text() combines the characters into a string

We now have all the basic building blocks, so we can move on to higher level rules.

Parsing challenge parameters

A challenge is made of an auth scheme followed by one or more parameters. The auth scheme is trivial (it’s just a token), so let’s start by parsing the parameters.

Although there isn’t a named rule for it in the grammar, let’s define a rule for parameter values. The value can be either a token or a quoted string:

    private static readonly Parser<string> ParameterValue =
        Token.Or(QuotedString);

Since a parameter is a composite element (name and value), let’s define a class to represent it:

class Parameter
{
    public Parameter(string name, string value)
    {
        Name = name;
        Value = value;
    }
    
    public string Name { get; }
    public string Value { get; }
}

The T in Parser<T> isn’t restricted to characters and strings, it can be any type. So the rule for parsing parameters will be of type Parser<Parameter>:

    private static readonly Parser<char> EqualSign = Parse.Char('=');

    private static readonly Parser<Parameter> Parameter =
        from name in Token
        from _ in EqualSign
        from value in ParameterValue
        select new Parameter(name, value);

Here we match a token (the parameter name), followed by the '=' sign, followed by a parameter value, and we combine the name and value into a Parameter instance.

Now let’s parse a sequence of one or more parameters. Parameters are separated by commas (','), with optional leading and trailing whitespace (look for "#rule" in RFC 2616 §2.1). The grammar for lists allows several commas without items in between, e.g. "item1 ,, item2,item3, ,item4", so the rule for the delimiter can be written like this:

    private static readonly Parser<char> Comma = Parse.Char(',');

    private static readonly Parser<char> ListDelimiter =
        from leading in Parse.WhiteSpace.Many()
        from c in Comma
        from trailing in Parse.WhiteSpace.Or(Comma).Many()
        select c;

We just match the first comma, the rest can be any number of commas or whitespace characters. We return the comma because we have to return something, but we won’t actually use it.

We could now match the sequence of parameters like this:

    private static readonly Parser<Parameter[]> Parameters =
        from first in Parameter.Once()
        from others in (
            from _ in ListDelimiter
            from p in Parameter
            select p).Many()
        select first.Concat(others).ToArray();

But it’s not very straightforward… fortunately Sprache provides an easier option with the DelimitedBy method:

    private static readonly Parser<Parameter[]> Parameters =
        from p in Parameter.DelimitedBy(ListDelimiter)
        select p.ToArray();

Parsing the challenge

We’re almost done. We now have everything we need to parse the whole challenge. Let’s define a class to represent it first:

class Challenge
{
    public Challenge(string scheme, Parameter[] parameters)
    {
        Scheme = scheme;
        Parameters = parameters;
    }
    public string Scheme { get; }
    public Parameter[] Parameters { get; }
}

And finally we can write the top-level rule:

    public static readonly Parser<Challenge> Challenge =
        from scheme in Token
        from _ in Parse.WhiteSpace.AtLeastOnce()
        from parameters in Parameters
        select new Challenge(scheme, parameters);

Note that I made this rule public, unlike the others: it’s the only one we need to expose.

Using the parser

Our parser is done, now we just have to use it, which is pretty straightforward:

void ParseAndPrintChallenge(string input)
{
    var challenge = Grammar.Challenge.Parse(input);
    Console.WriteLine($"Scheme: {challenge.Scheme}");
    Console.WriteLine($"Parameters:");
    foreach (var p in challenge.Parameters)
    {
        Console.WriteLine($"- {p.Name} = {p.Value}");
    }
}

With the OAuth 2.0 challenge example from earlier, this produces the following output:

Scheme: Bearer
Parameters:
- realm = FooCorp
- error = invalid_token
- error_description = The access token has expired

If there’s a syntax error in the input text, the Parse will throw a ParseException with a message describing where and why the parsing failed. For instance, if I remove the space between "Bearer" and "realm", I get the following error:

Parsing failure: unexpected ‘=’; expected whitespace (Line 1, Column 12); recently consumed: earerrealm

You can find the full code for this article here.

Conclusion

As you can see, Sprache makes it very simple to parse complex text. The code isn’t particularly short, but it’s completely declarative; there are no loops, no conditionals, no temporary variables, no state… This makes it very easy to understand, and it can easily be compared with the actual grammar definition to check its correctness. It also provides pretty good feedback in case of error, which is hard to accomplish with a hand-built parser.

What’s new in FakeItEasy 3.0.0?

FakeItEasy is a popular mocking framework for .NET, with an very intuitive and easy-to-use API. For about one year, I’ve been a maintainer of FakeItEasy, along with Adam Ralph and Blair Conrad. It’s been a real pleasure working with them and I had a lot of fun!

Today I’m glad to announce that we’re releasing FakeItEasy 3.0.0, which supports .NET Core and introduces a few useful features.

Let’s see what’s new!

.NET Core support

In addition to .NET 4+, FakeItEasy now supports .NET Standard 1.6, so you can use it in .NET Core projects.

Note that due to limitations in .NET Standard 1.x, there are some minor differences with the .NET 4 version:

  • Fakes are not binary serializable;
  • Self-initializing fakes are not supported (i.e. fakeService = A.Fake<IService>(options => options.Wrapping(realService).RecordedBy(recorder)).

Huge thanks to the people who made .NET Core support possible:

  • Jonathon Rossi, who maintains Castle.Core. FakeItEasy relies heavily on Castle.Core, so it couldn’t have supported .NET Core if Castle.Core didn’t.
  • Jeremy Meng from Microsoft, who did most of the heavy lifting to make both FakeItEasy 3.0.0 and Castle.Core 4.0.0 work on .NET Core.

Analyzer

VB.NET support

The FakeItEasy analyzer, which warns you of incorrect usage of the library, now supports VB.NET as well as C#.

New or improved features

Better syntax for configuring successive calls to the same member

When you configure calls on a fake, it creates rules that are "stacked" on each other, which means you can override a previously configured rule. Combined with the ability to specify the number of times a rule must apply, this lets you say things like "return 42 twice, then throw an exception". Until now, to do that you had to configure the calls in reverse order, which wasn’t very intuitive and meant you had to repeat the call specification:

A.CallTo(() => foo.Bar()).Throws(new Exception("oops"));
A.CallTo(() => foo.Bar()).Returns(42).Twice();

FakeItEasy 3.0.0 introduces a new syntax to make this easier:

A.CallTo(() => foo.Bar()).Returns(42).Twice()
    .Then.Throws(new Exception("oops"));

Note that if you don’t specify how many times the rule must apply, it will apply forever until explicitly overridden. Hence, you can only use Then after Once(), Twice() or NumberOfTimes(...).

This is a breaking change at the API level, as the shape of the configuration interfaces has changed, but unless you manipulate those interfaces explicitly, you shouldn’t be affected.

Automatic support for cancellation

When a method accepts a CancellationToken, it should usually throw an exception when it’s called with a token that is already canceled. Previously this behavior had to be configured manually. In FakeItEasy 3.0.0, fake methods will now throw an OperationCanceledException by default when called with a canceled token. Asynchronous methods will return a canceled task.

This is technically a breaking change, but most users are unlikely to be affected.

Throw asynchronously

FakeItEasy lets you configure a method to throw an exception with Throws. But for async methods, there are actually two ways to "throw":

  • throw an exception synchronously, before actually returning a task (this is what Throws does)
  • return a failed task (which had to be done manually until now)

In some cases the difference can be important to the caller, if it doesn’t directly await the async method. FakeItEasy 3.0.0 introduces a ThrowsAsync method to configure a method to return a failed task:

A.CallTo(() => foo.BarAsync()).ThrowsAsync(new Exception("foo"));

Configure property setters on unnatural fakes

Unnatural fakes (i.e. Fake<T>) now have a CallsToSet method, which does the same as A.CallToSet on natural fakes:

var fake = new Fake<IFoo>();
fake.CallsToSet(foo => foo.Bar).To(0).Throws(new Exception("The value of Bar can't be 0"));

Better API for specifying additional attributes

The syntax to specify additional attributes on fakes was a bit unwieldy; you had to create a collection of CustomAttributeBuilders, which themselves had to be created by specifying the constructor and argument values. The WithAdditionalAttributes method has been retired in FakeItEasy 3.0.0 and replaced with a simpler WithAttributes that accepts expressions:

var foo = A.Fake<IFoo>(x => x.WithAttributes(() => new FooAttribute()));

This is a breaking change.

Other notable changes

Deprecation of self-initializing fakes

Self-initializing fakes are a feature that lets you record the results of calls to a real object, and replay them on a fake object. This feature was used by very few people, and didn’t seem to be a good fit in the core FakeItEasy library, so it will be removed in a future version. We’re considering providing the same functionality as a separate package.

Bug fixes

  • Faking a type multiple times and applying different attributes to the fakes now correctly generates different fake types. (#436)
  • All non-void members attempt to return a Dummy by default, even after being reconfigured by Invokes or DoesNothing (#830)

The full list of changes for this release is available in the release notes.

Other contributors to this release include:

A big thanks to them!