Parsing Expression Grammars vs. Regular Expressions
Parsing expression grammars are some of those poorly appreciated and rarely used programming tools that really deserve a better rap. As a generalization, they are both more powerful and arguably more clear than their regular expression counterparts (while any regular expression can be implemented as a parsing expression grammar, there is an entire class of parsing expression grammars that is impossible to implement with regular expressions alone).
Regular Expressions
In day-to-day coding, regular expressions are usually temporary hacks, their particular structure intimately married to the form of their data. Viewed in this way, a regular expression can be seen as a specification of data, an implicit description of the format and structure of the data it works on. The usual life cycle of a regular expression is short; a programmer comes across some moderately complex parsing task and, rather than spin out a series of loops, condition checking and character-stepping code to extract the needed information, the programmer simply whips up a quick and dirty regex, pulls out the result, and moves on. Rarely are regular expressions reused, littering the codebase like cryptic shells.
I had a devious professor once who tasked my “Intro to Programming” class with writing a regular expression to tell if a given string exhibited balanced parentheses (its match would return true if so, false otherwise). Seems simple enough. But once we got into it, it became obvious this was no simple task. We had just a single night to work on this extra credit challenge, for which the professor promised extravagant rewards to anyone who could pull it off. I persisted for a bit and realized quickly I was out of my league. My mind was battered with paradoxes. I still remember a peer of mine, a brilliant fellow, but still green in the ways of programming, crashing into my room late at night, wild-eyed and babbling incoherently, insisting he had solved it. Excited, I took a look and, though it worked for some simple cases he was using as test cases, I readily produced an example that foiled his brittle scheme, and I watched as his whole world crumbled. I think he was in tears.
The punchline, of course, is that it is impossible with regular expressions alone to detect any kind of recursive structure like nested parentheses. We all cursed the professor, but he just continued grinning wickedly and said something along the lines of “It is helpful and illuminating to take impossible tasks seriously, and see in what ways you fail.” We groaned at the time (it was hysterical later), but to this day I thank him for making such a dramatic and unforgettable point, as it has saved me countless times when confronted with a problem which seems appropriate for regexes, but within which lurk hidden nestings or tree-like structure.
Enter Parsing Expression Grammars
The idea is simple: you provide a set of “rules,” which specify the expected structure in terms of other rules. At some point some of the rules must decompose to actual text characters, which can also include regex-like matching parameters.
I will give a real-world example to demonstrate: Interface has support for nested structures, the most notable of which is the page hierarchy, though models can be nested as well. When reorganizing these nested trees, I found that after reorganizing and making a nested organization of the pages, the front-end was decomposing this structure into a string representation and sending this representation to the server for updating. An example (in Ruby) of one of these strings would be like so:
> example = "1,5,2[3,4,8,9],7,6[11,10[12,15,18],19],13[14[16,17]]"
In this example, the numbers represent the ids of the objects and their relative ordering and the brackets signify that the enclosed ids act as children of the preceding id. A fairly simple format, but it is instructive to attempt to extract the necessary information with regexes alone. The main task would be to group ids with their children in some usable way. Obviously, splitting on commas fails:
> example.split(',')
["1", "5", "2[3", "4", "8", "9]", "7", "6[11", "10[12", "15", "18]", "19]", "13[14[16", "17]]"]
This gives nonsense like “2[3” and “19]”. Grouping by brackets fails in two ways, depending on whether you are using greedy or lazy operators:
> example.scan(/(\d+)(\[(.*)\])?,?/) [["1", nil, nil], ["5", nil, nil], ["2", "[3,4,8,9],7,6[11,10[12,15,18],19],13[14[16,17]]", "3,4,8,9],7,6[11,10[12,15,18],19],13[14[16,17]"]] > example.scan(/(\d+)(\[(.*?)\])?,?/) [["1", nil, nil], ["5", nil, nil], ["2", "[3,4,8,9]", "3,4,8,9"], ["7", nil, nil], ["6", "[11,10[12,15,18]", "11,10[12,15,18"], ["19", nil, nil], ["13", "[14[16,17]", "14[16,17"]]
The greedy example gets to the first bracket and takes everything until the last bracket, which happens to be at the end of the string. The lazy example gets somewhat closer, giving correct output for the ids without children and the first level of ids having only direct children (ie, the [“2”, “[3,4,8,9]”, “3,4,8,9”] entry, which correctly associates the children ids with their parent id), but fails for anything deeper than one level. This could be done with something such as this example embedded in a recursive descent function, but this is quickly getting more complicated than it needs to be.
To see how to go about solving this problem with parsing expression grammars, notice first how the structure of the nested strings is recursively self-similar, i.e., a “child” element has the same structure as its parent element. The nesting as a whole is represented by elements separated by commas, and an element can be either:
1. A single number
2. A single number followed by a nesting within brackets
This is a simple description, which implies there should be a simple way to capture this structure. In a parsing expression grammar, the implementation is a straightforward encoding of this idea. Since Interface is written in Ruby, I use the Treetop PEG library, which expresses this nesting in the following way:
grammar Nesting
rule nesting
element ',' nesting / element
end
rule element
id '[' nesting ']' / id
end
rule id
[0-9]+
end
end
The “grammar” directive is analogous to the “class” keyword, specifying that we will be defining a new grammar. This particular grammar has three “rules”: a nesting, an element, and an id. Notice how these rules refer to each other, and even themselves (in the case of the “nesting” rule). Circularity is not a problem here, indeed it is the whole raison d’etre for using parsing expression grammars in the first place. In each case, the ‘/’ signifies an alternative, to be tried if the first expression fails to match. It always tries to match the first rule first, so we can read this as follows:
1. A nesting is either an element followed by a comma and then another nesting, or just a plain element.
2. An element is either an id with another nesting inside of brackets, or just an id.
3. An id is at least one digit character, possibly an arbitrary number.
Believe it or not, these three rules are entirely sufficient to describe any string of the given nestable form. In Treetop, this grammar is compiled with the ‘tt’ command which outputs a .rb file containing the new “NestingParser” class that can be included in your code. Treetop has some other features too, such as associating each rule with a subclass of Treetop’s “SyntaxNode” class that can be used to do further processing once the string has been parsed into a tree (we use this to update all of the affected objects’ parent and children references directly while descending through the syntax tree), but the general idea is right there. Beautiful eh? Well, at least I think so.
(Upon further investigation, it turns out the latest version of Perl regexes allow for “grammars” in the sense above. Perl: Still awesome?)

Comments
You inspire me, Ryan. Nice writing, and I will devour this information as if it were soul food.