SimpleParse A Parser Generator for mxTextTools -- version 1.0

Note: Version 2.0.0 of the SimpleParse engine has been released (and provides a compatability API for SimpleParse 1.0 applications). This release has a number of exciting new features, better documentation, and an optional mx.TextTools engine upgrade. The 1.0 release is not likely to recieve any further development effort and you are strongly encouraged to use the 2.0.0 release for any new development.

A (deterministic) parser generation system based on Marc-André Lemburg's mxTextTools tagging engine. SimpleParse is fast, though not, as I had originally assumed it would be, orders of magnitude faster than my mcf.pars engine. That engine had far more tweaking and optimisation done than has SimpleParse, possibly with time an order of magnitude will be realized. On the whole, however, SimpleParse is the fastest parser generator I have found for use from Python (i.e. which doesn't require a compiler to create each new grammar).

This document assumes the you understand parsing fairly well, and can read an extended BNF (ebnf) grammar easily. The declaration grammar used in SimpleParse is available in the file simple.def in the installed simpleparse directory. The choice of this grammar was primarily influenced by my original mcf.pars parser. Those wishing to replace the various elements (such as the use of "/" to mean "first of") are, of course, free to do so.

Usage Example

declaration = r'''# note use of raw string when embedding in python code...
file           :=  [ \t\n]*, section+
section        :=  '[',identifier,']', ts,'\n', body
body           :=  statement*
statement      :=  (ts,';',comment,'\n')/equality/nullline
nullline       :=  ts,'\n'
comment        :=  -'\n'*
equality       :=  ts, identifier,ts,'=',ts,identified,ts,'\n'
identifier     :=  [a-zA-Z], [a-zA-Z0-9_]*
identified     :=  ('"',string,'"')/number/identifier
ts             :=  [ \t]*
char           :=  -[\134"]+
number         :=  [0-9eE+.-]+
string         :=  (char/escapedchar)*
escapedchar    :=  '\134"' / '\134\134'
'''
testdata = '''[test1]
	val=23
	val2="23"
	wherefore="art thou"
	; why not
	log = heavy_wood

[test2]
loose=lips

'''
from simpleparse import generator
from TextTools import TextTools
import pprint

parser = generator.buildParser( declaration ).parserbyname( 'file' )
pprint.pprint( TextTools.tag( testdata, parser ))

As you can see, this is not a very complex process. The format being defined is similar to a windows ini file, but with a more limited class of values (for instance, multiple numbers are not supported here). The declaration in this case is embedded directly in the file. I do not generally recommend this, as the various escaping paths through which the string passes tend to hide errors. On the other hand, it does prevent the atrociously ugly "find the file" logic you'll see most of my examples. Eventually, I may standardise on using the (new) features in mxTextTools 1.1 which provide for pickling/restoring parse-definition tuples and figure out some way of conveniently embedding those in the python file.

SimpleParse Grammar

Element Tokens

SimpleParse has four classes of element token: the group, the literal, the character range, and the production reference. A group may be of either the type "sequential", denoted by a "," operator, or "first of", denoted by a "/" operator.

Modifier	Example	Meaning
-	-"this" -those* -[A-Z]+ -(them,their) -(them/their)	Match any single character as long as the current text does not match "this" at this position. Match any number of characters until the production "those" matches. Match one or more characters not in the set A to Z. Match one character as long as the current text does not match a "them" production followed by a "their" production Match one character as long as the current text does not match a "them" production or a "their" production
?	"this"? those? [A-Z]? (them,their)? (them/their)?	Either the exact string "this", or nothing. Either match the production "those", or nothing. Either match a single character in the range A to Z, or nothing. Either match the production "them" followed by the production "their", or nothing. Either match the production "them", or the production "their", or nothing.
*	"this"* those* [A-Z]* (them,their)* (them/their)*	Match any number of exact strings "this". Match any number of production "those". Match any number of characters in the range A to Z. Match the two productions "them" and "their" in order any number of times. Match either of the two productions "them" or "their" any number of times.
+	"this"+ those+ [A-Z]+ (them,their)+ (them/their)+	Match one or more exact strings "this". Match one or more production "those". Match one or more characters in the range A to Z. Match the two productions "them" and "their" in order one or more times. Match either of the two productions "them" or "their" one or more times.

Declarations (Productions)

A simpleparse grammar is a series of declarations. Each declaration defines a single identifier. The identifier may be either reported, or unreported. A reported identifier is added to the parse tree, while an unreported identifier is not. Note: all children of an unreported identifier are also eliminated from the parse tree. The mapping between production and parse tree is fairly simple:

s := "this"	('s', start, stop, [] ) # no children
s := them, those?	('s', start, stop, [ ("them", start, stop, []), ("those", start, stop, []) ] ) ('s', start, stop, [ ("them", start, stop, []) ) # optional value not there
s := them*, those	('s', start, stop, [ ("them", start, stop, []), ("them", start, stop, []), ("those", start, stop, []) ] ) ('s', start, stop, [ ("those", start, stop, []) ) # optional repeating value not present

Where start and stop represent indexes in the source text such that sourcetext [ start: stop] is the text which matched this production. The list of children is the list of those reported productions which matched to allow the parent production to match.

Character Classes and Strings

Both character classes and strings in simpleparse may use octal escaping (of 1 to 3 octal digits) or standard Python character escapes ( \a\b\f\n\r\t\v (Note that "\n" is interpreted according to the local machine, and thus may be non-portable, so an explicit declaration using octal code might be more suitable in cases where differentiation is required)). Strings may be either single or double quoted (but not triple quoted).

To include a "]" character in a character class, make it the first character. Similarly, a literal "-" character must be either the first (after the optional "]" character) or the last character. The grammar definition for a character class is as follows:

'[',CHARBRACE?,CHARDASH?, (CHARRANGE/CHARNOBRACE)*, CHARDASH?,']'

Common Errors

If you have errors reported during the compilation process stating that an identifier is not found (but you have defined it), likely you have a syntax error. At the moment, there is their little support in simpleparse for error reporting or handling. This is something Marc-André and I will have to work on together I would assume.

Back tracking does not work within groups. Simpleparse does not currently provide for regex-style back tracking within repeating groups, so, consider using the following constructs:

t*  -->  s := u?  u := (t,u)/t # repeating, back tracking, optional
t+  -->  u := (t,u)/t # repeating, back tracking, required

It would be possible to create a subclass of the current generator which automatically creates these types of constructs for each repeating element. I just don't like them, they have a good chance of blowing the C stack, what with potentially creating recursive calls many hundreds of thousands of levels deep. Still, if you have a language which requires this type of back tracking, the risk might be worth it. This should have special case handling to make the entire list show up as a single list of elements, not an arbitrarily deep nested tuple tree.

Pre-built Parsing Elements

The latest version of Simpleparse allows for mixing and matching generated and hand-coded tuples. With this mechanism, it should be possible to use any mxTextTools tuple as a production within Simpleparse, thus giving access to the full features set of the package (though not in a very elegant manner).

Passing a sequence of (key, value) pairs to the Generator's build method will allow the values to be referenced by their keys as if they were declared names. Similarly, after having built a parser, you can extract the entire list of (key, value) pairs representing the grammar (you could thereby compose hierarchies of parsers). This mechanism has not been tested, but is rather simplistic and so should just work. The tuples passed in this way will be reported as normal productions unless their name is of the form '<name>' (surrounded by angle brackets (just as in the base simpleparse grammar)).

Examples

The examples sub-package contains a number of examples, including:

findliterals.py: Find all instances of double quoted strings in a file.
formatvrml.py: Syntax color VRML 97 source code, with text-processing-focused VRML 97 grammar.
vrml.def: VRML 97 parser focused on sceneGraph building (ignores comments etc.)
pyebnf_test.py: A parser for the EBNF grammar used to describe python's grammar. Does not include a tokenizer, so it cannot yet parse python itself.
simpleexample.py: The usage example above.

Possible Future Directions

Support for embedded code or calling external functions automatically (or an advanced mechanism for processing parse trees)
Improved error handling.
A "cut" directive, in effect, a way to spell "if you get this far, you must complete this production or register an error".
More efficient tuple trees, elimination of huge numbers of recursive calls generated by the current system. This may include recursively resolving each referenced identifier as we progress through the transformation stage.
Better documentation.
Add a way to specify that a production should be expanded in-place, as opposed to being reported or ignored, so that partial productions may be shared among productions.

Copyright & Disclaimer

© 1998-1999, Copyright by Mike C. Fletcher; All Rights Reserved. mailto: mcfletch@vrplumber.com

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.

THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE !

Use at your own risk.

Installation

Install Python (normal Python ("CPython"), not "JPython", as mxTextTools is not available for the Java VM).
Install mxTextTools.
Expand the simpleparse archive into a temporary directory, run "setup.py install" (requires distutils).

To test:

Switch to the simpleparse/examples directory and run the simpleexample.py script (the example above).
If the system is properly installed and functioning, the parse tree for the example above will be printed.

Only the files in the simpleparse directory are required by the parser generator, so if you are distributing the package, you only need to include these.