LMNL Parser Experiment

From LMNLWiki

I have been experimenting with creating an LMNL Parser. These pages describe the experiments and their status. All feedback or help is welcome.

Contents

Goals

  • to define the LMNL grammar at a high level using something like EBNF
  • to auto generate (most of) the software using a standard tool
  • to work with the parser in Python (nice if other languages are also supported)

Longer term goals are:

  • to define a set of LMNL objects/functions with a standard API
  • to construct those objects using the abstract syntax tree of the parser
  • to begin to do useful work parsing lmnl documents

Parsing speed & memory utilisation will be ignored unless performance is prohibitively poor.

Current Status

  • Fully working grammar and java parser using ANTLR
  • Mostly working grammar and python parser using SimpleParse
    • some limitations (no Content in Annotations)

Lessons Learned

  • Separate lexers and parsers make parsing tag languages difficult.
    • Lexers are not intelligent, so any context sensitive processing becomes very hard at the lexer level.
      • One common solution is to use syntactic predicates (a boolean flag that turns on and off depending on the context - e.g. inside the versionInfo or tagContent.)
      • A limitation of this technique is that context must be determined entirely at the lexer stage (you can't supply context from the higher level rules). Unfortunately, this means it becomes very difficult to parse LMNL like this.
      • Nested tags (annotations, and content inside annotations) would all have to be represented by the lexer, in order to determine the proper scope - i.e. if you were still inside a tag. Otherwise, the first close of an annotation inside its tag will reset the syntactic predicate back to being outside a tag.
    • The easiest solution is to accept a very broad range of characters at the lexer level, but supply syntactic validating predicates attached to the higher level rules that raise an error if the contents of the token aren't right in that context (supplied as code in the target language)

Experiments

Experiment 1: SimpleParse

Requirements

  • Python 2.4+
  • SimpleParse 2.1+

http://simpleparse.sourceforge.net/

SimpleParse is a parser generator for Python. It accepts grammars in a slightly verbose EBNF style, and operates as a front end to another text processing engine for Python called mxTextTools. mxTextTools is very fast, but has a very idiosyncratic way of specifying matches at a fairly low level.

Although originally built to make mxTextTools usable, some underlying limitations of the mxTextTools engine (stack limits due to recursion and infinite loops due to matching * zero width productions) have led the SimpleParse author to debate whether or not to move to another text processing back end. The latest version of SimpleParse has a significant rebuild of the mxTextTools engine included in it, as the author of mxTextTools isn't interested in going where SimpleParse wants to go.

Building the grammar

The style of SimpleParse grammars is very close to EBNF, but adds a bit more punctuation. Error reporting isn't particularly good. The resulting LMNL grammar is very close to the EBNF on the wiki, although I have expanded some productions a bit - partly to learn and partly to make parsing easier.

One issue I encountered is that the lmnl grammar specifies a lot of productions that can match zero to many times (*). Combining this with optional matches in higher level productions led to problems matching, and always with the suspicion that one of these might be triggering the "infinite loop" issue. With this in mind, I refactored most of the productions to match something one to many times (+), with the higher level productions specifying optionality (?). This seemed to help to figure out what was matching and what wasn't.

Some constructions particularly caused problems. SimpleParse does not do backtracking, so some of the productions could not match - in particular the grouping of spaces inside TagContent would not match. Grouping these another way allowed the productions to match (I think I got it to mean the same).

Having MetaData inside AnnotationTags caused me some recursion problems, although I could sometimes make this work if I disabled other productions. EmptyTags seem to cause matching problems, as the first two tokens of StartTag and EmptyTag are the same; there may be some grammar tricks to get around that (creating another intermediate token).

I could not resolve the recursion due to Content inside Annotations - it causes SimpleParse to go into an infinite loop. This sounds like one of the limitations of the mxTextTools engine - but it can be a subtle effect, occurring inside child productions of child productions etc. It is also possible that it is a left-recursion problem (but SimpleParse grammars aren't quite LL, LR, LALR or anything else, so it's hard to be entirely sure). Replacing Content with CharData allowed the grammar to build and parse, but is not ideal.

Running the Parser

I couldn't upload a non-image file to this wiki, so a link to a page containing the lmnl grammar is here Lmnl.def. A small python script to read a sample file (called 'example-lmnl-text.txt') and parse it is here Lmnl.SimpleParse.py. The example lmnl file is here example-lmnl-text.txt.

Place all of these files into the same directory, and (assuming you have SimpleParse & python installed and available on your path), type:

python Lmnl.SimpleParse.py

Results

Within the limitations noted above, I could successfully parse an example lmnl text (taken from the wiki). Parsing is fast and seems to be accurate. It is quite possible that all of these limitations can be worked around, but since they are a non-standard limitation, for a language novice like myself, it's one too many things to worry about. For those reasons, I decided to explore using more standard parser generators.

Experiment 2: ANTLR & ANTLRWorks

Requirements

  • AntlrWorks (from the antlr web site - includes a working antlr3 tool inside it)
  • The antlr python runtime (from http://www.antlr.org/download/Python/ - must run "python setup.py install" to install into your python environment.
  • Python 2.4+ (I am using 2.5)
  • Java 1.5+ (I am using 1.6)

http://www.antlr.org/

ANTLR (Another Tool for Language Recognition) is written in Java by Terence Par and accepts EBNF style grammars. It is a bit closer in spirit to lex/yacc than SimpleParse, involving producing lexer tokens, then and higher level productions built on top of those. It can generate parsers in a variety of languages (C, Java, Python, Ruby, C#, Perl, and many others), although Java is best supported as the native language. It is an LL(*) parser (apparently a refinement of LL(k)), so left recursion is still something to watch for, although ANLTR3 does support backtracking.

ANLTRWorks is a Java IDE for building and debugging ANTLR grammars. It provides sophisticated features to help build, visualise, navigate and debug grammars. It will check grammars, draw syntax diagrams in real time of the productions, visually indicate alternative problems, syntax highlights the grammars, and lots more. Like most Java programs, it doesn't quite integrate perfectly into the desktop environment, but it is still very usable.

If nothing else, it is worth downloading ANTLRWorks (it's free), and loading up the LMNL.g file inside it, just to explore the lmnl language. The visualisations are useful, and better than anything else I've seen.

Building the grammar

My first attempt at building an ANTLR grammar for lmnl, by directly translating the lmnl grammar into lexer tokens didn't work at all - so I went back and built it up from the bottom. A convention I had not appreciated is that a rule starting with an uppercase letter is a lexer token, and one with a lower case letter is a production.

Aside from those issues, I followed a similar style to the SimpleParse grammar - expanding some productions and defining most of the literals by name. I also encountered similar issues to SimpleParse, leading me to believe that these may be generic LL grammar issues.

I stayed with the idea of making productions specify what they match (one, or one-to-many times), and leaving optionality up to higher level productions, rather than having a lot of zero-to-many matches scattered about.

Like SimpleParse, EmptyTags could cause matching problems (which I could resolve using an intermediate token in StartTag and EmptyTag for "OpenStartTag TagContent"). Likewise, MetaData in AnnotationTags, and Content in Annotations caused fatal LL recursion errors, which apparently can be resolved by left factoring (something I have yet to fully explore).

I had some problems with the definitions of MetaData and whitespace in various productions. To resolve these, I simplified the statement to just MetaData* (with whitespace included in MetaData). This isn't exactly the same as the lmnl grammar on the wiki, but it does work. I will try to figure out those remaining issues at a later date, as they are not large.

I finally came up with a grammar that antlr will accept by specifying the "backtrack=true" option. This automatically backtracks if a match can't be made and tries other alternatives. The resulting gramar is almost exactly like the lmnl grammar (including recursive MetaData and Content in Annotations). ANTLR says the grammar checks out OK, and happily builds parser code.

Running the Parser

The LMNL grammar can be found at this page LMNL.g. A simple python test harness script, to load an example lmnl text file example-lmnl-text.txt and parse it can be found here lmnl-ANTLR.py. To use these files, load LMNL.g into AntlrWorks, and select Generate/Generate Code. This will place the generated python files wherever AntlrWorks is configured to put them (File/Preferences). Put these files into the same directory as all the others, and run the test harness script:

python lmnl-ANTLR.py

Results

I have tried Java and Python so far. Java I am having build issues with (probably don't have a good java environment set up). The Python parser runs, but currently doesn't recognise anything, so there are clearly some problems with my antlr grammar. It looks OK (at least, to my untrained eye), but currently matches nothing. Back to the drawing board then. I intend to play with some cut down versions to get matching working, then apply what I've learned to the current lmnl antlr grammar.

Update

I tried an incredibly cut down grammar, with only the fragment literals, and the definition of the prolog. It still didn't match anything. The grammar problem seems to be in my understanding of fragments. Without fragments, lots of tokens suddenly become unreachable. This is actually a clue to why it's not working - when the tokens are directly surfaced, some of them can never match because another token would match first. The next step is to make the basic tokens match properly, and ignore most of the higher level productions for the time being. That should let me understand what fragments are really for too.

Update 2

I figured out the matching problem; it was due to my lack of understanding of the separation of lexer and parser. The lexer runs first, without knowledge of the higher level productions. It must be able to unambiguously split up the input stream into lexer tokens, without knowing how they will be consumed. This presents a few problems for the lmnl grammar, as it has several lexer tokens that are subsets of another (e.g. encodingDecl and versionInfo). This means the lexer cannot disambiguate them, without additional help. One way to do this is via syntactic predicates - a boolean flag that is set or unset depending on the context.

However, we cannot use context supplied by the higher level productions - it must all happen at the lexer token level. This is not particularly easy, but probably not impossible. It does make for hard to read grammars though. A similar problem will be encountered with content inside annotation tags, as a global boolean flag will not cope properly with nested tags (but I haven't got that far yet).

The bottom line is that lexer/parser combinations are not particularly well suited to parsing tag languages with lexer tokens that should apply or not apply contextually. SimpleParse does not have this limitation, as it has no separate lexing stage. An alternative would be to explore PEG parsers (Parsing Expression Grammar), which do not have a separate lexing stage. For the time being, I will continue with Antlr and AntlrWorks and see how far I get.

Experiment 3: ANTLR with a dumb lexer

Requirements

The same as for experiment 2.

Building the grammar

Having tried to make an intelligent lexer (and succeeding, except for recursive structures like annotations), I decided to make a dumb lexer, that just splits up the input text into vaguely useful structures, and to place all the intelligence in the parser layer. The current grammar processes all LMNL constructs correctly, although I have not yet added predicates to test that the characters are all valid in each section.

There are some other weaknesses, notably that ending an annotation currently just tests that there are no other open tags in the content inside the annotation. It has proved difficult to match the start and end annotation tags using their tagnames, as I had to decide whether an end tag was an end tag or an end annotation tag before the parser gets to see the name of the tag. It may be possible to improve on this. The only real downside of this is that, if the content in an annotation is poorly formed, it prevents the parser from recovering gracefully, and probably stops the rest of the document from parsing correctly. So it works, but it is not very robust.

Finally, the parser does nothing except parse a LMNL document. It does not construct useful objects from parsing the LMNL, so at present it does nothing except prove it is possible to parse LMNL with ANLTR.

Running the Parser

(I can't currently upload grammar files to the wiki; hopefully this will soon be resolved.)

The parser currently has no test harness, but this is not necessary if using the ANTLRWorks IDE, as the target language is java, which ANTLRWorks natively understands. So you can "debug" a grammar (i.e. feed it an input text or file) directly in the IDE. This also lets you see the resulting parse tree as a nice graphical diagram, which is very useful.

Results

The current version I have produced is still a beta development version; it's a bit messy and needs tidying up a bit, but it does work. As such, it isn't fully conformant with the LMNL on the detailed syntax page, only because I have experimented with various things as I went along. I will produce a conformant version, and an experimental version. I will also produce a python version (the current version is java only).


___________________________

Matt Palmer, September 2008. I can be contacted at ieee.org using my name and a dot in the usual way.