Creole Open Issues
From LMNLWiki
This page discusses some of the open issues with Creole.
Contents |
Annotation and Tag Ordering
There are two related issues that we need to find answers to: how to handle markup languages like XML where attributes are unordered (in LMNL, annotations are [at the moment] ordered), and how to handle markup languages like LMNL where adjacent tags are unordered.
To illustrate the first of these issues, with the current algorithm, if you have:
<element name="foo"> <attribute name="bar" /> <attribute name="baz" /> <text /> </element>
then
[foo [bar}...{bar] [baz}...{baz]}...{foo]
is OK but
[foo [baz}...{baz] [bar}...{bar]}...{foo]
is not. This is OK for LMNL, but if someone were to use Creole with events generated from an XML document, they would rather expect
<foo bar="..." baz="...">...</foo>
to work.
Similarly, with the current algorithm, if you have:
<range name="foo"> <range name="bar"><text /></range> </range>
then
[foo}[bar}...{bar]{foo]
is valid but
[foo}[bar}...{foo]{bar]
[bar}[foo}...{foo]{bar]
[bar}[foo}...{bar]{foo]
are invalid. We have always said that the four options above are equivalent — even if we use the order of start tag to determine the ordering of ranges within the range layer, it still ought to be that the above schema will validate
[foo}[bar}...{foo]{bar]
We could have two flags (annotationsAreUnordered and tagsAreUnordered) that determine whether the order of annotations/tags is significant. Or we could use different kinds of events to indicate the different behaviour of different markup languages: generate StartUnorderedAnnotation events for unordered annotations, for example.
- Note: Jeni's very wary of using flags to modify validation behaviour.
Modifying significance of annotation order
If annotationsAreUnordered is true and we receive a StartAnnotation event, or when we receive a StartUnorderedAnnotation event (i.e. when processing XML) then all Group patterns would be treated like Interleave; for the purposes of dealing with the event, the schema
<element name="foo"> <attribute name="bar" /> <attribute name="baz" /> <text /> </element>
is effectively transformed into the schema
<element name="foo">
<interleave>
<attribute name="bar" />
<attribute name="baz" />
<text />
</interleave>
</element>
which means the attributes can appear in any order.
This is the way that attributes are handled in RELAX NG.
Modifying significance of tag order
To handle a tagsAreUnordered flag, we would introduce (to the algorithm, not the language) a StartTag pattern and transform all Range patterns into a group containing a StartTag pattern, content pattern, and a EndTag pattern. So the schema
<range name="foo"> <range name="bar"><text /></range> <range name="baz"><text /></range> </range>
would be effectively
<group> <startTag name="foo" /> <startTag name="bar" /> <text /> <endTag name="bar" /> <startTag name="baz" /> <text /> <endTag name="baz" /> <endTag name="foo" /> </group>
We could then say that all Group patterns were expanded, and then all StartTag and EndTag patterns grouped together into an Interleave. The above example would become effectively
<group>
<interleave>
<startTag name="foo" />
<startTag name="bar" />
</interleave>
<text />
<interleave>
<endTag name="bar" />
<startTag name="baz" />
</interleave>
<text />
<interleave>
<endTag name="baz" />
<endTag name="foo" />
</interleave>
</group>
which would permit (among other permutations)
[bar}[foo}...[baz}{bar]...{foo]{baz]
A problem with this approach arises when you have recursive structures: it's simply not possible to expand all ranges into start tags, content and end tags when you have a structure like:
<define name="section">
<range name="section">
<oneOrMore><ref name="para" /></oneOrMore>
<zeroOrMore><ref name="section" /></zeroOrMore>
</range>
</define>
Currently unresolved Jeni 05:13, 8 September 2006 (EDT)
Self-Contained Subsections
Rather than having a distinction between ranges and elements, we could introduce a <contained> pattern. The syntax would be:
<contained> pattern+ </contained>
and if more than one pattern were given in the content, they would be wrapped within a <group> during simplification.
The semantic is that the portion of the document matching the pattern must be self-contained: no range could overlap it. The <element> pattern then becomes a shorthand:
<element> nc p </element>
is a shorthand for
<contained> <range> nc p </range> </contained>
More to come...
