Creole Open Issues

From LMNLWiki

This page discusses some of the open issues with Creole.

Contents

Annotation and Tag Ordering

There are two related issues that we need to find answers to: how to handle markup languages like XML where attributes are unordered (in LMNL, annotations are [at the moment] ordered), and how to handle markup languages like LMNL where adjacent tags are unordered.

To illustrate the first of these issues, with the current algorithm, if you have:

<element name="foo">
  <attribute name="bar" />
  <attribute name="baz" />
  <text />
</element>

then

[foo [bar}...{bar] [baz}...{baz]}...{foo]

is OK but

[foo [baz}...{baz] [bar}...{bar]}...{foo]

is not. This is OK for LMNL, but if someone were to use Creole with events generated from an XML document, they would rather expect

<foo bar="..." baz="...">...</foo>

to work.

Similarly, with the current algorithm, if you have:

<range name="foo">
  <range name="bar"><text /></range>
</range>

then

[foo}[bar}...{bar]{foo]

is valid but

[foo}[bar}...{foo]{bar]
[bar}[foo}...{foo]{bar]
[bar}[foo}...{bar]{foo]

are invalid. We have always said that the four options above are equivalent — even if we use the order of start tag to determine the ordering of ranges within the range layer, it still ought to be that the above schema will validate

[foo}[bar}...{foo]{bar]

We could have two flags (annotationsAreUnordered and tagsAreUnordered) that determine whether the order of annotations/tags is significant. Or we could use different kinds of events to indicate the different behaviour of different markup languages: generate StartUnorderedAnnotation events for unordered annotations, for example.

Note: Jeni's very wary of using flags to modify validation behaviour.

Modifying significance of annotation order

If annotationsAreUnordered is true and we receive a StartAnnotation event, or when we receive a StartUnorderedAnnotation event (i.e. when processing XML) then all Group patterns would be treated like Interleave; for the purposes of dealing with the event, the schema

<element name="foo">
  <attribute name="bar" />
  <attribute name="baz" />
  <text />
</element>

is effectively transformed into the schema

<element name="foo">
  <interleave>
    <attribute name="bar" />
    <attribute name="baz" />
    <text />
  </interleave>
</element>

which means the attributes can appear in any order.

This is the way that attributes are handled in RELAX NG.

Modifying significance of tag order

To handle a tagsAreUnordered flag, we would introduce (to the algorithm, not the language) a StartTag pattern and transform all Range patterns into a group containing a StartTag pattern, content pattern, and a EndTag pattern. So the schema

<range name="foo">
  <range name="bar"><text /></range>
  <range name="baz"><text /></range>
</range>

would be effectively

<group>
  <startTag name="foo" />
  <startTag name="bar" />
  <text />
  <endTag name="bar" />
  <startTag name="baz" />
  <text />
  <endTag name="baz" />
  <endTag name="foo" />
</group>

We could then say that all Group patterns were expanded, and then all StartTag and EndTag patterns grouped together into an Interleave. The above example would become effectively

<group>
  <interleave>
    <startTag name="foo" />
    <startTag name="bar" />
  </interleave>
  <text />
  <interleave>
    <endTag name="bar" />
    <startTag name="baz" />
  </interleave>
  <text />
  <interleave>
    <endTag name="baz" />
    <endTag name="foo" />
  </interleave>
</group>

which would permit (among other permutations)

[bar}[foo}...[baz}{bar]...{foo]{baz]

A problem with this approach arises when you have recursive structures: it's simply not possible to expand all ranges into start tags, content and end tags when you have a structure like:

 <define name="section">
   <range name="section">
     <oneOrMore><ref name="para" /></oneOrMore>
     <zeroOrMore><ref name="section" /></zeroOrMore>
   </range>
 </define>

Currently unresolved Jeni 05:13, 8 September 2006 (EDT)

Self-Contained Subsections

Rather than having a distinction between ranges and elements, we could introduce a <contained> pattern. The syntax would be:

<contained> pattern+ </contained>

and if more than one pattern were given in the content, they would be wrapped within a <group> during simplification.

The semantic is that the portion of the document matching the pattern must be self-contained: no range could overlap it. The <element> pattern then becomes a shorthand:

<element> nc p </element>

is a shorthand for

<contained> <range> nc p </range> </contained>

More to come...