Creole
From LMNLWiki
Creole (Composable Regular Expressions for Overlapping Languages etc.) is a schema language that Jeni is developing for LMNL and other overlapping markup languages. It's based heavily on RELAX NG, with a few extensions that make it suitable for overlapping structures.
- As well as the items under discussion on the discussion page, there's a page of open issues to look at.
Contents |
Aims and Scope
Creole has the following aims:
- Composable
- The definition of overlap between ranges should be composable, rather than being done at a grammar level. Users should be able to define overlap within particular parts of a document rather than having to repeat the definition of an entire tree.
- Ranges, not tags
- Users should be able to specify how ranges overlap each other without detailing where start and end tags can appear.
- RELAX NG extension
- Creole should be an extension of RELAX NG, such that all RELAX NG grammars are Creole grammars.
- Language neutral
- Creole should be applicable to many different kinds of markup languages, including LMNL, TexMecs, and plain old XML.
It's important to note that Creole is not designed to be used to identify hierarchies that can be extracted from a document as XML structures.
Syntax
Creole is RELAX NG with the following extensions:
- It has
<annotation>patterns. These match annotations, which can have annotations themselves and can have ranges in their content. The<attribute>pattern is a shorthand for an<annotation>pattern. - It has
<atom>patterns. These match atoms, which have names and annotations but are treated as content (on a par with text) in LMNL. - It has
<range>patterns. These match ranges, which can overlap other ranges. The<element>pattern is a shorthand for a<range>wrapped in a<partition>. - It has
<partition>patterns. These indicate blocks of content that must be self-contained: ranges cannot start within them unless they also end within them, and vice versa. - It has
<interleave>and<concur>patterns. The only difference between them is that any content (text or atoms) must match both subpatterns of a concur rather than matching either subpattern of an interleave. - It has
<concurOneOrMore>and<concurZeroOrMore>patterns. These provide for self-overlap.
Annotations
In LMNL, annotations can have annotations themselves, and can have structured content, putting them on a par with elements. For example, you can do
[book [title}Beginning XSLT 2.0{]
[author [title}Dr{]}Jeni [surname}Tennison{surname]{]}
...
{book]
Annotation patterns match annotations, with the syntax
<annotation name="QName"> pattern </annotation>
or
<annotation> nameClass pattern </annotation>
Attribute patterns are still allowed in Creole. They are equivalent to an annotation pattern with the same name attribute or nameClass and pattern. If the attribute pattern doesn't have a subpattern, then the equivalent annotation pattern has the subpattern <text/>. Thus
<attribute name="title" />
is equivalent to
<annotation name="title"><text /></annotation>
Atoms
The basic building block of LMNL documents are atoms, which are pieces of content. Ranges range over sequences of atoms, and in fact text is considered simply a sequence of character atoms. Atoms have names and can have annotations. They are typically used in LMNL documents for media objects such as graphics which are part of the content of the page.
- Note: Atoms are usually represented as empty elements in XML-based markup languages.
Creole adds an atom pattern to RELAX NG, which has the same syntax as the element pattern
<atom name="QName"> pattern </atom>
or
<atom> nameClass pattern </atom>
For example, the schema
<define name="img-atom">
<atom name="img">
<annotation name="src"><text /></annotation>
<annotation name="alt"><ref name="inline" /></annotation>
<annotation name="desc"><ref name="blocks" /></annotation>
</atom>
</define>
would be used for an atom such as
{{img [src}logo.jpg{]
[alt}[acronym}LMNL{acronym] logo{]
[desc}
[para}
The [acronym}LMNL{acronym] logo is a jester's hat...
{para]
{desc]}}
Because atoms only have annotations, the following paths are disallowed in simplified Creole.
atom//range atom//element atom//atom atom//text
Simplified Creole ensures that every <atom>, <annotation>, <element> and <range> element resides in its own <define> element, so would transform the above example to
<define name="img-atom">
<atom name="img">
<ref name="img-src-annotation" />
<ref name="img-alt-annotation" />
<ref name="img-desc-annotation" />
</atom>
</define>
<define name="img-src-annotation">
<annotation name="src"><text /></annotation>
</define>
<define name="img-alt-annotation">
<annotation name="alt">...</annotation>
</define>
<define name="img-desc-annotation">
<annotation name="desc">...</annotation>
</define>
Thus the constraints on the content of the <atom> element are not broken.
Ranges
Creole adds a range pattern to RELAX NG, which has exactly the same syntax as the element pattern
<range name="QName"> pattern </range>
or
<range> nameClass pattern </range>
A range can be overlapped (other ranges can start within it but not end within it, and vice versa). Element patterns are still allowed in Creole. They are equivalent to a range pattern with the same name attribute or nameClass and pattern, wrapped in a <partition> element. For example,
<element name="foo"> <text /> </element>
is equivalent to
<partition> <range name="foo"> <text /> </range> </partition>
Ranges can only overlap with ranges that are part of another interleaving or overlapping subpattern; any ranges that are defined to be in the content of a range must be completely enclosed within that range. For example, the interleave pattern
<interleave>
<range name="foo">
<mixed><range name="baz"><text /></range></mixed>
</range>
<range name="bar"><text /></range>
</interleave>
allows overlap such as
[foo}...[bar}...[baz}...{baz]...{foo]...{bar]
[foo}...[baz}...[bar}...{baz]...{foo]...{bar]
but not
[foo}...[bar}...[baz}...{foo]...{baz]...{bar]
because the [baz] range is defined to be within the [foo] range, and therefore cannot overlap it. If [baz] were to be declared as an element rather than a range, then
[foo}...[baz}...[bar}...{baz]...{foo]...{bar]
would not be legal because the range [bar] starts within [baz] but does not finish within it.
Partitions
The most common use of partitions is to indicate a range that cannot be overlapped, otherwise known as an element. Some of the time, whether you classify a start/end tag pair as an element or a range doesn't matter. For example,
[foo}...{foo]
will match both the patterns
<element name="foo"><text /></element>
which is a shorthand for
<partition><range name="foo"><text /></range></partition>
and
<range name="foo"><text /></range>
However, when you use interleave (or overlap), the difference becomes apparent. Take the interleave pattern, as applied to elements:
<interleave> <element name="foo"><text /></element> <element name="bar"><text /></element> </interleave>
This pattern allows
[foo}...{foo][bar}...{bar]
[bar}...{bar][foo}...{foo]
but not
[foo}...[bar}...{foo]...{bar]
[foo}...[bar}...{bar]...{foo]
[bar}...[foo}...{bar]...{foo]
[bar}...[foo}...{foo]...{bar]
However, the interleave pattern when applied to ranges
<interleave> <range name="foo"><text /></range> <range name="bar"><text /></range> </interleave>
allows all of the above.
The <partition> element can hold any pattern. For example, you can use it to say that a particular group of ranges must be adjacent to each other.
<interleave>
<partition>
<range name="foo"><text /></range>
<range name="bar"><text /></range>
</partition>
<range name="baz"><text /></range>
</interleave>
allows
[foo}...{foo][bar}...{bar][baz}...{baz]
but not
[foo}...{foo][baz}...{baz][bar}...{bar]
Interleave and Concur
Creole adds an concur pattern to RELAX NG. The syntax is the same as interleave, group, choice and so on
<concur> pattern+ </concur>
As you can see from the example in the previous section, interleave patterns actually allow ranges to overlap each other in any way. Sometimes, though, you want to constrain the overlap such that the same content (text and atoms) is completely covered by the ranges. For example, when marking up the Bible, every piece of text in the Bible should be within both a [v] (verse) range and a [s] (sentence) range, though [v] and [s] ranges can, of course, overlap. We want to allow
[v}[s}In the beginning of creation, when God made heaven and earth,{v]
[v}the earth was without form and void, with darkness over the face of
the abyss, and a mighty wind that swept over the surface of the waters.{s]{v]
[v}[s}God said, "Let there be a light," and there was light;{v]
[v}and God saw that the light was good, and he separated the light from darkness.{s]{v]
[v}[s}He called the light day, and the darkness night. So evening came, and
morning came, the first day.{s]{v]
but not allow
[s}In the beginning of creation, when God made heaven and earth,
the earth was without form and void, with darkness over the face of
the abyss, and a mighty wind that swept over the surface of the waters.{s]
[v}God said, "Let there be a light," and there was light;{v]
[v}and God saw that the light was good, and he separated the light from darkness.{v]
[v}[s}He called the light day, and the darkness night. So evening came, and
morning came, the first day.{s]{v]
If we used an interleave pattern like
<interleave>
<oneOrMore>
<range name="s"><text /></range>
</oneOrMore>
<oneOrMore>
<range name="v"><text /></range>
</oneOrMore>
</interleave>
then both would be allowed. However, if we use a concur pattern like
<concur>
<oneOrMore>
<range name="s"><text /></range>
</oneOrMore>
<oneOrMore>
<range name="v"><text /></range>
</oneOrMore>
</concur>
then the second is invalid because text appears before the [v] range.
A more complicated overlap example is
[chapter}
[section}
[para}[v}...{v][v}...{para]
[para}...{v][v}...{v]{para]
... more paras ...
{section]
... more sections ...
{chapter]
Here, there are two hierarchies in play — a chapter/verse hierarchy and a section/para hierarchy — and the text must be covered in both. So we would use:
<concur>
<oneOrMore>
<range name="chapter">
<oneOrMore>
<range name="v"><text /></range>
</oneOrMore>
</range>
</oneOrMore>
<oneOrMore>
<range name="section">
<oneOrMore>
<range name="para"><text /></range>
<oneOrMore>
</range>
</oneOrMore>
</concur>
Now let's introduce a heading for section:
[chapter}
[section}[heading}The creation of the world{heading]
[para}[v}...{v][v}...{para]
[para}...{v][v}...{v]{para]
... more paras ...
{section]
... more sections ...
{chapter]
The heading is a self-contained range (it doesn't overlap anything), and we don't want the text that it contains to be included within a verse. When a partition is encountered, everything else is set aside until that partition is completed; the concur goes temporarily "out of scope", and the constraint that all text must appear in both parts of the concur is lifted. So the document is valid against the schema
<concur>
<oneOrMore>
<range name="chapter">
<oneOrMore>
<range name="v"><text /></range>
</oneOrMore>
</range>
</oneOrMore>
<oneOrMore>
<range name="section">
<element name="heading"><text /></element>
<oneOrMore>
<range name="para"><text /></range>
<oneOrMore>
</range>
</oneOrMore>
</concur>
- Note: Were we to extract a chapter/verse hierarchy from the document, we would want to ignore any text within headings.
Usually a range will be in only one of the two branches of a <concur>, but in some cases low-level ranges belong to both branches. For example, consider the following poem which appears across pages 199 and 200 in a particular book:
[page [n}199{]}
...
[poem}
[title}[pl}Recueillement{pl]{title]
[stanza}
[sl}[pl}Sois sage, ô ma douleur, et tiens-toi plus {pl]
[pl}tranquille.{pl]{sl]
[sl}[pl}Tu réclamais le Soir; il descend; le voici:{pl]{sl]
[sl}[pl}Une atmosphère obscure enveloppe la ville,{pl]{sl]
[sl}[pl}Aux uns portant la paix, aux autres le souci.{pl]{sl]
{stanza]
[stanza}
[sl}[pl}Pendant que des mortels la multitude vile,{pl]{sl]
[sl}[pl}Sous le fouet du Plaisir, ce bourreau sans merci,{pl]{sl]
[sl}[pl}Va cueillir des remords dans la fête servile,{pl]{sl]
[sl}[pl}Ma douleur, donne moi la main; viens par ici,{pl]{sl]
{stanza]
[stanza}
[sl}[pl}Loin d'eux.{s] [s}Vois se pencher les défuntes {pl]
[pl}Années,{pl]{sl]
[sl}[pl}Sur les balcons du ciel, en robes surannées;{pl]{sl]
[sl}[pl}Surgir du fond des eaux le Regret souriant;{pl]{sl]
{stanza]{page]
[page [n}200{]}[stanza}
[sl}[pl}Le Soleil moribund s'endormir sous une arche,{pl]{sl]
[sl}[pl}Et, comme un long linceul traînant à l'Orient,{pl]{sl]
[sl}[pl}Entends, ma chère, entends la douce Nuit qui {pl]
[pl}marche.{pl]{sl]{s]
{stanza]
{poem]
...
{page]
Here, the [sl] ranges indicate lines within a stanza, and the [pl] ranges indicate lines on the page. Some lines in the stanza appear over two lines in the page, but it's not possible for the stanza lines and the page lines to overlap. The pattern for this is:
<concur>
<oneOrMore>
<range name="poem">
<range name="title">
<range name="pl"><text /></range>
</range>
<oneOrMore>
<range name="stanza">
<oneOrMore>
<range name="sl">
<oneOrMore>
<range name="pl"><text /></range>
</oneOrMore>
</range>
</oneOrMore>
</range>
</oneOrMore>
</range>
</oneOrMore>
<oneOrMore>
<range name="page">
<attribute name="n" />
<oneOrMore>
<range name="pl"><text /></range>
</oneOrMore>
</range>
</oneOrMore>
</concur>
When a [pl] range is encountered, it will match the range patterns for [pl] ranges within both [sl] and [page] ranges.
Like the interleave, group and choice patterns, concur is associative, and
<concur> p1 p2 p3 </concur>
is transformed to
<concur> <concur> p1 p2 </concur> p3 </concur>
during simplification.
Repeated Concur
Self-overlap is one of the harder problems to handle in overlapping markup languages. The problem comes where two ranges of the same name overlap each other, such as in
[phrase=a}Broadway [phrase=b}Hit{phrase=a] or Miss{phrase=b]
Creole introduces two new patterns — concurOneOrMore and concurZeroOrMore — to handle this situation. They have the same syntax as oneOrMore and zeroOrMore patterns in RELAX NG:
<concurOneOrMore> pattern+ </concurOneOrMore> <concurZeroOrMore> pattern+ </concurZeroOrMore>
The subpattern can be repeated one-or-more (or zero-or-more) times, and each repetition is concurrent with previous ones. The pattern for the [phrase] range in the above would be
<concurOneOrMore>
<mixed>
<range name="phrase"><text /></range>
</mixed>
</concurOneOrMore>
Another example is indicating paginations within a document. A document might have different page breaks in different editions, but all editions should cover the same text. The pattern for the pagination view would look something like
<concurOneOrMore>
<range name="edition">
<attribute name="id" />
<oneOrMore>
<range name="page">
<attribute name="n" />
...
</range>
</oneOrMore>
</range>
</concurOneOrMore>
As with <oneOrMore> and <zeroOrMore>, during simplification of the schema, the content of the <concurOneOrMore> or <concurZeroOrMore> element is wrapped in a <group>, and a <concurZeroOrMore> is transformed such that
<concurZeroOrMore> p </concurZeroOrMore>
becomes
<choice> <concurOneOrMore> p </concurOneOrMore> <empty /> </choice>
Extended Example
Here is an example of Creole in use. The document is (in LMNL syntax):
[play [title}Peer Gynt{title]}
[act [n}1{]}
[scene [n}i{]}
[sp [who}Aase{]}[l} Peer, you're lying!{sp]
[sp [who}Peer{]}[stage}without stopping{stage] No, I'm not! {l]{sp]
[sp [who}Aase{]}[l} Well then, swear to me it's true! {l]{sp]
[sp [who}Peer{]}[l} Swear? Why should I? {sp]
[sp [who}Aase{]} See, you dare not! {l]
[l} Every word of it's a lie! {l] {sp]
[!-- ... more dialogue ... --]
{scene]
[!-- ... more scenes ... --]
{act]
[!-- ... more acts ... --]
{play]
A Creole schema that validates this document is:
<grammar xmlns="http://www.lmnl.org/schema/pattern"> <start> <ref name="play"/> </start> <define name="play"> <element name="play"> <annotation name="title"><text/></annotation> <oneOrMore><ref name="act"/></oneOrMore> </element> </define> <define name="act"> <element name="act"> <annotation name="n"><text/></annotation> <oneOrMore><ref name="scene"/></oneOrMore> </element> </define> <define name="scene"> <element name="scene"> <annotation name="n"><text/></annotation> <interleave> <zeroOrMore><ref name="stage"/></zeroOrMore> <concur> <oneOrMore><ref name="sp"/></oneOrMore> <oneOrMore><ref name="l"/></oneOrMore> </concur> </interleave> </element> </define> <define name="sp"> <range name="sp"> <annotation name="who"><text /></annotation> <mixed> <zeroOrMore><ref name="stage"/></zeroOrMore> </mixed> </range> </define> <define name="l"> <range name="l"> <mixed> <zeroOrMore><ref name="stage"/></zeroOrMore> </mixed> </range> </define> <define name="stage"> <element name="stage"><text/></element> </define> </grammar>
The most pertinent part is the scene pattern:
<define name="scene">
<element name="scene">
<annotation name="n"><text/></annotation>
<interleave>
<zeroOrMore><ref name="stage"/></zeroOrMore>
<concur>
<oneOrMore><ref name="sp"/></oneOrMore>
<oneOrMore><ref name="l"/></oneOrMore>
</concur>
</interleave>
</element>
</define>
This says that a [scene] range acts as an element (it can't have anything overlapping it). There are three kinds of ranges that can appear within a [scene]: [stage], [sp] and [l]. [sp] and [l] ranges overlap each other, and [stage] ranges interleave these overlapped sections. Because the [stage] range is treated as an element, it can't itself be overlapped. Because the [sp] and [l] ranges are concurrent rather than being interleaved, they must cover the same content: any content that doesn't appear within a [stage] must appear in both a [sp] and a [l].
Implementation
Creole can be implemented using a similar algorithm to the algorithm for RELAX NG validation supplied by James Clark. This section only describes the modifications to the algorithm described there.
A proof-of-concept implementation in XSLT 2.0 is available. The implementation involves the following stylesheets:
- parse-lmnl.xsl (download)
- parses a document in LMNL syntax into a sequence of event elements
- compile-creole.xsl (download)
- compiles a schema document into a sequence of
<define>elements, one for each distinct pattern in the schema - creole-impl.xsl (download)
- validates a sequence of events against a compiled schema
- validate-lmnl.xsl (download)
- takes a LMNL document (which it parses) and a schema document (which it compiles) and validates
Note that a sequence of events could be generated from any markup language, including XML, for validation against a Creole grammar.
Background
Michael Sperberg McQueen's Extreme 2006 paper on Rabbit/Duck grammars introduced a method of validating overlapping structures by defining two or more grammars that are each applied to the same document in parallel. Within a single grammar, an element can fall into one of four categories:
- first-class elements are normal elements in the hierarchy
- second-class elements are milestones within the hierarchy
- third-class elements are transparent: their content is included but they are not
- fourth-class elements are opaque: neither they nor their content appear
Michael described how to use Brzozowski derivatives to validate documents against such grammars. Brzozowski derivatives are a way of implementing (and formalising) regular expressions. The algorithm for RELAX NG specified by James Clark similarly applies Brzozowski derivatives to RELAX NG. Brzozowski derivatives were originally applied to matching strings with normal regular expressions. The idea is that if you've got a regular expression and the first character from the string, you can then work out a new regular expression that will match the rest of the string. The new regular expression is called the derivative.
When you apply Brzozowski derivatives to validation of markup languages, the "characters" that you're looking at are actually events generated from parsing a document, and the "regular expression" is held in a schema, but the basic idea (and a lot of the fundamental algorithm) is the same.
One place where regular expressions for markup languages departs from regular expressions for strings is in the treatment of element content. When parsing XML and similar markup languages, if you match a start tag against an element pattern, you need to make sure that the next bunch of events satisfy the pattern you've got for the content of that element and are then followed by an end tag. The naive way of doing this is to add the content model and end tag patterns to the start of the derivative, but if you have an interleave pattern then this will permit events that should be matched against the content model of the element to be matched against the interleaved pattern instead.
In James Clark's algorithm for RELAX NG, he introduces a special after pattern, which doesn't appear in RELAX NG itself, to deal with precisely this problem. In Michael's paper, for some reason he didn't, despite the fact that individual rabbit/duck grammars don't allow overlapping elements. This oversight prompted the line of research which has led to Creole...
Publications
- Tennison, J. (2007) Creole: Validating Overlapping Markup. In Proceedings of XTech 2007, Paris, France.
