Algorithm for Creole Validation
From LMNLWiki
A Creole validator is provided with a sequence of events and a pattern. To validate the events against the pattern, the validator takes the first event and calculates a derivative: a new pattern that should be applied to the remainder of the events. It continues doing this until it runs out of events. At that point, if the remaining pattern is nullable (matches an empty sequence) then the validation has succeeded, otherwise it has failed and the document is invalid.
James Clark's algorithm for RELAX NG defines how to calculate a derivative for each combination of XML event and RELAX NG pattern, as well as some supporting functions. The algorithm for Creole is basically the same, but has to deal with overlapping content and a few more patterns. Only the modifications (shown in bold where applicable) are specified here.
- Note: This algorithm does not (yet) handle the fact that in some languages, such as XML, attributes/annotations are essentially unordered, whereas in others, such as LMNL, they are ordered (though you can use interleave to specify that the order doesn't matter). One possibility is to introduce
start-attributeandend-attributeevents for unordered annotations.
Contents |
Datatypes
The only datatype we have to add is an Id. Ids are required to make sure that the correct end tag matches a given start tag, in the case of self-overlapping ranges.
type Id = String
Events
Rather than a hierarchy of nodes, a document is considered as a stream of events. The important events to generate from a document are
data Event = StartTagEvent QName Id
| EndTagEvent QName Id
| StartAnnotationEvent QName
| EndAnnotationEvent QName
| TextEvent String Context
| StartAtomEvent QName
| EndAtomEvent QName
Patterns
A Pattern represents a pattern after simplification. Creole adds eight new patterns – Concur, ConcurOneOrMore, Partition, All, Atom, EndRange, EndAnnotation and EndAtom – and replaces Attribute with Annotation and Element with Range.
data Pattern = Empty
| NotAllowed
| Text
| Choice Pattern Pattern
| Interleave Pattern Pattern
| Group Pattern Pattern
| Concur Pattern Pattern
| Partition Pattern
| OneOrMore Pattern
| ConcurOneOrMore Pattern
| List Pattern
| Data Datatype ParamList
| DataExcept Datatype ParamList Pattern
| Value Datatype String Context
| Atom NameClass Pattern
| Annotation NameClass Pattern
| Range NameClass Pattern
| EndRange QName Id
| EndAnnotation QName
| EndAtom QName
| After Pattern Pattern
| All Pattern Pattern
The EndRange, EndAnnotation, EndAtom and All patterns, like the After pattern, is an artifact of the algorithm and doesn't correspond to any element used in a Creole grammar.
Utility Functions
nullable tests whether a pattern matches the empty sequence.
nullable:: Pattern -> Bool
For our new patterns, nullable is computed as follows:
nullable (Concur p1 p2) = nullable p1 && nullable p2 nullable (All p1 p2) = nullable p1 && nullable p2 nullable (ConcurOneOrMore p) = nullable p nullable (Partition p) = nullable p nullable (Atom _ _) = False nullable (Range _ _) = False nullable (Annotation _ _) = False nullable (EndAtom _) = False nullable (EndRange _ _) = False nullable (EndAnnotation _) = False
onlyAnnotations tests whether a pattern only contains Annotation patterns. This is used when calculating derivatives for Group patterns, since Annotation patterns can appear anywhere within a Group pattern.
onlyAnnotations:: Pattern -> Bool
onlyAnnotations (Group p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (Interleave p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (Concur p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (Choice p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (Partition p) = onlyAnnotations p onlyAnnotations (OneOrMore p) = onlyAnnotations p onlyAnnotations (ConcurOneOrMore p) = onlyAnnotations p onlyAnnotations (After p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (All p1 p2) = onlyAnnotations p1 && onlyAnnotations p2 onlyAnnotations (Annotation _ _) = True onlyAnnotations _ = False
allowsAnnotations returns true if the pattern allows annotations at all, not just when an annotation is next.
allowsAnnotations :: Pattern -> Bool
allowsAnnotations (Choice p1 p2) = allowsAnnotations p1 || allowsAnnotations p2 allowsAnnotations (Group p1 p2) = allowsAnnotations p1 || allowsAnnotations p2 allowsAnnotations (Interleave p1 p2) = allowsAnnotations p1 || allowsAnnotations p2 allowsAnnotations (Concur p1 p2) = allowsAnnotations p1 || allowsAnnotations p2 allowsAnnotations (Partition p) = allowsAnnotations p allowsAnnotations (OneOrMore p) = allowsAnnotations p allowsAnnotations (ConcurOneOrMore p) = allowsAnnotations p allowsAnnotations (After p1 p2) = if nullable p1 then (allowsAnnotations p1 || allowsAnnotations p2) else allowsAnnotations p1 allowsAnnotations (All p1 p2) = allowsAnnotations p1 && allowsAnnotations p2 allowsAnnotations (Annotation _ _) = True allowsAnnotations _ = False
allowsText tests whether text is allowed within a pattern. Whitespace-only text nodes are ignored if a pattern doesn't allow text.
allowsText:: Pattern -> Bool
allowsText (Group p1 p2) = if onlyAnnotations p1 then allowsText p2 else (if nullable p1 then (allowsText p1 || allowsText p2) else allowsText p1) allowsText (Interleave p1 p2) = allowsText p1 || allowsText p2 allowsText (Concur p1 p2) = allowsText p1 && allowsText p2 allowsText (Choice p1 p2) = allowsText p1 || allowsText p2 allowsText (OneOrMore p) = allowsText p allowsText (ConcurOneOrMore p) = allowsText p allowsText (After p1 p2) = if nullable p1 then (allowsText p1 || allowsText p2) else allowsText p1 allowsText (All p1 p2) = allowsText p1 && allowsText p2 allowsText Text = True allowsText (List _) = True allowsText (Value _ _ _) = True allowsText (Data _ _) = True allowsText (DataExcept _ _) = True allowsText _ = False
Constructors
Constructors are required for Choice, Group, Interleave, Concur, Partition, OneOrMore, ConcurOneOrMore, After, and All patterns. These are broadly the same as in James' RELAX NG algorithm, except that those that have two branches recognise the case where one of the branches of the patterns is an After pattern, and in this case use the equivalent of the applyAfter function.
choice :: Pattern -> Pattern -> Pattern choice p NotAllowed = p choice NotAllowed p = p choice Empty Empty = Empty choice p1 p2 = Choice p1 p2
group :: Pattern -> Pattern -> Pattern group p NotAllowed = NotAllowed group NotAllowed p = NotAllowed group p Empty = p group Empty p = p group (After p1 p2) p3 = after p1 (group p2 p3) group p1 (After p2 p3) = after p2 (group p1 p3) group p1 p2 = Group p1 p2
interleave :: Pattern -> Pattern -> Pattern interleave p NotAllowed = NotAllowed interleave NotAllowed p = NotAllowed interleave p Empty = p interleave Empty p = p interleave (After p1 p2) p3 = after p1 (interleave p2 p3) interleave p1 (After p2 p3) = after p2 (interleave p1 p3) interleave p1 p2 = Interleave p1 p2
concur :: Pattern -> Pattern -> Pattern concur p NotAllowed = NotAllowed concur NotAllowed p = NotAllowed concur p Empty = p concur Empty p = p concur (After p1 p2) (After p3 p4) = after (all p1 p3) (concur p2 p4) concur (After p1 p2) p3 = after p1 (concur p2 p3) concur p1 (After p2 p3) = after p2 (concur p1 p3) concur p1 p2 = Concur p1 p2
partition :: Pattern -> Pattern partition NotAllowed = NotAllowed partition Empty = Empty partition p = Partition p
oneOrMore :: Pattern -> Pattern oneOrMore NotAllowed = NotAllowed oneOrMore Empty = Empty oneOrMore p = OneOrMore p
concurOneOrMore :: Pattern -> Pattern concurOneOrMore NotAllowed = NotAllowed concurOneOrMore Empty = Empty concurOneOrMore p = ConcurOneOrMore p
after :: Pattern -> Pattern -> Pattern after p NotAllowed = NotAllowed after NotAllowed p = NotAllowed after Empty p = p after (After p1 p2) p3 = after p1 (after p2 p3) after p1 p2 = After p1 p2
all :: Pattern -> Pattern -> Pattern all p NotAllowed = NotAllowed all NotAllowed p = NotAllowed all p Empty = if nullable p then Empty else NotAllowed all Empty p = if nullable p then Empty else NotAllowed all (After p1 p2) (After p3 p4) = after (all p1 p3) (all p2 p4) all p1 p2 = All p1 p2
Derivatives
A document is a sequence of events. The derivative for a sequence of events against a pattern is the derivative of the remaining events against the derivative of the first event.
eventsDeriv :: Pattern -> [Event] -> Pattern eventsDeriv p [] = p eventsDeriv p (h:t) = eventsDeriv (eventDeriv p h) t
The derivative for an event depends on the kind of event. Whitespace-only text nodes can be ignored if the pattern doesn't allow text.
eventDeriv :: Pattern -> Event -> Pattern eventDeriv p (TextEvent s cx) = if whitespace s && not allowsText p then p else (textDeriv cx p s) eventDeriv p (StartTagEvent qn id) = startTagDeriv p qn id eventDeriv p (EndTagEvent qn id) = endTagDeriv p qn id eventDeriv p (StartAnnotationEvent qn) = startAnnotationDeriv p qn eventDeriv p (EndAnnotationEvent qn) = endAnnotationDeriv p qn eventDeriv p (StartAtomEvent qn) = startAtomDeriv p qn eventDeriv p (EndAtomEvent qn) = endAtomDeriv p qn
Text Derivatives
textDeriv computes the derivative of a pattern with respect to a text event.
textDeriv :: Context -> Pattern -> String -> Pattern
For Concur, text is only allowed if it is allowed by both of the subpatterns.
textDeriv cx (Concur p1 p2) s =
concur (textDeriv cx p1 s)
(textDeriv cx p2 s)
For ConcurOneOrMore, we partially expand the ConcurOneOrMore into an Concur.
textDeriv cx (ConcurOneOrMore p) s =
concur (textDeriv cx p s)
(choice (ConcurOneOrMore p) anyContent)
The anyContent function returns a pattern that matches any number of atoms (with any name and with any number of annotations with any content and ranges) with interleaved text. It is equivalent to the anyContent pattern in the schema
<define name="anyContent">
<mixed>
<zeroOrMore>
<ref name="anyAtom" />
</zeroOrMore>
</mixed>
</define>
<define name="anyAtom">
<atom>
<anyName />
<ref name="anyAnnotations" />
</atom>
</define>
<define name="anyAnnotations">
<zeroOrMore>
<ref name="anyAnnotation" />
</zeroOrMore>
</define>
<define name="anyAnnotation">
<annotation>
<anyName />
<interleave>
<ref name="anyAnnotations" />
<concurZeroOrMore>
<interleave>
<ref name="anyContent" />
<optional>
<ref name="anyRange" />
</optional>
</interleave>
</concurZeroOrMore>
</interleave>
</annotation>
</define>
<define name="anyRange">
<range>
<anyName />
<ref name="anyAnnotations" />
<ref name="anyContent" />
</range>
</define>
For Partition, we create a After pattern that contains the derivative.
textDeriv cx (Partition p) s = after (textDeriv cx p s) Empty
A text event does not match any of the other new patterns; they are caught by the generic
textDeriv _ _ _ = NotAllowed
Start Tags
Start tags are handled in a very generic way by all the patterns, except the Range pattern, whose derivative is a group of the content pattern for the range followed by an end tag for the range.
startTagDeriv :: Pattern -> QName -> Id -> Pattern startTagDeriv (Range nc p) qn id = if contains nc qn then group p (EndTag qn id) else NotAllowed
startTagDeriv (Choice p1 p2) qn id =
choice (startTagDeriv p1 qn id)
(startTagDeriv p2 qn id)
startTagDeriv (Group p1 p2) qn id =
if onlyAnnotations p1 then
group p1 (startTagDeriv p2 qn id)
else
let p = group (startTagDeriv p1 qn id) p2
in if nullable p1 then choice p (startTagDeriv p2 qn id) else p
startTagDeriv (Interleave p1 p2) qn id =
choice (interleave (startTagDeriv p1 qn id) p2)
(interleave p1 (startTagDeriv p2 qn id))
startTagDeriv (Concur p1 p2) qn id =
choice (concur (startTagDeriv p1 qn id) p2)
(concur p1 (startTagDeriv p2 qn id))
startTagDeriv (Partition p) qn id =
after (startTagDeriv p qn id) Empty
startTagDeriv (OneOrMore p) qn id =
group (startTagDeriv p qn id)
(choice (OneOrMore p) Empty)
startTagDeriv (ConcurOneOrMore p) qn id =
concur (startTagDeriv p qn id)
(choice (ConcurOneOrMore p) anyContent)
startTagDeriv (After p1 p2) qn id =
after (startTagDeriv p1 qn id) p2
startTagDeriv _ _ _ = NotAllowed
End Tags
End tags are matched by EngTag events. An id is used to support self-overlap: the names have to match and so do the ids.
endTagDeriv :: Pattern -> QName -> Id -> Pattern
endTagDeriv (EndTag (QName ns1 ln1) id1) (QName ns2 ln2) id2 =
if id1 == id2 || (id1 == && id2 == && ns1 == ns2 && ln1 == ln2)
then Empty
else NotAllowed
endTagDeriv (Choice p1 p2) qn id =
choice (endTagDeriv p1 qn id)
(endTagDeriv p2 qn id)
endTagDeriv (Group p1 p2) qn id =
if onlyAnnotations p1 then
group p1 (endTagDeriv p1 qn id)
else
let p = group (endTagDeriv p1 qn id) p2
if nullable p1 then choice p (endTagDeriv p2 qn id) else p
endTagDeriv (Interleave p1 p2) qn id =
choice (interleave (endTagDeriv p1 qn id) p2)
(interleave p1 (endTagDeriv p2 qn id))
endTagDeriv (Concur p1 p2) qn id =
choice (concur (endTagDeriv p1 qn id) p2)
(concur p1 (endTagDeriv p2 qn id))
endTagDeriv (Partition p) qn id =
after (endTagDeriv p qn id) Empty
endTagDeriv (OneOrMore p) qn id =
group (endTagDeriv p qn id)
(choice (OneOrMore p) Empty)
endTagDeriv (ConcurOneOrMore p) qn id =
concur (endTagDeriv p qn id)
(choice (ConcurOneOrMore p) anyContent)
endTagDeriv (After p1 p2) qn id =
after (endTagDeriv p1 qn id) p2
endTagDeriv _ _ _ = NotAllowed
Start Annotations
Computing the derivative for a StartAnnotationEvent is similar to computing that of a StartTagEvent, with the exceptions that the derivative of an Annotation pattern creates a After pattern (not a Group pattern), because annotations can't overlap, plus the derivative of a Group pattern is computed differently to allow annotations to be declared anywhere within a pattern.
startAnnotationDeriv :: Pattern -> QName -> Pattern
startAnnotationDeriv (Annotation nc p) qn = if contains nc qn then after (group p (EndAnnotation qn)) Empty else NotAllowed
The derivative of a Group pattern depends on whether the first branch allows annotations or not. If it does, the result is the normal one; if it doesn't, we get the derivative of the second branch and use that.
startAnnotationDeriv (Group p1 p2) qn =
if allowsAnnotations p1 then
let p = group (startAnnotationDeriv p1 qn) p2
in if nullable p1 then choice p (startAnnotationDeriv p2 qn) else p
else
group p1 (startAnnotationDeriv p2 qn)
The rest is mostly the same as when dealing with a StartTag event.
startAnnotationDeriv (Choice p1 p2) qn =
choice (startAnnotationDeriv p1 qn)
(startAnnotationDeriv p2 qn)
startAnnotationDeriv (Interleave p1 p2) qn =
choice (interleave (startAnnotationDeriv p1 qn) p2)
(interleave p1 (startAnnotationDeriv p2 qn))
startAnnotationDeriv (Concur p1 p2) qn =
choice (concur (startAnnotationDeriv p1 qn) p2)
(concur p1 (startAnnotationDeriv p2 qn))
startAnnotationDeriv (Partition p) qn =
after (startAnnotationDeriv p qn) Empty
startAnnotationDeriv (OneOrMore p) qn =
group (startAnnotationDeriv p qn)
(choice (OneOrMore p) Empty)
startAnnotationDeriv (ConcurOneOrMore p) qn =
concur (startAnnotationDeriv p qn)
(choice (ConcurOneOrMore p) anyContent)
startAnnotationDeriv (After p1 p2) qn =
after (startAnnotationDeriv p1 qn) p2
- TODO: Check derivative of
ConcurOneOrMorehere.
The exception is that the derivatives of Range, Text and similar patterns is themselves, so they get carried through to deal with the next event.
startAnnotationDeriv NotAllowed _ = NotAllowed startAnnotationDeriv Empty _ = NotAllowed
startAnnotationDeriv p _ = p
End Annotations
Calculating the derivative from an EndAnnotationEvent is the same as doing so for a EndTagEvent, except we don't have to worry about matching ids.
endAnnotationDeriv :: Pattern -> QName -> Pattern
endAnnotationDeriv (EndAnnotation (QName ns1 ln1)) (QName ns2 ln2) =
if ns1 == ns2 && ln1 == ln2 then Empty else NotAllowed
endAnnotationDeriv (Choice p1 p2) qn =
choice (endAnnotationDeriv p1 qn)
(endAnnotationDeriv p2 qn)
endAnnotationDeriv (Group p1 p2) qn =
if onlyAnnotations p1 then
group p1 (endAnnotationDeriv p1 qn)
else
let p = group (endAnnotationDeriv p1 qn) p2
if nullable p1 then choice p (endAnnotationDeriv p2 qn) else p
endAnnotationDeriv (Interleave p1 p2) qn =
choice (interleave (endAnnotationDeriv p1 qn) p2)
(interleave p1 (endAnnotationDeriv p2 qn))
endAnnotationDeriv (Concur p1 p2) qn =
choice (concur (endAnnotationDeriv p1 qn) p2)
(concur p1 (endAnnotationDeriv p2 qn))
endAnnotationDeriv (Partition p) qn =
after (endAnnotationDeriv p qn) Empty
endAnnotationDeriv (OneOrMore p) qn =
group (endAnnotationDeriv p qn)
(choice (OneOrMore p) Empty)
endAnnotationDeriv (ConcurOneOrMore p) qn =
concur (endAnnotationDeriv p qn)
(choice (ConcurOneOrMore p) anyContent)
endTagDeriv (After p1 p2) qn =
after (endAnnotationDeriv p1 qn) p2
endTagDeriv _ _ _ = NotAllowed
Start Atoms
The StartAtomEvent event is treated similarly to the StartTagEvent, except that since atoms can't overlap, the Atom pattern generates an After pattern rather than a Group.
startAtomDeriv :: Pattern -> QName -> Pattern
startAtomDeriv (Atom nc p) qn = if contains nc qn then after (group p (EndAtom qn)) Empty else NotAllowed
startAtomDeriv (Choice p1 p2) qn =
choice (startAtomDeriv p1 qn)
(startAtomDeriv p2 qn)
startAtomDeriv (Group p1 p2) qn =
if onlyAnnotations p1 then
group p1 (startAtomDeriv p2 qn)
else
let p = group (startAtomDeriv p1 qn) p2
in if nullable p1 then choice p (startAtomDeriv p2 qn) else p
startAtomDeriv (Interleave p1 p2) qn =
choice (interleave (startAtomDeriv p1 qn) p2)
(interleave p1 (startAtomDeriv p2 qn))
startAtomDeriv (Concur p1 p2) qn =
concur (startAtomDeriv p1 qn)
(startAtomDeriv p2 qn)
startAtomDeriv (Partition p) qn =
after (startAtomDeriv p qn) Empty
startAtomDeriv (OneOrMore p) qn =
group (startAtomDeriv p qn)
(choice (OneOrMore p) Empty)
startAtomDeriv (ConcurOneOrMore p) qn =
concur (startAtomDeriv p qn)
(choice (ConcurOneOrMore p) anyContent)
startAtomDeriv (After p1 p2) qn =
after (startAtomDeriv p1 qn) p2
startAtomDeriv _ _ _ = NotAllowed
End Atoms
The EndAtomEvent is again treated in a similar way to the EndTagEvent.
endAtomDeriv :: Pattern -> QName -> Pattern
endAtomDeriv (EndAtom (QName ns1 ln1)) (QName ns2 ln2) =
if ns1 == ns2 && ln1 == ln2 then Empty else NotAllowed
endAtomDeriv (Choice p1 p2) qn =
choice (endAtomDeriv p1 qn)
(endAtomDeriv p2 qn)
endAtomDeriv (Group p1 p2) qn =
if onlyAnnotations p1 then
group p1 (endAtomDeriv p1 qn)
else
let p = group (endAtomDeriv p1 qn) p2
if nullable p1 then choice p (endAtomDeriv p2 qn) else p
endAtomDeriv (Interleave p1 p2) qn =
choice (interleave (endAtomDeriv p1 qn) p2)
(interleave p1 (endAtomDeriv p2 qn))
endAtomDeriv (Concur p1 p2) qn =
choice (concur (endAtomDeriv p1 qn) p2)
(concur p1 (endAtomDeriv p2 qn))
endAtomDeriv (Partition p) qn =
after (endAtomDeriv p qn) Empty
endAtomDeriv (OneOrMore p) qn =
group (endAtomDeriv p qn)
(choice (OneOrMore p) Empty)
endAtomDeriv (ConcurOneOrMore p) qn =
concur (endAtomDeriv p qn)
(choice (ConcurOneOrMore p) anyContent)
endAtomDeriv (After p1 p2) qn =
after (endAtomDeriv p1 qn) p2
endAtomDeriv _ _ _ = NotAllowed
Optimizations
As when implementing RELAX NG, the following optimizations are useful.
- storing the result of
nullable pfor each pattern - storing the result of
allowsText pfor each pattern - storing the result of
allowsAnnotations pfor each pattern - storing the result of
onlyAnnotations pfor each pattern - interning patterns, so that identical patterns are only represented once in memory
- avoiding exponential blowup by eliminating duplicate choices
- memoization: keeping a record of the results of particular events with a given pattern
