Algorithm for Creole Validation

From LMNLWiki

A Creole validator is provided with a sequence of events and a pattern. To validate the events against the pattern, the validator takes the first event and calculates a derivative: a new pattern that should be applied to the remainder of the events. It continues doing this until it runs out of events. At that point, if the remaining pattern is nullable (matches an empty sequence) then the validation has succeeded, otherwise it has failed and the document is invalid.

James Clark's algorithm for RELAX NG defines how to calculate a derivative for each combination of XML event and RELAX NG pattern, as well as some supporting functions. The algorithm for Creole is basically the same, but has to deal with overlapping content and a few more patterns. Only the modifications (shown in bold where applicable) are specified here.

Note: This algorithm does not (yet) handle the fact that in some languages, such as XML, attributes/annotations are essentially unordered, whereas in others, such as LMNL, they are ordered (though you can use interleave to specify that the order doesn't matter). One possibility is to introduce start-attribute and end-attribute events for unordered annotations.

Contents

Datatypes

The only datatype we have to add is an Id. Ids are required to make sure that the correct end tag matches a given start tag, in the case of self-overlapping ranges.

type Id = String

Events

Rather than a hierarchy of nodes, a document is considered as a stream of events. The important events to generate from a document are

data Event = StartTagEvent QName Id
             | EndTagEvent QName Id
             | StartAnnotationEvent QName
             | EndAnnotationEvent QName
             | TextEvent String Context
             | StartAtomEvent QName
             | EndAtomEvent QName

Patterns

A Pattern represents a pattern after simplification. Creole adds eight new patterns – Concur, ConcurOneOrMore, Partition, All, Atom, EndRange, EndAnnotation and EndAtom – and replaces Attribute with Annotation and Element with Range.

data Pattern = Empty
               | NotAllowed
               | Text
               | Choice Pattern Pattern
               | Interleave Pattern Pattern
               | Group Pattern Pattern
               | Concur Pattern Pattern
               | Partition Pattern
               | OneOrMore Pattern
               | ConcurOneOrMore Pattern
               | List Pattern
               | Data Datatype ParamList
               | DataExcept Datatype ParamList Pattern
               | Value Datatype String Context
               | Atom NameClass Pattern
               | Annotation NameClass Pattern
               | Range NameClass Pattern
               | EndRange QName Id
               | EndAnnotation QName
               | EndAtom QName
               | After Pattern Pattern
               | All Pattern Pattern

The EndRange, EndAnnotation, EndAtom and All patterns, like the After pattern, is an artifact of the algorithm and doesn't correspond to any element used in a Creole grammar.

Utility Functions

nullable tests whether a pattern matches the empty sequence.

nullable:: Pattern -> Bool

For our new patterns, nullable is computed as follows:

nullable (Concur p1 p2) = nullable p1 && nullable p2
nullable (All p1 p2) = nullable p1 && nullable p2
nullable (ConcurOneOrMore p) = nullable p
nullable (Partition p) = nullable p
nullable (Atom _ _) = False
nullable (Range _ _) = False
nullable (Annotation _ _) = False
nullable (EndAtom _) = False
nullable (EndRange _ _) = False
nullable (EndAnnotation _) = False

onlyAnnotations tests whether a pattern only contains Annotation patterns. This is used when calculating derivatives for Group patterns, since Annotation patterns can appear anywhere within a Group pattern.

onlyAnnotations:: Pattern -> Bool
onlyAnnotations (Group p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (Interleave p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (Concur p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (Choice p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (Partition p) = onlyAnnotations p
onlyAnnotations (OneOrMore p) = onlyAnnotations p
onlyAnnotations (ConcurOneOrMore p) = onlyAnnotations p
onlyAnnotations (After p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (All p1 p2) = onlyAnnotations p1 && onlyAnnotations p2
onlyAnnotations (Annotation _ _) = True
onlyAnnotations _ = False

allowsAnnotations returns true if the pattern allows annotations at all, not just when an annotation is next.

allowsAnnotations :: Pattern -> Bool
allowsAnnotations (Choice p1 p2) = allowsAnnotations p1 || allowsAnnotations p2
allowsAnnotations (Group p1 p2) = allowsAnnotations p1 || allowsAnnotations p2
allowsAnnotations (Interleave p1 p2) = allowsAnnotations p1 || allowsAnnotations p2
allowsAnnotations (Concur p1 p2) = allowsAnnotations p1 || allowsAnnotations p2
allowsAnnotations (Partition p) = allowsAnnotations p
allowsAnnotations (OneOrMore p) = allowsAnnotations p
allowsAnnotations (ConcurOneOrMore p) = allowsAnnotations p
allowsAnnotations (After p1 p2) = if nullable p1 then (allowsAnnotations p1 || allowsAnnotations p2) else allowsAnnotations p1
allowsAnnotations (All p1 p2) = allowsAnnotations p1 && allowsAnnotations p2
allowsAnnotations (Annotation _ _) = True
allowsAnnotations _ = False

allowsText tests whether text is allowed within a pattern. Whitespace-only text nodes are ignored if a pattern doesn't allow text.

allowsText:: Pattern -> Bool
allowsText (Group p1 p2) = 
  if onlyAnnotations p1 then allowsText p2
  else (if nullable p1 then (allowsText p1 || allowsText p2) else allowsText p1)
allowsText (Interleave p1 p2) = allowsText p1 || allowsText p2
allowsText (Concur p1 p2) = allowsText p1 && allowsText p2
allowsText (Choice p1 p2) = allowsText p1 || allowsText p2
allowsText (OneOrMore p) = allowsText p
allowsText (ConcurOneOrMore p) = allowsText p
allowsText (After p1 p2) = if nullable p1 then (allowsText p1 || allowsText p2) else allowsText p1
allowsText (All p1 p2) = allowsText p1 && allowsText p2
allowsText Text = True
allowsText (List _) = True
allowsText (Value _ _ _) = True
allowsText (Data _ _) = True
allowsText (DataExcept _ _) = True
allowsText _ = False

Constructors

Constructors are required for Choice, Group, Interleave, Concur, Partition, OneOrMore, ConcurOneOrMore, After, and All patterns. These are broadly the same as in James' RELAX NG algorithm, except that those that have two branches recognise the case where one of the branches of the patterns is an After pattern, and in this case use the equivalent of the applyAfter function.

choice :: Pattern -> Pattern -> Pattern
choice p NotAllowed = p
choice NotAllowed p = p
choice Empty Empty = Empty
choice p1 p2 = Choice p1 p2
group :: Pattern -> Pattern -> Pattern
group p NotAllowed = NotAllowed
group NotAllowed p = NotAllowed
group p Empty = p
group Empty p = p
group (After p1 p2) p3 = after p1 (group p2 p3)
group p1 (After p2 p3) = after p2 (group p1 p3)
group p1 p2 = Group p1 p2
interleave :: Pattern -> Pattern -> Pattern
interleave p NotAllowed = NotAllowed
interleave NotAllowed p = NotAllowed
interleave p Empty = p
interleave Empty p = p
interleave (After p1 p2) p3 = after p1 (interleave p2 p3)
interleave p1 (After p2 p3) = after p2 (interleave p1 p3)
interleave p1 p2 = Interleave p1 p2
concur :: Pattern -> Pattern -> Pattern
concur p NotAllowed = NotAllowed
concur NotAllowed p = NotAllowed
concur p Empty = p
concur Empty p = p
concur (After p1 p2) (After p3 p4) = after (all p1 p3) (concur p2 p4)
concur (After p1 p2) p3 = after p1 (concur p2 p3)
concur p1 (After p2 p3) = after p2 (concur p1 p3)
concur p1 p2 = Concur p1 p2
partition :: Pattern -> Pattern
partition NotAllowed = NotAllowed
partition Empty = Empty
partition p = Partition p
oneOrMore :: Pattern -> Pattern
oneOrMore NotAllowed = NotAllowed
oneOrMore Empty = Empty
oneOrMore p = OneOrMore p
concurOneOrMore :: Pattern -> Pattern
concurOneOrMore NotAllowed = NotAllowed
concurOneOrMore Empty = Empty
concurOneOrMore p = ConcurOneOrMore p
after :: Pattern -> Pattern -> Pattern
after p NotAllowed = NotAllowed
after NotAllowed p = NotAllowed
after Empty p = p
after (After p1 p2) p3 = after p1 (after p2 p3)
after p1 p2 = After p1 p2
all :: Pattern -> Pattern -> Pattern
all p NotAllowed = NotAllowed
all NotAllowed p = NotAllowed
all p Empty = if nullable p then Empty else NotAllowed
all Empty p = if nullable p then Empty else NotAllowed
all (After p1 p2) (After p3 p4) = after (all p1 p3) (all p2 p4)
all p1 p2 = All p1 p2

Derivatives

A document is a sequence of events. The derivative for a sequence of events against a pattern is the derivative of the remaining events against the derivative of the first event.

eventsDeriv :: Pattern -> [Event] -> Pattern
eventsDeriv p [] = p
eventsDeriv p (h:t) = eventsDeriv (eventDeriv p h) t

The derivative for an event depends on the kind of event. Whitespace-only text nodes can be ignored if the pattern doesn't allow text.

eventDeriv :: Pattern -> Event -> Pattern
eventDeriv p (TextEvent s cx) = 
  if whitespace s && not allowsText p then p else (textDeriv cx p s)
eventDeriv p (StartTagEvent qn id) = startTagDeriv p qn id
eventDeriv p (EndTagEvent qn id) = endTagDeriv p qn id
eventDeriv p (StartAnnotationEvent qn) = startAnnotationDeriv p qn
eventDeriv p (EndAnnotationEvent qn) = endAnnotationDeriv p qn
eventDeriv p (StartAtomEvent qn) = startAtomDeriv p qn
eventDeriv p (EndAtomEvent qn) = endAtomDeriv p qn

Text Derivatives

textDeriv computes the derivative of a pattern with respect to a text event.

textDeriv :: Context -> Pattern -> String -> Pattern

For Concur, text is only allowed if it is allowed by both of the subpatterns.

textDeriv cx (Concur p1 p2) s =
  concur (textDeriv cx p1 s)
         (textDeriv cx p2 s)

For ConcurOneOrMore, we partially expand the ConcurOneOrMore into an Concur.

textDeriv cx (ConcurOneOrMore p) s =
  concur (textDeriv cx p s) 
         (choice (ConcurOneOrMore p) anyContent)

The anyContent function returns a pattern that matches any number of atoms (with any name and with any number of annotations with any content and ranges) with interleaved text. It is equivalent to the anyContent pattern in the schema

<define name="anyContent">
  <mixed>
    <zeroOrMore>
      <ref name="anyAtom" />
    </zeroOrMore>
  </mixed>
</define>
<define name="anyAtom">
  <atom>
    <anyName />
    <ref name="anyAnnotations" />
  </atom>
</define>
<define name="anyAnnotations">
  <zeroOrMore>
    <ref name="anyAnnotation" />
  </zeroOrMore>
</define>
<define name="anyAnnotation">
  <annotation>
    <anyName />
    <interleave>
      <ref name="anyAnnotations" />
      <concurZeroOrMore>
        <interleave>
          <ref name="anyContent" />
          <optional>
            <ref name="anyRange" />
          </optional>
        </interleave>
      </concurZeroOrMore>
    </interleave>
  </annotation>
</define>
<define name="anyRange">
  <range>
    <anyName />
    <ref name="anyAnnotations" />
    <ref name="anyContent" />
  </range>
</define>

For Partition, we create a After pattern that contains the derivative.

textDeriv cx (Partition p) s =
  after (textDeriv cx p s) Empty

A text event does not match any of the other new patterns; they are caught by the generic

textDeriv _ _ _ = NotAllowed

Start Tags

Start tags are handled in a very generic way by all the patterns, except the Range pattern, whose derivative is a group of the content pattern for the range followed by an end tag for the range.

startTagDeriv :: Pattern -> QName -> Id -> Pattern

startTagDeriv (Range nc p) qn id =
  if contains nc qn then group p (EndTag qn id) else NotAllowed
startTagDeriv (Choice p1 p2) qn id =
  choice (startTagDeriv p1 qn id)
         (startTagDeriv p2 qn id)
startTagDeriv (Group p1 p2) qn id =
  if onlyAnnotations p1 then
    group p1 (startTagDeriv p2 qn id)
  else
    let p = group (startTagDeriv p1 qn id) p2
    in if nullable p1 then choice p (startTagDeriv p2 qn id) else p
startTagDeriv (Interleave p1 p2) qn id =
  choice (interleave (startTagDeriv p1 qn id) p2)
         (interleave p1 (startTagDeriv p2 qn id))
startTagDeriv (Concur p1 p2) qn id =
  choice (concur (startTagDeriv p1 qn id) p2)
         (concur p1 (startTagDeriv p2 qn id))
startTagDeriv (Partition p) qn id =
  after (startTagDeriv p qn id) Empty
startTagDeriv (OneOrMore p) qn id =
  group (startTagDeriv p qn id)
        (choice (OneOrMore p) Empty)              
startTagDeriv (ConcurOneOrMore p) qn id =
  concur (startTagDeriv p qn id)
         (choice (ConcurOneOrMore p) anyContent)              
startTagDeriv (After p1 p2) qn id =
  after (startTagDeriv p1 qn id) p2
startTagDeriv _ _ _ = NotAllowed

End Tags

End tags are matched by EngTag events. An id is used to support self-overlap: the names have to match and so do the ids.

endTagDeriv :: Pattern -> QName -> Id -> Pattern

endTagDeriv (EndTag (QName ns1 ln1) id1) (QName ns2 ln2) id2 =
 if id1 == id2 || (id1 ==  && id2 ==  && ns1 == ns2 && ln1 == ln2)
 then Empty
 else NotAllowed
 
endTagDeriv (Choice p1 p2) qn id =
  choice (endTagDeriv p1 qn id)
         (endTagDeriv p2 qn id)
endTagDeriv (Group p1 p2) qn id =
  if onlyAnnotations p1 then
    group p1 (endTagDeriv p1 qn id)
  else
    let p = group (endTagDeriv p1 qn id) p2
    if nullable p1 then choice p (endTagDeriv p2 qn id) else p
endTagDeriv (Interleave p1 p2) qn id =
  choice (interleave (endTagDeriv p1 qn id) p2)
         (interleave p1 (endTagDeriv p2 qn id))
endTagDeriv (Concur p1 p2) qn id =
  choice (concur (endTagDeriv p1 qn id) p2)
         (concur p1 (endTagDeriv p2 qn id))
endTagDeriv (Partition p) qn id =
  after (endTagDeriv p qn id) Empty
endTagDeriv (OneOrMore p) qn id =
  group (endTagDeriv p qn id)
        (choice (OneOrMore p) Empty)              
endTagDeriv (ConcurOneOrMore p) qn id =
  concur (endTagDeriv p qn id)
         (choice (ConcurOneOrMore p) anyContent)              
endTagDeriv (After p1 p2) qn id =
  after (endTagDeriv p1 qn id) p2
  
endTagDeriv _ _ _ = NotAllowed

Start Annotations

Computing the derivative for a StartAnnotationEvent is similar to computing that of a StartTagEvent, with the exceptions that the derivative of an Annotation pattern creates a After pattern (not a Group pattern), because annotations can't overlap, plus the derivative of a Group pattern is computed differently to allow annotations to be declared anywhere within a pattern.

startAnnotationDeriv :: Pattern -> QName -> Pattern
startAnnotationDeriv (Annotation nc p) qn =
  if contains nc qn then after (group p (EndAnnotation qn)) Empty else NotAllowed

The derivative of a Group pattern depends on whether the first branch allows annotations or not. If it does, the result is the normal one; if it doesn't, we get the derivative of the second branch and use that.

startAnnotationDeriv (Group p1 p2) qn =
  if allowsAnnotations p1 then 
    let p = group (startAnnotationDeriv p1 qn) p2
    in if nullable p1 then choice p (startAnnotationDeriv p2 qn) else p
  else 
    group p1 (startAnnotationDeriv p2 qn)

The rest is mostly the same as when dealing with a StartTag event.

startAnnotationDeriv (Choice p1 p2) qn =
  choice (startAnnotationDeriv p1 qn)
         (startAnnotationDeriv p2 qn)
startAnnotationDeriv (Interleave p1 p2) qn =
  choice (interleave (startAnnotationDeriv p1 qn) p2)
         (interleave p1 (startAnnotationDeriv p2 qn))
startAnnotationDeriv (Concur p1 p2) qn =
  choice (concur (startAnnotationDeriv p1 qn) p2)
         (concur p1 (startAnnotationDeriv p2 qn))
startAnnotationDeriv (Partition p) qn =
  after (startAnnotationDeriv p qn) Empty
startAnnotationDeriv (OneOrMore p) qn =
  group (startAnnotationDeriv p qn)
        (choice (OneOrMore p) Empty)              
startAnnotationDeriv (ConcurOneOrMore p) qn =
  concur (startAnnotationDeriv p qn)
         (choice (ConcurOneOrMore p) anyContent)              
startAnnotationDeriv (After p1 p2) qn =
  after (startAnnotationDeriv p1 qn) p2
TODO: Check derivative of ConcurOneOrMore here.

The exception is that the derivatives of Range, Text and similar patterns is themselves, so they get carried through to deal with the next event.

startAnnotationDeriv NotAllowed _ = NotAllowed
startAnnotationDeriv Empty _ = NotAllowed
startAnnotationDeriv p _ = p

End Annotations

Calculating the derivative from an EndAnnotationEvent is the same as doing so for a EndTagEvent, except we don't have to worry about matching ids.

endAnnotationDeriv :: Pattern -> QName -> Pattern

endAnnotationDeriv (EndAnnotation (QName ns1 ln1)) (QName ns2 ln2) =
 if ns1 == ns2 && ln1 == ln2 then Empty else NotAllowed
 
endAnnotationDeriv (Choice p1 p2) qn =
  choice (endAnnotationDeriv p1 qn)
         (endAnnotationDeriv p2 qn)
endAnnotationDeriv (Group p1 p2) qn =
  if onlyAnnotations p1 then
    group p1 (endAnnotationDeriv p1 qn)
  else
    let p = group (endAnnotationDeriv p1 qn) p2
    if nullable p1 then choice p (endAnnotationDeriv p2 qn) else p
endAnnotationDeriv (Interleave p1 p2) qn =
  choice (interleave (endAnnotationDeriv p1 qn) p2)
         (interleave p1 (endAnnotationDeriv p2 qn))
endAnnotationDeriv (Concur p1 p2) qn =
  choice (concur (endAnnotationDeriv p1 qn) p2)
         (concur p1 (endAnnotationDeriv p2 qn))
endAnnotationDeriv (Partition p) qn =
  after (endAnnotationDeriv p qn) Empty
endAnnotationDeriv (OneOrMore p) qn =
  group (endAnnotationDeriv p qn)
        (choice (OneOrMore p) Empty)              
endAnnotationDeriv (ConcurOneOrMore p) qn =
  concur (endAnnotationDeriv p qn)
         (choice (ConcurOneOrMore p) anyContent)              
endTagDeriv (After p1 p2) qn =
  after (endAnnotationDeriv p1 qn) p2
  
endTagDeriv _ _ _ = NotAllowed

Start Atoms

The StartAtomEvent event is treated similarly to the StartTagEvent, except that since atoms can't overlap, the Atom pattern generates an After pattern rather than a Group.

startAtomDeriv :: Pattern -> QName -> Pattern
startAtomDeriv (Atom nc p) qn =
  if contains nc qn then after (group p (EndAtom qn)) Empty else NotAllowed
startAtomDeriv (Choice p1 p2) qn =
  choice (startAtomDeriv p1 qn)
         (startAtomDeriv p2 qn)
startAtomDeriv (Group p1 p2) qn =
  if onlyAnnotations p1 then
    group p1 (startAtomDeriv p2 qn)
  else
    let p = group (startAtomDeriv p1 qn) p2
    in if nullable p1 then choice p (startAtomDeriv p2 qn) else p
startAtomDeriv (Interleave p1 p2) qn =
  choice (interleave (startAtomDeriv p1 qn) p2)
         (interleave p1 (startAtomDeriv p2 qn))
startAtomDeriv (Concur p1 p2) qn =
  concur (startAtomDeriv p1 qn)
         (startAtomDeriv p2 qn)
startAtomDeriv (Partition p) qn =
  after (startAtomDeriv p qn) Empty
startAtomDeriv (OneOrMore p) qn =
  group (startAtomDeriv p qn)
        (choice (OneOrMore p) Empty)              
startAtomDeriv (ConcurOneOrMore p) qn =
  concur (startAtomDeriv p qn)
         (choice (ConcurOneOrMore p) anyContent)              
startAtomDeriv (After p1 p2) qn =
  after (startAtomDeriv p1 qn) p2
startAtomDeriv _ _ _ = NotAllowed

End Atoms

The EndAtomEvent is again treated in a similar way to the EndTagEvent.

endAtomDeriv :: Pattern -> QName -> Pattern

endAtomDeriv (EndAtom (QName ns1 ln1)) (QName ns2 ln2) =
 if ns1 == ns2 && ln1 == ln2 then Empty else NotAllowed
 
endAtomDeriv (Choice p1 p2) qn =
  choice (endAtomDeriv p1 qn)
         (endAtomDeriv p2 qn)
endAtomDeriv (Group p1 p2) qn =
  if onlyAnnotations p1 then
    group p1 (endAtomDeriv p1 qn)
  else
    let p = group (endAtomDeriv p1 qn) p2
    if nullable p1 then choice p (endAtomDeriv p2 qn) else p
endAtomDeriv (Interleave p1 p2) qn =
  choice (interleave (endAtomDeriv p1 qn) p2)
         (interleave p1 (endAtomDeriv p2 qn))
endAtomDeriv (Concur p1 p2) qn =
  choice (concur (endAtomDeriv p1 qn) p2)
         (concur p1 (endAtomDeriv p2 qn))
endAtomDeriv (Partition p) qn =
  after (endAtomDeriv p qn) Empty
endAtomDeriv (OneOrMore p) qn =
  group (endAtomDeriv p qn)
        (choice (OneOrMore p) Empty)              
endAtomDeriv (ConcurOneOrMore p) qn =
  concur (endAtomDeriv p qn)
         (choice (ConcurOneOrMore p) anyContent)              
endAtomDeriv (After p1 p2) qn =
  after (endAtomDeriv p1 qn) p2
  
endAtomDeriv _ _ _ = NotAllowed

Optimizations

As when implementing RELAX NG, the following optimizations are useful.

  • storing the result of nullable p for each pattern
  • storing the result of allowsText p for each pattern
  • storing the result of allowsAnnotations p for each pattern
  • storing the result of onlyAnnotations p for each pattern
  • interning patterns, so that identical patterns are only represented once in memory
  • avoiding exponential blowup by eliminating duplicate choices
  • memoization: keeping a record of the results of particular events with a given pattern