Talk:LMNL data model
From LMNLWiki
Contents |
Limina
There's an implication that this data model supports ranges over ranges, but I don't think it does: is the intention that a Document has more than one Limen? That would imply more than one set of content.
- The idea (which I admit may not be completely thought out) is that the ranges of one Limen are the content of another Limen. --John Cowan 14:01, 10 September 2006 (EDT)
- It's clearer now how Limens work, but I still don't think the current model does what we want it to, because we need a method of having ranges over subsets of ranges (i.e. exclude particular ranges because they don't belong in this hierarchy). In old terminology, it currently supports the Tree Subset of LMNL. So either you need something like the original data model below, or you need to say that if Limen A is owned by Limen B then Limen A's content contains the ranges from Limen B. You need to either say that the ranges in Limen A's content are in the same relative order as they were in Limen B's ranges property (which is what was implied in the old model), or you need to explicitly say that they can be in a different order (which wasn't allowed before, but certain use cases suggest is useful). — Jeni 04:39, 15 September 2006 (EDT)
The original data model for LMNL had a Layer object with the properties
- owner - the Document or Annotation that this Layer belongs to
- base - the Layer over which the Ranges in this Layer range (null if this is an atomic layer)
- content - a sequence of Atoms or Ranges that are the content of this Layer; a Layer can only hold content of one type — if these are Atoms then it's termed an atomic layer, if these are Ranges then it's termed a range layer.
Document and Annotation then had the property
- layers - one or more Layers that hold the Atoms and Ranges for this Document/Annotation
FWIW, I don't particularly like the term "limen", but can't think of anything else at the moment. — Jeni 10:08, 10 September 2006 (EDT)
- I didn't either at first (I introduced it because I couldn't think of anything better) but it's grown on me. Lewis and Short define limen as "threshold, lintel, sill"; in effect, the boundary zone incorporating both content and ranges over the content.
- Hmm, maybe we should change the name of the language to "Liminal Markup and aNnotation Language", in the GNU tradition? --John Cowan 14:01, 10 September 2006 (EDT)
- I didn't like "limen" at first but have also grown to like it (and the plural "limina").
- By the way, a selection function that operates on a limen to specify another limen with a single clean hierarchy is properly an "elimination" function (which literally means it throws something out of doors). ;->
- But I also think it's still a "layered" language. — Wendell 16:16, 16 July 2007 (BST)
Base URIs
(How) should we include base URIs in the data model? Documents probably need them, but what other objects do? Ranges? Annotations? Atoms? — Jeni 10:08, 10 September 2006 (EDT)
- I had originally placed base URI as a property of Limina, and then removed it tentatively. Currently there is no way to set the base URI within a document (or even at the start of a document), so it's academic. In general, though, I think Limen is the right level, because the value of an annotation can be in effect a whole document itself; so perhaps on the syntax side there should be some way to set the base URI for the whole document and for each annotation.
- The main reason I can see to have base URIs is to enable the correct interpretation of URIs in a document, so that you can have relative URIs that are resolved based on the location of the document. A Document can be given a base URI simply by virtue of being located or retrieved from somewhere: the URI you use to get the document is the base URI for the Document. The only reason for changing base URI within the document is if you've gotten part of the document from elsewhere and want the base URI to be preserved (in the general entity/XInclude tradition).
- I'd be happy to introduce a [lmnl:base] annotation to provide a base URI for the content of an Annotation or Atom. I'm less happy to allow something like that on a range: because they don't stack neatly, you'd end up with situations where it's not clear what the base URI should be:
[foo [lmnl:base}uri1{]}...[bar [lmnl:base}uri2{]}...{{img [src}image.gif{]}}...{foo]...{bar]
- (We could perhaps solve this by saying that the enclosing Range with the closest start is the one that supplies the base URI for a particular URI in content.)
- More troubling is where the URI in content has ranges over it with different base URIs:
...[foo [lmnl:base}uri1{]}some/directory/[bar [lmnl:base}uri2{]}file.xml{bar]{foo]...
- I guess that an application would use the URI from the range that covers the entirety of the URI that it's interpreting; so if you were processing the
[foo]range then the URI "some/directory/file.xml" would be relative to uri1, but if you were processing the[bar]range then the URI "file.xml" would be relative to uri2.
- I guess that an application would use the URI from the range that covers the entirety of the URI that it's interpreting; so if you were processing the
- All pretty complicated. — Jeni 16:42, 10 September 2006 (EDT)
- Here's another problematic example: lmnl1.lmnl contains only the string "read more at readmore.html"; lmnl2.lmnl contains only " and more at stillmore.html". They are transcluded into a document. Unless the transclusion mechanism marks these strings with ranges, there's no way to assign Base URIs is there?
- I wonder whether Base URIs don't belong in the application, as an aspect of a definition of on-the-fly document assembly or transclusion (which would also have to define whether overlap examples such as the ones above can even happen, and what should happen if they do), where Base URIs really matter. This leads me to think that although the model may need them, it should perhaps not yet try to tackle them at levels lower than the document. We don't have the equivalent of external parsed entities do we? — Wendell 21:52, 13 September 2006 (EDT)
Ranges and Limen
Some thoughts about the ranges in the data model, from Jeni 11:26, 3 November 2006 (EST)
Partial Order
Ranges over the same sequence of content are partially ordered: you can define an order for them based on their start and end, but there's no way to order two Ranges with the same start and end. (Well, we could define an arbitrary order based on their names and annotations, but I'd rather not.)
Sequences or Sets?
If Ranges within a document are held in a sequence, as in the current data model, then it implies that two documents that contain different sequences of Ranges are different, and therefore that (in a serialization) tag order does matter. The two serializations
[foo}[bar}...{bar]{foo]
and
[bar}[foo}...{foo]{bar]
would produce different sequences (one with [foo] first, one with [bar] first) and therefore represent different documents.
Having Ranges held in a sequence also raises the possibility of the same Range appearing twice in a given sequence, or Ranges appearing in an order that doesn't reflect their natural order (i.e. a Range that starts before another Range might appear later in the sequence). We only avoid this by having extra wording in the data model that forbids it.
If Ranges within a document were held in a set, then it would imply that two documents that contain the same set of Ranges are equal, and therefore that (in a serialization) tag order doesn't matter, and the two examples above would create the same data model.
I propose that we hold Ranges in a set in the simple, flat model.
- I completely accept this, and will change the main page to agree with it.
Ranges Over Ranges
The reason we originally had Ranges held in a sequence was so we could define other Ranges over those Ranges. There are three situations where you might want to do this.
- I think that all three of these can be handled by saying that when we create a Limen from another Limen, we provide a selection function to pick out which ranges wind up as content in the new Limen, and an ordering function to say how they get ordered. I have modified the article to say this, and will now modify LOM to add these. At the moment I am opposed to adding syntax for this; let's just keep it at the application level. --John Cowan 16:58, 10 November 2006 (EST)
- I agree with John here. Saying a limen can be constructed from another limen supports this sort of thing fairly nicely, I think, while not getting us into particular elaborations. I also agree that until we've done more of it for "real", discussing syntax may be premature. It might be better to work out first how the selection and ordering functions work.
- Nevertheless the fact that these sorts of things shouldn't be hard to do is reassuring! — Wendell 16:22, 16 July 2007 (BST)
Interrupted Ranges
For example, in TexMecs, something like:
{line{... {word{hyphen}-word}-}line}
{line{{+word{ated}word} ...}line}
The {word} element here is suspended with the }-word} suspend tag and resumed with the {+word{ resume tag, so that the hyphen in the middle of the word "hyphenated" isn't included within it.
In LMNL, we might do this by introducing a [wp] range to mark up each word part, and then define the [word] range to range over the [wp] elements rather than over the text "hyphen-ated", with something like:
[!container word wp]
[line}... [word}[wp}hyphen{wp]-{line]
[line}[wp}ated{wp]{word] ...{line]
Here, the [word] range ranges over two [wp] ranges. The order of the [wp] ranges within the [word] range is the same as their start order.
Matrix Structures
For example, a 2x2 table will usually have its rows marked up like:
[table}
[row}[cell=r1c1}...{cell]
[cell=r1c2}...{cell]{row]
[row}[cell=r2c1}...{cell]
[cell=r2c2}...{cell]{row]
{table]
but logically, the cells r1c1 and r2c1 belong to a column and the cells r1c2 and r2c2 to another column.
Here, we want a [col] range that contains [cell=r1c1] and [cell=r2c1] but doesn't contain the [cell=r1c2], despite the fact that [cell=r1c2] appears between [cell=r1c1] and [cell=r2c1] in the start order of those ranges.
Random Rearrangements
The classic example is:
{sp who="HUGHIE"{{p{How did that translation go?}p}
{lg type="haiku"{{l{da de dum de dum,}l}
{l@frog{gets a new frog,}l}
{l{...}l}}lg}}sp}
{sp who="LOUIS"{{p{Er ...}p}
{lg{{l@new{it's a new pond.}l}}lg}}sp}
{sp who="DEWEY"{{p{Ah ...}p}
{lg{{l@pond{When the old pond}l}}lg}
{p{Right. That's it.}p}}sp}
...
{lg{{^l^pond}{^l^frog}{^l^new}}lg}
Discussion
In each of these examples, a new sequence containing existing ranges is created. The ranges in this new sequence have ranges defined over them.
I propose that we change the data model so that Documents and Annotations have a number of Limen, one of which has Atoms in its content and the others of which have Ranges. Limen have no relationship with each other except that any Ranges in their content must be Ranges in the ranges of another Limen with the same owner.
We could place some additional constraints which build on each other:
- All the Ranges in the content of a Limen must have the same owner (I think this is reasonable, and it means that relationships could be defined between Limen.)
- The Ranges in the content of a Limen must be partially ordered according to their start and end; in other words, they must appear in the content of the Limen in the same order as they appeared in the ranges of their owner Limen. This prevents us from supporting the third example above, but means that an in-line syntax is at least theoretically feasible.
- The Ranges in the content of a Limen must be a subsequence from the ranges of their owner Limen. This prevents us from supporting the second example above, but also makes in-line syntax easier.
I don't have the dramatic aversion that John has to out-of-line markup but I agree that, where it's possible, in-line markup is a lot easier to maintain.
Fundamentally, we need to decide whether we want to support Ranges-over-Ranges in the data model at all. We could argue that Ranges-over-Ranges should be handled by another processing layer. Most other overlapping-markup solutions have data model and syntax support for the examples above, but we could argue that simplicity is more important, especially if we have a good story about what the other processing layer looks like.
