Automatic Fragmentation
From LMNLWiki
This page is the Wiki-version of the poster "Automatic Fragmentation of Overlapping Structures" I presented at Digital Humanities 2007, University of Illinois, Urbana-Champaign. Section 5 is a new addition.
-- Bert Van Elsacker
Contents |
The script (in Perl)
See Fragmentation Script.
What’s the problem?
In real texts, we often encounter overlapping structures. In prose, for instance, a quotation may cross a paragraph boundary, in the Bible, the chapter/verse hierarchy is independent from the paragraph/sentence hierarchy, in annotated texts, one stretch of annotated text may overlap another one. The subject is an evergreen in markup theory. It has been discussed extensively on the leading conference in this field, Extreme Markup Languages. One suggestion to encode overlap is the use of a milestones. Instead of the illegal
<p>Dorothy said: <q>...</p> <p>...</q> and that was all.</p>
you could encode
<p>Dorothy said: <q sID=”q1”/>... </p> <p>...<q eID=”q1”/> and that was all.</p>
which is at least well-formed xml.
Another workaround is fragmenting one of the overlapping elements:
<p>Dorothy said: <q id=”q1.0” next=”q1.1”>...</q></p> <p><q id=”q1.1” prev=”q1.0”>...</q> and that was all.</p>
This poster shows a technique to automatically convert the first type of encoding into the second. Why would you want to do this? The milestone technique is easier for humans encoding a text or editing an XML-document in source code. In practice it can be quite challenging and time-consuming to keep track of fragments and write code which will return the original element when all fragments are joined. But the fragmented representation is easier to process further, as the text data is wrapped in the appropriate element. This has advantages for validation and display. The programming language most suited for processing XML documents, XSLT, is oriented towards hierarchically encoded and properly nested documents. It’s not straightforward to write templates which will extract, for instance, all text in between page breaks, if pages are encoded as empty delimiter elements. For projects which cannot rely on an experienced programmer, ready-to-use code may come in very handy. And in practice, there seem to be not so many programmers keen on programming XSLT.
Preliminary Observations
If you look at this, you see three segments: one covered by A, one covered by A and B, and one covered by B alone. There are two ways the structure can break in order to avoid overlap:
So in order to create a well-formed structure, the computer needs to know which element has priority. The current implementation of this algorithm sticks to a simple table of elements and priorities; if two elements have equal priority, the leftmost one will be given preference.
The extremity of the higher ranked element which falls in the content of the lower ranked element is the point where the latter should break. The lower ranked element falls apart in two segments, one of which is nested in the higher ranked element. Multiple overlapping elements may have an impact on each other, even if they have only an indirect connection.
In this example, three elements overlap. Assume A < C < B. If we consider only the overlap of A and C, it’s clear what should happen:
However, since C overlaps B, C should be split too:
As a result, A falls apart in four fragments: A3 and A4 fitting into C, A1 and A2 as the free standing part.
So even though A and B do not overlap directly, because of their mutual connection with C, B has an impact on A. Therefore overlapping elements cannot be processed in pairs; it’s necessary to look at the complete cluster -- and to identify the clusters in the first place.
Obviously, not all overlapping elements in a document are connected. In this example, clusters 1 and 2 may be processed independently.
The Algorithm
These simple observations are the basis of an algorithm to automatically fragment overlapping structures. We assume an element is represented by a pair of markers and that in each marker the element ID is present.
Identify overlap
First, go through the document and identify all elements involved in overlap. To do this, check if an element X has in its content one and not more than one marker belonging to element Y. Per element, keep a record of the ID’s of the overlapping elements.
Identify clusters of elements involved in overlap
Process the elements from left to right. Check if the element has already a cluster ID. If not, check if an overlapping element more to the right has a cluster ID. If so, add the current element to this cluster. If not, add the current element and all overlapping elements to a new cluster.
Process the clusters
Make a list of all elements in the cluster. Sort this list on element priority; if priorities are equal, give preference to the leftmost element.
Process all elements in the list. Firstly, each element looks at a per-cluster list of the boundaries of the preceding elements and checks if there are boundaries which fall in its content. If so, for each of such boundaries it adds its ID to a per-boundary list of elements which break at the boundary. Afterwards, it adds its own start and end points to the per-cluster list.
An illustration based on the previous example, which has this structure:
[1] [2] [3] [4] [5] [6] START START START END END END ID: 1 ID: 2 ID: 3 ID: 2 ID: 1 ID: 3 A B C B A C
There is one cluster which contains 3 elements with the ID’s 1 (A), 2 (B) and 3 (C). Ordered by priority: 2 3 1
Processing 2
ELEMENT BOUNDARIES CLUSTER LIST PER BOUNDARY LISTS
2, 4 empty none
- There are no boundaries in the per-cluster list.
- Adding the boundaries of 2 to the list.
- The list now contains markers 2 and 4.
Processing 3
ELEMENT BOUNDARIES CLUSTER LIST PER BOUNDARY LISTS
3, 6 2, 4 none
- The list contains markers 2 and 4.
- Marker 4 falls into the content of element 3, therefore id 3 is added to the per-boundary list of marker 4.
- Adding the boundaries of 3 to the per-cluster list.
- The list now contains the markers 2, 3, 4 and 6.
Processing 1
ELEMENT BOUNDARIES CLUSTER LIST PER BOUNDARY LISTS
1, 5 2, 3, 4, 6 4 => ID3
- The list contains markers 2, 3 and 4. They all fall in the content of 1, so ID 1 is added to the per-boundary list of these three markers.
- Adding the boundaries of 1 to the per-cluster list.
- The list now contains the markers 1, 2, 3, 4, 5 and 6.
End state
ELEMENT BOUNDARIES CLUSTER LIST PER BOUNDARY LISTS
-- 1, 2, 3, 4, 5, 6 4 => ID3, ID1
2 => ID1
3 => ID1
Output the results
To create a well-formed structure, output per marker
- end tags for the elements split at this boundary
- the marker itself
- start tags for the elements split at this boundary
It’s trivial to add an attribute in order to distinguish original and added tags.
<a xid="1" orig="y"> </a><b xid="2" orig="y"><a xid="1">
[1] [2]
</a><c xid="3" orig="y"><a xid="1"> </a></c></b><c xid="3"><a xid="1"> </a> </c>
[3] [4] [5] [6]
Implementation
Source code of a basic implementation in Perl is available here.
Discussion
See for some reactions to the poster http://www.jenitennison.com/blog/node/27.
Limitations
Using this algorithm could be frustrating because it will interprete the markup contsructs literally and doesn't take into account what they mean. So if you have something like:
<title>..<x/>..</title><note>...</note><para>...<x/>..</para>
(x-elements delimiters for overlapping element)
you'll get output like this:
<title> <x> </x> </title> <x> <note> </note> </x> <para> <x> </x> </para>
...the note will be wrapped in a fragment of x -- which may or may not be what you expect. For example, when encoding variants, <rdg>-elements ("reading") contain different versions of the same text. In this case, the standard result is undesirable:
<p>
.....
<app>
<rdg version="A">
<x>
</x>
</rdg>
<x>
<rdg version="B">
</rdg>
</x>
</app>
<x>
.....
</x>
.....
</p>
(where <app> is a container for <rdg>'s)
In this case, <rdg version="B"> should be left out of the fragment. We would like the script to understand A and B are alternative versions, so an edit in version A should not have an effect on B.
This kind of relation cannot be expressed in a DTD or schema. The DTD could state an app contains several rdg's, but there is in this respect no difference between a sequence of <rdg>'s and a sequence of paragraphs or list items.
In this way, the script makes obvious the need for a framework which makes it possible to express a broader and richer set of semantic relations. If such a grammar were available, a preprocessor could scan the document and use information about semantic relations to automatically adjust the markup (for example, removing the second <rdg> from the marked selection by splitting the range in two discontinuous sections). This kind of relations also apply to more straightforward issues, like "is a note part of a paragraph in the same way as a paragraph is part of a section?". (See the article Meaning and interpretation of markup.)
