From lou@ermine.ox.ac.uk Fri Sep 21 15:44:23 2001
Date: Thu, 13 Sep 2001 23:14:25 +0100 (GMT Daylight Time)
From: Lou Burnard <lou@ermine.ox.ac.uk>
Reply-To: Lou.Burnard@oucs.ox.ac.uk
To: Martin Wynne <martin.wynne@computing-services.oxford.ac.uk>
Cc: steven_derose@brown.edu
Subject: Re: TEI P4 chap23

Many thanks for your comments, which I've annotated below.

On Thu, 13 Sep 2001, Martin Wynne wrote:

>
> Chapter 23: Language Corpora
>
> 1. XML errors in examples
>
> p.502   <TEI.2> should be <tei.2> in XML
> p.502   example of overall structure of the corpus uses single quotes;
> should  be double.
> p.504   double quotes for 'textdesc' in Header Extensions example

Why do you think these are errors? tags are case sensitive in XML, but
uppercase is permitted. And the current name for the root element is
TEI.2, not tei.2. However, there *are* several occurrences of "tei.2"
which should all be TEI.2, and will be in the next revision.

Single quotes or double quotes are equally valid, though it has been
suggested that we should standardize on one or the other to avoid
confusion.

> p.505   Is this document type declaration the most up-to-date (shouldn't it
>         be P4?)
>

This is a matter of some debate.

> 2. Formatting errors
>
> p.506   Formatting error in paragraph in centre of page ("Schemes
> similar..."):
>         should be a footnote?

This is a tagging error, to be fixed.

>
> 3 .More substantive points
>
> 3.1 Groups
>
> It's not clear how to use the group elements.  The existence of the
> group element allows two a new method for encoding metadata.

eh?

> Information on text classification can still be added in the
> teiHeader, but can now also instead (or in addition) be encoded in the
> group element attributes.

No, this is a misunderstanding, I think. See further below.

>
> For example, if you have a corpus of fiction texts, wherein some of
> the texts are first person narratives and some of them third person,
> and you want to encode this information, you can include this in the
> header somewhere, or group the texts according to this criterion.
>
> It seems to me that groups should only be used for assigning fixed and
> external classifications of texts (probably in electronic versions of
> print anthologies), not for ad hoc classsifications in text corpora.

>
> There seem to be two problems with this:

> - groups are not a good mechanism for assigning classifications, so
> this potential use should be warned against in the guidelines;

> - it is not clear from chapter 23 (or chapter 7) whether the semantics
> of this classification can be encoded. Are attributes of the group tag
> possible, or some other some of definition of the group identity?
> Otherwise they are just arbitrary and semantically uninterpretable.

>
> While flexibility is a good thing, and I think the option of groups is
> potentially very useful in corpora, I think the dangers of abuse and
> overuse should be flagged up more clearly.
>

Clearly, the possibility of using groups for this purpose needs a better
explanation. I'll try to explain it here, and think further about how to
improve the text. The key issue is that you can use <group> to combine a
number of <text> elements into a single (higher level) <text>, to an
arbitrary degree of complexity. The protypical use for a <group> is
something like a printed anthology: the suggestion here is that you could
also use it for a corpus in which a number of texts are to be
combined into semantically meaningful groups: you suggest third vs first
person narratives, and that's a good example. Another might be in say a
corpus of ephemera where you wanted to have one "text" of say small ads,
another of letters, another of pamphlets etc. In such a case, it might be
more convenient to create the "collection of letters" as a group, assuming
that most of the metadata about them were much the same.

There's no necessary implication that all the texts in a group should have
the same classification, however, any more than there is any necessary
implication that all the chapters of a single novel would necessarily get
the same classification. The mechanism by which you classify things
smaller than a text (and allow for multiple classifications of a single
text) is discussed in 23.3.2



> 3.2 Text Structure, Linguistic Annotation and Bracketting Paradoxes
>

> As is well known, developers of linguistic corpora frequently run into
> problems with bracketting paradoxes when they wish to encode elements
> of linguistic annotation but they overlap with elements of text
> structure. I believe that this problem is not surprising given that
> the TEI allows (and even appears to encourage)  the attempt to put two
> conceptually distinct systems (textual markup and linguistic
> annotation) in a single tree. I believe that users of the TEI for
> developing corpora urgently need an accessible and usable system of
> remote or stand-off annotation.

We will have to have an argument some time about why you think textual
markup and linguistic annotation are conceptually distinct. You are right
in saying that combining the two sometimes leads to bracketting problems,
of course, and also that standoff annotation is a good solution to them.
You are probably also right in saying (or at least implying) that the TEI
methods for doing it are in need of improvement or explication. But I
don't think there's anything actually wrong or unusable in what's
presented here: probably some more specific discussion beyond the cross
references to other chapters might help, but I think the proper place for
that would be a tutorial on how to use TEI for corpora, not this chapter.
(Earlier drafts did in fact have more examples, but they were moved to
other chapters!)


>
> 3.3 Dublin Core
>

> There seems to be a lot of interest in use of the DC Metadata set as a
> sort of minimum level of metadata encoding for interchange.  OLAC
> proposals and the promised LinguistList search engine may be widely
> used. Perhaps some help in mapping the richer TEI system to the OLAC
> implementation of the Dublin Core would be useful. If the Linguistlist
> language resource search system really takes off, then this could be a
> way to promote the use of TEI.
>

Yes, I'm entirely in agreement here, and there is definitely scope for
updating (e.g.) the standalone-header chapter to discuss mapping to DCMI,
and indeed OLAC. The OAI protocol seems to offer the best approach
currently on offer for implementing the mapping, and it would be useful to
cite it here, or somewhere. I am hoping that setting up a workgroup to
address this may actually happen.


Thanks again for your comments


Lou


> NB I'm not an expert on XML, so I can't guarantee that all the examples are
> valid.
> __
> Martin Wynne
> martin@ota.ahds.ac.uk
> Information Officer (Linguistics)
> Oxford Text Archive
> Arts and Humanities Data Service
>
> Tel: 01865 283299
> Fax: 01865 273275
>
> 13 Banbury Road
> Oxford
> OX2 6NN
>
>


