From sb@unagi.cis.upenn.edu Fri Sep 21 15:55:41 2001
Date: Thu, 20 Sep 2001 17:48:54 EDT
From: Steven Bird <sb@unagi.cis.upenn.edu>
Reply-To: Steven Bird <sb@ldc.upenn.edu>
To: lou.burnard@computing-services.oxford.ac.uk
Subject: Re: P4 Review 


Dear Lou,

Here is my review of chapter 11.  As I mentioned in an earlier email, I'm
focussing on the conceptual issues rather than the typography.

Yours,
-Steven

----

TEI P3 Chapter 11. Transcriptions of Speech
Steven Bird, University of Pennsylvania

As this chapter makes clear, there is a wide variety of transcribed
spoken material, and a wide variety of researchers - from phonologists
to discourse analysts - who depend on such material.  In recent years,
both these kinds of diversity have broadened significantly, as has the
sheer quantity of material.  Over the same period, *reannotation* has
become commonplace; spoken texts collected for one purpose are found
to be useful for a completely different purpose, and completely new
layers of annotation are added.  For example, the Switchboard corpus
of conversational speech began with the three basic levels:
conversation, speaker turn, and word. Various parts of it have since
been annotated for syntactic structure, for breath groups and
disfluencies, for speech act type, for phonetic segments, and for
sociolinguistic variables.

The situation for transcriptions of speech is particularly acute since
there are many types of entities and relations, on many scales, from
acoustic features spanning a hundredth of a second to narrative
structures spanning tens of minutes.  Moreover, there are many
alternative representations or construals of any given kind of
linguistic information.  Sometimes these alternatives are simply more
or less convenient for a certain purpose.  Thus a researcher who
thinks theoretically of phonological features organized into moras,
syllables and feet, will often find it convenient to use a phonemic
string as a representational approximation. In other cases, however,
different sorts of transcription or annotation reflect different
theories about the ontology of linguistic structure or the functional
categories of communication.  An additional complication is that
recordings of speech events are increasingly multichannel and
multimodal.  Whereas once a monophonic recording was the primary
record, now it is common to make multiple simultaneous audio, video
and/or physiological recordings of a linguistic interaction.  The
interaction may be transcribed in toto, or each speaker's contribution
may be transcribed in isolation, or just a few interesting fragments
may be transcribed in excruciating detail while the rest is left
untranscribed.  The temporal alignment of the transcription with the
recordings may be fine (phoneme-level), coarse (paragraph-level),
intermediate, absent, or some opportunistic mixture of these.

Many aspects of the TEI model for transcriptions of spoken language
are poorly suited to this new reality, while certain other aspects -
the ontology of speech events, tempi, and voice qualities, and key
concepts such as timelines and anchors - still apply.  The key change,
I believe, is that embedded markup must be replaced with "standoff"
markup.  Other formats for speech transcription that are structurally
similar to the TEI such as the CHILDES CHAT format are adapting in
just this way.  More recent models, such as Annotation Graphs (Penn)
and MATE (Edinburgh) are based on standoff markup.  The fundamental
idea is very simple and has been discovered multiple times
(e.g. "remote markup", Ide et al; "standoff markup", Thompson et al).
Each layer of annotation must establish identifiers to which other
layers, stored elsewhere, can make reference.  Multiple layers may be
present in the same file, or spread across multiple files on different
machines.  The point is simply that an annotation is anchored to some
linguistic material by naming it, not by being embedded in it.

Consequences of standoff annotation are that the multiple layers of
annotation are not convolved with each other; one can query a layer
without needing to know about the representation (or even the
existence) of other layers; incompatible annotations of the same
material can co-exist; different specialists can be responsible for
the content of different layers; the degree of overlap between speaker
turns can be specified with as much precision as required without
complicating the internal structure of either turn; certain layers
(e.g. signal fidelity, background noise) can be set up as annotations
of the recording itself, rather than annotations of the transcription;
and so forth.

As an illustration of the difference between TEI embedded annotation
and standoff annotation, consider the following examples.  The first
is from 11.3.2:

<text>
     <body>
        <u id="u1" trans="smooth" who="jane">have you read Vanity
  <anchor synch="u2 k1" id="a1"/> Fair</u>
        <u id="u2" trans="smooth" who="stig">yes</u>
        <kinesic id="k1" who="lou" iterated="y" desc="nod"/>
     </body>
</text>

We can make the following observations about this example:

1) The coordination of speaker overlap is assumed to occur at a word
   boundary (see the position of the anchor tag), whereas no such
   coordination occurs in natural conversation.

2) A query for the phrase "Vanity Fair" would need to take account of
   the possibility that markup relating to an independent event (Stig's
   utterance) intervenes.

3) The overlap is specified asymmetrically; only Jane's contribution
   is marked for overlap, even though Stig's contribution is also
   overlapping.

4) The approach breaks down when more speakers are involved.  If both
   Stig and Lou cut in on Jane's utterance at different times, the
   result would be multiple indeterminancies as to whose utterances
   were specified for overlaps with who else's utterances.  This
   multiplicity of distinct yet equivalent and possibly redundant
   ways to encode the same information is highly undesirable.

5) Whitespace before the "F" of "Fair" is employed, suggesting that
   the presence/absence of optional markup may interfere with the
   tokenization of the sentence.

The representation of the same text as an annotation graph is given
below.  This is just one particular model of standoff annotation.  We
begin by setting up a timeline consisting of three signals (here, two
audio and one video).  Anchors are defined to serve as the start and
end points of the annotations.  (For further information, see
[http://arXiv.org/abs/cs/0010033, http://sf.net/projects/agtk/]).

<text>
  <Timeline id="T1">
    <Signal id="S1" mimeClass="audio" mimeType="wav" encoding="wav"
        unit="16kHz" xlink:href="jane.wav"/>         
    <Signal id="S2" mimeClass="audio" mimeType="wav" encoding="wav"
        unit="16kHz" xlink:href="stig.wav"/>         
    <Signal id="S3" mimeClass="video" mimeType="mpeg" encoding="mpeg2"
        unit="sec" xlink:href="lou.mpeg"/>
  </Timeline>

  <AG id="t1" timeline="T1">
    <Anchor id="A0" signals="S1" offset="10375" unit="16kHz"/>
    <Anchor id="A1" signals="S1" offset="21925" unit="16kHz"/>
    <Anchor id="A2" signals="S2" offset="19112" unit="16kHz"/>
    <Anchor id="A3" signals="S2" offset="24050" unit="16kHz"/>
    <Anchor id="A4" signals="S3" offset="23550" unit="16kHz"/>
    <Anchor id="A5" signals="S3" offset="35372" unit="16kHz"/>

    <Annotation id="Ann0" type="transcription" start="A0" end="A1">
      <Feature name="trans">have you read Vanity Fair</Feature>
      <Feature name="speaker">Jane</Feature>
    </Annotation>

    <Annotation id="Ann1" type="transcription" start="A2" end="A3">
      <Feature name="trans">yes</Feature>
      <Feature name="speaker">Stig</Feature>
    </Annotation>

    <Annotation id="Ann2" type="transcription" start="A4" end="A5">
      <Feature name="kinesic">nod</Feature>
      <Feature name="speaker">lou</Feature>
    </Annotation>
  </AG>
</text>

Note that such annotations are normally created by software tools
which hide most of the structure from users, and which store the
information in a relational database.  The XML format is for long-term
storage and interchange.  In the above example, we could specify the
degree of overlap between "Fair" and "yes" in as much detail as
necessary (e.g. by breaking up Jane's utterance into a sequence of
word-level annotations).  Here's another version of this standoff
annotation in which no recorded signals are available, and where
word-level segmentation and overlap are specified:

<text>
  <Timeline id="T1"/>

  <AG id="t1" timeline="T1">
    <Anchor id="A0" offset="0.6" unit="sec"/>
    <Anchor id="A1" offset="1.0" unit="sec"/>
    <Anchor id="A2" offset="1.4" unit="sec"/>
    <Anchor id="A3" offset="1.2" unit="sec"/>
    <Anchor id="A4" offset="1.5" unit="sec"/>

    <Annotation id="Ann0" type="transcription" start="A0" end="A1">
      <Feature name="trans">Vanity</Feature>
    </Annotation>

    <Annotation id="Ann1" type="transcription" start="A1" end="A2">
      <Feature name="trans">Fair</Feature>
    </Annotation>

    <Annotation id="Ann2" type="transcription" start="A3" end="A4">
      <Feature name="trans">yes</Feature>
    </Annotation>
  </AG>
</text>

----end----



