From lou.burnard@computing-services.oxford.ac.uk Thu Aug 30 12:24:37 2001
Date: Thu, 30 Aug 2001 00:08:24 +0100 (GMT Daylight Time)
From: Lou Burnard <lou.burnard@computing-services.oxford.ac.uk>
To: Daniel Pitti <dpitti@virginia.edu>
Cc: editors@tei-c.org
Subject: Re: P4 revision

Dear Daniel

Many thanks for your comments on ST and apologies for the delay in
responding to them.

On Wed, 22 Aug 2001, Daniel Pitti wrote:

> The minor critique.
>
> 3.2.1 second p
>
> "In general, exactly one base tag set must be selected for any
> TEI-conformant document. Errors will result if none, or more than one, is
> selected, because the same elements may be differently defined in different
> base tag sets."
>
> This conflicts with the example given in 2.9.2:
>
> <!DOCTYPE tei.2 PUBLIC "-//TEI P3//DTD Main Document Type//EN">
> <tei.2>
> This is an instance of an unmodified TEI type document
> </tei.2>
>
> There is no base set selected in this example. Later in 3.3, this example
> is given:
>
> <!DOCTYPE TEI.2 PUBLIC "-//TEI P3//DTD Main Document Type//EN"
> "tei2.dtd" [
> <!ENTITY % TEI.prose 'INCLUDE' >
> ]>
>
> This is more accurate (and also gives a sysid for the DTD). The first
> example in fact gives rise to a useless DTD (or nearly useless).

Agreed. We'll fix the example in SG when that chapter gets revised.
(It's also wrong -- "tei.2" is not the name of the root element in a
TEI document!)

>
> The final clause "because ..." applies to "or more than one" but not to
> "none." I think "none" and "more than one" each need to be explained
> separately.
>
> I think the entire paragraph needs to begin something like this: At least
> one base must be invoked,  though for documents that mingle structurally
> dissimilar elements, two or more bases can be invoked. There are two
> different methods for invoking two or more bases: mixed base and general
> base. ...

I agree that the paragraph is a little too condensed, as are many in this
chapter. I propose to revise more or less as follows:

"For most documents, exactly one base tagset must be selected. If no base
tag set is selected, no elements will be available to encode the basic
components of the document. If more than one base tag set is selected,
errors will result because the same elements may be differently defined in
different base tagsets. The special purpose mixed and general bases
discussed in section 3.4 may however be used to overcome this problem, for
documents which mingle structural dissimilar elements."

[Note that I want to convey that you really don't need to mix bases in
most cases!]

>
> 3.4 first p
>
> The paragraph is confusing. It seems to suggest that one only needs the core:
>
> "For more detailed tagging, the encoder MAY choose the prose base ..."
>
> I'd replace this sentence with something like the following:
>
> "To decide which base tag set is appropriate, one should determine the
> predominant nature of the text to be encoded. Is it exclusively or
> predominantly prose, verse, and so on? If the text is not predominately of
> one type, or if detailed encoding of several types is desired, then either
> the mixed base or general base approach to invoking more than one base
> should be employed."
>

I think the main problem with this section is that it confuses the issue
of multiple texts (corpora, groups, etc) with the issue of texts that
actually need different bases. I'm not sure how to go about rewriting it
though. Your suggestion is OK, but it does imply that one cannot deal with
a text that combines prose and verse without using the mixed or general
base. Which is not the case: you only need it if you are using the
specialist features provided by e.g. the verse base. You can encode a text
which is entirely verse using the prose base, provided the verse tags in
the core are adequate to your needs. Maybe using dictionaries and spoken
texts would be a better example.



> 3.4 fifth? p
>
> "This is the only exception to the general rule that no more than one base
> tag set ..."
>
> This and similar text in other places seems to suggest that one base is the
> rule, combining bases is somehow exceptional. I would do away with this
> approach and simply lay them out as two options, one, or two or more, with
> the decision contingent on the nature of the text(s) and the analytic
> objectives.

Well, yes, that's sort of where we differ. There IS a suggestion that
one base is the rule.

>
> 3.5 Global Attributes
>
> While mentioning the global attributes here is appropriate, the detailed
> discussion of their semantics and mechanics should be moved elsewhere. The
> detailed discussion is out of character with the rest of the discussion,
> which is about the structure of the DTD. In addition to disrupting the
> description of the structure, it is counterintuitive for readers to look in
> a chapter on structure for this discussion. I am not sure where it belongs.
> Section II might be the right place.

I remember some agonising about where to put this section. Your comment
suggests that we didn't get it right but, like you, we couldn't think
where else to put it. Where do you mean in "section II"?  The logical
place would be in chapter 6, which is already overweight. But then we
wouldnt have discussed the global attributes before they are used in
chapter 5. So should we put them into a chapter of their own between 4 and
5? That would be quite an upheaval...


>
> Later in this section, the following sentence appears:
>
> "The contents of the rend attribute are free text."
>
> I checked around to see if the expression "free text" appears with any
> frequency. It does not. While moving away from technical terms (like
> "CDATA") might be good in general, I think there are certain terms that
> ought to be presented in a glossary and then used throughout. #PCDATA and
> CDATA are two of them. They are so ubiquitous and basic that anyone working
> with SG/XML ought to know what they mean. And working around them using
> expressions like "free text" and "text" gives rise to ambiguities.

A glossary is certainly a good idea. Maybe we should include it in chapter
2, which would be an appropriate place to explain for example
the difference between CDATA and #PCDATA. I think we chose to use looser
phrases like "free text", or "simple text", because early readers were put
off by the technical terms, which were not so familiar then as now.
But what kind of ambiguity are you thinking of here?

>
> 3.6 first p
>
> "The main TEI DTD is always invoked by specifying the file tei2.dtd."
>
> Either add "See section 3.3 for an example" or, if you accept my view that
> repetition is the mother of education (and boredom), give an example,

OK. An example here is a good idea.

>
> Under 3.6, first p, 1. iii.
>
> "... for TEI generic identifiers ..."
>
> I'd revise to read
>
> "... for TEI generic identifiers (or element names) ..."
>
> Contrary to what I said above, here I think the technical term is less than
> well-known (and with much less need to be). A lot of folks who work with
> SG/XML all the time will pause when you say "generic identifier."
>

Interestingly enough, I got a comment from another P4 reviewer only
yesterday asking whether "generic identifier" was a special XML technical
term so maybe you're right. But I don't think you can argue for CDATA and
against generic identifier! "element name" is just as imprecise as "free
text"!


> 3.6.1 first p
>
> in the list, item 3 has "component )". The close parenthesis appears to be
> a typo. In the pdf form, "component" though I think it is in a different
> font, really does not stand out very well.

Yes, it's a typo (actually a cockup introduced in the XMLification). Well
spotted!

>
> 3.6.2 second (or is it third) paragraph
>
> "When an entity is declared more than once, the first declaration is
> binding and the others ignored."
>
> I'll get back to this later in the major critique.
>
> In the following example and the one that follows, I think you ought to
> include the TEIform attribute. This would bring it into line with
> recommendations concerning modifications.
>
> <!ELEMENT it - - (%phrase.seq) >
> <!ATTLIST it
> id ID #IMPLIED
> lang IDREF %INHERITED
> n CDATA #IMPLIED
> rend CDATA #FIXED 'italics' >

Good point. Have added it.


>
> 3.6.3 end
>
> "The default text structure tags, which are also documented as part of the
> core, are embedded by the base tag set, unless the base defines its own
> text structure tags; see the chapters on the individual bases."
>
> This is not clear. And the reference should give the location of the
> chapters in the guidelines, that is, Section III.
>

This is supplied at the other point where "default text structure" is
referenced. How would you clarify the prose though?


> 3.7 first p
>
> I think that "a-class" and "m-class" should be followed by "(attribute
> class)" and "(model or member class)". And then eliminate spelling it out
> later.

Or maybe it would be better to drop "a-class" and "m-class" as terms
completely?

>
> 3.7.1 List beginning "The a-classes declared in the core tag sets of these
> Guidelines are:"
>
> First, "of these Guidelines" seems to be an unnecessary qualifier.

Agreed

>
> But I think I detect a minor logic error in the pointer models.
>
> pointer elements which point from one location in the document to another
> (section 6.6, Simple
> Links and Cross References)
>
> This defines pointer as encompassing a pointer from one location to another
> in the same document, but in fact, the a.pointer is referenced in
> a.xPointer. And why is there an a.xPointer and not, let us say, an
> a.inPointer which also references a.pointer? Shouldn't there be, even if
> nothing is gained in maintainability (since it is only one attribute), for
> the sake of consistency.

I think I follow your argument, but I'm not sure what to do about it. The
reasoning  is simply that a.xPointer is a subclass of a.pointer, because
it inherits attributes from it. You're right in saying that x.Pointer
extends the definition of a.Pointer of course, so maybe this isn't proper
subclassing, but that's how we do it...

>
> At any rate, the definition of pointer in the list seems at odds with its
> use in xPointer (that is the statement that the attribute class is about
> pointing within a document and not just about pointing).

Using a.Pointer in the definition of a.xPointer doesnt imply that the
semantics of a.Pointer are *all* inherited. A flying fish is still a kind
of fish, even though a reasonable definition of a fish might be that it
dosn't fly!


>
> 3.7.2 first p
>
> This paragraph is confusing as written:
>
> When the members of a class are structurally similar and can appear at the
> same kinds of structural
> locations in the document, they are grouped together into an m-class (or
> 'model-class'). M-classes are implemented by defining a parameter entity
> for use in the formal declaration of element content models. The parameter
> entity takes the name of the class it defines, and prefixes the string
> 'm.', which can be interpreted as model or as members. The replacement text
> of the entity is a list of the members of the class, separated by '|', the
> content model symbol for alternation.
>
> First, it would help to change "appear at the" to "appear in the."
>

OK


> Better yet, rephrase it more like this:
>
> When the members of a class are structurally similar and can appear in
> elements that are structurally similar, they are grouped together into an
> m-class (or 'model-class').
>
> Your explanation of the m-class, namely that they embody elements that are
> structurally similar and can appear in elements that are structurally
> similar is in tension with 3.7.4, where the low-level element classes are
> grouped because they are "semantically or structurally" similar. Perhaps
> all of this needs to be teased out a bit to be clearer and more consistent.
> In fact, I think some very interesting design issues are at play here.

The definition of a model class could certainly be improved. I'll have a
crack at it when I feel a bit stronger. If I do go through systematically
removing the "m-class" term, this will be a good opportunity also to do
something about the consistency issue.

Conceptually at least some model classes are more "semantic" than others.
For example, the m.edit class groups all elements which are used for
editorial emendation. Now, these elements are *structurally*
indistinguishable  from any other of the various phrase level elements
-- m.emph elements for example -- i.e. they have the same content model
and can appear in the same content models. But it seems useful to class
them separately -- even more so in a world which is moving to schemas!

> >
3.7.2 second p >
> Somehow "the default value of these x-dot entities is always the empty
> string" seems to elevate empty string to an interesting status (which
> perhaps it should have). "an empty string" would read better.

OK, "an" empty string it is.

>
> 3.7.4
>
> Within Low-Level you have phrase-level, inter-level, and an unnamed class.
> I found inter-level within low-level to be confusing. Perhaps there is a
> better descriptive name than low-level. I can't think of one myself. As for
> the unnamed group, I think they need a descriptive name, for stylistic
> reasons, but also to add clarity to the description. Something like the
> "in-and-among classes", "multi-level classes", "anti-hierarchy classes",
> "class-less classes" (because they are anti-hierarchy) or, to follow the
> sg/xml fragment, "Included Classes".
>

Yes, "low-level" is confusing and probably unnecessary. It's meant to be a
characterization of these classes, rather than a technical term. I will
reword the three headings to be more consistent (inter, phrase, and incl
are all defined elsewhere of course)

> In the list at the top of 3.7.4 you have "versePhrase" but in the dtd
> fragment you have "phrase.verse".
>

Whoops! The name in the DTD fragment is wrong (or at least,
inconsistent in form), in fact, so fixing it is not as simple
as it looks: but I'll give it a go...


> The "included classes" are at the end of the list, but in the fragment are
> in between in the sg/xml fragment that follows. In the list and fragment,
> data and date are in different orders. It makes it a bit confusing if one
> is reading the list and then reading the fragment, item by item.
>

OK. Will reorder consistently.

> 3.7.7
>
> component.seq I think "sequence" is misleading, or at least confusing,
> because it seems to suggest that the elements are in a specified order. (At
> least the common understanding of the word sequence is that it implies an
> ordering.) Instead the sequence, if I understand correctly, only applies to
> the component.seq as defined for the general base, and the sequence
> describes the sequence of the content model groups. Yes? Otherwise the
> component.seq is more or less (because of the %m.Incl;) a repeatable or
> group. The phrase.seq is also not a sequence (in the sense of ordered).
> Perhaps this is a legacy of the migration to XML?

No, it's just the name we've always used for them. It isn't meant to imply
sequence, just repetition, so you're right that it's not the best name.
But I think we're stuck with it now.

 >
> 3.8.2
>
> Again, I would use "element name" or "name" rather than generic identifier.
> In fact in the first paragraph "standard name" is used, followed by
> "generic identifier" in parentheses. I think the latter is more intuitive
> for most of the likely readers.

Since we use "standard generic identifier" elsewhere in the paragraph, I
think I prefer to simply remove the word "name" and the parentheses. That
way I can replace all "generic identifier" occurrences by "element name"
if so we decide to do.


>
> Also the last sentence about mixed case in the parameter name might be
> enhanced and made a little clearer. While matching case in the name is
> essential, either upper, lower, or mixed case can be used in the literal
> (that is the element name), though users should bear in mind that XML is
> case sensitive with regard to the element name.
>

The sentence is misleading: it is an SGML relic [you'll remember that in
SGML entity names are case-sensitive by default where element names
are not]. I have replaced it with

"Since all names in the TEI dtds are case sensitive, the ... "


> Footnote 7
>
> Sentence beginning "It should be noted however ...generic identifiers;
> attribute names ... modified." I'd replace the semicolon with a full stop.
> It makes it a little clearer.
>

Fair enough. Some people, like me, just love semicolons... I think the
whole footnote could go actually, since we haven't done anything about
alternate sets of GIs. The thing about indirecting attribute names etc. is
only there because we did allow for that in P1 and it drove everyone nuts.



> 3.8.3
>
> The ISO-date can also be entered as 19680922 (basic format as opposed to
> extended). Thus "a date like 'September 22, 1968' would be entered as
> '19680922' or alternatively as '19680922'."
>

err? OK, I see what you mean. You don't comment on the fact that the PDF
version has the footnote in the wrong place. Now fixed.


> Now the major critique.
>
> This chapter is one of the most important in the entire guidelines, at
> least for anyone that wants to really understand the DTDs and how they are
> put together. The complexity of the subject matter is such that it is
> unquestionably difficult to describe. One problem is that it is difficult
> to lay it out progressively; one needs to know everything to know anything.
> I am sorry to say that I have no substantive suggestion on a rhetorical
> strategy, but I do have a few suggestions about how it might be improved.
> The only strategy I can think of is to work "from the outside in," which is
> to say, with an overview of the whole, and then progressively more detailed
> discussions of the whole. Linear, I think, will only work for the
> persistent and patient. I am sorry I cannot come up with a more precise
> description of what I mean, but perhaps this will be suggestive.
>

Well, I agree with all of that, but I also share your inability to
come up with a better rhetorical strategy in what is after all a reference
book, not an introductiory guide. I've now explained all of this often
enough to have a better sense for how it should be presented in a
tutorial context but that's not what we're after here.


> Understanding declarations subsets, parameter entities (including the order
> in which they are read) and marked sections is critical to understanding
> TEI. While they get mentioned over and over and here and there, it would be
> much better to have a major section at or near the beginning of the chapter
> that explains them and their use in TEI thoroughly and clearly. The bits in
> the gentle introduction chapter and scattered about are not sufficient to
> make this chapter stand alone, which is to say, to be intelligible without
> going out and finding a good SG/XML book. I think it should stand alone.

Good point, but I think it belongs in the preceding chapter.

>
> I also think that there needs to be a section describing entity management,
> public identifiers, system identifiers, catalogs, and how they all relate
> to parsers and parsing. It probably does not belong in this chapter, but in
> the gentle introduction chapter, in and around 2.9 Putting It All Together.
> This would be of extra-TEI use, as no one ever seems to take this on,
> despite the flood of tomes in the last few years. It would be good if TEI
> did this, and did it well.

OK, Steve and I are having a crack at that, and will fly the result past
you as soon as it's available.

>
> XML is mentioned here and there. Given that you appear to be rewriting the
> DTD in such a way that it no longer uses global inclusions/exclusions and
> all of the content models are XML conforming, I think you can deal with
> both XML and SGML in this chapter together. If my perception is true, and
> you do decide to simply deal with both, then you need to comb carefully
> through the chapter to make mention of SGML and XML consistent. As it
> stands, XML is mentioned here and there along the way.
>
> When technical terminology is used, I'd migrate it to XML terminology, as
> this is increasingly more likely to be the terminology understood and
> encountered by users.
>

Yes, we are trying to define the thing so that it can be both XML and SGML
compliant (some irritating person on the list told me off for implying
that XML wasn't SGML the other day, but you know what I mean!)

> I note that you have maintained mixed case in the generic identifiers. I am
> strongly in favor of folding all names into lower case, as it makes
> converting SGML into XML easier (either retrospective conversion, or as we
> frequently do, using SGML for maintenance and XML for delivery), does not
> require editors to know casing when using SG/XML ignorant software for
> markup, and, at any rate, makes keying names easier. The loss of
> readability is minor, I think, in relation to ease of conversion. I think
> legacy issues are mute, as the default behavior for SGML software was to
> fold to upper (though emacs, contrary to the default behavior, folds to
> lower case). In fact, using only lower case all NAME and NMTOKEN makes
> dealing with legacy information very easy, as SX has an option to fold to
> lower case when converting.

Sorry, this is a lost cause. We are sticking with camelcase and we are
sticking with case-sensitive names. Thus it was decreed by the founding
fathers of the metalangquage committee after many long hours of wrangling
and torment, and thus it shall remain, lo, even until the end of time.

>
> Even if you do not agree, you should at least make sure that all of the
> "enumeration attribute types" and, where appropriate, "default attribute
> values" are lower case. In P3 there was some inconsistency with respect to
> these. For example, I remember running across a lot of "(yes | no) 'yes'"
> and "(YES | NO) 'YES'." It would be much better if all of these were simply
> lower case, as it avoids some unexpected tripping. This is, of course,
> invisible when parsing the DTD, but appears the first time you have an
> instance that has a clash, for example, when the instance literal is "yes"
> and the enumeration or default value is "YES." I only discover this by
> accident, when doing an SGML to XML conversion.

Here, on the other hand, I entirely agree. We need to take a good look at
the enumeration attribute values and make them consistent throughout --
please tell me of any specific cases you come across.

 >
> I also think it would be helpful to explain the use of the dtd fragments in
> the chapter. It would be useful to explain some of the conventions,
> especially in regard to the bracketed information. Since entries like the
> following are not SG/XML, they need to be explained:
>
> [definitions from 3.6.2: Local modifications to parameter entities inserted
> here ]

I thought we did explain that in the introduction somewhere, but will
check.

>
> I assume the ODD instance takes care of these references, yes? But your
> uninitiated readers are going to be mystified by it
>
> I hope this helps at least a little. Good luck in wrapping this up.

Very helpful. Many thanks.


>
> Daniel
>
>
> At 12:52 PM 8/20/01 +0100, Lou Burnard wrote:
> >A brave man! Thanks a lot ... your cheesy form letter follows. Look
> >forward to seeing you downunder
> >
> >Lou
> >
> >Dear Daniel
> >
> >Many thanks for your kind offer to help us in the revision of TEI
> >P4. You are hereby assigned to the reading of chapter 3 (Structure of
> >the TEI DTD). Good luck!
> >
> >Just to recap on what we're hoping to achieve this time round:
> >
> >- you should be reading the text carefully for incompatibilities
> >between what is said in the text and the use of XML; and for errors
> >and typos.
> >
> >- you may also identify more basic issues which you think need
> >substantial revision of rethinking
> >
> >- you should aim to send your list of proposed corrections/revision to
> >the TEI editors (editors@tei-c.org) by 20 September 2001
> >
> >- if you wish to propose topics which need more substantial
> >attention, rewriting, or further work, please do so: if we receive
> >them before the November members meeting, we will be able to present
> >them to the TEI Council for its consideration at that time, but
> >proposals for further work can be considered at any time after the
> >Council is in existence.
> >
> >The current draft text is at http://www.tei-c.org/P4X/ST.htm (or
> >ST.pdf in PDF format). We are not planning to distribute the XML form
> >of the chapter at this stage since its format is likely to change
> >during the editing process, and we don't want to give you the
> >additional burden of learning how the ODD system works!
> >
> >To see who else is working on the revision process, keep an eye on
> >http://www.tei-c.org/P4X/Status/
> >
> >Please feel free to discuss any points of detail you think of interest
> >to the TEI community using tei-l@listserv.brown.edu in the usual way.
> >
> >Once more, many thanks for your assistance, and good hunting!
> >
> >Lou Burnard
> >Steven De Rose
> >
> >editors@tei-c.org
> >
> >
> >On Mon, 20 Aug 2001, Daniel Pitti wrote:
> >
> >|I'll take a look at 3. Structure of the TEI Document Type Definition --Daniel
> >|
> >|
> >|Daniel V. Pitti         Project Director
> >|Institute for Advanced Technology in the Humanities
> >|Alderman Library        University of Virginia  Charlottesville, Virginia
> >22903
> >|Phone: 434 924-6594     Fax: 434 982-2363       Email: dpitti@Virginia.edu
> >|http://jefferson.village.virginia.edu
> >|AREA CODE IS NEW EFFECTIVE JUNE 2001
> >|
> >|
> >
> >  ----------------------------------------------------------------
> >  Lou Burnard                           http://users.ox.ac.uk/~lou
> >  ----------------------------------------------------------------
>
>

