Object schemata
---------------

An object schema consists of the following components:

  * a set of atomic types, usually a subset of Python's builtin
    types.  The default atomic types are string, int, long, float, and
    complex.  In principle, you can add other builtin types (like
    function, class, or file) or extension types to a schema, but
    Grouch currently has problems with many builtin types.  (In
    particular, only types whose values can be pickled may be atomic
    types in Grouch.)

  * a type alias mapping, letting you define shorthand names for common
    types.

  * a set of class definitions.  A class definition maps instance
    attribute names to attribute types.  This performs two purposes: it
    defines the expected set of attributes for instances of a class, and
    it defines the type of each attribute.

In the current version of Grouch, an object schema is defined through a
project description file and the class docstrings in a set of source
files.  This is useful in practice, but it's kind of hard to talk about
object schemata without a simple, compact schema description language.
Thus, consider the following pseudo-schema:

  class Thing:
    name : string

  class Animal (Thing):
    num_legs : int
    furry : boolean

(Coincidentally, this is the syntax emitted by gen_schema's "-t" option.
However, this is currently a write-only language; Grouch has no way to
parse schemata created by "gen_schema -t".)

This defines an object schema with no additional atomic types (just the
default five: string, int, long, float, and complex), no aliases, and
two classes (both, presumably, in the __main__ module, since the class
names are unqualified).

If you ask Grouch to type-check an instance of Thing under this schema,
or if it comes across a Thing instance in the course of type-checking a
larger object graph, it does the following:
  * ensure that the instance has exactly one attribute, 'name'
  * ensure that the value of this attribute is a string

Similarly, Grouch type-checks an Animal instance under this schema as
follows:
  * ensure that it has exactly three attributes, 'name', 'num_legs',
    and 'furry' (note that 'name' is inherited from Thing)
  * ensure that the value of 'name' is a string, 'num_legs' an int,
    and 'furry' a boolean (i.e. 0, 1, or None)


Defining an object schema: class docstrings
-------------------------------------------

Currently, you define an object schema by writing specially-formatted
class docstrings.  (There is no separate schema description
language... yet.)  For example, the Thing class in the above
pseudo-schema might be documented as:

  class Thing:
      """A single thing, which may be an animal, vegetable, or mineral.
      The only property common to all things is a name.

      Instance attributes:
        name : string
          the name of the thing
      """

Grouch (specifically, the gen_schema script that parses these docstrings)
ignores everything in the docstring up to the "Instance attributes:"
line.  After that, things get fairly rigid:
    
  * the "Instance attributes:" line must be indented to the same depth
    as the main body of the docstring
    
  * each attribute name is indented two spaces relative to that,
    and followed by a colon (":") and the attribute's type
    
  * attribute descriptions (which are optional, and are ignored by
    Grouch) are indented a further two spaces
    
  * when indentation returns to the same level as the "Instance
    attributes:" line, Grouch stops processing the docstring and
    goes on to the next class in the module (thus, blank lines
    are allowed in the attribute list)

Here is a slightly more elaborate example:

  class Animal (Thing):
      """An animal, ie. a thing with multiple legs and possibly fur.

      Instance attributes:
        num_legs : int
          the number of legs this animal has
        furry : boolean
          whether this animal is furry or not

      Outsiders should use 'get_num_legs()' and 'is_furry()' to access
      these attributes.
      """

Here is a stripped-down version of this docstring that is exactly
equivalent as far as Grouch is concerned:

  class Animal (Thing):
      """
      Instance attributes:
        num_legs : int
        furry : boolean
      """

Sometimes a class will have no instance attributes of its own; Grouch has
special syntax for this:

  class Mammal (Animal):
    """Instance attributes: none"""

This is different from simply omitting the list of instance attributes,
or omitting the docstring entirely.  If Grouch sees a Mammal instance
with any attributes apart from those inherited from Animal, it will
complain.  However, if Mammal has no docstring or attribute list, Grouch
can't do detailed type-checking of instances of that class.  Instead, it
  * complains that the class has no docstring (or no attribute list)
  * exclude the class from the schema
  * when type-checking an object graph, complain about any instances of
    that class it discovers 


Defining an object schema: the project description file
-------------------------------------------------------

Writing class docstrings that document every instance attribute is the
key part of defining an object schema.  However, you still have to tell
Grouch how to find those class docstrings and what to do with them.  This
is done with the gen_schema script and its project description file.


[Searching by directory]

At its simplest, the project description file contains a list of
directories to search for Python source files, and possibly a prefix to
use in turning source filenames into module names.  Those directories
are interpreted relative to a base directory that you supply to
gen_schema.

For example, the project description file for Grouch itself (grouch.proj
in the top-level Grouch directory) starts out with this:

  dirs = ["."]
  prefix = "grouch"

(The project description file is just Python code; it's execfile'd by
gen_schema.)  This instructs gen_schema to search for *.py in the
base directory, and to assume that all modules found actually live in
the "grouch" package.  Hence when it finds schema.py, it considers that
module to be "grouch.schema", and a class ObjectSchema in that file will
be called "grouch.schema.ObjectSchema".

gen_schema does *not* search recursively; if you want it to descend into
sub-directories, you must specify them explicitly:
  dirs = ["compiler", "compiler/parser", "compiler/optimizer"]

The directories in 'dirs' are interpreted relative to a base directory
supplied with the "-d" (or --base-dir) option to gen_schema.  If you run
gen_schema from Grouch's top directory, you must specify "lib" as the
base directory, since that's where the Grouch source code (schema.py and
friends) live:
  ./scripts/gen_schema -d lib -p grouch.proj

The resulting schema will be written (as a pickle) to schema.pkl; if you
want a human-readable representation of the schema, add "-t schema.txt"
to the command.


[Specifying individual modules]

If you don't want to search every "*.py" file in a list of directories,
you can supply a list of explicit module names, eg.:
  extra_modules = ["grouch.schema",
                   "grouch.valuetype"]

Note that extra_modules is a list of fully-qualified module names, *not*
filenames.

This variable is called 'extra_modules' because these modules are added
to the list of modules found by searching the directories named in
'dirs'.  If 'dirs' isn't supplied, the modules in 'extra_modules' are
Grouch's only source for class definitions.


[Excluding individual modules]

You can refine gen_schema's search for classes by excluding certain
modules.  As an example, Grouch includes a copy of SPARK (John Aycock's
nifty parser framework) as the "grouch.spark" module; since this is
really someone else's code, it doesn't have Grouch-style docstrings to
parse.  Also, the parser classes are transient and shouldn't wind up in
any persistent store of an Grouch object graph, so there's not much point
in type-checking them.  Thus, I exclude both grouch.spark and
grouch.type_parser (which provides classes derived from the SPARK
classes) from gen_schema's scan:
  exclude_modules = ["grouch.spark", "grouch.type_parser"]

Like extra_modules, exclude_modules is a list of fully-qualified module
names.


[Excluding individual classes]

You can also exclude specific classes from the search, instead of whole
modules.  This is useful if a particular module provides some transient
classes and other first-class persistent classes.  For example, I might
wish to exclude the TypecheckContext class, defined in grouch.context,
from schema generation:
  exclude_classes = ["grouch.context.TypecheckContext"]

Again, classes are specified as fully-qualified Python names.


[Adding atomic types]

If the five default atomic types aren't enough for your project, you'll
have to add new ones.  This might happen if you use extension types in
your application, or if you store slightly odd objects in your
persistent object graph, like functions or class objects.  New atomic
types are specified using an example value, not using the type object
itself.  (This is necessary because type objects can't be pickled, and
gen_schema pickles the schema for future use.  We can't store type
objects in the pickled schema, so we store sample values instead.)

For instance, to add Marc-Andr Lemburg's DateTime type to your schema,
add this to your project definition:
  import mx.DateTime
  atomic_types = [mx.DateTime.now()]

The structure of 'atomic_types' is a tad complex.  Most often, each
element of the list is simply a value of the atomic type you want to add
to your schema -- eg. here I created a sample DateTime object.  Since
these sample values go straight into the object schema, which is
subsequently pickled by gen_schema, these must be pickle-able values.
Grouch probably needs to grow a real schema definition language before
you can have, say, Python function or file objects as atomic types in an
object schema.  (In other words, I think this is an implementation
problem due to reliance on pickling rather than a fundamental problem.)

In this simple case, the name of the atomic type is implicit, because
the type itself supplies its name -- "DateTime" in the above example.
(Try "type(mx.DateTime.now()).__name__".)

In some cases, though, you may want to specify your own name for an
atomic type.  In that case, just supply a tuple (sample_value,
type_name) in atomic_types.  This is useful if you're dealing with
ExtensionClass, where every class is a new type.  (This is also the case
with new-style classes in Python 2.2.)  For instance, a ZODB application
that needs "class" and "instance" types (for class objects and generic
instance objects) might do this:

   import ZODB
   from Persistence import Persistent
   # ...
   atomic_types = [(Persistent(), "instance"),
                   (Persistent, "class")]

If you don't understand why you might need this, you probably don't need
it.


Putting it all together
-----------------------

For a simple example of defining an object schema, take a look in the
"examples" sub-directory of Grouch's source distribution.  There, you'll
find:
  * the thing.py and animal.py modules, which provide the classes
    ThingCollection, Thing, Animal, and Mammal
  * the make_things script, which creates some things, bundles
    them in a collection, and pickles them to things.pkl
  * the things.proj project description file, which tells
    gen_schema how to generate a schema for this project

For now, we're just going to generate a schema from the Python source
files and things.proj.  Later (in "checking.txt", the document that
covers type-checking an object graph) we'll run make_things and
type-check the results.

If you haven't installed Grouch yet, you should either do so now or
perpetrate your favourite kludge for ensuring that it's available
through sys.path.  (If you don't have a favourite kludge, just install
it.)  Run
  python -c 'import grouch'
to make sure it worked -- if this command completes silently, all is
well.

Installing Grouch should also install the gen_schema and check_data
scripts.  I'll assume they're in your shell's PATH; you might have to
adjust your PATH or the commands here accordingly.

Before we run gen_schema, let's take a look at the ingredients of this
project.  First, the project description file, things.proj, is quite
simple:

  extra_modules = [("thing", "thing.py"), ("animal", "animal.py")]

There's no 'dirs' here, meaning gen_schema won't go searching for "*.py"
anywhere.  It just looks for the 'thing' module in thing.py, and the
'animal' module in animal.py.  Since explicit source filenames are
supplied, the 'thing' and 'animal' modules don't have to be in Python's
path -- gen_schema simply parses the source files.

Next, take a look at thing.py.  You'll see that it defines two classes,
Thing and ThingCollection, and that the instance attributes of each are
fully documented.  Similarly, animal.py provides the Animal and Mammal
classes.

Finally, let's run gen_schema.  We'll save the schema for this project
to thing_schema.pkl and thing_schema.txt -- the two files have the same
content, but only the latter is human-readable.  From the "examples"
directory, run this:

  gen_schema -p things.proj -o things_schema.pkl -t things_schema.txt

If you're really curious about what's going on here, add the "-v"
option.  The output of gen_schema (without "-v") should look like this:

  looking for classes...
  found 4 classes
  parsing class docstrings...
  writing object schema to things_schema.txt...
  pickling object schema to things_schema.pkl...

Take a look at things_schema.txt for a human-readable representation of
the schema also saved in things_schema.pkl.

Now that we have an object schema for this project, we can use it later
to type-check a persistent object graph created by applications that use
this project, such as make_things.  This will be done in the next
document, "checking.txt".


$Id: schema.txt 20229 2003-01-16 21:29:07Z akuchlin $
