
  README file for PyLucene
  ------------------------

  Contents
  --------

   - Welcome
   - Installing PyLucene
   - Libgcj security initialization
   - Running the -db- flavor of PyLucene
   - PyLucene paper presented at PyCON 2005 and EuroPython 2005
   - API documentation for PyLucene


  Welcome
  -------

  Welcome to PyLucene, a gcj-compiled python extension for Java Lucene.
  PyLucene is a project maintained by the Open Source Applications Foundation.

  For more information on PyLucene, consider joining the PyLucene mailing list
  at pylucene-dev@osafoundation.org or visit http://pylucene.osafoundation.org.
  See http://lists.osafoundation.org/mailman/listinfo/pylucene-dev for more
  information on the PyLycene mailing list.


  Installing PyLucene
  -------------------

  To build PyLucene from sources, please see the INSTALL file.

  To install PyLucene binaries you just downloaded:

    - install the files in the python directory into python's site-packages
      directory
    - if you downloaded binaries with Berkeley DB support, install the files
      in the db directory into the directory containing your Berkeley DB
      shared libraries, such as /usr/local/BerkeleyDB.4.3/lib
    - if you are installing Unix (Mac OS X or Linux) binaries, install the
      files in the gcj directory into /usr/local/lib

  For a Debian package, consider fixing http://bugs.debian.org/256283 !


  Libgcj security initialization
  ------------------------------

  By default, PyLucene runs with the default libgcj classpath.security and
  libgcj.security files. These files are included in the PyLucene binary
  distributions in the python/security subdirectory. 

  If you installed these files into a 'security' subdirectory of python's
  'site-packages' directory and PyLucene is the first or only library in
  your python process to initialize libgcj, then these files are picked up
  by libgcj.

  If this default behaviour is unsuitable for your application, you need to
  re-build PyLucene from sources after having changed or removed the code in
  PyLucene.i's %init body that changes the 'gnu.classpath.home.url' System
  property to a URL to python's 'site-packages' directory.


  Running the -db- flavor of PyLucene
  -----------------------------------

  PyLucene can be built with support for DbDirectory, a Berkeley DB-based
  implementation of Lucene's abstract Directory class. DbDirectory can be
  used as an alternative to FSDirectory when transactional support is
  required.

  To use DbDirectory you need an understanding of the python Berkeley DB
  API documented at http://pybsddb.sourceforge.net/bsddb3.html. This API
  mirrors the Berkeley DB C API documented at http://www.sleepycat.com/docs.
  For sample code using DbDirectory, please refer to the test_DbDirectory.py
  unit test.

  In order for DbDirectory to work properly, PyLucene and python's _bsddb
  extension MUST be using the same Berkeley DB SHARED library. PyLucene and
  python's _bsddb extension MUST be built against the SAME VERSION of
  Berkeley DB's shared library. If you downloaded a pre-compiled
  PyLucene-db- binary, that version is Berkeley DB 4.3.29.

  On Unix, you need to verify that python's _bsddb.so is linked against the
  same Berkeley DB shared library (on Linux, use 'ldd', on Mac OS X, use
  'otool -L'). If it is not, you need to rebuild the _bsddb extension so
  that it is. You may have to modify the logic in python's setup.py that
  deals with finding a suitable Berkeley DB installation accordingly.

  On Windows, this means that you most likely need to build a custom version
  of python as the default one appears to be shipping with a _bsddb.pyd
  extension that is statically linked against Berkeley DB's 4.2.52 lib.
  Building a dynamically linked _bsddb.pyd extension requires changing the
  _bsddb.vcproj file accordingly. For an example, please see:
      http://svn.osafoundation.org/chandler/trunk/external/python/win32/
  and visit the directory corresponding to your python version.


  PyLucene paper presented at PyCON 2005 and EuroPython 2005
  ----------------------------------------------------------

    Title : Pulling Java Lucene into Python: PyLucene
    Author: Andi Vajda

     Abstract
     --------

     As OSAF needed an open source text search engine library for its Python
     based project, Chandler, we made the following bet: what if we pulled
     together Java Lucene, GNU's gcj Java compiler and SWIG to build a
     Python extension ?

     This paper examines the issues of pulling an open source Java library
     into Python by matching the differences in memory management, thread
     support and type systems. It also describes a technique for extending a
     Java class from Python by providing a python implementation to a Java
     extension proxy.

     Introduction
     ------------

     OSAF's flagship project, Chandler (http://www.osafoundation.org), is a
     personal information manager. As such, it needs the ability to run
     unstructured full text queries over arbitrarily large repositories of
     text.

     There are not that many open source text search engines
     available. Lucene is considered among the better ones and it is
     licensed under the Apache license, both of which make it a very
     attractive solution.

     But Lucene is written in Java. 

     Why PyLucene ?
     --------------

     Java presents quite a challenge since it is a very insular
     environment. It doesn't play well, both culturally and technically,
     with other non-Java environments. The Java Native Interface (JNI) is
     clunky, the Java Runtime Environment is something we didn't want to be
     encumbered with either.

     Really, for Lucene to be an acceptable solution for Chandler, it had to
     be able to be compiled down to a simple Python extension that is loaded
     and managed just like any other Python extension written in C or C++.

     Enter GNU's Java Compiler, gcj.

     GCJ compiles a set of Java classes together into a shared library and
     exports these classes as if they were implemented in C++ via the Common
     Native Interface (CNI) which makes them very easily accessible to other
     C/C++ programs.

     GCJ does not depend on the Apple or Sun Java Runtime Environments. It
     comes with its own, 100% open source, clean room implementation of the
     Java runtime classes. 

     Enter the interface compiler, SWIG.

     SWIG is an interface compiler that connects programs written in C and
     C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It
     works by taking the declarations found in C/C++ header files and using
     them to generate the wrapper code that scripting languages need to
     access the underlying C/C++ code. (quoted from
     http://www.swig.org/exec.html)

     Using SWIG and gcj allows to keep the resulting Python extension very
     close to the actual Java Lucene library which is under active
     development. This approach, while involving many thousands of lines of
     boilerplate code, almost all generated by SWIG, ultimately yields a
     much closer and more up to date library than a handcrafted port.

     Alternatives to the gcj/SWIG approach include a manual port of Java
     Lucene to Python such as Lupy, which is ten times slower than Java
     Lucene and way behind the Java Lucene version; a manual port to C++
     integratable into Python such as CLucene, which is four times faster
     than Java Lucene but comes with its own set of C++ bugs and is also way
     behind the original; or finally, a JPipe-based solution, which is very
     clever but relies on the JNI and the JRE.

     PyLucene Architecture
     ---------------------

     PyLucene is compiled as a shared library - a python extension - loaded
     into the Python process by an 'import' statement.
     On Mac OS X and Linux, gcj's libgcj.so and libstdc++.so need to be
     shipped along with _PyLucene.pyd. On Windows however, all these
     libraries are linked together into a single shared library of a few
     megabytes in size.

      -----------              --------------          -------------------
      | Java    | --- gcj ---> | lucene.o   |->------<-| PyLucene_wrap.o |
      | Lucene  | --- gcjh     --------------     |    -------------------
      -----------      |                          v             ^
                       v                   -----------------    |
                    ---------------        | _PyLucene.pyd |   g++
                    | C++ headers | -->|   -----------------    |
                    ---------------    |                        |
                                       |--- SWIG ---> ---------------------
      --------------                   |              | PyLucene_wrap.cxx |
      | PyLucene.i | ----------------->|              ---------------------
      --------------                                  |    PyLucene.py    |
                                                      ---------------------

     Preparing the Java Lucene library for compilation by gcj
     --------------------------------------------------------

     The GNU java compiler, gcj, has been under very active development for
     a number of years now and is quite usable. Still, it comes with a
     number of bugs and kinks that need to be worked around. For example,
     gcj can compile .java source files and .class bytecode files to .o
     files alike. There are, however, more bugs in the .java to .o
     compilation process, notably with handling anonymous inner
     classes. 
     The infamous Miranda bug,
     http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411, needs to be
     carefully worked around by adding abstract method declarations to
     abstract java classes for the methods of interfaces they claim to
     implement. 

     Differences in reserved words between Java, C++ and Python can also
     cause problems. For example, 'delete' is a perfectly acceptable name
     for a method in Java but a reserved keyword in C++. Similarly, 'or' is
     acceptable in Java but a reserved keyword in Python. Some of these
     incompatibilities can be dealt with in the SWIG layer, others need to
     be patched in the original Java sources.


     Matching the memory models
     --------------------------

     PyLucene involves three languages, Java, C++ and Python, each with
     their own conflicting memory management policies. Java's memory is
     garbage collected, C++ memory is more or less managed by hand and
     Python's memory objects are reference counted.

     The GNU Java class header file generator, gcjh, makes Java classes
     available to C++/CNI programs as C++ classes via the C++ headers it
     generates.

     SWIG, which generates the glue code between Python and C++, is not
     aware that the C++ pointers it wraps for Python's use are not really
     C++ but pointers to Java objects that cannot be explicitely 'deleted'
     when Python releases them.

     The gcj runtime library, libgcj, includes a garbage collector which
     keep track of Java objects in the Java heap, on the stack and in static
     variables but not in the Python heap.

     To solve these problems all Java objects returned to Python are kept in
     a Java IdentityHashMap where they are reference counted. That way, they
     are not garbage collected until they are removed from the table when
     their refcount reaches zero.

     Matching thread support and concurrency models
     ----------------------------------------------

     Both Python and Java support threads. On the platforms PyLucene is
     currently supported, Mac OS X, Linux and Windows, Python and Java use
     the same operating system thread implementations. One would hope that
     threads created in Python could be used to call into Java and
     vice-versa, but it is not that simple.

     Before any Java memory can be allocated by a thread using gcj's Java
     runtime library, libgcj, the garbage collector has to be made aware of
     the thread and its stack. The current implementation of the garbage
     collector, boehm-gc, does not support the registration of a thread
     after the fact. Hans Boehm, the author of the garbage collector module,
     intends to eventually support this but making it work reliably on the
     three supported PyLucene operating systems is a lost cause at the
     moment.

     Luckily, Python is a lot more nimble about threading support. It
     happily runs in threads it did not create and can even be coaxed into
     treating such a thread as one of its own.

     PyLucene exposes a new class, called PythonThread, which is a subclass
     of Python's threading.Thread class whose start() method delegates to
     Java the creation, initialization and starting of the actual operating
     system thread. There are then in fact two distinct thread objects, one
     Python, one Java, that use the same operating system thread.
     Any thread in python wishing to use the PyLucene APIs needs to be an
     instance of this PythonThread class.

     Python's threading support is not truly concurrent. While a Python
     thread is running, it is holding a Global Interpreter Lock (GIL) that
     prevents any other Python thread in the same process from
     running. Java, on the other hand, is fully concurrent. Several Java
     threads can run truly concurrently on hardware that supports it.

     Because of this concurrency mismatch, deadlocks may occur if the GIL is
     not released whenever a call is made from Python to Java since Java
     code may call back into Python when invoking Python 'extensions' to
     Java Lucene classes.

     Releasing the GIL also helps the Python process in not appearing hung
     when a longer running Java Lucene API such as IndexWriter.optimize() is
     in progress.

     'Extending' Java classes from Python
     ------------------------------------

     Many areas of the Lucene API expect the programmer to provide their own
     implementation or specialization of a feature where the default is
     inappropriate. For example, text analyzers and tokenizers are an area
     where many parameters and environmental or cultural factors are calling
     for customization.

     PyLucene enables this by providing Java extensions that serve as
     proxies for Java to call back into the Python implementations of these
     customizations.

     Technically, the PyLucene programmer is not providing an 'extension'
     but a Python implementation of a set of methods encapsulated by a
     Python class whose instances are wrapped by the Java proxies provided
     by PyLucene.

     For example, the code below, extracted from a PyLucene unit test,
     defines a custom analyzer using a custom token stream that returns the
     tokens '1', '2', '3', '4', '5' for any document it is called on.

     All that is needed in order to provide a custom analyzer in Python is
     defining a class that implements a method called 'tokenStream'. The
     presence of the 'tokenStream' method is detected by the corresponding
     SWIG type handler and the python instance passed in is wrapped by a new
     Java PythonAnalyzer instance that extends Lucene's abstract Analyzer
     class.

     In other words, SWIG in reverse.

          class _analyzer(object):
               def tokenStream(self, fieldName, reader):
                   class _tokenStream(object):
                       def __init__(self):
                           self.tokens = ['1', '2', '3', '4', '5']
                           self.increments = [1, 2, 1, 0, 1]
                           self.i = 0
                       def next(self):
                           if self.i == len(self.tokens):
                               return None
                           t = Token(self.tokens[self.i], self.i, self.i)
                           t.setPositionIncrement(self.increments[self.i])
                           self.i += 1
                           return t
                   return _tokenStream()

           analyzer = _analyzer()

           store = RAMDirectory()
           writer = IndexWriter(store, analyzer, True)

           d = Document()
           d.add(Field.Text("field", "bogus"))
           writer.addDocument(d)
           writer.optimize()
           writer.close()

     Supporting downcasting
     ----------------------

     Python's type system does not require type casting. On the other hand,
     downcasting is a very common operation in Java. SWIG will wrap a C++
     object with a Python object matching the object's declared protocol.

     For example, if a Lucene API is declared to return Query, the resulting
     Python wrapper implements the Query methods, exactly. If the wrapped
     object is actually an instance of a subclass of Query, such as
     BooleanQuery, the subclass's methods are not available on the Python
     proxy.

     Where this is a problem, PyLucene extends the types in question to have
     is<Type>() type checkers and to<Type>() type casters to work this
     around.
     For example: 
       analyzer = StandardAnalyzer()
       query = QueryParser("data", analyzer).parse("a AND b").toBooleanQuery()

     Pythonic API flavors
     --------------------

     Java is a rather verbose programming language. In places where it made
     sense, PyLucene added some pythonic extensions to the Lucene APIs such
     as iterators or property accessors. For example, one of the most
     commonly used Lucene classes, Hits, which returns the documents found
     in a search, is not iterable in Java Lucene, a hit counter is used
     instead. PyLucene wraps this nicely with a Python iterator. This can be
     illustrated as follows:

       The Java loop:

           for (int i = 0; i < hits.length(); i++) {
               Document doc = hits.doc(i);
               System.out.println(hits.score(i) + " : " + doc.get("title"));
           }

       with PyLucene becomes:

           for i, doc in hits:
               print hits.score(i), ':', doc['title']

       or:

           for hit in hits:
               print hit.getScore(), ':', hit.get('title')

     Error reporting
     ---------------

     Java exceptions are caught by the Python - Java call boundary and
     wrapped into a Python JavaError exception. Errors that occur in Python
     code called from Java, are also caught there and reported as usual.

     In effect, for every call made from Python to Java, the glue code is
     the following:

           try {
               PythonThreadState state;             // release/restore GIL
               $action                              // call Java API
           } catch (org::osafoundation::util::PythonException *e) {
               return NULL;                         // report Python error
           } catch (java::lang::Throwable *e) {
               PyErr_SetObject(PyExc_JavaError,     // wrap Java exception
                               jo2pr_type(e, "java::lang::Throwable *"));
               return NULL;                         // report Java error
           }

     The PythonThreadState type is a simple C++ class that ensures that the
     Python thread state is saved and the GIL released when it enters into
     scope and conversely that the Python thread state is restored and the
     GIL reacquired when it goes out of scope.

           class PythonThreadState {
             private:
               PyThreadState *state;
             public:
               PythonThreadState()
               {
                   state = PyEval_SaveThread();
               }
               ~PythonThreadState()
               {
                   PyEval_RestoreThread(state);
               }
           };

     Samples
     -------

     A large number of samples are shipped with PyLucene. Most notably, all
     the samples published in the "Lucene in Action" book that did not
     depend on a third party Java library for which there was no obvious
     Python equivalent were ported to Python and PyLucene.

     "Lucene in Action" is a great companion to learning Lucene. Having all
     the samples available in Python should make it even easier for Python
     developers.

     "Lucene in Action" was written by Erik Hatcher and Otis Gospodnetic,
     both part of the Java Lucene development team, and is available from
     Manning Publications at http://www.manning.com/hatcher2.

     Future work
     -----------

     Most of PyLucene's SWIG code is boilerplate code that could also be
     generated before being fed to SWIG. Such a Java classes to SWIG
     generation tool remains to be written.

     The same gcj/SWIG-based techniques could be used for other languages
     supported by SWIG such as Perl, Ruby, Lisp, etc... Various people have
     shown interest in a Ruby version for a while now. It would be exciting
     to see PyLucene morph into SWIGLucene as support for some of these
     languages is added.

     PyLucene could be ported to other operating systems. That effort is
     effectively bounded by the state of the implementation of libgcj on
     these platforms. In particular, threading and garbage collection
     support can be problematic. GNU gcj is under very active development
     and progress is made on a regular basis.

     Acknowledgements
     ----------------
 
     PyLucene wouldn't be possible without the tireless efforts of the
     people contributing to the open source projects below:

      - the contributors to the GCC/GCJ compiler suite,
        http://gcc.gnu.org/onlinedocs/gcc/Contributors.html

      - the GNU classpath team
        http://savannah.gnu.org/project/memberlist.php?group_id=85

      - the Java Lucene developers,
        http://jakarta.apache.org/lucene/docs/whoweare.html

      - the SWIG developers,
        http://www.swig.org/guilty.html

      - the Open Source Applications Foundation, hosting the PyLucene
        project http://www.osafoundation.org

     Thank you all !

     License
     -------

     This paper is licensed under a Creative Commons License:
         http://creativecommons.org/licenses/by/2.0


  API documentation for PyLucene
  ------------------------------

  PyLucene is currently built against Java Lucene 1.4.3 and all its APIs
  except for the RemoteSearchable class are supported (patches are welcome).
  PyLucene also includes the Snowball analyzer and stemmers, the
  highlighter package, the brazilian, chinese-japanese-korean, chinese,
  czech, french and dutch analyzers not currently included in the main
  Lucene JAR file.

  This document only covers the pythonic extensions to Lucene offered
  by PyLucene as well as some differences between the Java and Python
  APIs. For API the documentation on Java Lucene APIs, please visit:
      http://lucene.apache.org/java/docs/api/index.html

  To help with debugging and to support some Lucene APIs, PyLucene also
  exposes some Java runtime APIs described later.

   - Contents

     . Samples
     . Threading support with PyLucene.PythonThread
     . Exception handling with PyLucene.JavaError
     . Differences between the Java Lucene and PyLucene APIs
     . Pythonic extensions to the Java Lucene APIs
     . Java Runtime classes exposed by PyLucene
     . Extending Lucene classes from Python

   - Samples

     The best way to learn PyLucene is to look at the many samples included
     with the PyLucene source release or on the web at

         http://svn.osafoundation.org/pylucene/trunk/samples/
         http://svn.osafoundation.org/pylucene/trunk/samples/LuceneInAction/

     A large number of samples are shipped with PyLucene. Most notably, all
     the samples published in the "Lucene in Action" book that did not
     depend on a third party Java library for which there was no obvious
     Python equivalent were ported to Python and PyLucene.

     "Lucene in Action" is a great companion to learning Lucene. Having all
     the samples available in Python should make it even easier for Python
     developers. 

     "Lucene in Action" was written by Erik Hatcher and Otis Gospodnetic,
     both part of the Java Lucene development team, and is available from
     Manning Publications at http://www.manning.com/hatcher2.

   - Threading support with PyLucene.PythonThread

     The garbage collector implemented by the Java runtime support in libgcj
     insists on having full control over the creation of threads used by
     it. At the moment, it cannot be told about a thread 'after the fact'.

     Therefore all Python threads, except for the main thread, using any
     PyLucene code must be an instance of PythonThread. A PythonThread
     instance is an extension of a Python thread delegating the creation and
     initialization of the actual operating system thread to libgcj. There
     are in fact two thread objects, one Python, one Java, for the same
     operating system thread. The Python and Java runtimes are fully aware
     of that thread and view it as one of their own.

   - Exception handling with PyLucene.JavaError

     Java exceptions are caught at the language barrier and reported to
     Python by raising a JavaError instance whose args tuple contains the
     actual Java Exception instsance.

   - Differences between the Java Lucene and PyLucene APIs

     . The PyLucene API exposes all Java Lucene classes in a flat namespace
       in the PyLucene module.
       For example, the Java import statement:
         import org.apache.lucene.index.IndexReader;
       corresponds to the Python import statement:
         from PyLucene import IndexReader

     . the static 'parse' method defined on 
       org.apache.lucene.queryParser.MultiFieldQueryParser was renamed to
       'parseQueries'. 

     . Because 'delete' is a C++ keyword, the delete(int) and
       delete(org.apache.lucene.index.Term) methods defined on
       org.apache.lucene.index.IndexReader were renamed deleteDocument(int)
       and deleteDocuments(org.apache.lucene.index.Term) respectively.

     . Instead of taking array arguments the readBytes and readChars methods
       defined on org.apache.lucene.store.IndexInput take the number of
       bytes or unicode characters to read and return Python 'str'
       and 'unicode' objects instead respectively. For example:

           bytes = indexInput.readBytes(256)
           chars = indexInput.readChars(256)

     . Similarly, instead of taking array arguments the read method defined
       on org.apache.lucene.index.TermDocs takes the number of entries to
       read from the enumeration and returns two lists, the documents
       numbers and the term frequencies actually read. For example:

           docs, freqs = termDocs.read(16)

     . Downcasting is a common operation in Java but not a concept in
       Python. Because the wrapper objects implementing exactly the APIs of
       the declared type of the wrapped object, a number of downcasting and
       type checking methods were added to Lucene classes:

         org.apache.lucene.analysis.Analyzer:

           GermanAnalyzer toGermanAnalyzer()
           PerFieldAnalyzerWrapper toPerFieldAnalyzerWrapper()
           RussianAnalyzer toRussianAnalyzer()
           SimpleAnalyzer toSimpleAnalyzer()
           StandardAnalyzer toStandardAnalyzer()
           StopAnalyzer toStopAnalyzer()
           WhitespaceAnalyzer toWhitespaceAnalyzer()
           boolean isGermanAnalyzer()
           boolean isPerFieldAnalyzerWrapper()
           boolean isRussianAnalyzer()
           boolean isSimpleAnalyzer()
           boolean isStandardAnalyzer()
           boolean isStopAnalyzer()
           boolean isWhitespaceAnalyzer()

         org.apache.lucene.search.Searchable:

           Searcher toSearcher()
           boolean isSearcher()

         org.apache.lucene.search.Searcher:

           Searchable toSearchable()
           Searcher toSearcher()
           IndexSearcher toIndexSearcher()
           MultiSearcher toMultiSearcher()
           ParallelMultiSearcher toParallelMultiSearcher()
           boolean isSearchable()
           boolean isSearcher()
           boolean isIndexSearcher()
           boolean isMultiSearcher()
           boolean isParallelMultiSearcher()

         org.apache.lucene.search.Query:

           Query toQuery()
           BooleanQuery toBooleanQuery()
           PrefixQuery toPrefixQuery()
           TermQuery toTermQuery()
           PhraseQuery toPhraseQuery()
           FilteredQuery toFilteredQuery()
           RangeQuery toRangeQuery()
           MultiTermQuery toMultiTermQuery()
           FuzzyQuery toFuzzyQuery()
           WildcardQuery toWildcardQuery()
           SpanQuery toSpanQuery()
           SpanFirstQuery toSpanFirstQuery()
           SpanNearQuery toSpanNearQuery()
           SpanNotQuery toSpanNotQuery()
           SpanOrQuery toSpanOrQuery()
           SpanTermQuery toSpanTermQuery()
           boolean isQuery()
           boolean isBooleanQuery()
           boolean isPrefixQuery()
           boolean isTermQuery()
           boolean isPhraseQuery()
           boolean isFilteredQuery()
           boolean isRangeQuery()
           boolean isMultiTermQuery()
           boolean isFuzzyQuery()
           boolean isWildcardQuery()
           boolean isSpanQuery()
           boolean isSpanFirstQuery()
           boolean isSpanNearQuery()
           boolean isSpanNotQuery()
           boolean isSpanOrQuery()
           boolean isSpanTermQuery()

         org.apache.lucene.search.ScoreDoc:

           FieldDoc toFieldDoc()
           boolean isFieldDoc()

   - Pythonic extensions to the Java Lucene APIs

     Java is a very verbose language. Python, on the other hand, offers
     many syntactically attractive constructs for iteration, property
     access, etc... As the Java Lucene samples from the 'Lucene in Action'
     book were ported to Python, PyLucene received a number of pythonic
     extensions listed here:

     . Iterating search hits is a very common operation. Hits instances are
       iterable in Python. Two values are returned for each iteration, the
       zero-based number of the document in the Hits instance and the
       document instance itself.

         The Java loop:

             for (int i = 0; i < hits.length(); i++) {
                 Document doc = hits.doc(i);
                 System.out.println(hits.score(i) + " : " + doc.get("title"));
             }

         is better written in Python:

             for i, doc in hits:
                 print hits.score(i), ':', doc['title']

     . Hits instances partially implement the Python 'list' protocol.

         The Java expressions:

             hits.length()
             doc = hits.get(i)

         are better written in Python:

             len(hits)
             doc = hits[i]

     . Similarly, IndexReader instances partially implement the 'list'
       protocol and can be iterated over for their documents.

         The Java expressions:

             indexReader.maxDoc()
             indexReader.document(i)

         are better written in Python:

             len(indexReader)
             indexReader[i]

         The Java loop:

             for (int i = 0; i < indexReader.maxDoc(); i++) {
                 Document doc = indexReader.document(i);
                 ...
             }

         is better written in Python:

             for i, doc in indexReader:
                 ...

     . Document instances have fields whose values can be accessed through
       the dict and attribute protocol.

         The Java expressions:

             doc.get("title")
	     doc.getField("title")
             doc.removeField("title")

         are better written in Python:

             doc['title']
	     doc.title
             del doc.title

     . Document instances can be iterated over for their fields

         The Java loop:

             Enumeration fields = doc.fields();
             while (fields.hasMoreElements()) {
                 Field field = (Field) fields.nextElement();
                 ...
             }

         is better written in Python:

             for field in doc:
                 ...

   - Java Runtime classes exposed by PyLucene

     To help with debugging and to support some Lucene APIs, PyLucene also
     exposes some Java runtime APIs. As with the Java Lucene APIs, these
     APIs are fully documented on their development website at
         http://developer.classpath.org/doc/

       . java.lang.Object
            boolean equals(object)
            int hashCode()
            string toString()
	    class getClass()
            void notify()
            void notifyAll()
            void wait()
            void wait(long)
            void wait(long, int)

       . java.lang.Thread
            Thread(runnable)
            Thread(runnable, string)
            string getName()
            boolean isAlive()
            boolean isDaemon()
            boolean isInterrupted()
            void setDaemon(boolean)
            void setName(string)
            void start()
            void join()
            void join(long)
            void join(long, int)

       . java.lang.Class
            string getName()
            boolean isArray()
            boolean isInterface()
            boolean isPrimitive()
            boolean isAssignableFrom(class)
            boolean isInstance(object)

       . java.lang.System
            static long currentTimeMillis()
            static void gc()
            static string getProperty(string)
            static string getProperty(string, string)
            static string setProperty(string, string)
            static java.util.Properties getProperties()
            static void load(string)
            static void loadLibrary(string)
            static void mapLibraryName(string)
            static void runFinalization()
            static jint identityHashCode(object)
            static java.io.PrintStream out
            static java.io.PrintStream err

       . java.lang.Process
            void destroy()
            jint exitValue()
            void waitFor()

       . java.lang.Runtime
            static Runtime getRuntime()
            long freeMemory()
            long totalMemory()
            long maxMemory()
            void gc()
            void runFinalization()
            int availableProcessors()
            void addShutdownHook(Thread)
            void removeShutdownHook(Thread)
            Process execute(string)
            Process execute(string[])
            Process execute(string, string[])
            Process execute(string[], string[])
            void traceInstructions(boolean)
            void traceMethodCalls(boolean)

         Because 'exec' is a keyword in Python, the exec() methods were
         renamed to 'execute'.

       . java.lang.Throwable
            Throwable getCause()
            string getLocalizedMessage()
            string getMessage()
            void printStackTrace()

       . java.io.Reader

         Instead of taking an array argument, the read() method returns a
         unicode string of the fully read stream. To only read a specific
         number of unicode characters, this methods also accepts a length
         argument.

       . java.io.OutputStream

       . java.io.FilterOutputStream

       . java.io.PrintStream
            void flush()
            void printString(string)
            void printObject(object)
            void println()
            void println(string)
            void println(object)

         Because 'print' is a reserved word in Python, the print(string) and
         print(object) methods were renamed to 'printString' and
         'printObject' respectively.

       . java.util.Locale
            Locale(string, string, string)
            Locale(string, string)
            Locale(string)
            static Locale getDefault()
            static void setDefault(Locale)
            static Locale[] getAvailableLocales()
            static string[] getISOCountries()
            static string[] getISOLanguages()
            string getLanguage()
            string getCountry()
            string getVariant()
            string getISO3Language()
            string getISO3Country()
            string getDisplayLanguage()
            string getDisplayLanguage(Locale)
            string getDisplayCountry()
            string getDisplayCountry(Locale)
            string getDisplayVariant()
            string getDisplayVariant(Locale)
            string getDisplayName()
            string getDisplayName(Locale)
            static Locale ENGLISH
            static Locale FRENCH
            static Locale GERMAN
            static Locale ITALIAN
            static Locale JAPANESE
            static Locale KOREAN
            static Locale CHINESE
            static Locale SIMPLIFIED_CHINESE
            static Locale TRADITIONAL_CHINESE
            static Locale FRANCE
            static Locale GERMANY
            static Locale ITALY
            static Locale JAPAN
            static Locale KOREA
            static Locale CHINA
            static Locale PRC
            static Locale TAIWAN
            static Locale UK
            static Locale US
            static Locale CANADA
            static Locale CANADA_FRENCH

       . java.util.BitSet
            BitSet()
            BitSet(int)
            void andSet(BitSet)
            void andNot(BitSet)
            int cardinality()
            void clear()
            void clear(int)
            void clear(int, int)
            void flip(int)
            void flip(int, int)
            boolean get(int)
            BitSet get(int, int)
            boolean intersects(BitSet)
            boolean isEmpty()
            jint length()
            jint nextClearBit(jint)
            jint nextSetBit(jint)
            void orSet(BitSet)
            void set(int)
            void set(int, boolean)
            void set(int, int)
            void set(int, int, boolean)
            int size()
            void xorSet(BitSet)

         Because 'and', 'or' and 'xor' are reserved words in Python, the
         corresponding BitSet methods were renamed 'andSet', 'orSet' and
         'xorSet' respectively.

       . java.util.Date
            Date()
            Date(long)
            boolean after(Date)
            boolean before(Date)
            int compareTo(Date)
            long getTime()
            void setTime(long)

       . java.util.Calendar
            static Calendar getInstance()
            static Calendar getInstance(locale)
            static locale[] getAvailableLocales()
            Date getTime()
            void setTime(Date)
            long getTimeInMillis()
            void setTimeInMillis(long)
            jint get(int)
            void set(int, int)
            void set(int, int, int)
            void set(int, int, int, int, int)
            void set(int, int, int, int, int, int)
            void clear()
            void clear(int)
            boolean isSet(int)
            boolean before(object)
            boolean after(object)
            void add(int, int)
            void roll(int, boolean)
            void roll(int, int)
            void setLenient(boolean)
            boolean isLenient()
            void setFirstDayOfWeek(int)
            int getFirstDayOfWeek()
            void setMinimalDaysInFirstWeek(int)
            int getMinimalDaysInFirstWeek()
            int getMinimum(int)
            int getMaximum(int)
            int getGreatestMinimum(int)
            int getLeastMaximum(int)
            int getActualMinimum(int)
            int getActualMaximum(int)
            static int ERA
            static int YEAR
            static int MONTH
            static int WEEK_OF_YEAR
            static int WEEK_OF_MONTH
            static int DATE
            static int DAY_OF_MONTH
            static int DAY_OF_YEAR
            static int DAY_OF_WEEK
            static int DAY_OF_WEEK_IN_MONTH
            static int AM_PM
            static int HOUR
            static int HOUR_OF_DAY
            static int MINUTE
            static int SECOND
            static int MILLISECOND
            static int ZONE_OFFSET
            static int DST_OFFSET
            static int FIELD_COUNT
            static int SUNDAY
            static int MONDAY
            static int TUESDAY
            static int WEDNESDAY
            static int THURSDAY
            static int FRIDAY
            static int SATURDAY
            static int JANUARY
            static int FEBRUARY
            static int MARCH
            static int APRIL
            static int MAY
            static int JUNE
            static int JULY
            static int AUGUST
            static int SEPTEMBER
            static int OCTOBER
            static int NOVEMBER
            static int DECEMBER
            static int UNDECIMBER
            static int AM
            static int PM

         The following downcast and type checking methods are also included:
            GregorianCalendar toGregorianCalendar()
            boolean isGregorianCalendar()

       . java.util.GregorianCalendar
            GregorianCalendar()
            GregorianCalendar(int, int, int)
            GregorianCalendar(int, int, int, int, int)
            GregorianCalendar(int, int, int, int, int, int)
            GregorianCalendar(jlocale)
            Date getGregorianChange()
            void setGregorianChange(Date)
            boolean isLeapYear(int)
            static int BC
            static int AD

       . java.util.Enumeration
            boolean hasMoreElements()
            object nextElement()

       . java.util.Dictionary

       . java.util.Hashtable

       . java.util.Properties
            Properties()
            Properties(Properties)
            string getProperty(string)
            string getProperty(string, string)
            object setProperty(string, string)
            stringEnumeration propertyNames()

         The Properties class partially implements the Python 'dict'
         protocol.

             The Java expressions:

                 props.getProperty("title")
                 props.getProperty("title", "default")
                 props.setProperty("title", "foo")
                 props.containsKey("title")

             are better written in Python:

                 props['title']
                 props.get('title', 'default')
                 props['title'] = 'foo'
                 'title' in props

       . java.text.Format
            string format(object)

       . java.text.NumberFormat
            string format(long)
            string format(double)
            static Locale[] getAvailableLocales()
            static NumberFormat getCurrencyInstance()
            static NumberFormat getCurrencyInstance(Locale)
            static NumberFormat getInstance()
            static NumberFormat getInstance(Locale)
            static NumberFormat getNumberInstance()
            static NumberFormat getNumberInstance(Locale)
            static NumberFormat getIntegerInstance()
            static NumberFormat getIntegerInstance(Locale)
            static NumberFormat getPercentInstance()
            static NumberFormat getPercentInstance(Locale)
            int getMaximumFractionDigits()
            int getMaximumIntegerDigits()
            int getMinimumFractionDigits()
            int getMinimumIntegerDigits()
            boolean isGroupingUsed()
            boolean isParseIntegerOnly()
            void setGroupingUsed(boolean)
            void setMaximumFractionDigits(int)
            void setMaximumIntegerDigits(int)
            void setMinimumFractionDigits(int)
            void setMinimumIntegerDigits(int)
            void setParseIntegerOnly(boolean)
            static int INTEGER_FIELD
            static int FRACTION_FIELD

       . java.text.DateFormat
            string format (Date)
            static Locale[] getAvailableLocales()
            Calendar getCalendar()
            static DateFormat getDateInstance()
            static DateFormat getDateInstance(int)
            static DateFormat getDateInstance(int, Locale)
            static DateFormat getDateTimeInstance()
            static DateFormat getDateTimeInstance(int, int)
            static DateFormat getDateTimeInstance(int, int, Locale)
            static DateFormat getInstance()
            NumberFormat getNumberFormat()
            static DateFormat getTimeInstance()
            static DateFormat getTimeInstance(int)
            static DateFormat getTimeInstance(int, Locale)
            boolean isLenient()
            Date parse(string)
            void setCalendar(Calendar)
            void setLenient(boolean)
            void setNumberFormat(NumberFormat)
            static int FULL
            static int LONG
            static int MEDIUM
            static int SHORT
            static int DEFAULT
            static int ERA_FIELD
            static int YEAR_FIELD
            static int MONTH_FIELD
            static int DATE_FIELD
            static int HOUR_OF_DAY1_FIELD
            static int HOUR_OF_DAY0_FIELD
            static int MINUTE_FIELD
            static int SECOND_FIELD
            static int MILLISECOND_FIELD
            static int DAY_OF_WEEK_FIELD
            static int DAY_OF_YEAR_FIELD
            static int DAY_OF_WEEK_IN_MONTH_FIELD
            static int WEEK_OF_YEAR_FIELD
            static int WEEK_OF_MONTH_FIELD
            static int AM_PM_FIELD
            static int HOUR1_FIELD
            static int HOUR0_FIELD
            static int TIMEZONE_FIELD

       . java.text.DecimalFormat
            DecimalFormat()
            DecimalFormat(string)
            void applyLocalizedPattern(string)
            void applyPattern(string)
            int getGroupingSize()
            int getMultiplier()
            string getNegativePrefix()
            string getNegativeSuffix()
            string getPositivePrefix()
            string getPositiveSuffix()
            boolean isDecimalSeparatorAlwaysShown()
            void setDecimalSeparatorAlwaysShown(boolean)
            void setGroupingSize(int)
            void setMultiplier(int)
            void setNegativePrefix(string)
            void setNegativeSuffix(string)
            void setPositivePrefix(string)
            void setPositiveSuffix(string)
            string toLocalizedPattern()
            string toPattern()

       . java.text.SimpleDateFormat
            SimpleDateFormat()
            SimpleDateFormat(string)
            SimpleDateFormat(string, Locale)
            void applyLocalizedPattern(string)
            void applyPattern(string)
            Date get2DigitYearStart()
            void set2DigitYearStart(Date)
            string toLocalizedPattern()
            string toPattern()

   - Extending Java Lucene classes from Python

     Many areas of the Lucene API expect the programmer to provide their own
     implementation or specialization of a feature where the default is
     inappropriate. For example, text analyzers and tokenizers are an area
     where many parameters and environmental or cultural factors are calling
     for customization.

     PyLucene enables this by providing Java extension points listed below
     that serve as proxies for Java to call back into the Python
     implementations of these customizations.

     To learn more about this topic, please refer to the PyLucene paper
     included earlier.

     Unless otherwise documented, passing the Python extension instance
     where a wrapped Java instance returned by PyLucene is normally expected
     is sufficient for the Python extension instance to be wrapped by Java
     for its use.

     Each extension point below enumerates the methods that a Python class
     needs to implement in order to be functioning as an 'extension' of the
     corresponding Java Lucene class.

     . org.apache.lucene.analysis.Analyzer extension point:
           TokenStream tokenStream(fieldName, reader)

     . org.apache.lucene.analysis.CharTokenizer extension point:
           boolean isTokenChar(char)
           char normalize(char)

       In order to instantiate such a custom char tokenizer, the additional
       charTokenizer() factory method defined on
       org.apache.lucene.analysis.TokenStream instances needs to be invoked
       with the Python extension instance.

     . org.apache.lucene.analysis.TokenFilter extension point:
           Token next()

       In order to instantiate such a custom token filter, the additional
       tokenFilter() factory method defined on
       org.apache.lucene.analysis.TokenStream instances needs to be invoked
       with the Python extension instance.

     . org.apache.lucene.analysis.TokenStream extension point:
           Token next()

     . org.apache.lucene.queryParser.QueryParser extension point:
           Query getBooleanQuery(super, clauses)
           Query getFieldQuery(super, fieldName, queryText, slop=None)
           Query getFuzzyQuery(super, fieldName, termText, minSimilarity)
           Query getPrefixQuery(super, fieldName, termText)
           Query getRangeQuery(super, fieldName, part1, part2, inclusive)
           Query getWildcardQuery(super, fieldName, termText)

       The 'super' argument is provided to invoke the default Java
       implementation of these methods as needed. 

       In order to instantiate such a custom query parser, the additional
       queryParser() factory method defined on
       org.apache.lucene.analysis.Analyzer instances needs to be invoked
       with the Python extension instance.

       Please refer to the AdvancedQueryParserTest.py and
       CustomQueryParser.py 'Lucene in Action' samples for more details.

     . org.apache.lucene.search.Filter extension point:
           BitSet bits(indexReader)

     . org.apache.lucene.search.FilteredTermEnum extension point:
           float difference()
           boolean termCompare(term)
           boolean endEnum()
           void setEnum(termEnum)

     . org.apache.lucene.search.HitCollector extension point:
           void collect(docNum, score)

     . org.apache.lucene.search.ScoreDocComparator extension point:
           int compare(scoreDoc0, scoreDoc1)
           int sortType()
           Comparable sortValue(ScoreDoc i)

       Please refer to the DistanceComparatorSource.py and
       DistanceSortingTest.py 'Lucene in Action' samples for more details on
       writing custom sorting code in Python.

     . org.apache.lucene.search.SortComparator extension point:
           ScoreDocComparator newComparator(indexReader, fieldName)
           Comparable getComparable(termText)

       Please refer to the DistanceComparatorSource.py and
       DistanceSortingTest.py 'Lucene in Action' samples for more details on
       writing custom sorting code in Python.

     . org.apache.lucene.search.SortComparatorSource extension point:
           ScoreDocComparator newComparator(indexReader, fieldName)

       Please refer to the DistanceComparatorSource.py and
       DistanceSortingTest.py 'Lucene in Action' samples for more details on
       writing custom sorting code in Python.

     . org.apache.lucene.search.Searchable extension point:
           void close()
           int docFreq(term)
           Document doc(n)
           int maxDoc()
           void searchAll(query, filter, hitCollector)
           TopDocs search(query, filter, n)
           TopFieldDocs searchSorted(query, filter, n, sort)
           Query rewrite(query)
           Explanation explain(query, docNum)

     . org.apache.lucene.search.Similarity extension point:
           float coord(overlap, maxOverlap)
           float idf(term, searcher)
           float idf(terms, searcher)
           float idf(docFreq, numDocs)
	   float lengthNorm(fieldName, numTokens)
           float queryNorm(sumOfSquaredWeights)
           float sloppyFreq(distance)
           float tf(freq)

     . org.apache.lucene.search.highlight.Formatter extension point:
           string highlightTerm(originalText, tokenGroup)

     . org.apache.lucene.store.Directory extension point:
           void close();
           IndexOutput createOutput(name)
           void deleteFile(name)
           boolean fileExists(name)
           long fileLength(name)
           long fileModified(name)
           string[] list()
           Lock makeLock(String name)
           IndexInput openInput(name)
           void renameFile(from, to)
           void touchFile(name)

     . org.apache.lucene.store.IndexInput extension point:
           void close(isClone)
	   long length()
           string read(length, pos)
           void seek(pos)

       Because IndexInput instances may be cloned, the close() method takes
       an extra argument in python telling whether a clone is being closed.

     . org.apache.lucene.store.IndexOutput extension point:
           void close()
           long length()
           void write(string)
           void seek(pos)

     . org.apache.lucene.store.Lock extension point:
           boolean isLocked()
           boolean obtain()
           boolean obtain(lockWaitTimeout)
           void release()

     . java.io.Reader extension point:
           void close()
           string read(len)

     . java.lang.Comparable extension point:
           int compareTo(object)

     . java.lang.Runnable extension point:
           void run()
