How does indexer walk through the hypertext links
=================================================

When indexer trying to insert new URL into database or trying to
index the existing one it does first of all discovers whether does this URL 
have correspondent Server record given in indexer.conf. When indexer
seeks for the correspondent to an URL Server command it compares first bytes
of being discovered URL and start URL of Server command given in it's argument.
During startup indexer sorts all servers by start URL's length so the 
longest one will be found first. This schema allow to give different
parameters to for example whole server and it's subsection. Imagine that
we have server subdirectory which contains news articles. Surely those
articles are to be reindexed faster than the rest of the server. This
combination will be usefull for such case:

Period 600000
Server http://www/
Period 200000
Server http://www/news/

These commands give different reindexing period for /news/ subdirectory
comparing with the period of whole server. indexer will choose the second
Server record for the http://www/news/page1.html because this server record
will be in memory first due to sorting order.

There are actually three different types of indexer behavour when it
makes a desition whether index URL or not.


1) Default rules

The defalt behavour of indexer is to follow through those links 
which have  correspondent Server command in the indexer.conf file.
It also jumps between servers if both of them are represented in indexer.conf.
For example, imagine that we have two Server commands:

Server http://www/
Server http://web/

When indexing http://www/page1.html indexer WILL follow the link 
http://web/page2.html if the last one has been found. Note that these 
pages are on different servers, but BOTH of them have correspondent 
Server record.

If one of the Server command will be deleted, indexer will remove
all expired URLs from this server during next reindexing.


2) Using "FollowOutside yes"

 The first way to change described default behavour is to use 
"FollowOutside yes" indexer.conf command. indexer will walk through the 
ANY found URLs and will jump between different servers. Theoretically, 
it will index all Internet in this case if there are no harware limits :-)

When "FollowOutside yes" command is specified, indexer just add in memory
one server record with the empty start URL during loading indexer.conf.
According to the sort order, this empty server will be found only in
the case when no other Server records with longer start URL are found.


3) Using "DeleteNoServer no".

The second way to change default behavour is to use "DeleteNoServer no" command. 
This command means that URLs which are already in database will not be deleted 
even if they have not correspondent Server command. "DeleteNoServer no" is 
implemented by addition one empty server just like "FollowOutside yes". 
The difference of those two commands is that in the case of "DeleteNoServer no"
 indexer will follow ONLY through the links INSIDE the servers and will not 
jump between different servers.

Imagine this commands sequence:

DeleteNoServer no
Server http://www/
Server http://web/

When indexing http://www/page1.html indexer WILL follow the link 
http://www/page2.html but WILL NOT follow http://web/page2.html
because http://www/page1.html and http://web/page2.html are on different
servers. 


4) Using "indexer -f <filename>"

The third schema is very usefull for "indexer -i -f url.txt" running. You may
maitain required servers in the url.txt. When new URL is added into url.txt
indexer will index the server of this URL during next startup. It does not 
matter whether you have passed the root URL (http://www/) of the server 
or one of the internal pages (http://www/path/to/some/page.html). Indexer
will index all server http://www/.


Note that if you delete URL from the list in url.txt using the schema
with "DeleteNoServer no", indexer WILL NOT delete URLs from the same server. 
Imagine that you have removed http://www/ from url.txt. To remove all URLs 
of this server from the database you'll have to run 
"indexer -C -u http://www/%".

