.nr % 1
.OH ''PBS IDS'MOM'
.EH 'MOM'PBS IDS''
.P1
.so ids_setup.ms
.Rv $Revision: 2.8 $
.nr Fi 0 1
.nr H1 7
.NH 1
.Tc "\f3\s+2MOM - Machine-Oriented Miniserver\s-2\fP"
.LP
.OF 'Chapt \*(rV''\n(H1-%'
.EF '\n(H1-%''Chapt \*(rV'
.\"         Portable Batch System (PBS) Software License
.\" 
.\" Copyright (c) 1999, MRJ Technology Solutions.
.\" All rights reserved.
.\" 
.\" Acknowledgment: The Portable Batch System Software was originally developed
.\" as a joint project between the Numerical Aerospace Simulation (NAS) Systems
.\" Division of NASA Ames Research Center and the National Energy Research
.\" Supercomputer Center (NERSC) of Lawrence Livermore National Laboratory.
.\" 
.\" Redistribution of the Portable Batch System Software and use in source
.\" and binary forms, with or without modification, are permitted provided
.\" that the following conditions are met:
.\" 
.\" - Redistributions of source code must retain the above copyright and
.\"   acknowledgment notices, this list of conditions and the following
.\"   disclaimer.
.\" 
.\" - Redistributions in binary form must reproduce the above copyright and 
.\"   acknowledgment notices, this list of conditions and the following
.\"   disclaimer in the documentation and/or other materials provided with the
.\"   distribution.
.\" 
.\" - All advertising materials mentioning features or use of this software must
.\"   display the following acknowledgment:
.\" 
.\"   This product includes software developed by NASA Ames Research Center,
.\"   Lawrence Livermore National Laboratory, and MRJ Technology Solutions.
.\" 
.\"         DISCLAIMER OF WARRANTY
.\" 
.\" THIS SOFTWARE IS PROVIDED BY MRJ TECHNOLOGY SOLUTIONS ("MRJ") "AS IS" 
.\" WITHOUT WARRANTY OF ANY KIND, AND ANY EXPRESS OR IMPLIED WARRANTIES, 
.\" INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, 
.\" FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT ARE EXPRESSLY
.\" DISCLAIMED.
.\"
.\" IN NO EVENT, UNLESS REQUIRED BY APPLICABLE LAW, SHALL MRJ, NASA, NOR
.\" THE U.S. GOVERNMENT BE LIABLE FOR ANY DIRECT DAMAGES WHATSOEVER,
.\" NOR ANY INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\" 
.\" This license will be governed by the laws of the Commonwealth of Virginia,
.\" without reference to its choice of law rules.
.NH 2
.Tc "\f3Machine-Oriented Miniserver Overview\fP"
.LP
The purpose of the Machine-Oriented Miniserver (MOM) daemon is to create
executing batch jobs, watch over and control their execution and report
on their demise to the Batch Server which issued the job to MOM.
One MOM exists on each machine under the Batch Server's control.
Though the Batch Server maintains responsibility for each batch job it
executes, MOM takes care of the housekeeping details required to
actually initiate, monitor and clean up after batch jobs.
A running job has one or more running tasks.
MOM is the parent of each task which runs on its machine.
Unlike other batch systems, there is only one MOM for all running tasks,
there is no per-task shepherd process.
.LP
The MOM daemon has been combined with the Resource Monitor in an
effort to consolidate code.  The functions will be described separately.
MOM also acts as a Task Manager for each job she controls.  
.LP
The Batch Server uses essentially the same protocol to talk with MOM as
the Batch Server's clients use to talk with it.
MOM uses restricted interpretations of some of the Batch protocol and
makes use of a few special messages.
The protocol between MOM and the Batch Server, though, is essentially
the same as between the Batch Server and any of its clients.
MOM acts as a client of the Batch Server for only one message type.
Its purpose is to announce the demise of a batch job.
.LP
A Batch Server which is responsible for a simple uniprocessor machine
will work with only one MOM.
Batch Servers for multiprocessors or for extended families of
workstations may deal with multiple MOMs.
.NH 3
.Tc MOM's Interpretation of PBS Protocol
.LP
MOM interprets a few PBS Protocol messages exactly as does the Batch Server.
There are several other PBS Protocol messages which MOM interprets in a more
restrictive way than specified in the ERS.
Finally, there are a few PBS Protocol messages which are unique to
communication with MOM.
The desired effect of these PBS Protocol interpretations is to simplify
MOM at the expense of requiring a more sophisticated Batch Server.
Since there will only be one Batch Server per workstation cluster or
distributed memory multiprocessor, and there will be many MOMs,
simplifying MOM seems to be a good idea.
.NH 4
.Tc \s-2Unchanged PBS Protocol Messages\s+2
.LP
The following PBS Protocol messages are interpreted by MOM exactly as
specified in the ERS explanation of the protocol.
.IP \(bu
Message Job
.IP \(bu
Signal Job
.IP \(bu
Status Job
.NH 4
.Tc \s-2Re-interpreted PBS Protocol Messages\s+2
.LP
The following paragraphs cover MOM's restricted PBS Protocol interpretations.
.NH 5
.Tc Modify Job
.LP
Of all the aspects of a batch job which can be modified by the Modify
Job message, MOM only supports reducing the limits of a running job,
and that only if it is possible for the resident machine.
Any other attempted modification will result in an error response.
.NH 5
.Tc Delete Job
.LP
If the designated job has been checkpointed and if MOM has the
checkpoint file, MOM will honor a Delete Job message by deleting the
file.
This situation will only arise if the Batch Server is configured to let
MOM keep checkpoint files.
The alternative Batch Server configuration will ask MOM to send it any
checkpoint files.
If the designated job is running or is unknown to MOM, an error
results.
.NH 5
.Tc Hold Job
.LP
If the designated job is running and if checkpoint is supported on the
resident machine, MOM will checkpoint the job.
The checkpoint file may later be sent back to the Batch Server or it may be
left in place, at the whim of the Batch Server.
If the designated job is not running, or if the target job cannot be
checkpointed by the resident machine, an error results.
.NH 5
.Tc Queue Job
.LP
Rather than put a job received through the Queue Job protocol into a
queue, MOM puts it into execution.  If a corresponding checkpoint file
exists, the job is actually restarted.
.NH 5
.Tc Server Shutdown
.LP
If any job is still running, MOM reports an error.
Otherwise, she exits.
It is the responsibility of the Batch Server to send MOM a Signal Job
message or a Hold Job message for each running job
before sending the Server Shutdown message.
.NH 4
.Tc "\s-2Unused PBS Protocol Messages\s+2"
.LP
Much of the PBS Protocol has no meaning to MOM.  Any of the following
messages, if received by MOM, will result in an error response.
.IP \(bu
Manager
.IP \(bu
Move Job
.IP \(bu
Rerun Job
.IP \(bu
Run Job
.IP \(bu
Select Jobs
.IP \(bu
Status Queue
.IP \(bu
Status Server
.IP \(bu
Locate Job
.IP \(bu
Track Job
.IP \(bu
Pull Job
.IP \(bu
Register Job
.NH 4
.Tc "\s-2MOM-specific PBS Protocol Messages\s+2"
.LP
The following PBS Protocol messages are used exclusively by the Batch
Server while acting as a client of MOM.
.NH 5
.Tc Copy Files
.LP
The Copy Files message provides MOM with a list of filename pairs, a
direction flag, a user identification on MOM's machine, a file owners
name, and a hostname.
MOM treats the first name of each pair as a filename local to MOM's machine.
It treats the second name as a filename local to the named host.
MOM arranges for a copy to be made in the direction specified by the
direction flag.
MOM acts in the name of the identified user on MOM's machine.
.LP
When files are being copied outward from MOM and the copy is successful,
MOM deletes the file on her machine.
.LP
If the file transfers cannot take place, an error response is given.
If the transfer that failed is in to MOM's machine, then MOM deletes
all the files which were copied in prior to the failed file.
.NH 5
.Tc Delete Files
.LP
The Delete Files message provides MOM with a list of filenames and a
user identification on MOM's machine.
MOM interprets the filenames as the names of local files, and deletes them.
MOM acts in the name of the identified user.
.LP
If the files cannot be deleted, an error response is given.
.NH 4
.Tc "\s-2MOM-specific PBS Protocol Message Sent by MOM\s+2"
.LP
The following PBS Protocol message is used exclusively by MOM
while acting as a client of the Batch Server.
.NH 5
.Tc Job Obituary
.LP
MOM uses the Job Obituary message to tell the Batch Server that a batch
job has ended and how.
The message contains the job_id, the termination status and the total
resource utilization of the process which was the job's session leader.
The termination status is the value returned through the integer pointer
which is the argument of the POSIX wait() function.
.NH 2
.Tc \f3Program: pbs_mom\fP
.LP
.NH 3
.Tc Overview
.LP
.NH 3
.Tc Packaging
.LP
MOM is composed of three parts,
.IP \(bu
PBS-generic routines for communication and server operation, drawn from
the Batch libraries under directory
.I src/lib/*
and from Batch Server files from directory
.I src/server ,
.IP \(bu
Machine-independent, MOM-specific information, in the files mom_func.h
and various C source files located in the 
.I src/resmom
directory, and
.IP \(bu
Machine-dependent, MOM-specific information, in the files mom_mach.h,
mom_mach.c, mom_start.c, and pe_input.c.
.NH 3
.Tc External Interfaces
.LP
MOM has the following external interfaces:
.IP \(bu
Arguments supplied by the pbs_mom command line,
.IP \(bu
Inter-Server Protocol messages,
.IP \(bu
Resource Monitor Protocol messages,
.IP \(bu
Batch Protocol messages received from the Batch Server, and
.IP \(bu
Task Manager messages exchanged with running jobs and other MOMs.
.NH 3
.Tc Machine-independent Files
.LP
.NH 4
.Fi pbs_mom.h
.LP
The file
.I
src/include/mom_func.h
.R
contains the machine-independent macro definitions which are unique to MOM 
as well as the function prototypes for MOM.
.NH 4
.Fi job.h
.LP
The file
.I
src/include/job.h
.R
contains many structure and flag defines for both the Batch Server and MOM.
The structures for MOM have become more complicated with the need to track
.B tasks .
The job structure contains a number of fields not needed in the Batch Server.
It has entries for the number of nodes in the job, the local MOM's node
id, an array of node resources, an array of node entries,
and a list of task structures.  The array of node entries each give
the node id and host name for the node they represent.  They also have
an RPP stream number and a list of events which are being waited for
from other MOM's.
.NH 4
.Fi mom_main.c
.LP
The file
.I
src/resmom/mom_main.c
.R
contains the machine-independent source code which is unique to MOM.
.Fn main()
.Cs
main(int argc, char **argv)
.Ce
.IP Args: 4
.RS
.IP argc
The count of the number of arguments.
.IP argv
A null-terminated list of character pointers.
If
.Ar argv
points to option key letters and arguments, see the pbs_mom(8B) man page.
.RE
.IP Return: 4
.RS
.IP zero
if success.
.IP non-zero
an error code defined in pbs_errno.h.
.RE
.LP
.B "Start Up"
.LP
Mom must be run with a real and effective UID of root.
Her service port and that of the server is obtained by calling
.I get_svrport() .
Mom then processes the options specified on the command line.
Resource limits which will be inherited by the job and might not be reset
are set to unlimited.  Mom then sets up paths and checks the security of 
her files and directories.
.LP
Local host and the name of her host obtained are added to the list of systems
which may contact Mom, see
.I addclient() .
If a configuration file was specified with the -c option, the config file is
processes by calling
.I read_config() .
.LP
The routine
.I mom_open_poll()
is called to initialize the machine dependent polling routines.
.LP
The routine
.I init_abort_jobs()
is called if jobs were running when mom last ceased operation.  This routine
will kill those jobs.
.LP
.B "Main Loop"
.LP
In normal operation to place a job into execution, MOM will determine if
the job is to run on more than one node by checking the attribute
"exec_host".  If so, the MOMs on the other hosts are contacted to
request they join the job.  If this succeeds, or no other nodes
are part of the job, MOM will fork herself, see
.I start_exec() .
The child process will establish the script as standard input, and setup
standard output and error as required by the job.  It will then set by
whatever means are supported on the system the resource limits of the job.
The child will will then \*Qexec\*U the shell on top of itself and become
the job.
.LP
The single parent MOM after forking the child will determine in
.I mom_do_poll()
if any of the resource limits cannot be enforced by the system directly and
therefore require MOM to monitor the usage by the job by polling.  If polling
is required, the job is added to a special list.
In the main loop, once every 
.Sc CHECK_POLL_TIME
(120) seconds, Mom will obtain the process information for all running 
processes by calling
.I mom_get_sample() .
For all running jobs, their resource usage is updated by calling
.I mom_set_use() .
.I rpp_io()
is called to see if any RPP i/o is required.
If any (running) job has the Mom flag
.Sc MOM_NO_PROC
set, then for each task in the job the session leader's existance is verified
by calling kill(2) with signal zero (SIGNULL).  If -1 is returned and errno
is ESRCH, then the process no longer exists (even as a zombie).  We force the
task in 
.Sc TI_STATE_EXITED
state.   This allows Mom to catch the termination of tasks for who she is not
the parent (say after a restart).  Note, the MOM_NO_PROC flag is set in
.I cput_sum() 
if no processes are found when summing up the jobs cpu usage.
.LP
If checkpointing is supported ...
.br
In the main loop when jobs are running, MOM will determine if there is any need
to checkpoint jobs by looking for a non-zero
.Ar ji_chkpttime ,
the checkpoint interval time.  If set, MOM checks
.Ar ji_chkptnext 
to see if the time for the next checkpoint has been reached.  If so, that
time is updated to now + ji_chkpttime and 
.I start_checkpoint ()
is called to checkpoint the job.
.LP
Then for each \*Qpolled\*U job, MOM will call
.I mom_over_limit()
to determine if any of the usage is over limit.   When that occurs, a message
is written on the standard error file by calling
.I message_job()
and
.I kill_job() 
is called to terminate the job.
kill_job() will be called up to three times
(due to problems with IBM poe on the SP-2).  The first two
times, kill_job() is a called with SIGTERM.  The
last time, MOM gets serious and calls it with SIGKILL.
.LP
When a job terminates, the SIGCHLD signal is sent to MOM.  The post job
processing requires a two step approach, so the SIGCHLD signal handler only
sets a flag,
.Av termin_child ,
which indicates that some child process has terminated.  The child may not
even be a task, but some other child process of MOM.  However, the terminated
process (task) cannot be reaped immediately.  Reaping a child on a system
where resource usage is maintained in the process table cause the process
table entry to be freed and the information lost.  Before the 
.I wait() 
is called, MOM on finding that
.Av termin_child
is set, will call
.I scan_for_terminated()
to get the latest resource usage and then determine which task (if any) 
terminated.  The
.Av exiting_tasks
flag is set.  This flag may also be set on recovery.  When MOM finds this flag
set, 
.I scan_for_exiting()
is called to post process any jobs marked as exited.  MOM then sits and waits
for another service request.
.LP
.B "Termination"
.LP
When MOM exits as a result of an SIGTERM, 
.I mom_close_poll() is called to perform any required cleanup for actions
established in 
.I mom_open_poll() ,
such as closing access to the kernel.
Then MOM attempts to kill any running job and marks each one as exiting.
Clean up will occur when MOM is restarted.
.Fn do_rpp()
.Cs
int do_rpp(int stream)
.Ce
.IP Args: 4
.RS
.IP stream
a stream index to read.
.RE
.LP
Read the stream to get the protocol number.
Read the protocol version number and call
.B rm_request()
if it is a Resource Monitor request, or
.B im_request()
if it is an Inter-MOM request, or
.B is_request()
if it is an Inter-Server request.
.Fn tcp_request()
.Cs
int rpp_request(int fd)
.Ce
.IP Args: 4
.RS
.IP fd
not used.
.RE
.LP
Input is coming from an RPP stream.  Call
.B rpp_poll()
to get the stream index to process.  If it is a valid stream, call
.B do_rpp().
Continue this until there are no more streams to process.
.Fn do_tcp()
.Cs
int do_tcp(int fd)
.Ce
.IP Args: 4
.RS
.IP fd
a file descriptor to read.
.RE
.LP
Read the file descriptor to get the protocol number.  If the call to
.B disrsi(\|)
returns DIS_EOF, the connection is closed.  If it returns DIS_EOD,
there is no more data, but the connection is still open.
Read the protocol version number and call
.B rm_request()
if it is a Resource Monitor request, or
.B tm_request
if it is a Task Manager request.
.Fn tcp_request()
.Cs
int tcp_request(int fd)
.Ce
.IP Args: 4
.RS
.IP fd
a file descriptor to read.
.RE
.LP
Input is coming from a tcp stream as either a Resource Monitor request
or a Task Manager request.  
Check that it is coming from a machine in the okclients array then
go into a loop calling
.B do_tcp
until there are no more messages to process.
.Fn read_config()
.Cs
static int read_config(char *file)
.Ce
.IP Args: 4
.RS
.IP file
name of the configuration file specified on the -c option.
.RE
.IP Returns: 4
zero if ok, non-zero otherwise.  Errors are logged.
.LP
Each line in the configuration file is read.  Lines starting with a hash
mark (#) are
comments and are ignored as are null lines.
.LP
Non-comment lines can have a static resource definition or a
command that causes a function to be called with a token.
A resource definition is described in the Resource Monitor IDS.
A command begins with a dollar sign ($).  The command names and
the functions they call are as follows:
.TS
center box tab(/);
l | l .
command/function
_
clienthost/addclient
restricted/restricted
logevent/setlogevent
.TE
.Fn addclient()
.Cs
static u_long addclient(char *hostname)
.Ce
.IP Args: 4
.RS
.IP hostname
name of a host to added to the allowed clients of MOM.
.RE
.IP Returns: 4
the IP address of hostname if ok, zero otherwise.
.LP
The routine
.I get_hostaddr()
is called to return the IP address of the listed host.  Any invalid or
unknown hosts causes addclient to return 0 and MOM shuts down.
Valid addresses are added to the global binary tree
.Ar okclients.
If a signal causes MOM to re-read the config file, the IP addresses
previously in
.Ar okclients
are not deleted.  MOM must be restarted to remove an IP address from
those allowed to connect.
.Fn restricted()
.Cs
static u_long restricted(char *name)
.Ce
.IP Args: 4
.RS
.IP name
the name to be matched against the name of any host sending a request
with a non-privleged port number.
.RE
.IP Returns: 4
non-zero if ok, zero otherwise.
.LP
The first character of
.I name
can be a star (*) to allow wildcard matches of hostnames.  For example, if
.I name
is "*.spam.com", any host in the domain "spam.com"
will be allowed to perform restricted queries.
.Fn setlogevent()
.Cs
static u_long setlogevent(char *value)
.Ce
.IP Args: 4
.RS
.IP value
new value for log_event_mask
.RE
.IP Returns: 4
non-zero if ok, zero otherwise.
.LP
Set a new value for log_event_mask.
.NH 4
.Fi start_exec.c
.LP
The file
.I src/resmom/start_exec.c
contains machine independent functions used to place a job into execution.
.Fn start_exec
.Cs
void start_exec(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job to place into execution.
.RE
.LP
This function is called from MOM's version of
.I req_commit() 
within the file req_quejob.c.  The purpose is to place the job into 
execution.  The
following are the steps take:
.LP
The
.At JOB_ATR_Cookie
attribute is set for the job.  The cookie is used to validiate inter-mom and
task management (tm_ API) calls.  By calling
.I job_nodes() ,
the nodes allocated to the job are determined by examining the
.At JOB_ATR_exec_host
attribute.  Note that the flag
.Sc JOB_SVFLG_HERE
was set back in req_commit() when the job was received.  It indicates this Mom
is designated "mother superior" for the job.  Also note that start_exec() is
not called on the sister nodes.
.LP
If other nodes are to be part of the job...
Two sockets (for standard out and error) are opened with will be used by
pbs_demux to collect output from tasks on the other nodes.
The port number bound to the sockets are saved in
.I ji_stdout
and 
.I ji_stderr .
The other nodes are sent a 
.SC JOIN_JOB
inter-mom message with the job information including the above ports,
the logical node numbers, and the job attributes.
.LP
If the job will only run on the local machine,
.I finish_exec()
is called.
Note, for multiple node jobs, finish_exec() is called when all sisters have
acknowledged the JOIN_JOB message, see 
.I im_request()
in mom_comm.c.
.Fn finish_exec()
.Cs
void finish_exec(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job to place into execution.
.RE
.LP
Start a job running by establishing the resource usage limits, setting up the
standard output and error files for the job, connecting the script as the
standard input to the job and then invoking the login or user specified
shell to interpret the script.
When called, MOM is running as a single process with root privilege and
her current working directory is her private directory.
.LP
If other nodes are allocated to the job, this is the Mom which will run
the job script.  She is known as \*QMother Superior\*U.  Mother Superior
obtains the port number associated with the sockets allocated for communication
between the job and the pbs_demux process (which will be started later).
The port numbers must be passed to the other Moms associated with the job.
The flag 
.Sc MOM_HAS_NODES
is set in 
.I ji_flags
of the job structure.
.LP
The next thing finish_exec() does is obtain the password entry for the
user specified by the server in the job attribute
.At JOB_ATR_euser .
This is the user name under which the job should be executed.  The
corresponding uid is save later in the job structure for future use, see
.I check_pwd() .
.LP
The machine dependent function
.I mom_do_poll()
is called to determine if the newly started job has resources which require
MOM to poll its usage.   If it returns true, or if the job has more than one
node, then the job is
added to a special polling list as described under mom_main.c
.LP
If checkpoint is enabled ...
.br
If the job's checkpoint attribute,
.At JOB_ATR_chkpnt ,
has a value of
.Ty c=nnn
then the user is requesting periodic checkpoint at an interval of 
.Ty nnn
minutes.  The interval is set into the job structure in
.Ar ji_chkpttime
and
.Ar ji_chkptnext
is set to time now plus the interval.
.LP
If checkpoint is enabled ...
.br
If the job is marked as having been checkpointed, 
.Sc JOB_SVFLG_CHKPT
is set in 
.Ar ji_svrflags ,
and if the restart file exists, then the machine independent routines
.I site_mom_prerst ()
and
.I mom_restart_job ()
are called to restart the process.
The time the job was started which is kept in
.Av ji_stime
is adjusted to the current time minus the wall clock time the job ran 
before it was checkpointed and held.  This is so the held time is not counted
against the job's wall clock time.   
On the CRAY, if the job is in a suspended state when restarted, the job start
time is not adjusted as it will be when the job is resumed.
.LP
If the restart fails for a \*Qpermanent\*U
reason, the job is marked as exiting and the exit status is set to 
.Sc JOB_EXEC_FAIL .
If the reason is temporary, see restart() on the Cray, the job is marked as
exiting, but the exit status is set to 
.Sc JOB_EXEC_RETRY
which directs the server to re-queue the job.
.LP
.B "If Interactive support is enabled ..."
.br
If the job is an interactive job, attribute
.At JOB_ATR_interactive
is non-zero, A master side pseudo tty is opened by calling 
.I open_master() .
This is done before finish_exec() forks because the name of the slave tty
must be saved in the job's output path attribute
.At JOB_ATR_outpath
so that it can be found if a message (qmsg) is sent to the job, see
.I message_job() .
.LP
.B "If not interactive ..."
.br
and if
.Sc SHELL_INVOKE
is defined as 1, the default, a pipe is created with will be the shell's 
standard in and will be used to pass the name of the job script.  This is
equivalent to saying:
.Ty "echo script_path | shell"
.LP
A pair of pipes is created for communication between the child of MOM and
MOM.
MOM then forks the child process which will
become the job.  Standard files and resource limits must
be established by the child process which becomes the job.  Certain conditions
may exist which result in the failure to establish the files or limits.
If the conditions are temporary, the job should be re-queued by the server.
If the conditions are permanent, then the job should be aborted.  The child
returns via the pipe either the
.B "session/job id"
(greater than zero) if the job is placed into execution, 
.Sc JOB_EXEC_RETRY
if a temporary condition prevented failure, or
.Sc JOB_EXEC_FAIL
if the job cannot ever be run.
When MOM reads from the pipe either of the status that indicates the job
did not execute, then that value is saved as the job exit status and returned
to the server.  Note that the true exit status of a job, the argument to
the exit() call, cannot be negative.  Thus JOB_EXEC_RETRY and JOB_EXEC_FAIL
are negative values to indicate to the server that these are from MOM.
After the session id or job status is read from one pipe, MOM will acknowledge
by returning the same value back to the child on the other pipe.
This releases the child to continue with
.B exec()
or
.B exit() .
This prevents the possibility of SIGCHLD interrupting the pipe read
and confusing parent mom.
.LP
After forking the child and receiving a positive return, the parent MOM
will record the job start time (used to determine wall clock execution time),
and the session id of the child. 
The 
.I "global id"
which was created to hold an SGI Array Session Handle, ASH, is saved and
the task state is set to 
.Sc TI_STATE_RUNNING .
.LP
If there is more than one node,  the sockets created for communication between
the job and pbs_demux are no longer needed by Mom herself, so they are closed.
The associated port numbers are saved in the job structure.
.LP
At this point, the main MOM returns to req_commit() and then to the main loop.
.LP
The 
.B "child process of MOM"
 does the following:
.IP 1. 5
Insures that the file descriptors for the pipes are greater than 2 (standard
error).
.IP 2.
Determines the correct shell to invoke by calling the machine dependent
.I set_shell() .
.IP 3. 
Sets up a whole slew of environment variables, including those 
passed with the job in the attribute
.At JOB_ATR_variables .
HOME, LOGNAME, PATH, SHELL, and USER which would be set by login(8).
Other variables set are those mandated by POSIX batch, including
\*QPBS_ENVIRONMENT=PBS_BATCH\*U and for compatibility with NQS,
\*QENVIRONMENT=BATCH\*U.
In addition, PBS_JOBCOOKIE, PBS_NODENUM, PBS_TASKNUM and PBS_MOMPORT
are set so the Task Management library can communicate back with MOM. 
Also, the machine dependent routine
.I set_mach_vars()
is called to set any machine specific variables (typically none).
.IP 4a.
If interactive support is enabled...
.br
If the job is interactive, the environment variable PBS_ENVIRONMENT is set to
PBS_INTERACTIVE.  The name of the host where qsub is running is extracted from
the environment variable PBS_O_HOST and the port number to which qsub is
listening is taken from the interactive attribute
.At JOB_ATR_interactive .
.I conn_qsub()
is called to open a network connection back to qsub.  Over the connection we
send the job id as a simple validation as to who we are.  Qsub sends the
window size, terminal type, and terminal characters (special characters) of
its controlling terminal.  
An alarm is established around the code which connects qsub and reads the
terminal characters.   This prevents a suspended qsub or a network problem
from holding up MOM which in turn would hold up the server and the scheduler
who is waiting for the pbs_runjob() to be acknowledged.
.IP
.I set_job() 
is called to establish a new session.  This must be done to free the job
of any prior controlling terminal.
The ownership of the slave pseudo tty is changed to the user and mode is
set to 0620.
Then the slave side of the pseudo
terminal is opened, it becomes the controlling terminal of the job.
For the CRAY only \- ioctl calls are made to force the slave to be the
controlling terminal.
.IP
The child process forks a grandchild.  This process becomes the
writer process,
.I mom_writer() .
It reads from the master tty (data written by the job on the slave
side) and sends it over the socket to qsub.
The original child of MOM, parent to the grandchild, 
sets up stdout and stderr for the
prolog to run.  This is why the writer process was started so
the output can be delivered to the user's screen.
The function
.I run_pelog()
is called to execute the prologue script, if one exists.
Run_pelog() is called with
.Sc PE_PROLOGUE
to signify the prologue and
.Sc PE_IO_TYPE_ASIS
because the standard output and error files of the job are already opened on
descriptors 1 and 2.
Note, the child's current working directory is still mom_priv.
After the prolog is run, it forks again to create grandchild number two.
This will become the job while the original child will become the
reader process,
.I mom_reader() ,
which reads the input from qsub (socket) and passes it to the job
by writing on the master socket.
When mom_reader() exits, the pseudo tty is reset to root ownership and
mode 0666.
.IP 4b.
For non-interactive jobs ...
.br
The environment variable PBS_ENVIRONMENT is set to PBS_BATCH.
.IP
Standard output and standard error are established depending on the setting
of the
.At JOB_ATR_join 
attribute by calling 
.I open_std_file() .
.IP
The call to
.I run_pelog()
is made in a similar fashion to the interactive case except this
takes place before the call to
.I set_job() .
This is so the time spent running the prolog will not be charged
to the user's job.
.IP
The machine dependent function
.I set_job()
is called to establish the session and process group id.  This is machine
dependent because a few vendors (such as CRAY) support a \*Qjob\*U concept.
.IP 5.
For both interactive and normal batch ...
.br
Resource limits are established by calling
.I mom_set_limits() .
If this fails, either 
.Sc JOB_EXEC_RETRY
or
.Sc JOB_EXEC_FAIL
is returned to MOM depending on the permanence of the failure and the job exits.
Interactive jobs cannot tolerate a JOB_EXEC_RETRY attempt because they
have lost the chance to connect with
.B qsub
so a JOB_EXEC_FAIL will be returned for either case.
.IP 6
The argv array to pass to the shell is established.  Note that the shell
name is prepended in arg[0] with a '-' because of the traditional login
shell rules.
.IP
The supplementary groups are set by calling the system routine setgroups().
The real group and user id is established to that of the user.
The child changes the current working directory to the user's home directory.
The user must be able to access the directory or this will fail.  Some site
clean the permissions on the home directory when an account is disabled, which
is one reason the chdir() is delayed until this point, if done as root it would
succeed.
.IP
The log is closed.
.IP
The session id is returned to the parent mom by calling
.I starter_return() .
Finally, the shell is exec-ed.
.LP
.Fn start_process()
.Cs 
int start_process(task *ptask, char **argv, char **envp)
.Ce
.IP Args: 4
.RS
.IP ptask
the task structure which has already been created for the session.
.IP argv
an array of arguments to pass to execve().
.IP envp
an array of environment strings, the last must be NULL.
.RE
.IP Returns: 4
.RS
.IP 0
if no error occurred.
.IP -1
on error.
.RE
.LP
Start a process for a spawn request.  This will be different from
a job's initial shell task in that the environment will be specified
and no interactive code need be included.
.LP
.Fn fork_me()
.Cs 
pid_t fork_me(int connection)
.Ce
.IP Args: 4
.RS
.IP connection
all network connections except 
.Ar connection
are closed.
.RE
.IP Returns: 4
.RS
.IP pid
of the newly forked child process.
.RE
.LP
Forks a new process.  In the child, closes all network connections except
the one specified (-1 means close all).  The action for SIGCHLD is
reset to 
.Sc SIG_DFL ,
otherwise the 
.I system()
call used in various places will not function.
The action for SIGHUP, SIGINT and SIGTERM are also reset to
.Sc SIG_DFL
and the signal mask is reset so no signals are blocked.
The machine dependent function
.I mom_close_poll()
is called to close or clean up any files/items/... opened/created in
.I mom_open_poll() .
.LP
.Fn nodes_free()
.CS
void nodes_free(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job structure of interest.
.RE
.LP
Free the ji_nodes array for a job.  If any events are
attached to an array element, free them as well.
.LP
.Fn job_nodes()
.CS
void job_nodes(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job structure of interest.
.RE
.LP
Generate a ji_nodes array for a job from the exec_host attribute.
Call nodes_free() just in case we have seen this job before.
Parse exec_host first to count the number of nodes and allocate
an array of nodeent's.  Then, parse it again to get the hostname
of each node and init the other fields of each nodeent element.
The final element will have the ne_node field set to TM_ERROR_NODE.
.Fn starter_return()
.CS
void starter_return(int up_pipe, int down_pipe, int code)
.Ce
.IP Args: 4
.RS
.IP up_pipe
file descriptor of pipe to the parent MOM.
.IP down_pipe
file descriptor of pipe from the parent MOM.
.IP code
which is written on the up pipe to MOM to indicate the job state.
.RE
.LP
The code is written to mom and the up pipe is closed.
A read is made from the down pipe as a sync mechanism, then it is closed.
If the code is less than zero,
.B exit()
is called.
.Fn std_file_name
.Cs
char *std_file_name(job *pjob, enum job_file which)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job.
.IP which
file's name should be returned, values are: 
.Sc StdOut ,
.Sc StdErr ,
or
.Sc Chkpt .
.RE
.IP Returns: 4
.RS
.IP 
The file name created for the indicated file.  Note, the
return points to a static area that will be overwritten on the next call.
.RE
.LP
If interactive jobs are supported ...
.br
If the job is interactive, 
.At JOB_ATR_interactive
set, the slave tty name has been stored in the output path attribute,
.At JOB_ATR_outpath .
That name is returned.
.LP
Otherwise, if
the file is to be retained on the execution host as determined by attribute
.At JOB_ATR_keep ,
the file name generated is the \*Qdefault\*U name
.br
.Ty "job_name.Xjob_sequence_number"
.br
where 
.Ty X
is either 
.Ty o
or 
.Ty e .
This file will be created in the user's home directory.
The home directory path is maintained in the job structure.
.LP
If the file is not kept, then it is created in the PBS spool directory
unless MOM is build with
.Sc NO_SPOOL_OUTPUT
defined, in which case it is created in the user's home directory.
In either case, the name is the 11 character prefix obtained from 
.Av ji_fileprefix
appended with a suffix corresponding to which file is being created.
.Fn open_std_file()
.Cs
int open_std_file(job *pjob, enum job_file which, int flag, gid_t gid)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job.
.IP which
file is to be opened, see 
.I std_file_name() .
.IP flag
for open, specifies create or truncate options as well as read/write mode.
.IP gid
which group should own the file.
.RE
.IP Returns: 4
.RS
.IP descriptor
for the open file, -1 if the open fails.
.RE
.LP
Calls 
.I std_file_name()
to obtain the name and then opens the file.
.Fn bld_env_variables
.Cs
void bld_env_variables(struct var_table *table, char *name, char *value)
.Ce
.IP Args:
.RS
.IP table
pointer to the var_table structure which controls the buffer and array 
of variables being built for the job.
.IP name
of a variable to add.
.IP value
of a variable to add.
.RE
.LP
An environment variable of the form, keyword=value, is added to the set to
be place in the job's environment.  The argument
.Ar name
may be either a name or the whole keyword=value string.   If the
.Ar value
argument is a null pointer, then 
.Ar name
is assumed to be the whole string.  If
.Ar value
is not null, then a '=' is appended to name and the value appended to that.
If there is no room in the control table or the buffer, nothing is added.
.Fn init_groups()
.Cs
int init_groups(char *user, int pwgroup, int group_size, int *groups)
.Ce
.IP Args: 4
.RS
.IP user
name.
.IP pwgroup
primary user's group from password entry.
.IP group_size
size of the
.Ar groups
array, typically 
.Sc NGROUPS_MAX .
.IP groups
pointer to integer array of size
.Ar group_size .
.RE
.IP Returns: 4
The number of supplementary groups placed in 
.Ar groups ,
-1 on an error.
.LP
The primary group gid is placed in 
.Ar groups .
The C library routine getgrent() is used to scan the group file to locate
all groups in which the
.Ar user
is a member.  The gid for those groups are added to the array
.Ar groups.
An error, -1, is returned if the number of groups exceeds the array size
given by 
.Ar group_size .
.Fn catchinter()
.Cs
static void catchinter()
.Ce
.LP
This routine applies only to \*Qinteractive\*U jobs.
This routine catches the death of child signal when either the reader child or
the job grand-child of MOM dies.  Remember, the direct child of MOM is not the
job in this case, but is the writer() process.
.LP
When SIGCHLD is received, the other processes in the group are killed to make
sure the job ends.
The variable
.Ar mom_writer_go
is set to zero, see
.I mom_writer() .
.Fn check_pwd()
.Cs
struct passwd * check_pwd(job *pjob)
.Ce
.LP
This routine obtains the password entry for the
user specified by the server in the job attribute
.At JOB_ATR_euser .
This is the user name under which the job should be executed.  The
corresponding uid is save later in the job structure for future use. The
execution group is handled likewise.  If the execution group is not
specified (it always will be) or the
.Sc ATTR_VFLAG_DFLT
bit is set indicating the normal login group, the primary group from the
password entry is used.   The routine
.I init_groups()
is called to scan the group file and build a list of the supplementary
groups of which the user is a member.  This group list and the user's
home directory is saved in an 
.Ar grpcache
structure as an extension to the job structure.
.LP
The routine
.I site_mom_chkuser ()
is called.  This is a stub routine provided to allow a site the ability
to customize checking of an accounts validity.  The supplied version
always returns false.   If a site's modified version returns true, meaning
the account is invalid, MOM will set the job state to 
.Sc JOB_SUBSTATE_EXITING 
and the exit status to 
.Sc JOB_EXEC_FAIL
aborting the job.
.LP
.Fn mom_restart_job()
.Cs
int mom_restart_job(job *pjob, char *path)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job structure of job to restart.
.IP path
Name of path of directory containing restart files for tasks within the job.
.RE
.IP Return:
The number of tasks restarted, -1 implies an error occurred.
.LP
The directory specified by 
.I path
is read and for each entry the task id is taken from the entry name.
The task must be part of the the job.
The machine dependent routine
.I mach_restart ()
is called with the task pointer and path it the specific restart file to
restart the job as required on that type of machine.
.LP
If there are any errors, -1 is returned.
.NH 4
.Fi catch_child.c
.LP
The file 
.I src/resmom/catch_child.c
contains machine independent functions dealing with the termination of 
a job. A name like on_job_termination would be better suited, but the
name started with the first function placed in the file, oh well...
.LP
.Fn catch_child()
.Cs
void catch_child()
.Ce
.LP
This is the signal handler for SIGCHLD for the main mom.
All it does is set
.Av termin_child
to indicate some process died, maybe even a job task.
.Fn scan_for_exiting()
.Cs
void scan_for_exiting()
.Ce
.LP
This function is called from MOM's main loop when it finds the flag
.Av exiting_tasks
set.  As explained in the general narrative on MOM, resource usage by
tasks must be collected before the child process is reaped and the process
table entry is erased.  This is done in 
.Av scan_for_terminated() .  
If the a job is marked
.Sc MOM_CHKPT_ACTIVE ,
it will be skipped since we do not want to change its state as tasks exit.
If a job has
.Sc MOM_CHKPT_POST
set, a checkpoint attempt had an error and some tasks were aborted.
In this case, the function
.I chkpt_partial()
is called.
If the death of child is from a task, then 
.Av exiting_tasks
is set.  If the task was the original shell started by mother superior,
.Sc JOB_SUBSTATE_EXITING
is set if no other nodes are part of the job or none can be communicated
with.  If other nodes exist which can be communicated with,
they are sent a message to terminate the job.
.LP
For a job now marked as substate
.Sc JOB_SUBSTATE_EXITING ,
the following operations are performed:
.IP 1. 4
Call
.I kill_job()
to kill off any processes in the task sessions that tried to escape from the
job, i.e. were forked or placed into the background.
.IP 2.
The job is unlinked from the resource usage polling list.
.IP 3.
A connection is opened to the server which sent MOM the job.
The connection set set to receive the reply from the server by calling
.I add_conn()
to set the read function to be
.I obit_reply() .
.IP 4A.
A child process is forked.  The child runs the epilogue script, if one exists, 
by calling
.I run_pelog() 
with
.Sc PE_EPILOGUE .
If the job is a \*Qnormal\*U batch job, run_pelog() is called with
.Sc PE_IO_TYPE_STD
so the epilogue output goes to the job's output file.  But if the job
is an interactive job, the pseudo terminal connection back to qsub has already
been lost, so run_pelog() is called with
.Sc PE_IO_TYPE_NULL
so the epilogue output is sent to /dev/null.
.IP
A Job Obituary Notice is sent to that server, see 
.I send_jobobit() in server/isode_write.c.
The notice contains the exit status of the
job including any of the special status discussed in start_exec.c.
The notice also contains the most recent, hopefully last, accounting of the
resources used by the job.
.IP 4B.
In the parent, the real and true MOM, the job substate is set to
.Sc JOB_SUBSTATE_OBIT
indicating that the notice has been sent.   Note, this state is not recorded
on disk (save_job() is not called).   If MOM crashes, we want her to resend
the obit notice.
.LP
MOM will process up to two jobs in the exiting state before returning to
the main loop.  The two job limit is to keep from being out of touch with the
network (mainly doing accept()s) for too long.
.LP
The addition of multiple tasks running on more than one MOM has made
the state changes more difficult to understand.  Just the
RUNNING and EXITING states are considered in the following figure.
.DS
.so job_state.pic
.sp
.ce
Figure \n(H1 - \n+(Fi
.DE
.Fn obit_reply()
.Cs
static void obit_reply(socket)
.Ce
.IP Args: 4
.RS
.IP socket
descriptor of the connection on which data has been received.
.RE
.LP
This function is entered when data is ready to read on a connection whose
read function is this function.  That is set up in 
.I scan_for_exiting() 
as described above.  A request structure is allocated in which the reply
is decoded by calling 
.I isode_reply_read() .
If the reply does not decode or the connection has closed (the server timed
it out - a bad network situation), the reply code is set to -1.
.LP
When scan_for_exiting() sent the obit notice to the server, it set the
socket number in the job structure in ji_momhandle.  So we now scan the
jobs for the job with the substate of 
.Sc JOB_SUBSTATE_OBIT
and the corresponding socket number recorded there.
.LP
The server may respond with one of four different codes:
.IP PBSE_NONE
The job is placed into substate 
.Sc JOB_SUBSTATE_EXITED
and the connection to the server is added to the set on which MOM will accept
requests.  The server will then direct disposition of the job's files, etc.
.IP PBSE_ALRDYEXIT
The server is already performing exit processing for the job.  MOM must have
been restarted after sending a Job Obituary request and is now sending a
second.  The server will continue processing the job on the thread and
connection started by the first request.
MOM will update her copy of the job state and close this connection.
.IP PBSE_CLEANEDOUT
The server did not own the job and the most recent server restart was of type
cold.  Therefore, the job was discarded by the server and mom should do the
same.  Thus 
.I mom_deljob()
is called.  Note, this is the reason a pointer to the next job in the list
has already be recorded.  Otherwise, at the bottom of the loop there would
be no next job.
.IP -1
The special connection is closed if EOF is read or the reply
did not decode.  The job substate is reset to JOB_SUBSTATE_EXITING so another
Obit notice will be sent to the server.
.IP "Any other"
If the server returns any other response code, such as PBSE_NOJOBID, then
the event is logged, and the job discarded by calling
.I mom_deljob() .
We hope this never happens.
.LP
Finally, the request structure is freed and the socket shutdown and closed.
.LP
.Fn chkpt_partial()
.Cs
void chkpt_partial(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job which had a checkpoint error
.RE
.LP
This routine is called to restart tasks for a job which had a
checkpoint cause tasks to abort but was not able to
finish a complete checkpoint of every task.  It gets called
from scan_for_exiting() when the work process spawned to do
the checkpoint returns.
.LP
Loop through each task of the job.  Each task that was checkpointed
and has been reaped is restarted by calling mach_restart().
If all tasks for the job are running when we are done with the loop,
turn off MOM_CHKPT_POST flag so job is back to where it was before
the bad checkpoint attempt.  Then get rid of incomplete checkpoint
directory and move old chkpt dir back to regular if it exists.
If any task restart fails, kill the job.
.LP
.Fn init_abort_jobs()
.Cs
void init_abort_jobs(int mode)
.Ce
.IP Args: 4
.RS
.IP mode
of MOM's initialization.
.RE
.LP
This function is called from mom_main when MOM is first started.  One of
three conditions exists.
.IP 1. 4
There are/were no jobs running on the host.  We will not find any in MOM's
job directory.
.IP 2.
There are jobs found and the 
.Ty -r
command option was set, the
.Av mode 
flag is non-zero.  MOM is coming up after being killed or (heaven forbid)
crashing.  The jobs that were running are no longer the children of MOM, they
now belong to the
.I init
process.  MOM will not receive the death of child notice.  Therefore MOM
goes into a homicidal rage, reaping vengeance for being abandoned, and 
kills off all the tasks associated with jobs she had managed, by calling
.I kill_job() .
.IP
The session id for each task
is cleared so that scan_for_exiting() will not issue another kill.
If the
.Sc JOB_SVFLG_HERE
flag is not set for the job (i.e. local MOM is not mother superior),
the job is thrown away.  If the flag is on (i.e. we are mother superior),
and a sisterhood exists, a message is sent to all the other MOM's to
kill the job.  If we are mother superior and no sisterhood exists,
the job exit status is set to one of three values depending on the job
to indicate it was killed on recovery:
.in +0.2i
.br
.Sc JOB_EXEC_INITABT 
\- normal, non-checkpointed job.
.br
.Sc JOB_EXEC_INITRST
\- job has a non-migratable (Cray style) checkpoint file, thus could restart.
.br
.Sc JOB_EXEC_INITRMG
\- job has a migratable checkpoint file (not implemented).
.in -0.2i
The substate is set to
.Sc JOB_SUBSTATE_EXITING ,
and the 
.Av exiting_tasks
flag is set to tell the main loop that jobs have \*Qdied\*U.
.IP 3.
There are jobs but the recovery mode is not set.  This by convention indicates
that the system was also down and there are not jobs still running.  In this
case we cannot go killing session as MOM might hit innocent bystanders,
the session id may have been already re-used.  All of the other procedures
described in case 2 are followed.
.LP
.Fn mom_deljob()
.Cs
void mom_deljob(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job structure.
.RE
.LP
This is a system semi-dependent routine.   On Unicos (Cray), 
.I rmtmpdir()
is called to remove the temporary directories.
Then 
.I job_purge()
is called to completely remove any knowledge MOM has of the job.
.NH 4
.Fi mom_inter.c
.LP
The file
.I src/resmom/mom_inter.c
contains functions to support the execution of an interactive batch job.
Mostly, these functions deal with the creation and setup of the pseudo
tty used by the job for input and output and the routines to move data
between the master tty and the socket connection to qsub,
see figure \n(H1\-\n+(Fi.
.DS B
.so interact.pic
.sp
\f3Figure \n(H1\-\n(Fi: Interactive Job Communication Flow\f1
.DE
.Fn read_net()
.Cs
static int read_net(int socket, char *buffer, int amount)
.Ce
.IP Args: 4
.RS
.IP socket
The socket connection back to qsub.
.IP buffer
Pointer to a data buffer.
.IP amount
The amount to of data to read from the network.
.RE
.IP Returns: 4
Positive number which is amount of data read, or -1 on error.
.LP
This routines will read data from the network until the expected 
.Ar amount
of data has been read.
.Fn rcvttype()
.Cs
char *rcvttype(int socket)
.Ce
.IP Args: 4
.RS
.IP socket
The socket connection back to qsub.
.RE
.IP Returns: 4
The terminal type string.
.LP
This routine reads and validates the terminal type string and terminal control
characters sent by qsub.
The terminal type string must be of the form:
.Ty TERM= type.
The number of control characters expected from qsub is defined by
.Sc PBS_TERM_CCA .
See 
.I set_termcc() ,
below, for which characters are expected.
.Fn set_termcc()
.Cs
void set_termcc(fds)
.Ce
.IP Args: 4
.RS
.IP fds
The file descriptor of the slave pseudo tty.
.RE
.LP
This routine makes the system call
.I tcsetattr()
with
.Sc TCSANOW
to set the control characters of the pseudo terminal to match those of qsub's
controlling terminal.  The characters set are VINTR, VQUIT, VERASE, VKILL,
VEOF, and VSUSP.
.Fn rcvwinsize()
.Cs
int rcvwinsize(int socket)
.Ce
.IP Args: 4
.RS
.IP socket
connection back to qsub.
.RE
.IP Returns: 4
Zero if successful, -1 on an error.
.LP
This function receives the window size as send by qsub.  It is a string of
the form:
.Ty "WINSIZE rn cn xn yn" ,
where rn, cn, xn, and yn are numerical values specifying the row size, column
size, and number of pixels in x and y.  The received values will be used by
.I setwinsize()
to set the window size for the pseudo tty.
.Fn setwinsize()
.Cs
int setwinsize(fds)
.Ce
.IP Args: 4
.RS
.IP fds
The file descriptor of the slave pseudo tty.
.RE
.IP Returns: 4
Zero if successful, -1 on an error.
.LP
Sets the window size by calling
.I ioctl()
with
.Sc TIOCSWINSZ
and the size values obtained by 
.I rcvwinsize() .
.Fn mom_reader()
.Cs
int mom_reader(int socket, int ptc)
.Ce
.IP Args: 4
.RS
.IP socket
connection to qsub.
.IP ptc
The file descriptor of the master side pseudo tty.
.RE
.IP Returns: 4
Zero on EOF, -1 on an error.
.LP
Data is read from the network and written to the master tty until either
EOF is read or an error occurs.  If either the read or the write are
interrupted by a signal, the operation is retried.
Minus 0ne (-1) is returned for any other error.  Zero (0) is returned when
EOF is reached.
.Fn mom_writer()
.Cs
int mom_writer(int socket, int ptc)
Ce
.IP Args: 4
.RS
.IP socket
connection to qsub.
.IP ptc
The file descriptor of the master side pseudo tty.
.RE
.IP Returns: 4
Zero on EOF, -1 on an error.
.LP
Data is read from the master tty and written to qsub over the network until
EOF is read, an error occurs or the variable
.Ar mom_writer_go
is set to zero in 
.I catchinter() 
on death of the job or reader process.
If either the read or the write are
interrupted by a signal, the operation is retried.
Minus 0ne (-1) is returned for any other error.  Zero (0) is returned when
EOF is reached.
.Fn conn_qsub()
.Cs
int conn_qsub(char *host, int port)
.Ce
.IP Args: 4
.RS
.IP host
the name of the host on which qsub is running.
.IP port
the port on which qsub will accept a connection.
.RE
.IP Returns: 4
The socket descriptor if the connection is made, or -1.
.LP
The host address is obtained from
.I get_hostaddr() .
.I client_to_svr() 
is called to establish the connection.
.NH 4
.Fi requests.c
.LP
The file
.I src/resmom/requests.c
contains various request processors for MOM.  These are for requests which
are handled completely differently than by the server, therefore MOM has
her own separate code.  The Queue Job sequence is very similar to the servers,
thus that code, with a few #IFDEFs is used directly as described below.
.Fn fork_to_user()
.Cs
static pid_t fork_to_user(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer, must point to a 
.Ty cpyfiles
request as used by either the Copy Files or Delete Files request from the
server.
.RE
.IP Returns: 4
.RS
.IP pid 
of the new process.
.RE
.LP
The function
.I fork_me()
is called to fork a new process, a child of mine (MOM).
For the child process,
the user identification information is obtained from one of two places.
If the request is for a job about which MOM knows, she is able to find the
job structure for the provided job id, then she uses the execution user id
and group id from the basic job structure and the home directory and
supplementary groups from the 
.Ar grpcache
structure extension to the job structure set up by 
.I finish_exec() .
If the job is not known by MOM, true for a stage-in request, them MOM must
go to the password entry and call
.I init_groups()
for the above information.  The user name is obtained from the 
.Ty cpyfile
request.
.LP
The current working directory is changed to that user's home and the pid from
the fork() call is returned.
.Fn add_bad_list()
.Cs
static void add_bad_list(char **plist, char *newentry, int newlines);
.Ce
.IP Args: 4
.RS
.IP plist
pointer to character pointer which may point to pre-allocated area or be null.
.IP newentry
The new text entry to add to the list of bad file messages.
.IP newlines
The number of new lines to prefix the new text.
.RE
.LP
If 
.Ar plist
is null, no memory has yet been allocated for this message, malloc the
amount needed to hold the new text.
Otherwise, a prior message has been built, realloc the memory to hold it
plus the new message.
.LP
Pre-append the required number of new-lines for formatting.
Append the new text.
.Fn return_file()
.Cs 
static int return_file(job *pjob, enum job_file which, int socket)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to a job structure.
.IP which
file should be returned to the server.
.IP socket
connection to the server.
.RE
.IP Returns: 4
.RS
.IP zero
on successes.
.IP non-zero
error number if an error occurred.
.RE
.LP
This function is used to return \*Qstandard\*U files of a job back to the
server.  Such files are the job's standard output and error and the job's
checkpoint file.  Typically, this function is called when a job is being rerun.
The standard files must be returned to the server in order to have them
available to send to a different MOM when the rerun actually occurs.
.LP
.I std_file_name() 
is called to obtain the file name associated with the job and the type (which).
If the file can be opened, one or more Move Job File requests are generated
and sent to the server.  Each Move Job File request sends a chunk of the file,
up to 
.Sc RT_BLK_SZ
(4k) bytes. 
.I send_jobfile()
is used to encode and send the request to the server.  When all chunks have
been sent the file is closed.  Note the connection to the server is left open
by this routine.
.Fn local_or_remote()
.Cs
static int local_or_remote(char **path)
.Ce
.IP Args: 4
.RS
.IP path
(In and Out) pointer to string containing pathname
.RE
.IP Return:
.RS
.IP value
1 if file is remote, 0 if local on this host.
.IP path
If the file is local,
.Ar path
is updated to point to the local path,  without the 
.Ty host:
prefix.
.RE
.LP
The argument
.Ar path
is a file name in the form
.Ty host:filepath .
If the
.Ty host
prefix matches with this machine's host name, 1 is returned and path is
undisturbed.  Otherwise 0 is returned and path is reset to point after the
colon.  
.LP
The
.Ty host
prefix will match if 
.IP 1.
It exactly matches the local host name or if it matches
a leading substring of the local host name and the next character in the
local host name is a dot ('.').  E.g.  a host prefix of
.Ty x.y
will match a local host name of
.Ty x.y.z
.IP 2.
It is 
.Ty localhost .
.IP 3.
The helper function
.I told_to_cp()
returns true saying the host and path match a $usecp entry in the config file.
In this case, 
.Ar path
is updated to point to the \*Qlocal\*U substitute path.
.LP
.Fn told_to_cp()
.Cs
static int told_to_cp(char  *host, char  *oldpath, char **newpath)
.Ce
.IP Args:
.RS
.IP host
is a host name from a file destination.
.IP oldpath
is the path from a file destination.
.IP newpath
(RETURN) is updated to the local path if a match is found.
.RE
.IP Returns:
1 if match is found, 0 otherwise.
.LP
This routine checks an file destination against a 
.Ty $usecp
entry in the config file.   This entry tells Mom that a remote destination
is also mounted locally and what the local path is.  This allows the use of
/bin/cp instead of rcp to deliver (or stage) the file.
.LP
If the
.Ar host
and
.Ar oldpath
from a destination supplied to 
.I local_or_remote()
matches an $usecp config file entry, then
.Ar newpath
is updated to point to alternate patch supplied on the config file entry.
As called from local_or_remote(), 
.Ar host
is the host portion of the output (or staged file) path, 
.Ar oldpath
is the path portion of that same entry and 
.Ar newpath 
original points to the whole entry.   Newpath is changed if host and oldpath
match up with an entry.
.Fn is_file_same()
.Cs
static int is_file_same(char *file1, char *file2)
.Ce
.IP Args: 4
.RS
.IP "file1 file2"
name of two files, presumed local to this host.
.RE
.IP Returns:
1 if the two names point to the same file (inode), 0 if not.
.LP
Using
.I stat() ,
if the two file names point to the same device and same inode, then they are
the same file and one (1) is returned.  Otherwise zero (0) is returned.
.Fn req_deletejob()
.Cs
void req_deletejob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Delete Job Request received from the Server.
.RE
.LP
The job is located by calling
.I find_job()
with the job id from the request and purged by calling
.I mom_deljob() .
For Unicos, any temporary directories are removed by calling
.I rmtmpdir ().
.Fn req_holdjob()
.Cs
void req_holdjob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Hold Job Request received from the Server.
.RE
.LP
The job is located by calling
.I find_job() .
The routine
.I start_checkpoint() 
is called to checkpoint the job if that is supported.
If checkpoint is not supported, start_checkpoint() will return
.Er PBSE_NOSUP .
.Fn message_job()
.Cs
int message_job(job *pjob, enum job_file jft, char *text)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job.
.IP jft
indicates which file: StdOut or StdErr.
.IP text
to write on the file.
.RE
.IP Returns:
A PBS error number or zero.
.LP
This routine is used to write a message onto either the standard output or
standard error file of the job.  The file is opened by calling
.I open_std_file() .
If the last character of the supplied text is not a new-line character,
one is appended.
.Fn req_messagejob()
.Cs
void req_messagejob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Message Job Request received from the Server.
.RE
.LP
A flag is set according to which file the message is to be written, the
default is the job's standard output.
The job structure is located by calling 
.I find_job() .
The message from the request is passed to
.I message_job() .
.Fn req_modifyjob()
.Cs
void req_modifyjob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Modify Job Request received from the Server.
.RE
.LP
.I attr_atomic_set()
is called to decode the resource limits (and attributes). 
Each is update in the job structure.
If a resource list (limit) item is changed, 
.I mom_set_limits()
is called with the mode parameter of 
.Sc SET_LIMIT_ALTER
to update the limits.
Any errors are returned to the server.
.LP
.Fn req_shutdown()
.Cs
void req_shutdown(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Shutdown Request received from the Server.
.RE
.LP
.QP
Currently does nothing from a lack of time and understand of how to do it.
.LP
.Fn req_signaljob()
.Cs
void req_signaljob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Signal Job Request received from the Server.
.RE
.LP
This function forwards a signal from the server to the running job.
The signal is an numeric or alphanumeric string with or without the
prefix \*QSIG\*U.
.LP
Two names are treated special under Unicos.  If 
.Sc SIG_SUSPEND
(
.Ty suspend )
is received, a running job is to be suspended via the system call suspend(2).
This is done within the Cray specific routine
.I cray_susp_resum()
which is called with with the argument
.Ar which
set to 1 to indicate a suspend should be performed.
If 
.Sc SIG_RESUME
(
.Ty resume )
is received, a suspended job is be resumed via the system call resume(2).
This is also done by
.I cray_susp_resum()
where
.Ar which 
is set to 0 to indicate a resume.
.LP
Otherwise, if the first character is numeric, the string is
converted into a number. 
If the signal is a string, it is taken to be signal name, and the SIG prefix,
if present, is stripped.  The name is converted into a numeric value via
a table, 
.I sig_tbl
located in 
.I mom_start.c .
The table is system dependent in order to correctly map the varying
signal names.
The signal value is issued to the job by calling
the routine
.I kill_job() .  
.LP
If the signal being sent to the job is
.Sc SIGKILL
and the job is not in substate
.Sc JOB_SUBSTATE_RUNNING ,
we have troubles as the server would not have passed on the signal job request
unless it thought the job was running.  So if MOM believes the job is not 
running, we mark it as
.Sc JOB_SUBSTATE_EXITING
and set
.Ar exiting_tasks
to cause a (new) Job Obit notice to be sent to the scheduler.  This was added
in response to bug #779,  a job has no live processes but the server thought
it was still running.   Sending a signal (via qdel) did nothing because
no process would die to generate a SIGCHLD to mom to cause her to (re)issue
the obit notice.   This fix (spelled "kludge") will cause the obit notice.
.LP
Another \*Qkludge\*U is the SIGNUL and no processes found check.   If 
.I kill_job()
returns zero, then no processes were found that were part of the job, hence the
job should have exited.  We use SIGNUL because of timing between the server and
mom \- the server may well send SIGKILL after a prior SIGTERM because it didn't
receive the Obit notice in time, but the job may in fact have exited.
Here there would not be any processes alive and we do not wish to trigger the
recycle or log the nominal case.
.Fn req_stat_job()
.Cs
void req_stat_job(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Status Job Request received from the Server.
.RE
.LP
If a null job id was passed in the request, the request is for status
of all jobs, otherwise it is for the one identified job.
For each job for which status is to be returned to the server,
.I mom_set_use()
is called to update the latest resource usage figures.
Attributes are returned to the server only if they are modified,
.Sc ATR_VFLAG_MODIFY
is set.   This is done to keep down traffic and
make sure Mom doesn't update any attribute she shouldn't.  The attributes
returned are: JOB_ATR_resc_used, JOB_ATR_errpath, JOB_ATR_outpath and
JOB_ATR_session_id.
The resources used is calculated with a special case for "cput" and
"mem".  These are added with the polled information from any sisterhood
to give a total for all nodes in the job.
.Fn del_files()
.Cs
static int del_files(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Copy/Delete File Request received from the Server.
.RE
.LP
This is a local support function.  Before being called the following two
things must be true: (1) the external variables
.Av useruid ,
.Av usergid ,
.Av ngroup 
(the number of supplementary groups), and
.Av groups
(the supplementary group array)
must be set and the current working
directory must be the the owner's home directory.  Both of these are
accomplished by first calling 
.I fork_to_user() .
.LP
Each file contained the basic Delete (Copy) Files Request is deleted from
MOM's spool directory (or the user's home directory) by unlinking it if the file
is not a standard (output or error) file and if the remote and local names
in the request point to different files.  If the remote file is actually local
as determined by
.I local_or_remote() ,
and if 
.I is_file_same()
indicates the two names point to the same file, then the file is not deleted
as it was here before the job started.  We only deleted files staged in or
staged out to a different file.   For example, if the user said to stage in 
a file from foo:/tmp/bar to /tmp/bar and if the job ran on host foo, the file
should not be deleted.
.LP
If the file is marked as being a 
.I "standard job file" ,
meaning output, error, or checkpoint, the unlink is done as root.
These files must be listed first by the server, once the process changes to
act with the user's level of privilege, it cannot go back.
If not marked as a standard job file, the file which is likely one spooled in.
It is unlinked as the user.  This prevents the removal of files
owned by a different user.
.Fn req_rerunjob()
.Cs
void req_rerunjob(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Rerun Job Request received from the Server.
.RE
.LP
This request is sent to MOM by the server to tell MOM the job is being
rerun and the so called
.I "standard files"
 (output, error, and checkpoint) should be returned to the server for
safe keeping.
This happens after the server has killed the job by sending a request for
a SIGKILL signal.
As the file return will require multiple requests back to the server,
.I fork_me()
is called to create a child process.  The password entry is found for the
uid under which the job was executed.
A new connection is opened by the child to the server which owns the job,
.I client_to_svr() .
The local function 
.I return_file()
is called three times, once each for output, error, and checkpoint.
.Fn req_cpyfile()
.Cs 
void req_cpyfile(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Copy Files Request received from the Server.
.RE
.LP
The Copy Files request is sent by the server to MOM after the job has
terminated.  It directs MOM to deliver the files to their destinations.
It may also be sent to stage-in a file before the job is sent to MOM.
The routine
.I fork_to_user()
is called to fork a child and setup: (1) the external variables
.Av useruid ,
.Av usergid ,
.Av ngroup
(the number of supplementary groups), and
.Av groups
(the supplementary group array)
and change the current working
directory to the the owner's home directory.
The child process sets its real and effective uid and gid and supplementary
groups to that of the user.
Then for each file listed in the request, the destination is parsed.
.LP
If the destination host name, the part before the colon,  see
.I local_or_remote() ,
is the same as which MOM is running, the child sets up to do
a local copy using the
.Ty /bin/cp
command.
For a different host, the child will set up to do a remote copy using the
.Ty pbs_rcp
command.  The copy (cp or rcp) command is built.  If the file does not
exist, no attempt to copy is made and no error is returned.
.LP
If the file is local and if the source and destination is the same file,
.I is_file_same() ,
then the copy operation is skipped.
Otherwise, the 
.I sys_copy()
function, see below, is used to issue the copy command.
If the return is zero, it is assumed the copy was successful.
Then and only if the file was being
copied out bound and from MOM's spool directory is that file deleted.
.LP
If the sys_copy call returns an error, then we assume the copy failed.
If the copy was to stage in files, any prior file in the request, that
was successfully copied, is deleted to prevent unused files from being
left lying around as the job will not be run.  If MOM was copying a file
outward from her spool area at the time of the failure, then that file
is relinked (moved) into a 
.I undelivered
directory.  It is up to the administrator to deal with any files in that
directory.  Any copy failure results in a log message with an 
.I "event class"
of \*QFile\*U.
The information is relayed back to the server so that the server can send
mail to the user.
.Fn sys_copy()
.Cs
static int sys_copy(char *ag0, char *ag1, char *ag2, conn)
.Ce
.IP Args: 4
.RS
.IP ag0
Argument zero of the copy, the copy command.   It must be a full path name.
.IP ag1
The source file name, it should be full qualified.   If a remote file, it
should be of the form:  user@host:/full/path/name
.IP ag2
The destination file name, it should be full qualified.   If a remote file, it
should be of the form:  user@host:/full/path/name
.RE
.IP Returns: 4
The result code:
.RS
.IP 0 6
Successful copy.
.IP 13
Exec() of copy program failed.
.IP 10xxx
Fork() failed.  xxx is the system error number.
.IP 20xxx
Error on wait(), xxx is the system error number.
.IP 30xxx
The copy process was stopped, xxx is the stop signal.
.IP 40xxx
The copy process was killed with a signal, xxx is the signal.
.RE
.LP
Sys_copy will attempt to fork() and exec() the copy program up to 4 times
with a 15 second delay between each try.   Any failures are logged and if
all four attempts fail, the error value described above is returned.
.Fn req_delfile()
.Cs
void req_delfile(struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP request
pointer to the Delete File Request received from the Server.
.RE
.LP
This request is sent to MOM by the server to tell MOM to delete job related
files, see 
.I on_job_end()
and
.I on_job_rerun()
in server/req_jobobit.c.
.LP
The request uses the same structure as the Copy File Request.  Only the
local file name is used.  As with that request, the first thing is to call
.I fork_to_user() to create a child process and change to the user's home
directory.   However, at this level, the child remains for the time being
it root privileges.   For each file in the list,
.I del_files()
is called to delete the file.
.Fn start_checkpoint()
.Cs
int start_checkpoint(job *pjob, int abort, struct batch_request *preq);
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job.
.IP abort
If non-zero, the checkpoint call is to abort the running processes.
.IP preq
If non-null, points to the request from the server;  if null, this is an
internal call.
.RE
.IP Returns: 4
Zero if checkpointing is supported and succeeded, 
.Er PBSE_NOSUP
if checkpoint is non supported, and other non-zero error returns
for errors.
.LP
On the Cray, the checkpoint call may wait for two minutes to start, if the
processes are swapped out, the actual checkpoint call must be done by a child
process to keep from locking up MOM for that time.
.LP
This function first calls a machine dependent routine,
.I mom_does_chkpnt (),
to determine if checkpointing is supported.  If it is, a child process is
forked to call the routine
.I mom_checkpoint_job ()
to do the checkpoint.  If the checkpoint is being performed at the request
of the server, 
.Av preq
points the the request;  the child process will reply based on the return of
mom_checkpoint_job().  The parent (original MOM) will set the function
post_chkpt() to be called when the child is done with the checkpoint.
Also, if
.I abort
is set, the flag
.I MOM_CHKPT_ACTIVE
is turned on so dying tasks don't cause obit messages to be sent.
.Fn mom_checkpoint_job()
.Cs
int mom_checkpoint_job(job *pjob, int abort)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job
.IP abort
kill the job if TRUE.
.RE
.LP
Form the pathname for the checkpoint directory of the job.  If one
already exists, rename it with the postfix ".old".  Create a new
checkpoint directory.  On the Cray, check to see if the job is suspended
and if
.I abort
is set.  If so, resume the job first so job will be "Q"ueued and then
back into "R"unning when restarted.  For each task in the job, call
the machine dependent routine
.B mach_checkpoint
to checkpoint each task.  If any checkpoints fail and
.I abort
is set, return the error
.I PBSE_CKPSHORT .
If any checkpoints fail but
.I abort
is not set, just remove the checkpoint directory and, if there
is an old directory, rename it to the original name.
.Fn post_chkpt()
.Cs
void post_chkpt(job *pjob, int ev)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job
.IP ev
error value
.RE
.LP
This function is called from scan_for_terminated() when found in ji_mompost
to clean up after a checkpoint.
We save the value of the flag
.I MOM_CHKPT_ACTIVE
so we can tell if the checkpoint was being done with abort.  The flag
.I MOM_CHKPT_ACTIVE
is turned off and the value of
.I ev
is checked to see if there was an error.  If no error occurred, turn on
the flag
.I JOB_SVFLG_CHKPT
and return.  If an error took place, but the checkpoint was not done
with abort, just return.  Otherwise, turn on the flag
.I MOM_CHKPT_POST
and loop through the job's checkpoint directory looking for checkpoint
images for tasks.  For each checkpoint images found, look for the
corresponding task structure and set the task flag
.I TI_FLAGS_CHKPT .
Then set exiting_tasks so we call scan_for_exiting().
.Fn cray_susp_resum()
.Cs
static void cray_susp_resum(job *pjob, int which, struct batch_request *request)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job
.IP which
1 for suspend, 0 for resume.
.IP request
Pointer to the signal job request.
.RE
.LP
Under Unicos, the system functions suspend() and resume() can take a while,
up to 120 seconds, MOM cannot afford to sit and wait.   Therefore, the
functions are performed by a child process.
.LP
MOM, the parent however needs to know if the operation succeeded or failed
in order to update the job structure.  
When MOM forks, the parent records in the job structure the pid of the child
process,
.Ar ji_momsubt
(for subtask), and a pointer to a post processing function,
.Ar ji_mompost ,
which is called when the child process exits.
.LP
For suspend, the post processing function is
.I post_suspend()
located in unicos8 or unicosmk2 mom_start.c.
MOM also notes the current time in
.Ar ji_momstat
in the job structure.  This is needed to adjust the walltime used when
the job is resumed.
.LP
The child job performs the suspend() or resume() system call and
then acknowledges or rejects the original request from the server.
The system call may be retried up to 3 times if it returns EAGAIN or
EINTR.  If the system call does not return an error, the child exits with
zero; it exits with 1 if there was an error.  See scan_for_terminated() in
unicos8 or unicosmk2 mom_start.c.
.NH 4
.Fi prolog.c
.LP
The file
.I src/resmom/prolog.c
continues the various functions to support administrator supplied
prologue and epilogue scripts.  These scripts are run with root privilege
before and after the user's job.
.LP
The prologue script arguments (argv) are:
.IP argv[1]
The job ID.
.IP argv[2]
The user's name.
.IP argv[3]
The user's group name.
.LP
The epilogue arguments are the above plus:
.IP argv[4]
The Job Name.
.IP argv[5]
The Session ID.
.IP argv[6]
The list of requested resource limits, attribute 
.At Resource_List  .
.IP argv[7]
The list of resources used, attribute
.At resources_used .
.IP argv[8]
The name of the queue in which the job resides.
.IP argv[9]
The account sting (qsub -A option) if it is set.
.LP
The input file to the script is architecture dependent, see pelog_input().
The scripts standard output and standard error are connected to the files
which are the output and error of the job.  One exception being when the
job is interactive, the output and error are closed before the epilogue
is run, hence the epilogue is connected to /dev/null for output and error.
.Fn pelog_err()
.Cs
static int pelog_err(char *file, int error, char *text)
.Ce
.IP Args: 4
.RS
.IP file
name of prologue/epilogue script.
.IP error
number to record in log.
.IP text
message to record in log.
.RE
.IP Returns: 4
The error number is returned.
.LP
This function records a error number and text message in MOM's log when
the prologue or epilogue fails.
.Fn pelogalm()
.Cs
static void pelogalm()
.Ce
.LP
This function is the SIGALRM handler for prolog.c.  When the alarm set 
around the prologue or epilogue script times out before the script
completes, this function is called.  It kills the child running the script
and sets the script exit to -4.
.Fn run_pelog()
.Cs
int run_pelog(int which, char *file, job *pjob, int type)
.Ce
.IP Args: 4
.RS
.IP which
script to run, 
.Sc IP_PROLOGUE
or 
.Sc IP_EPILOGUE .
.IP file
name of script to execute.
.IP pjob
pointer to job structure.
.IP type
of operation to connect to output/error.
.RE
.IP Returns: 4
The exit status of the script.
.LP
This is the heart of the prologue/epilogue processing.
If the script file does not exist, there is no action performed and it is
not considered an error.
Before the script (which may be an executable binary) is executed, the
following checks are made to insure that it is \*Qsafe\*U to execute the
script:
.IP \(bu
The file must be owned by root.
.IP \(bu
The file must be a regular file.
.IP \(bu
The file must be readable and executable by root (the owner).
.IP \(bu
The file must not be writable to any one other than root.
.LP
The system dependent input file is opened by calling
.I pe_input() .
If an error occurs, it is logged,
.I pelog_err() ,
and the error returned.
A child is forked, inheriting the null environment from MOM.
The parent process sets an alarm to prevent the child from taking forever.
The parent then waits for the child to complete.
When it does, the exit status is returned.
.LP
If the output operation type is
.Sc PE_IO_TYPE_NULL ,
/dev/null is opened for both standard output and and standard error.
This is done when running the epilogue for an interactive job because
the pseudo terminal has already be lost.
If the output operation type is
.Sc PE_IO_TYPE_STD ,
the standard output and error files of the job are opened and passed
to the script.  This is the case for the epilogue for normal jobs.
If the output operation type is
.Sc PE_IO_TYPE_ASIS ,
we go with the current file descriptors for 1 and 2.  When called to
run the prologue, the caller, 
.I finish_exec()
is already attached to the standard output and error of the job.
.NH 4
.Fi req_quejob.c
.LP
MOM borrows the receive job functions req_quejob(), req_jobcredential(),
req_jobscript(), req_rdytocommit(), and req_commit() from the server.
There are some differences created by \*Q#ifdef\ PBS_MOM\*U that should
be pointed out.  Additionally, MOM has her own version of req_mvjobfile().
.Fn req_quejob()
.LP
MOM requires that the request be from another daemon, the server.
Also MOM does not worry about \*Qqueues\*U.
.LP
If MOM finds the job being sent to her already exists, she sees if the
existing version is marked as a checkpointed job,
.SC JOB_SVFLG_CHKPT
set in 
.Ar ji_svrflags .
If so, she keeps the existing version, but marks it as state
.Sc JOB_SUBSTATE_TRANSICM
for req_commit().  The server should not be sending a script or the \*Qready
to commit\*U requests.
.LP
For new jobs, MOM insist that the job owner attribute,
.At JOB_ATR_job_owner
be set by the server.
.LP
When decoding the job attributes, any error is fatal to the request.
Also, if the 
.Ar al_op
field in the received svrattrl structure is DFLT rather than SET, then
the attribute being passed (likely a resource_list entry) contains a value
set by the server rather than the user based on either a queue or server
.At resource_default
attribute (default value).  Under Unicos, the default value may be overridden
by the limit set in the user's User Data Base (UDB) entry.  This check of
DFLT is thus required.  If DFLT, then 
.Sc ATR_VFLAG_DEFLT
is set in the attribute (resource) structure at_flags member.  This flag will
be checked in 
.I mom_set_limits() ,
see src/resmom/unicos8/mom_mach.c, when limits are being actually set.
.Fn req_jobcredential()
.LP
The sender must be a server.
.Fn req_jobscript()
.LP
The sender must be a server.
.Fn req_mvjobfile()
.LP
This is MOM's own version.  The files are owned by the user and placed
in either the spool area or the user's home directory depending
on the compile option, see 
.I std_file_name ().
.Fn req_rdytocommit
.LP
The sender must be a server.
.Fn req_commit()
.LP
The job is linked into the all job list, marked in state
.Sc JOB_STATE_RUNNING
and substate 
.Sc JOB_SUBSTATE_RUNNING ,
and the server's network address is saved in
.Ar ji_momt.ji_svraddr .
Then 
.I start_exec()
is called to place the job into execution.
On return, the job information is saved with a call to
.I job_save() .
Then the attributes
.At JOB_ATR_errpath ,
.At JOB_ATR_outpath ,
and
.At JOB_ATR_session_id
are marked as modified so their values will be returned (once) to the server in 
.I status_attrib() ,
see stat_job.c.
.NH 4
.Fi mom_comm.c
.LP
The file
.I
src/resmom/mom_comm.c
.R
groups together functions that deal with communication between MOM
and tasks requesting Task Management functions, and communication between MOM's
within a job acting as a sisterhood of nodes.  Some miscellaneous
functions are here for convenience.
.LP
.Fn save_task()
.Cs
int save_task(ptask)
.Ce
.IP Args: 4
.RS
.IP ptask
A pointer to the task structure representing the target task.
.RE
.IP Return: 4
.RS
.IP 0
if no errors take place.
.IP -1
if an error occurs.
.RE
.LP
This function is used to save the critical information associated with
a task to disk.
.LP
.Fn event_alloc()
.Cs
eventent *event_alloc(int com, nodeent *pnode, tm_event_t event, tm_task_id taskid)
.Ce
.IP Args: 4
.RS
.IP com
the command associated with the event.
.IP pnode
an entry in the nodeent array of the job to which the event belongs.
.IP event
the event number given by the requesting task or TM_NULL_EVENT if it
is an internally generated event.
.IP taskid
the task id of the requesting task.
.RE
.IP Return: 4
.RS
.IP "pointer to malloc'ed eventent structure"
.RE
.LP
This function will allocate an event and link it to the given nodeent entry.
.LP
.Fn task_create()
.Cs
task *task_create(job *pjob, tm_task_id taskid)
.Ce
.IP Args: 4
.RS
.IP pjob
a pointer to the job structure which the new task will join.
.IP taskid
the task id of the new task.
.RE
.IP Return: 4
.RS
.IP "pointer to malloc'ed task structure"
.RE
.LP
This function will allocate a task and link it to the given job.
If a limit for the number of tasks allowed to be created on
a single node exists for the job (taskspn), a NULL is returned if
the new task would go over the limit.
.LP
.Fn task_recov()
.Cs
int task_recov(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
a pointer to the job structure which is to have its tasks read from disk.
.RE
.IP Return: 4
.RS
.IP 0
if no error occurs.
.IP -1
on error.
.RE
.LP
Recover (read in) the tasks from their save files for a job.
This function is only needed upon MOM start up.
.LP
.Fn tm_reply()
.Cs
int tm_reply(int stream, int com, tm_event_t event)
.Ce
.IP Args: 4
.RS
.IP stream
the TCP stream to communicate with the user task.
.IP com
the command to send.
.IP event
the event number for the message.
.RE
.IP Return: 4
.RS
.IP "a DIS library error value"
.RE
.LP
Send a reply message to a user proc over a TCP stream.  The message
will have the protocol type (TM_PROTOCOL), followed by the version
(TM_PROTOCOL_VER), the command number then the event.
.LP
.Fn im_compose()
.Cs
int im_compose(int stream, char *jobid, char *cookie, int com, tm_event_t event, tm_task_id taskid)
.Ce
.IP Args: 4
.RS
.IP stream
the RPP stream to another MOM.
.IP jobid
the job id of the job this message concerns.
.IP cookie
the job cookie of the job.
.IP com
the command of the message.
.IP event
the event of the message.
.IP taskid
the task which this message concerns.
.RE
.IP Return: 4
.RS
.IP "a DIS library error value"
.RE
.LP
Send a reply message to another MOM over an RPP stream.  The message
will have the protocol type (IM_PROTOCOL), followed by the version
(IM_PROTOCOL_VER), the job id, cookie, command, event and then task id.
.LP
.Fn send_sisters()
.Cs
int send_sisters(job *pjob, int com)
.Ce
.IP Return: 4
.RS
.IP Return:
count of messages sent.
.RE
.LP
Send a message (command = com) to all the other MOMs in the job pjob.
.LP
.Fn find_node()
.Cs
nodeent *find_node(job *pjob, int stream, tm_node_id nodeid)
.Ce
.LP
Check to see which node a stream is coming from.  Return a NULL
if it is not assigned to this job.  Return a nodeent pointer if
it is.
.LP
.Fn job_start_error()
.Cs
void job_start_error(job *pjob, int code)
.Ce
.LP
An error has been encountered starting a job.
Format a message to all the sisterhood to get rid of their copy
of the job.  There should be no processes running at this point.
.Fn stream_eof()
.Cs
void stream_eof(int stream, int ret)
.Ce
.IP Args: 4
.RS
.IP stream
an RPP stream that needs to be closed due to an error.
.IP ret
the DIS error which caused the problem.
.RE
.LP
Enter a loop to search though all the jobs looking for
.B stream .
We want to find if any events are being waited for
from the "dead" stream and do something with them.
If the stream is not found, just return.  If it is found,
enter a loop for the events being waited for.
For each event, check the command and execute code to
process an error for that type of request.
If the command is
.I IM_JOIN_JOB ,
call
.I send_sisters
to send an
.I IM_ABORT_JOB
to all the other MOM's to get rid of their copy of the job.
Then mark the job with JOB_EXEC_RETRY.
There should be no processes running at this point.
If the command is
.I IM_ABORT_JOB
or
.I IM_KILL_JOB ,
the job is already in the process of being killed but somebody has
dropped off the face of the earth.  Just check to see if everybody
has been heard from in some form or another and set JOB_SUBSTATE_EXITING if so.
If the command is a user request (such as IM_SPAWN_TASK), just inform
the requesting process.
If the command is
.I IM_POLL_JOB ,
mark the job to die.
If the stream turns out to come from Mother Superior, we are an orphan
and just kill the job.
.LP
.Fn im_request()
.Cs
void im_request(int stream, int version)
.Ce
.IP Args: 4
.RS
.IP stream
an RPP stream that has a message to read.
.IP version
the protocol version read by
.B do_rpp()
in mom_main.c.
.RE
.LP
Check that the
.B version
of the protocol is one we understand.  Make sure the address of
the incoming stream is from a host that is in our cluster.
Read the jobid, cookie, command, event and task and verify that
they are meaningful.  A large switch statement is entered with code
for each type of command.
.IP IM_JOIN_JOB
Make sure it is Mother
Superior calling.  Then read the node id to be assigned to me, the
number of nodes in the job and the node id for each node.  The job
attributes follow and are read by calling
.I decode_DIS_svrattrl() .
Send a IM_ALL_OKAY message back.
.LP
Anything other than IM_JOIN_JOB should be a request for a job
we know about so call
.I find_job
and send an error if we come up empty.  Make sure the cookie checks
out.  If the message is a reply to a request we sent (IM_ALL_OKAY or
IM_ERROR), look for the event that corresponds to the message.
.IP IM_KILL_JOB
Sender is mom superior commanding me to kill a job which I should be a part of.
Send a signal and set the jobstate to begin the
kill.  We wait for all tasks to exit before sending
an obit to mother superior.
.IP IM_SPAWN_TASK
Read the parent node id and the task id for the new task.  Next, read
strings until a zero length string.  These are the argv array for the
exec.  Finally, read strings until end of message.  These are the
environment variables.  Call
.I task_create
then send a IM_ALL_OKAY message back.
.IP IM_GET_TASKS
Sender is MOM which controls a task that wants to get
the list of tasks running here.  Read the node id of the sending node
and call
.I find_node()
to verify it is okay.  Send a reply with the task id of each task
running on the local node.
.IP IM_SIGNAL_TASK
Sender is MOM sending a task and signal to
deliver.  Read the node id of the sending node, the task id of the task
to signal and the signal number to deliver.  Call
.I kill_task
and send a reply back.
.IP IM_OBIT_TASK
Sender is MOM sending a request to monitor a task for exit.
Read the node id of the sending node and the task id of the task
to monitor.  Check to make sure the task is local.  If it has
already exited, send a reply with the exit status.  If it is
still running, generate an obitent structure and link it to
the task.
.IP IM_POLL_JOB
ender is (must be) mom superior commanding me to send
information for a job which I should be a part of.
Reply with a flag which gives a "recommendation" as to whether
the job should be killed or not, followed by the cpu time and
memory usage of the tasks on the local node.
.IM IM_ABORT_JOB
Sender is (must be) mom superior commanding me to
abort a JOIN_JOB request.  Make sure it is Mother Superior calling,
then call
.I job_purge() .
.IM IM_GET_TID
This request is only sent to Mother Superior from a sub-mom to get a task
id.  Reply with a new task id for the job.
.LP
If The message received is a reply to one we sent, the event which
is being completed will have a command number.  Another switch statement
will be entered for a reply of either IM_ALL_OKAY or IM_ERROR.
A summary for IM_ALL_OKAY follows.
.IP IM_JOIN_JOB
I'm mother superior and the sender is one of the sisterhood saying
she got the job structure sent and she accepts it. 
Check to see if any other sisters still need to reply.  If not, call
.I finish_exec
to get the job going.
.IP IM_KILL_JOB
Sender is sending a response that a job
which needs to die has been given the ax.  Read the summed cpu time and
memory usage of the tasks on the node responding.  If no nodes
have a KILL_JOB request outstanding, set JOB_SUBSTATE_EXITING.
.IP IM_SPAWN_TASK
Sender is MOM responding to a spawn request.  Read the task
id of the new task and compose a message to the requesting task with
.I tm_reply .
.IP IM_GET_TASKS
Sender is MOM giving a list of tasks which she has started for this job.
Send a reply to the requesting task, reading task id's from the remote
MOM and writing them to the task.
.IP IM_SIGNAL_TASK
Sender is MOM with a good signal to report.  Just send a TM_OKAY
reply to the requesting task.
.IP IM_OBIT_TASK
Sender is MOM with a death report.  Read the exit value for the
task and compose a reply to the requesting task.
.IP IM_POLL_JOB:
I must be Mother Superior for the job and this is a reply with job
resources to tally up.  Read the recommendation to kill or not kill
the job, the cpu time and memory sums for the sending node.
If the recommendation is true, mark the job to be killed.
.IP IM_GET_TID
Sender must be Mother Superior with a TID.
We should have a saved SPAWN request which corresponds to this TID
request.  Check to see if the SPAWN request needs to be forwarded to
another MOM.  If so, call
.I im_compose
with the new task id.  If the SPAWN is local, call
.I task_create
to launch the new task, then reply to the requesting task with the
SPAWN result.
.LP
The second type of reply which can come back from another MOM from
the sisterhood is IM_ERROR.  The type of request which is being replied to
determines what is to be done with the error.
.IP IM_JOIN_JOB
A MOM has rejected a request to join a job.
We need to send a ABORT_JOB to all the sisterhood
and fail the job start to server.
I must be mother superior.  Call
.I job_start_error
with the error code sent with the reply.
.IP IM_ABORT_JOB
.IP IM_KILL_JOB
Both these requests indicate job cleanup failed on a sister.
Wait for everybody to respond then finish up.
I must be mother superior.
.IP IM_SPAWN_TASK
.IP IM_GET_TASKS
.IP IM_SIGNAL_TASK
.IP IM_OBIT_TASK
These are all requests which originate with a task.  Find the task
which needs to be informed of the error and call
.I tm_reply
to send it.
.IP IM_POLL_JOB
I must be Mother Superior for the job and
this is an error reply to a poll request.
The job needs to die so mark it to be killed.
.IP IM_GET_TID
Sender must be Mother Superior failing to
send a TID.
Send a fail to the task which called SPAWN.
.LP
.Fn tm_request()
.Cs
int tm_request(int fd, int version)
.Ce
.IP Args: 4
.RS
.IP fd
a file descriptor to read.
.IP version
the protocol version being sent.
.RE
.IP Return: 4
.RS
.IP -1
on error.
.IP 1
if no more data is available.
.IP 0
if more data is available.
.RE
.LP
Check that the source machine is localhost.
If reading the jobid, cookie, command, event and task all
work and make sense, the command is checked to see if it is TM_INIT.
if so, a reply is generated and sent and the function returns.
If the command is not TM_INIT, the node number where the requested
action will take place is read.  If the node number is part of the
job, a large switch statement is entered with code
for each type of command.  If the action node is not the local host,
a message to the remote action node will be composed and sent
and an event attached to that node's element in the job's node
array.
.NH 4
.Fi mom_server.c
.LP
The file
.I
src/resmom/mom_server.c
.R
groups together functions that deal with communication between a server
and MOM's composing a cluster.
This only includes a few message types, but will become more complex
as the scalability of the code improves.  Right now, no attempt is
made to deal with scale issues.
.LP
.Fn is_compose()
.Cs
int im_compose(int stream, int command)
.Ce
.IP Args: 4
.RS
.IP stream
the RPP stream to a server.
.IP command
the command of the message.
.RE
.IP Return: 4
.RS
.IP "a DIS library error value"
.RE
.LP
Send a reply message to a server over an RPP stream.  The message
will have the protocol type (IS_PROTOCOL), followed by the version
(IS_PROTOCOL_VER), and the command.
.LP
.Fn is_request()
.Cs
void is_request(int stream, int version)
.Ce
.IP Args: 4
.RS
.IP stream
an RPP stream that has a message to read.
.IP version
the protocol version read by
.B do_rpp()
in mom_main.c.
.RE
.LP
Check that the
.B version
of the protocol is one we understand.  Make sure the address of
the incoming stream is from the server.  The commands that
are recognized follow:
.LP
.IP IS_NULL
This is used to send a "ping" from the server to MOM's that
are not active.  No response is needed.
.IP IS_HELLO
The server wants us to send a IS_HELLO packet.  This is used
by the server to contact a MOM that was already up when the
server came up.  In this case, the server needs to initiate
communication with any MOM that has not been heard from.
MOM will send an IS_HELLO message back.
.IP IS_CLUSTER_ADDRS
This is a response to a IS_HELLO message.
It contains a list of IP addresses of the machines in the cluster.
They get added to the okclients binary tree.  Since IP addresses
do not get deleted from okclients, there is a problem if a server
is brought down and comes up again with a different node list.
If any MOMs stay up through this process, they will get the new
list added to the old.
.LP
.NH 3
.Tc Machine-dependent Files
.LP
Within the directory,
.I
src/resmom ,
.R
there is one subdirectory for machine dependent code for each class of
machine on which MOM runs.
The basic structure of each
machine dependent code
is identical.  Variations exist between systems as to how to accomplish
the required function.  The following sections will describe in machine
independent terms what function each common module performs.  Later sections
will address the machine dependent methods used where there is significant
difference from the \*Qcommon model.\*U
.NH 4
.Fi mom_mach.h
.LP
The file
.I
src/resmom/<machine>/mom_mach.h
.R
contains the machine-dependent macro definitions which are unique to
MOM.
It also contains the function prototypes for the equivalent
machine-dependent functions.
.NH 4
.Fi mom_mach.c
.LP
The file
.I
src/resmom/<machine>/mom_mach.c
.R
contains the machine-dependent source code which is unique to MOM and generally
relates to setting resource usage limits or determining resource usage by a job.
.Fn mom_set_limits()
.Cs
int mom_set_limits(job *pjob, set_mode)
.Ce
.IP Args: 4
.RS
.IP pjob
A pointer to the job structure representing the target job.
.IP set_mode
specifies if this call is for the initial setting of limits or for altering
existing limits.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
This function recognizes all resources controlled on the machine,
tests their values for sanity.
.LP
If 
.Ar set_mode
is 
.Sc SET_LIMIT_SET ,
then it also sets the limits for the designated
job to the limits specified by the job's resources.  In this case,
the function assumes that it is called from the child process before
execing the job's shell.
Additionally, it assumes that it is running as root and has access to the job's
standard error file handle.
.LP
If
.Ar set_mode
is
.Sc SET_LIMIT_ALTER ,
this function is being called to test and for some systems (the Cray), alter
the limits.  Systems which use the bsd based setrlimits() call cannot alter
kernel enforced limits because the setrlimits call assumes the limits are
being set for the current process.  In the alter case, the main MOM cannot
alter the job limits, only check them.  The Cray's limits() call does allow
another session's limits to be set.
.LP
In case of an error, in addition to returning a code defined in
pbs_errno.h, the function puts an error message on standard error.
.LP
For implementation of this function on machine X for all values of X 
except unicos8 or unicosmk2,
the method is to validate the limit value for all supplied
resource limits, including default values, and set that limit if valid.
If a limit is not supplied, the limit is set to unlimited.
.LP
For the different types of Unicos,
there is also the User Data Base (UDB) declare limits with which
to be concerned, thus things a done a bit differently.  A private function
.Ix which_limit() 
chooses the real limit based on:
.IP 1.
If a limit is specified by the user, not a server default, and if less than
the UDB limit, the user supplied limit is set.
If the user limit is greater than the UDB limit, the job is aborted \-
why waste cycles since it likely would fail assuming the user specified the
limit correctly.
.IP 2.
If a default limit was established by the server (the user didn't supply one).
Then the lessor of the server default and the UDB is the true limit.
.IP 3.
If no limit was supplied, the UDB limit is used.
.LP
.Fn mom_do_poll()
.Cs
int mom_do_poll(pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
A pointer to the job structure representing the target job.
.RE
.IP Return: 4
.RS
.IP 0
if the job has no machine-dependent resource limits which require polling.
.IP 1
if at least one machine-dependent resource limit requires polling.
.RE
.LP
This function is called by MOM before forking the job as a child.
It tells mom whether it will be necessary to poll the job's condition
to determine if any specified resource limit has been exceeded.
.Fn mom_open_poll()
.Cs
int mom_open_poll()
.Ce
.IP Args: 4
.RS
.IP
None.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
This routine's purpose is to establish a connection with kernel data
structures which will be used in job resource use polling cycles.
.Fn mom_get_sample()
.Cs
int mom_get_sample()
.Ce
.IP Args: 4
.RS
.IP
None.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
If there is at least one machine-dependent resource to be polled for
at least one job, this routine is called before each MOM resource
limit polling cycle.
It samples the state of every job on the system in preparation for
job-by-job polling.
.Fn mom_over_limit()
.Cs
int mom_over_limit(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
A pointer to the job structure representing the target job.
.RE
.IP Return: 4
.RS
.IP 0
if polling the state of the job shows that all resource consumption is
within limits.
.IP 1
if polling the state of the job reveals that it has exhausted a
controlled resource.
.RE
.LP
MOM's job polling loop calls this routine to see if the specified job
is over its limits.
The function returns a logical value telling whether to kill the job.
.Fn mom_set_use()
.Cs
int mom_set_use(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
A pointer to the job structure representing the target job.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
This function sets the job's resources_used attribute to the
list of resources and amounts used so far by the job.
Any call to this function must appear in the execution order between a
call to mom_open_poll and mom_close_poll and must come after the job's
session_id attribute is defined.
The values it inserts into the resc_used attribute reflect
conditions at the last mom_get_sample call.
Note, that the attribute
.At JOB_ATR_resc_used
is marked as modified,
.Sc ATR_VFLAG_MODIFY
set, on each call so the latest information will be returned to the server,
see status_attrib().
.LP
On the Cray (unicos8 or unicosmk2), if 
.Sc JOB_SVFLG_Suspend
is set in 
.Av ji_svrflags ,
walltime is not updated as the job is suspended.
.Fn mom_close_poll()
.Cs
int mom_close_poll()
.Ce
.IP Args: 4
.RS
.IP
None.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
Close polling connections to the kernel.
.Fn mom_does_chkpnt()
.Cs
int mom_does_chkpnt()
.Ce
.IP Returns: 4
True (1) if checkpoint supported, false (0) if not.
.LP
This routine is machine dependent.  The PBS_MACH types unicos8, unicosmk2 and
irix6array currently return true.
.Fn mach_checkpoint()
.Cs
int mach_checkpoint(job *pjob, int abort)
.Ce
.IP Args: 4
.RS
.IP pjob
A pointer to the job structure representing the target job.
.IP abort
A logical value specifying whether the job should be aborted after the
checkpoint has been taken.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
If checkpointing is not supported on the machine, mach_checkpoint does nothing
and returns
.Sc PBSE_NOSUP .
Otherwise, for those machine types that support checkpoint,
this function causes the designated
job to be checkpointed into the restart file named in the job structure.
If the checkpoint file already exists from a prior checkpoint, the file
is renamed.
Additionally, if
.Ar abort
is true, the job is killed if the checkpoint succeeds.
.LP
If checkpoint succeeds,  any old checkpoint file is unlinked.
If checkpoint fails, the new file is unlinked, and old file is renames
to the original name.
.Fn mach_restart()
.Cs
int mach_restart(task *ptask, char *file)
.Ce
.IP Args: 4
.RS
.IP ptask
A pointer to the task structure for task being checkpointed.
At the current time, only irix6array/mom_mach.c uses this parameter.
.IP file
the path of the task's checkpoint image.
.RE
.IP Return: 4
.RS
.IP 0
if success.
.IP non-0
an error code defined in pbs_errno.h.
.RE
.LP
If checkpointing is not supported on the machine, mach_restart does nothing
and returns
.Sc PBSE_NOSUP .
Otherwise, the system's restart call along with any other required supporting
code is executed.
.Fn kill_task()
.Cs
int kill_task(task *ptask, int signal)
.Ce
.IP Args: 4
.RS
.IP ptask
pointer to task structure.
.IP signal
to send to the running task.
.RE
.IP Returns:
Number of processes killed, i.e. zero if no processes belonging to the task
were found.
.LP
This function sends the specified signal to each process which is a member
of the task's session.  At the current time, most systems do not support a
direct method of signaling the members of a session, only a single process
or the members of a process group.  There may be more than one process group
active within the job session.
.LP
If the task session number
is less than or equal to one (1), then return without doing anything,
either the session has not yet been established, it has already been
signaled, or the session no longer exists for other reasons.
.LP
The function
.I mom_get_sample() 
is called to update the job usage attribute with the latest information.
.LP
For most systems, the method of signaling the session's processes is to
walk the process table looking for any process which is a member of the
session.  If it is, the system function kill() is called with the supplied
signal.
A count of the number of processes found and signaled is returned as the 
function return.
.LP
The Unicos OS does support a kill "job" system call, killm().  Using this call
saves walking the process table.   However, we can only return 1 or 0 for the
kill_task() value since we only know that zero or more than zero processes
existed in the  task.
.NH 4
.Fi mom_start.c
.LP
The file
.I src/resmom/<machine>/mom_start.c
contains machine dependent code dealing with placing a job into execution
and with post termination processing.
.Fn set_job()
.Cs
int set_job(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job structure
.RE
.IP Returns: 4
.RS
.IP zero 
if successful, non-zero if error.
.RE
.LP
This dependent routine establishes a new session.  Typically, this is
done by calling setsid().
.LP
The Unicos version requires a bit extra, its concept of \*Qjob\*U is a bit
different.   Also the batch bit must be set in a sesscnt() call.
.LP
Irix 6.x support Project IDs which is an accounting entity.  The project id
for the job is set here.
.Fn set_globid()
.Cs
int set_globid(job *pjob, struct startjob_rtn *sjr)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job structure
.IP sjr
pointer to info returned from new job.
.RE
.LP
This dependent routine sets a value for the job structure field
ji_globid.  If any kind of job management software can independently
track processes useing a special identifier, that can be formeated
into a string in this routine.  Otherwise, "none" is filled in.
.LP
Irix 6.x supports the Array Session Handle (ASH) which is formatted into
a hex number in the ji_globid field.  This allows any task spawned
to carry the same ASH.
.Fn set_mach_vars()
.Cs
void set_mach_vars(char *buffer, int space, char **environ, int enspace)
.Ce
.IP Args: 4
.RS
.IP buffer
a character buffer in which the various environmental variable strings are
placed.
.IP space
the amount of space (left) in the buffer.
.IP environ
a pointer to (the first available member of) an array of pointers to strings.
This array is included as the environment when the \*Qshell\*U is exec-ed.
.IP enspace
the number of unused members in the array 
.Av environ .
.RE
.LP
This function is provided in case there is a need for machine dependent
environment variables.  Any required string of the form \*Qkeyword=value\*U
should placed into the
.Av buffer
provided it does not exceed the available
.Av space .
A pointer to the start of the string should be placed in
.Av environ . 
No more than 
.Av enspace 
- 1 variables should be added.  
.LP 
The 
.Av enspace
array must be null terminated which is all the default function does.
.Fn set_shell()
.Cs
char *set_shell(job *pjob, struct passwd *pwd)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job.
.IP pwd
pointer to the password entry for the user under whose uid the job will be run.
.RE
.IP Returns: 4
.RS
.IP pointer
to the shell to execute as the job.
.RE
.LP
This routine returns a pointer to the name of the shell program which should
be executed.   The pointer is to a area which might be overwritten by another
call to obtain a different password entry.
.LP
The general method of determining which shell is given by the following:
.IP 1. 4
The entry in the job attribute
.I JOB_ATR_shell
which has a host name that matches the current host.
.IP 2.
The entry in the attribute
which has a wild card (null) hostname.
IP 3.
The user's login shell.
.LP
.Fn scan_for_terminated()
.Cs
void scan_for_terminated()
.Ce
.LP
This routine is called from MOM's main loop when the 
.I termin_child
flag is set in
.I catch_child() .
Its purpose is to determine which job, if any, has terminated execution and
to update the resource usage information for that job.
.LP
On most machines, the resource usage information is maintained in the process
table of each process and rolled upward to the parent as each child dies.
Thus the session leader, the shell, ends up with the total usage numbers.
The trick is that the process table entry goes away when the child is reaped.
Thus, when MOM received a SIGCHLD and does a wait() to obtain the pid of the
dead process to determine which job has terminated, the information in the
process table is lost.  Hence, for these machines, MOM must before calling
wait(), call 
.I mom_get_sample() 
to obtain the basic information from the system and then call
.I mom_set_use()
for each job which might have terminated, i.e. all running jobs.
.LP
Now, the wait(), actually waitpid(), is called and the returned pid
can be matched against the various job's session ids to determine which
job has terminated.  The exit status returned by waitpid() is saved in
.Av ji_exitstat
(with some modification)
and the job is marked as being in substate 
.Sc JOB_SUBSTATE_EXITING 
to identify the job to 
.I scan_for_exiting() .
.Av exiting_tasks
is set so MOM will call scan_for_exiting().
The exit status is filtered by the WIFEXITED and WIFSIGNALED macros.  If the
exit status is an exit value, it is returned unchanged.  If the exit value is
a signal number, it plus 10000 is returned.   See 
.I req_jobobit()
in server/req_jobobit.c for how the 10000 is used.
.LP
See the Cray C90 version for a system which provides integrated support
for jobs, limits, and usage.
.LP
The Unicos version of scan_for_terminated() also checks if the terminated
process was a special helper which was performing a time consuming task for
MOM.  Such tasks are checkpointing, suspending, or resuming a job.  
MOM checks the pid returned by waitpid() against 
.Ar ji_momsubt
in each job.  If a match is found, the function pointed to by
.Ar ji_mompost
is invoked as
.Cs
void func(job *, int)
.Ce
where the second argument is the exit status of the child.
.Fn open_master()
.Cs
int open_master(char **rtn_name)
.Ce
.IP Args: 4
.RS
.IP rtn_name
[Return] A pointer to a character pointer in which a pointer to the
slave side pseudo terminal name is placed.
.RE
.IP Returns: 4
The file descriptor of the master side pseudo terminal is returned.
-1 if one was not opened.
.LP
There are several versions of this routine, just about one per system type.
Some systems, notably AIX, provide a multiplexor device to provide both the
master and slave tty without searching.
The Intel Paragon has a similar routine,
.I openpty() .
On most systems without a multiplexor or library routine, open_master must try
opening each possible master name until the open succeeds.
The slave name is derived by changing the sub-string
.LP
The important issues are to return the file descriptor of the master side
as the function return, or -1 on an error, and a pointer to the slave name
into the argument.
.NH 3
.Tc Site Modifiable Files
.LP
MOM contains several modules which are meant to be easily
modifiable by a site.
The supplied version of these files may be found in the
.I src/lib/Libsite
directory and are linked via the libsite.a library.
How to modify these files is discussed in the IDS chapter on libsite.a.
.NH 4
.Ix site_mom_chu.c
.LP
The file
.I src/lib/Libsite/site_mm_chu.c
contains the function:
.Fn site_mom_chkuser()
.Cs
int site_mom_chkuser(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob 
pointer to the job being placed into execution.
.RE
.IP Returns:
zero if account is valid, non-zero if the account is invalid and hte job should be aborted.
.LP
This routine is provided to allow a site to add whatever type of account 
validation it chooses.  It should return non-zero if the job should be 
aborted for whatever reason.
As provided it always returns zero.
.NH 4
.Ix site_mom_ckpt.c
.LP
The file
.I src/lib/Libsite/site_mom_ckpt.c
contains two functions:
.Fn site_mom_postchk()
.Cs
int site_mom_postchk(job *pjob, int hold_type)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to job structure.
.IP
type of hold being applied to the job.
.RE
.IP Returns:
Zero if successful, non-zero if failed.
.LP
This routine is called following a successful checkpoint-and-terminate of
a job as the result of a qhold of a running job or a pbs_server shutdown.
(This applies only to the Cray implementation.)
The return value is used as the exit code of the child process doing the
checkpoint.   It has little impact on the job.
.LP
As an example of usage, at NAS this routine is being used to migrate the
checkpoint image of certain large, low priority jobs.
.Fn site_mom_prerst()
.Cs
void site_mom_prerst(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job structure.
.RE
.IP Returns: 4
.RS
.IP zero
(0) if successful.
.IP JOB_EXIT_FAIL1
if job should be permanently aborted.
.IP JOB_EXIT_RETRY
if job should be requeued.
.RE
.LP
This routine is called before a job is restarted from a checkpoint image.
(This applies only to the Cray implementation.)
.LP
As an example of usage, at NAS this routine is being used to reload the
checkpoint image of large, low priority jobs before the restart.
.NH 4
.Ix site_mom_jset.c
.LP
The file
.I src/lib/Libsite/site_mom_jset.c
contains the following function:
.Fn site_job_setup()
.Cs
int site_job_setup(job *pjob)
.Ce
.IP Args: 4
.RS
.IP pjob
pointer to the job structure of the the job being placed into execution.
.RE
.IP Returns: 4
Zero on success, non-zero if job should be aborted.
.LP
This routine is called from 
.I finish_exec()
shortly after the job session is established.  A site may use it to
perform any additional session related setup required at that site.
.LP
Return zero (0), if the setup is successful, or non-zero if the job
is to be aborted.

.NH 2
.Tc \f3Program: pbs_rcp\fP
.LP
.NH 3
.Tc Overview
.LP
Included with the source for MOM, in subdirectory
.I src/resmom/mom_rcp
is the source code for the rcp(1) command from the bsd4.4-Lite distribution.
This code is copyrighted by UCB as noted in the source files.
The code has been slightly modified to allow it to
compile under systems other than bsd4.4;  note the liberal use of functions such
as vwarnx() and snprintf() not found in POSIX.
The copyright clearly grants the right to modify and redistribute the source.
.NH 3
Why pbs_rcp
.LP
Why is this code supplied as part of PBS?
Within PBS, there are three cases in which MOM must move files
between her machine and some other:
.RS
.IP a. 4
Preexecution stage in of files.
.IP b.
Post-execution stage out of files.
.IP c.
Post-execution return of the job`s standard output and standard error.
.RE
.LP
The PBS project did not wish to be dependent on NFS, AFS, or any
other distributed file system in order to support file delivery.
Nor did we wish to restrict the source/target of file movement to
those systems with a PBS server.  This ruled out using the "job"
protocol as a file transport.  Ftp(1) and ftam require the user's
password.  We did not wish to require that knowledge.  Thus rcp(1)
was selected as the transport method.   MOM uses the system(3)
library routine to execute the rcp command.
.LP
However, many rcp implementations come with a serious flaw.  They
may exit and return an exit status of zero (0), when the file was
not delivered.  If this happens, MOM would believe that the file
was delivered when it was not.  
.LP
One solution would have been to implement a new copy utility for MOM
very similar to rcp.  But this would have required it's installation
on every system to/from which the user may wish to move files.  Rather
than duplicate rcp, lets just fix it.  As only the rcp used by MOM
must be "fixed", the PBS team opted to provide a version of rcp that
works correctly.   The bsd4.4-Lite version was chosen because of the
freedom to copy and modify it granted by its copyright.
.NH 3
Use of pbs_rcp
.LP
The supplied rcp source is compiled and the program is named
"pbs_rcp" in order to reduce the level of confusion on having
two "rcp"s installed on the system.  It is installed in the same
system binary directory as MOM (pbs_mom).
This path is compiled into MOM, see
.I src/resmom/requests.c .
.LP
When MOM invokes pbs_rcp, MOM has forked a child which as set its
effective and real uid to that of the user on whose behalf MOM
is operation.  This child of MOM, as the user, will use system(3)
to fork a shell and execute pbs_rcp.  The path to the pbs_rcp 
is specified in building src/resmom/requests.c and contains the directory
where MOM is (will be) installed.
.LP
Pbs_rcp, as in normal rcp, must be installed "setuid" and owned by root.
.\" force next chapter to odd page
.bp
.if e \{
\&
.sp 10
.DS C
[This page is blank.]
.DE
.bp
\}
