\" following macro cuts the date provided by RCS so only yy/mm/dd shows
.de Cd
\&\\$2
..
.sp 3
.in +0.7i
.sp 0.22i
.vs 11
.ps +22
P
.br
\h'11p'B\ \ 
.ps -4
\v'-4p'Portable Batch System\v'4p'
.br
.ps +4
\h'21p'S
.br
.ps -22
.vs 12
\l'4.9i'
.br
.in -0.8i
.sp 3
.TL
Requirements Specification
.AU
Robert L. Henderson \(dg
.FS \(dg
MRJ Technology Solutions, NASA Contract NAS 2-14303, Moffett Field, CA 
94035
.FE
Dave Tweten
.AI
NAS Scientific Computing Branch
NAS Systems Division
NASA Ames Research Center
.sp
.so ../ers/release.ms
.br
Printed: \*(DY
.LP
.ds CH PBS Requirements
.bp
.DS C
\s+2\f3Portable Batch System (PBS) Software License\fP\s-2
.sp
Copyright \(co 1999, MRJ Technology Solutions.
.br
All rights reserved.
.DE
.LP
Acknowledgment: The Portable Batch System Software was originally developed
as a joint project between the Numerical Aerospace Simulation (NAS) Systems
Division of NASA Ames Research Center and the National Energy Research
Supercomputer Center (NERSC) of Lawrence Livermore National Laboratory.
.LP
Redistribution of the Portable Batch System Software and use in source
and binary forms, with or without modification, are permitted provided
that the following conditions are met:
.IP -
Redistributions of source code must retain the above copyright and
acknowledgment notices, this list of conditions and the following disclaimer.
.IP -
Redistributions in binary form must reproduce the above copyright and
acknowledgment notices, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution.
.IP -
All advertising materials mentioning features or use of this software must
display the following acknowledgment:
.RS
.QP
This product includes software developed by NASA Ames Research
Center, Lawrence Livermore National Laboratory, and MRJ Technology Solutions.
.RE
.sp
.LP
.ce
DISCLAIMER OF WARRANTY
.QP
THIS SOFTWARE IS PROVIDED BY MRJ TECHNOLOGY SOLUTIONS ("MRJ") "AS IS" WITHOUT
WARRANTY OF ANY KIND,  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,  FITNESS
FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT ARE EXPRESSLY  DISCLAIMED.
.QP
IN NO EVENT, UNLESS REQUIRED BY APPLICABLE LAW, SHALL MRJ, NASA, NOR
THE U.S. GOVERNMENT  BE LIABLE FOR ANY DIRECT DAMAGES WHATSOEVER,
NOR ANY  INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
.LP
This license will be governed by the laws of the Commonwealth of Virginia,
without reference to its choice of law rules.
.LP
.ds LF $Revision: 2.1 $
.ds CF -%-
.SH
PBS Revision History
.so ../ers/rel_history.ms
.sp 1
.SH
Acknowledgements
.LP
Special acknowledgements are extended to the follow groups for
their contributions to this requirements specification:
.DS
Lawrence Livermore National Laboratory - LCC
Lawrence Livermore National Laboratory - NERSC
Members of the POSIX P1003.15 Working Group
.DE
.bp
.NH 1
INTRODUCTION
.NH 2
Purpose
.LP
This document specifies the functional requirements for the software
package \*QPortable Batch System\*U or
.B PBS .
PBS is an extension to a POSIX\**
.FS
IEEE Standard 1003,
.I
Information technology \(em portable operating system interface.
.R
See the glossary section for more information on POSIX standard groups.
.FE
or Unix\**
.FS
Unix is a trademark of AT&T.
.FE
operating system which provides the capability to submit and control
jobs in a batch environment, independent of the interactive environment.
.LP
Within this requirements document, the following definitions apply:
.LP
The term
.B must
is used to introduce a hard requirement which the implementation
must satisfy.
.LP
The term
.B may
is used to introduce an optional behavior.  The implementation
should attempt to satisfy this behavior.  If it cannot, the rationale
must be specified within the design documents.
.LP
The project to develop the PBS package has the following goals:
.IP \(bu
The package must provide extensions to the capabilities of the base
operating system and utilities.
The extensions must provide for batch job processing on single and
networked heterogeneous systems.
The package must provide for the ability to control:
.RS
.IP \(em
Initiation time of a batch job.
.IP \(em
Use of resources by a single batch job.
.IP \(em
Use of resources by all running batch jobs.
.RE
.IP \(bu
PBS will be a replacement for an earlier batch queuing subsystem, 
NQS, which was also developed by the Numerical Aerodynamic Simulation,
NAS, program.
The new package, PBS, will extend the capabilities of NQS.
.IP \(bu
The package must be portable to new systems added to the NAS complex.
.IP \(bu
This package must provide a
.I "Batch Queuing"
subsystem which is a superset of that defined in POSIX 1003.2d
.UL "Batch Queuing Extensions for Portable Operating Systems" .
.IP \(bu
The package must be designed and implemented in a manner to insure
PBS is modular and easily modified and extended.
.IP \(bu
PBS will be available
through COSMIC to vendors and will be the base upon which
vendors build their own Batch Queuing product.
.NH 2
Limitations
.LP
This document:
.IP \(bu
Does not specify design, that is to be found the 
.UL "PBS External Design Specification"
and the 
.UL "PBS Internal Design Specification" .
.IP \(bu
Does not present a risk analysis or contingency plan.
.IP \(bu
Does not provide an implementation schedule.
.NH 2
Constraints
.LP
The specification of requirements, design, and implementation of PBS
are constrained by the following:
.NH 3
Explicit Constraints
.NH 4
Portability
.LP
It is a goal of PBS to be a portable package, installable on all
systems currently available at the Numerical Aerodynamic Simulation
complex or which might be acquired in the near future.  This requirement
constrains the requirements, design, and implementation of PBS to be
based upon the specifications of an \*Qopen system.\*U
In the past, the closest operating system to an open system has been Unix.
However, even Unix had different flavors of implementation which complicated
the implementation of a portable package.
.LP
Currently, an international standard for open systems is being defined
by IEEE.  This standard, POSIX, is the best base for a package which must
be portable to platforms from a large number of vendors.
Most current Unix systems have
a subset which approaches the POSIX standard.  Most Unix systems will be
POSIX compliant in a few years.
.LP
Therefore, to maximize portability, PBS must be based upon the
system features defined in POSIX.1.
.NH 4
Inter-operability
.LP
Part of the effort to standardize an open system under POSIX is  working
group POSIX.15 \(em
.I
Batch Queuing Extensions for Portable Operating Systems.
.R
A number of vendors plan to implement batch systems which are compliant
with POSIX 1003.2d, the standard developed by the working group.
In order to be inter-operable with those systems, PBS must also be
compliant with the standard.
As there may be features
required in PBS which are not described in the standard,
the PBS package will be a superset of POSIX 1003.2d, \*Qconforming with
extensions,\*U not \*Qstrictly conforming.\*U
.LP
Features described in POSIX 1003.2d which are not described within this
document are to be required in PBS unless explicitly excluded.
.NH 3
Implicit Constraints
.LP
The implementation of PBS will be constrained by the systems available
for the implementation.  At the current time, those systems consist of:
.DS
Cray 2
Cray YMP
Amdahl 5880
Silicon Graphics Power Servers and workstations
Sun SparcStations
Thinking Machine CM2
Intel iPSC/860
.DE
All of these systems have a Unix based operation system.
None of the systems are expected to be certified POSIX compliant
during the early development phase of PBS.  It is a goal of the project
to insure the system usage by PBS may not conflict with
the POSIX.1 and POSIX.2 standards.  However, without a compliant 
development base, some deviation from the standard may occur.
.NH 2
Glossary
.LP
The following definitions apply to usage of the terms within this 
document.  The definition may not completely agree with those
supplied in POSIX documents.
.LP
.so glossary.ms
.NH 2
Assumptions
.LP
While it is the goal of PBS to be conforming with POSIX 1003.2d,
the design and development phase of PBS will begin before the POSIX 1003.2d
standard is completed and approved.  It is assumed that the final standard
will be close to the present draft standard at version D8.
The areas of least certainty are the Administration commands and the
Network protocol.  It is the intent of the PBS project to define a 
network protocol which will be flexible, extensible, simple to implement,
and become the accepted standard.
.LP
It is assumed PBS will be implemented in several phases.  The least
certain and most system dependent parts of PBS will be implemented in the
later phases.

.NH 1
GENERAL CONCEPTS
.LP
In the past, Unix systems were used in a totally interactive manner.
Background jobs were just processes with their input disconnected from the
terminal.  However, as Unix moved onto larger and larger processors,
the need to be able to schedule tasks based on available resources increased
in importance.  The advent of networked compute servers, smaller general
systems, and workstations leads to the requirement of a networked batch
capability.
.NH 2
Overview of the Batch System
.LP
The purpose of the batch system (or subsystem) is to provide additional
controls over the initiating or scheduling of execution of batch jobs;
and to  allow routing of those jobs between different hosts.
The batch system allows a site to define and implement policy as to what
types of resources and how much of each resource can be used by different jobs. 
The batch system also provides a mechanism with which a user can insure
a job will have access to the resources required to complete.
.NH 2
Beyond Cosmic NQS
.LP
A forerunner of PBS was Cosmic NQS, which was also developed by
the NAS program.  It became the early standard for batch system under
Unix.  However, Cosmic NQS had several limitations
and it was difficult to maintain and enhance.
Some of the problems with Cosmic NQS which PBS is to avoid were:
.IP \(bu
There was no design or implementation documentation.
.IP \(bu
Cosmic NQS was monolithic in structure; new features or capabilities were
difficult to add.
.IP \(bu
Cosmic NQS used a special method for mapping system network names and 
providing security called
.I nmap.
Nmap required assigning an additional unique identifier to each system
which could connect to the batch system.
This required coordination between all system administrators.
.IP \(bu
A very rigid protocol was used between the client and server processes.
There was no provision for adding additional resources or capabilities.
.IP \(bu
Cosmic NQS provided no job tracking.  If a user submitted a job to queue
which routed the job to another system, the user had to directly query
the other system for status about the job.  Users were not always aware
of where their jobs currently resided.
.IP \(bu
Under Cosmic NQS, to route a job from the local host to a queue on a remote
host, the local host must have had a \*Qpipe queue\*U naming the remote
queue as its destination.  Thus, queue names had to be propagated across all
interconnected batch systems.
.IP \(bu
Cosmic NQS only provided interactive commands for modifying the attributes
of the batch system or a queue.  If the batch processing policy changed at
various times of the day, an operator had to manually enter the
new parameters.
.IP \(bu
Cosmic NQS provided only a very simple FIFO job selection algorithm.
Any change to that policy required modification of a routine buried deeply
within NQS.
.IP \(bu
Cosmic NQS required the full implementation to be installed on a system.
Without it users could not submit jobs, but the full batch server was too
large to fit on the typical workstation.

.NH 1
FEATURES
.LP
This section of the Specification addresses the specific functional
requirements of the subsystem which are visible to the user.
The requirements are broken into the following groups:
.DS
Commands; user, operator, and administrator,
Programmatic interface,
Resource management,
Job routing,
Job initiation,
File staging,
Job tracking,
History file,
Error detection and recovery,
Batch system security, and a
Client only subsystem.
.DE
.sp 1
.NH 2
Commands
.LP
The following sections list requirements for the user, operator, and
administrator commands which must be implemented.
.NH 3
User Commands
.LP
.so user_cmds.ms
.NH 3
Operator Commands
.LP
.so op_cmds.ms
.NH 3
Administration Commands
.LP
.so admin_cmds.ms
.sp 1
.NH 2
Programmatic Interface
.LP
A library of function calls must be available to provide a programmatic
interface to the batch system.
Functions available in the library must provide the same capabilities
available through commands.
Users must have functions to submit, status check, modify, and control their
own batch jobs.  Administrators must have functions to allow modification
of operational characteristics.
This enables a site to develop programs to automatically
adjust batch server and queue attributes depending on policy, loading,
time of day, or other criteria.
.sp 1
.NH 2
Resource Management
.LP
A main function of PBS is to initiate jobs based on resource requirements
and resource allocations.
Different sites may wish to manage the resources in different ways,
and different systems place different demands on resource management.
.NH 3
Specifying Job Resource Requirements
.LP
To manage the resource usage on a system requires knowledge of the
types and amounts of resources available.
Different systems support a variety of resource types.  New resources 
types can be added to a system by updates to the system's operating system.
It is necessary therefore to be flexible in the processing of resources.
.LP
Throughout the batch system, resources must be maintained
in a form which allows simple addition of new types of resources.
.LP
Resource options, from the job submit command or the job script,
must be passed as
.DS C
.I keyword=value
.DE
strings.
The units of measurement for the 
.I value
depend on the type of resource.
Only limited checking may be performed by the job submittal command prior to
submittal on the common resources such as memory, disk space
(mainly checking the units).
.LP
The final destination server must fully check the strings upon arrival.
No checking is to be performed by intermediate servers.
.LP
At the final destination, unrecognized keywords or recognized keywords
with errors in the
.I value
must cause the job to be aborted.
.NH 3
Standard Resources
.LP
The following are very common resource types and are specified
in POSIX 1003.2d:
.DS B
.TS
box ;
c c
l l .
Keyword	Definition
_
cput	job cpu time
pcput	process cput time
mem	job memory size
pmem	process memory size
nice	nice value
pf	job permanent file space
ppf	process permanent file space
tf	temporary file space
.TE
.DE
.NH 3
System Specific Resource
.LP
The following resources may also be recognized on some of the systems
supported by PBS:
.DS B
.TS
box ;
c c
l l .
Keyword	Definition
_
file	file exists and is readable
infile	file to stage in before job executes
outfile	file to stage out after job executes
memt	Maximum job memory * time  (byte_seconds)
ncpus	Number of cpus
memhier	memory hierarchy
typecpu	type of cpu
cpugroup	set of cpus (in a parallel system, i.e. a quadrant)
9trk	number of 9 track tape drives  (round reel)
3480	number of 18 track tape drives (square reel)
3490	number of 36 track tape drives (square reel)
8mm	number of 8mm (Exabyte) tape drives
.TE
.DE
.NH 3
Queue Resource Limits
.LP
Each queue may have a set of resources limits associated with it.
These limits control which jobs enter each queue.
There may be limits on each resource as to the amount of that resource
which can be used by all batch jobs running on the system.
There may be 
.I complex
limits which limit the amount of each resource that can be allocated to jobs
from all queues which are contained in that complex.
.LP
All of these limits must be settable at time of queue definition.
They must be changeable by a privileged administrator using PBS commands.
Additionally, a library of interface functions must be provided which will
enable a site to build a program which can interrogate and modify the
resource limits.
.NH 3
System Wide Resource Management
.LP
If a host system supported the capability for a privileged process to
monitor system wide resource usage and to control allocation of resources
to
.I "interactive sessions" ,
then PBS would be able to coordinate resources between batch jobs
and interactive users.
However, this capability is virtually non-existent.\**
System vendors are encouraged to provide the interfaces that would allow
implementation of total resource management.
.FS
An exception to this is the current implementation of NQS on the CM-2,
which provides for controlling the number
of processor nodes assigned to interactive sessions as well as batch jobs.
.FE
.LP
Therefore, PBS will be limited to controlling resource utilization
by batch jobs, except where special features are provided by the 
system vendor.
.NH 3
Resource Allocation and Limits
.LP
Each site will have different policies on managing jobs and resources.
For example, a site may wish to manage resources or costs by assigning
cumulative resource limits to certain users, groups or accounts.
A Job which has resource requirements that would exceed a resource limit would
not be scheduled for execution.
When a job began execution,  the applicable user (group, account, ...)
resource limits would be reduced by the amount of resource requested by the job.
When the job terminates, the applicable limits may be adjusted for any
unused resources.  For example, if a job terminates early, the user may be
credited for the amount of cpu time the job allocated but failed to use.
.LP
A clearly defined interface must be specified and provided to allow
implementation of a resource management capability without impacting the
design or implementation of other portions of PBS.
.NH 3
Isolation of Resource Management Policy/Functionality
.LP
The functions which make up resource management must be packaged in
a manner to isolate the functions from the other pieces of the batch
system.  The interfaces between the resource management subsystems must
be well defined and documented.
.LP
One possible example of the isolation is shown in figure 2.
.sp 1
.NH 2
Job Routing
.LP
The term
.I "job routing"
refers to the process of moving a batch job from one destination to
another.  This process occurs when the job is entered into a queue which
has a destination attribute rather than an execution attribute.
Two basic methods of job routing can be implemented within the
overall scope of PBS, a 
.I push
model and a 
.I pull
model.
.LP
In the push model, jobs are routed to other queues based upon the
resource requirements of the job.  While the work load in execution
queues might be factored into the process of determining the destination,
the determination is associated with the routing queue.
Cosmic NQS was based solely on the push model.
.LP
In the pull model, jobs are collected in a "central" pool and distributed
upon the request of a execution server when that server has resources 
available for jobs.  This model is very effective when the execution 
servers are spread over a large number of identical systems. 
.LP
With the diversity of the systems located within NAS, the push model
appears to be the more appropriate model to implement.  However when
possible, the design and implementation of PBS should allow for the
support of either model.
.NH 3
Single versus Multiple Destinations
.LP
A queue with a destination attribute, for short known as a
.I "routing queue" ,
may have either a single destination or have multiple destinations.
Any of the destinations may be in the same batch server or on a different
batch server.
For a routing queue within the push model, any of the destinations may be
another routing queue or an execution queue.
.NH 3
Load Leveling
.LP
Within this section, the term
.I "load"
is defined as the sum of the cpu time of the jobs in the destination (queue)
that are runnable (other definitions are possible).
.I "Load leveling"
means to balance the load across destinations.
.NH 3
Job Routing Policy
.LP
The following sections describe the job routing requirements for the
push model of routing.  The pull model will be covered in lesser detail
in Appendix B.
.NH 4
Isolation of Policy/Functionality
.LP
When a routing queue has multiple destinations, a
.I "job routing"
function must be provided to select from the destinations.
This function must be designed and implemented in a manner to isolate
its implementation.  This simplifies the process of changing the
routing policy.
.NH 4
Default Policy
.LP
For routing queues with multiple destinations, there are two routing policies
which must be supported: load leveling and first available destination.
If load leveling is active, the destination must be selected with the
intent to load level across the listed destinations.
This policy is most useful when the routing queue points to multiple
queues on a single server or to 
destinations across multiple homogeneous hosts.
If \*Qfirst available\*U is active, the first destination specified which
is available will be selected for the job destination.
.LP
Which of the two methods must be used is controlled by
an attribute of the routing queue.
.NH 4
Load Value Polling and Cache 
.LP
Within a single batch server, the load on all queues is known.
To achieve the multiple destination load leveling,
there must be a method of polling
other destinations for the loading of its queues.  A protocol for querying
other batch servers for the loading of a specified queue and for returning
the response to said query must be defined as part of the network
application level protocol.
.LP
To minimize the amount of querying, the response must be cached on the
local system.  Any additional routing decisions may use the cached value
until it is declared 
.I stale
after
.B "<some period of time>" .
Additionally, the protocol for transferring a batch job from one
server to another must include the ability to return an updated loading value.
.sp 1
.NH 2
Job Initiation
.LP
The selection of a certain job or jobs for execution from
a set of runnable jobs in an execution queue is the function of the
job initiator.
Note, the term 
.I initiate
is used rather than 
.I schedule
to indicate that once the batch job has begun execution, this part of
PBS no longer has any control over the job.
This is opposed to 
.I "job management" ,
where control exists to preempt a running job in order to start a
higher priority job.
.LP
Specifying a default job initiation policy is one of the hardest
tasks in developing a set of requirements.  Each site, and often various
groups within a site, have their own concept of how best to schedule jobs.
These different concepts can lead to very different requirements and
implementations of the batch system, including impacting the usage of
queues as a means of separating job into groups.
Additional discussion of this topic is included as Appendix A.
Discussion of the impact of a pull model of routing on job initiation
is in Appendix B.
.NH 3
Isolation of Job Initiation Policy/Functionality
.LP
The functions which make up the job initiator must be packaged
in a manner to isolate the function from the other pieces of the batch system.
Additionally, the interface between the bulk of PBS and the initiator
must be well defined.
This is to ease the modification or total replacement of the initiation
algorithm.  
.LP
At the same time, the job initiator must have direct and total control over
what jobs are running (excluding operator intervention).  No initiation
of jobs is to take place outside of the initiator.
.NH 3
Default Job Initiation Policy
.LP
No job initiation policy is called out by this requirements specification.
Instead, the PBS implementation must allow for the widest possible
variation in policy and be capable of simply and quickly changing between
members of a set of policies.
.LP
Therefore, a default PBS job initiator will be provided that is
table driven such that
the policies governing job initiation may easily be changed.
An interface library must be created to enable sites to build their own
programs to modify the values controlling job initiation.
The default job initiation included as part of PBS may
be based upon weights assigned to each of the following attributes
and objectives:
.IP \(em
Each of the various resources including cpu time.
.IP \(em
A cpu time \(** memory usage product.
.IP \(em
A range or list of job priorities.
.IP \(em
Job age (time in queue).
.IP \(em
A list of job identifiers.
.IP \(em
User names (or ids).
.IP \(em
Group names (or ids).
.IP \(em
Other specifications may be included.
The default job initiation function must also include the ability to
prevent jobs from being run according to the values of certain specifiable
job attributes such as resources, job id, user, and/or group.
.NH 3
Job/Queue Attributes
.LP
Each job in a queue must have a 
.I "base priority" .
.LP
Each queue must have a
.I "queue priority" .
Each queue must have a 
.I "run limit"
which is the maximum number of jobs from that queue which may be executed
concurrently.  Each queue must have a set of 
.I "maximum resources" .
.LP
A queue may belong to a 
.I complex .
Each complex must have a run limit and a set of maximum resources.
.LP
The batch server must have a total run limit and a set of total
maximum resources.
.NH 3
Support of Job Dependency
.LP
PBS must support job dependency. The user must have the ability
to specify that a
.I child
job cannot be scheduled for execution until one or more specified 
.I parent
jobs have:
.IP \(em
Started execution.
.IP \(em
Completed execution successfully.
.IP \(em
Completed execution successfully or unsuccessfully (non zero exit status).
.NH 4
Simultaneous Start of Multiple Jobs Across Systems.
.LP
As a special case of job dependency,
PBS may provide the capability of simultaneously starting multiple jobs
residing in different queues on different heterogeneous systems.
This control would provide for starting all jobs in the set when
the resources required for each job is available simultaneously.
This capability is intended to support distributed computing.
.sp 1
.NH 2
File Staging
.LP
Because disk space on supercomputers is often at a premium, PBS must
include file 
.I staging .
Files residing on known hosts on the network may be identified as
resources required before the job can be run.  PBS may use standard
network file transfer utilities to copy the file to the execution host
(see the section on
.B "File Transmission" ).
PBS must insure the files are available under the specified local
path before the job is executed.
If a time is specified for the job's execution, PBS must attempt to
have all files staged in before that time.
If a file already exists on the local (execution) system with the path name
specified then PBS, acting with the capabilities of the user, must attempt
to delete (unlink) the file prior to staging in the requested file.
If PBS is unable to delete the file, PBS must treat this condition
as an error.
If an error occurs on staging in one or more files, the user must be
notified and the batch job must be aborted.
.LP
Files residing on the execution host may be specified for staging out
after the job has completed execution.  PBS must attempt to copy the
files to the specified host and path using the standard file transfer utility.
If the stage out fails, PBS must notify the user.
.LP
All files staged in or out must be deleted from the execution host
upon the job's completion.
.LP
In the case of an error during file staging, the user must be notified.
The notification may be either by appending an error message to the
standard error file of the batch job or by sending the user mail.
.sp 1
.NH 2
Job Tracking
.LP
With current implementations of NQS, the user who submits a job
must track it from host to host.  If the user attempts to status the
job on the system where it was submitted, but the job has been routed
to another host, the 
.B qstat
fails to show any information.  The qstat must be directed to the
host where the job resides at that time.
.LP
To simplify requesting job status,
PBS must implement automatic job tracking.
The batch server where the job was originally submitted must retain status
information about the job until the job has completed execution.
If the job is routed to a different queue or batch server at any time, 
update information must be returned to the original server.
A request for status of the job on the original server must display the
latest information regardless of the current location of the job.
.sp 1
.NH 2
History File
.LP
PBS must maintain a 
.I "history file" .
The history file must be either human readable, yet arranged in such manner
that it can be processed by utility programs; or it must be a binary file,
in which case a utility is provided to display the contents in human readable
form.
.LP
This file must contain at least the following information about each
job processed by PBS:
.IP \(em 4
Time job entered batch system.
.IP \(em 4
Time job entered (each) queue.
.IP \(em 4
Time job started execution.
.IP \(em 4
Time job suspended execution.
.IP \(em 4
Time job restarted execution.
.IP \(em 4
Time job terminated.
.IP \(em 4
Exit status of job.
Periodic resource usage by job, where the period is administrator selected.
.IP \(em 4
Maximum resource usage by job.
.LP
Entries may be included in the log describing
administrative and operator action which effect the operational characters
of the batch system, such as modification of job selection parameters.
.LP
Specification of which messages of differing types are to be written to the log
must be by a list rather than by a single level.  This allows maximum
flexibility.
.LP
.sp 1
.NH 2
Error Detection and Recovery
.LP
PBS must take every precaution to insure submitted batch jobs
are not lost in the event of user, administrator, system, or network
errors.
.NH 3
User and Administrator Errors
.LP
Error checking must be performed on all command options (See section on 
.B "Specifying Job Resource Requirements" ).
.LP
Administrative commands which when executed cause lose of batch
jobs as a side effect must require positive confirmation by the
administrator.
.NH 3
System Hardware and Software Errors
.NH 4
PBS Internal Errors
.LP
Error returns on system calls are to be evaluated.
Boundary conditions on table and internal resources are to be checked.
In the event of a detected fatal error, PBS must attempt to insure
integrity of its data.
Upon restart for any reason, PBS must validate its data and repair
and/or report any anomalous conditions.
.NH 4
Network Outage / System Crash
.LP
The ordering of events during routing a batch job to another server or
any network communication must protect against loss of the
batch job.
The transfer of the job request is to be in two steps.  First, a copy of the
job is transferred to the receiving server.  Second, ownership (control)
of the job is transferred.  This is known in Cosmic NQS as a
.I "two phase commit" .
This approach decreases the uncertainty of the
state on the receiving server following a communication failure.
For all other requests, positive acknowledgement must be required.
State must be recorded so that network services can resume upon
restoration of service following either a network, remote system, or local
system failure.
.NH 4
Synchronize File Update
.LP
PBS must maintain copies of its task requests and internal tables
on disk.  These disk copies are used to resume processing when PBS
is restarted after either an orderly shutdown or a system crash.
To prevent corruption of its internal information and to maintain
synchronization of tasks between host systems, which might lead to lost
of jobs, the disk copies must be identical to the in\*-memory copies.
.LP
POSIX systems and Unix systems normally have 
.I "delayed write"
disk I/O.  This can result in a time lag between when PBS requests a write
to disk and when the data is actually recorded on disk.  If a system crash
occurs before the data is actually recorded on disk, the crashed system
may lose information or be out of sync with the other systems on the network.
To avoid this condition, it is necessary to insure the data is written
synchronously.
POSIX.1 does not currently specify a method of performing synchronized I/O.
However, most Unix systems currently provide the capability via the
.B O\(ruSYNC
flag on the 
.I open(2)
call.  This flag forces all write operations to the file to be synchronous.
.LP
To protect against loss of data, and to not resort to very round\*-about
procedures which make the system hard to understand and maintain, PBS
must make use of the O\(ruSYNC capability or the 
.I fsync()
call on BSD based Unix system.  If the underlying system does
not support O\(ruSYNC or fsync, then there will be an increased risk of
losing jobs following a system crash.
.sp 1
.NH 2
Batch System Security
.LP
PBS must provide for restricting access to the batch system.
Restriction is at two levels:  batch server and individual queue.
.NH 3
Access to Batch Server.
.LP
PBS must provide a security system to restrict access to a batch processing
server to a specifiable list of systems.  This list must be easy to
maintain.   The list must be independent of any coordination with other systems 
outside of what is required to maintain a network connection.
.NH 3
Access to Queues
.LP
PBS must provide for the ability to place access restrictions on individual
queues by user name (uid) and group name (gid).  The access list, or data
base, must be easy to maintain.  
.sp 1
.NH 2
A Client Only Subsystem
.LP
PBS may provide a version of the batch subsystem which would allow
users to submit, status, and delete batch jobs without
the ability to execute batch jobs on that subsystem.
This
.I "client only"
subsystem would be intended primarily for workstations.  From these
workstations the user could submit and control batch jobs for execution
on processing servers.  The workstation would not be burdened with
the full implementation of the batch server.
.sp 1
.NH 1
OPERATING SYSTEM SUPPORT
.LP
This section addresses the features required of the
underlaying operating system kernel or operating environment, and those
features which are optional but the lack of which would impact PBS.
.NH 2
Resource Management
.LP
The following sections discuss the external (typically kernel) functions
required to implement resource management in PBS.
.NH 3
Setting Resource Limits
.LP
A system call which allows a process to set resource limits
must be available in the host operating system.
The system call should support all settable resources and provide for
setting limits on either a job or process basis.
.LP
.I Set_resource_limit(2)
is specified in POSIX.10, the Supercomputing Profile.
It is patterned after the
.I getrlimit(2)
call in
.B bsd
Unix and
.I limit(2)
in UNICOS by Cray.
.LP
If a vendor's system provides some other method of resource control, that
method may be incorporated into the implementation of PBS on that
system.
.LP
It is not intended to place a requirement on PBS to develop and support
special system calls which might increase the capability of PBS at
the expense of requiring special kernel modifications and additional on-going
support.
However, that development is not prohibited by this specification.
.NH 3
Notification When Exceeding a Resource Limit
.LP
In order to prevent a batch job from exceeding an established resource limit,
the usage of resources by the job must be monitored and the job must be
notified when a limit is reached. 
Typically this is performed by the kernel.
Without this kernel feature, resources can be scheduled by PBS but
the enforcement of run time limits may not be possible.
.LP
Two resource limit notifications are preferred.  First, a warning
signal upon reaching a soft limit (or warning limit which is less than
the hard limit).  This provides the job with time to take action to preserve
itself or release some of the resources.  A second, fatal signal is sent
to the job when the hard limit is reached.
It is also preferred that the soft limit be adjustable by the user as long
as it cannot exceed the hard limit which will be set by PBS
based upon the job specification.
.NH 3
Resource Reservations
.LP
A resource limit insures that a job will not use more of a resource than
was requested.  It does not guarantee that the job will have access to the
full amount unless the resource is solely controlled by PBS 
(see section 3.3.5).  It is desirable to be able to guarantee that a
job will in fact receive the full allotment of resources requested when the
job is scheduled.  To provide this guarantee requires kernel support which
varies from kernel to kernel.  PBS will make use of resource reservations
when supported by the kernel.  System vendors are encouraged to provide
the needed kernel functions to support reservation of resources.
.NH 2
Checkpoint / Restart
.LP
Several features of PBS depend on support of checkpoint/restart 
by the host operating system.
Checkpoint implementation varies greatly from one
system to another. 
Therefore, the interface between PBS and checkpoint must be isolated and
limited so that it can be easily modified.
If checkpoint / restart is absent from the host kernel, then those features
and command options referencing checkpoint will not be available on that
host.
.sp
.NH 1
NETWORK INTERFACE
.LP
Batch servers residing on different hosts require a network subsystem
in order to communicate.  In addition to the physical network, the network
software services, and the network subsystem protocol, PBS must define
an application protocol for transmitting requests and replies.
PBS must be built upon the \*Qclient \*- server model\*U
of distributed processing [see figure 1].
All batch systems may at one point be a client requesting service by another
server, or be a server providing a processing service to a client.
The user commands will be a client requesting service by the local batch
server.  In a \*Qclient only\*U version, the commands will be clients to
a remote batch server.
.NH 2
Support of TCP/IP\*-Sockets and OSI
.LP
PBS requires a
.I "reliable stream"
virtual circuit between client and server portions of each batch
system.  At present, POSIX lists two different network services/protocols:
sockets/TCP/IP and OSI.
.LP
PBS implementation may be based on sockets and TCP/IP.  However, every
attempt must be made to insure a simple port to the OSI interface.  Part of
this attempt is to isolate the network interface.  All network interfaces
must be collected into a single module containing a simple set of public
functions.  These public functions must be the only means of controlling and
accessing the network for PBS.
.NH 2
Basic Network Services Required
.LP
The following five functions are the public functions upon which the PBS
network interface is built.
.NH 3
PBS_listen
.LP
The PBS_listen
function is used by a batch server to indicate to the underlying network that
the server is prepared to accept incoming requests for stream connections.
The listen function must return a separate virtual connection for each
connection request (each client).
.LP
In the implementation based upon sockets and TCP/IP, the PBS_listen function
may be implemented via calls to:
.I "socket(2), bind(2), listen(2),"
and
.I accept(2) .
.NH 3
PBS_connect
.LP
The PBS_connect
function is used by a client to establish a stream connection with a
server.  If the PBS_connect function succeeds, then a unique virtual stream
connection must exist between the invoking client and requested server.
.LP
In the implementation based upon sockets and TCP/IP, the PBS_connect function
may be implemented via calls to:
.I "socket(2), bind(2),"
and 
.I connect(2) .
.NH 3
PBS_disconnect
.LP
The PBS_disconnect
function is used by a client or a server to terminate the stream
connection.
.LP
In the implementation based upon sockets and TCP/IP, the PBS_disconnect
function may be implemented via a call to:
.I close(2) .
.NH 3
PBS_write_data
.LP
The PBS_write_data
function is used by both client and server to send data of any type
over the network stream connection.  It is the intent of this spec that upon
completion of the PBS_write_data function, delivery of the data to the
remote system is guaranteed.
.LP
In the implementation based upon sockets and TCP/IP, the PBS_write_data
function may be implemented via a call to:
.I write(2) .
.NH 3
PBS_read_data
.LP
The PBS_read_data
function is used by both client and server to receive data of any
type over the network stream connection.
.LP
In the implementation based upon sockets and TCP/IP, the PBS_read_data
function may be implemented via a call to:
.I read(2) .
.NH 2
Network Protocol
.LP
The design of the PBS application level protocol must meet the following
goals.  It must...
.IP \(bu
Be able to be supported on a variety of
systems with different word sizes and data types.
.IP \(bu
Be extensible, allowing addition of new features, either standard or single
vendor.
.IP \(bu
Be flexible, allowing optional requests or features.
.IP \(bu
Support multiple versions of the protocol, with bidding between client and
server to reach highest common version.
.IP \(bu
Allow for reporting and otherwise disregarding
unrecognized packets.
.LP
The protocol must support three types of packets: task requests, replies,
and data packets.  Data packets are used to transmit streams of data,
such as the job's required resources or the job's script file.
Packets of different types need not be of uniform length;
the size must be found in a common location within each packet.
The type of request must be coded in a common location in the packet.
.NH 2
File Transmission
.LP
Transmission of files for staging input or returning output may use
the underlying network's standard file transmission program.  For
TCP/IP, this is either
.I rcp
or
.I ftp .
For OSI, this is 
.I FTAM .
.bp
.so appendix.ms
