Question: My batch job exited with code ###. What does that mean?
When a program finishes executing it returns an exit code to
the system. The batch system reports this exit code. There are
three general ways for the exit code of a program to be set.
1) The program can explicitly call exit() (or return from main(),
which eventually calls exit()). In this case the exit code is
the argument to exit() and its meaning depends on the program.
The call to exit() may actually occur in a library routine that
your program uses. An example of this is the SGI FORTRAN
io library. The FORTRAN io routines set an exit code in the range
100 - 185 when an error occurs. The specific meaning of these codes
can be found in the appendix to the Fortran 77 Programmer's Guide
(available online as an insight book).
2) The program executes the last instruction in main(), (not
calling exit() or return). In this case the system sets the
exit code to 0.
3) The program can terminate due to the receipt of a signal. In
this case the system sets the exit code to 128 + <signal number>.
(This assumes that the program doesn't have a signal handler
which calls exit(), then we're back in case 1).
The following table lists the various signals whose default
action is to terminate a program. Note that the codes are different
depending on the platform you run on (SGI Origins or IBM pSeries). See
/usr/include/sys/signal.h for more info.
Name Number (SGI) Number (IBM)
SIGHUP 1 1
SIGINT 2 2
SIGQUIT 3 3
SIGILL 4 4
SIGTRAP 5 5
SIGABRT 6 6
SIGEMT 7 7
SIGFPE 8 8
SIGKILL 9 9
SIGBUS 10 10
SIGSEGV 11 11
SIGSYS 12 12
SIGPIPE 13 13
SIGALRM 14 14
SIGTERM 15 15
SIGUSR1 16 30
SIGUSR2 17 31
SIGPOLL 22 23
SIGIO 22 23
SIGVTALRM 28 34
SIGPROF 29 32
SIGXCPU 30 24
SIGXFSZ 31 25
SIGRTMIN 49 888
SIGRTMAX 64 999
There are several reasons that your program might receive
a signal.
a) You sent it a signal with kill, bkill, or bdel. If you don't
specify which signal to send, kill defaults to SIGTERM
(exit code 143) and bkill defaults to SIGKILL (exit code 137).
Bdel sends SIGINT (exit code 130), then SIGTERM, then SIGKILL
until your job dies.
b) The system sent it a signal because an error occurred or
a system resource limit was reached. In this case, in addition
to the exit code, the batch system will usually report an
error message. Examples of this case are:
Signal Exit Code Typical Reason
SIGILL 132 illegal instruction, binary probably corrupt
SIGTRAP 133 integer divide-by-zero
SIGFPE 136 floating point exception or integer overflow
(these exceptions aren't generated unless
special action is taken, see man sigfpe for
more information)
SIGBUS 138 unaligned memory access (e.g. loading a word
that is not aligned on a word boundary)
SIGSEGV 139 attempt to access a virtual address which
is not in your address space
SIGXCPU 158/152 CPU time limit exceeded
SIGXFSZ 159/153 File size limit exceeded
c) The batch system sent it a signal because it exceeded a limit
on the queue it was running in. Three queue limits are enforced
in this way:
i) CPU limit. When the total CPU usage of all processes in a
batch job exceeds the queue CPU limit, the batch system
kills the job by sending SIGXCPU, then SIGINT, then SIGTERM,
then SIGKILL until the job dies. In this case the exit
message says "Exited with signal termination: Cputime limit
exceeded, and core dumped."
Under rare circumstances, the system (as opposed to the batch
system) could kill a job by sending SIGXCPU. In this case the
exit message would say "Exited with exit code 158."
ii) RUNTIME limit. Most batch queues have a limit on the actual
time that a job can run. When this limit is exceeded, the
batch system kills the job by sending SIGUSR2, then SIGINT,
then SIGTERM, then SIGKILL until the job dies. Usually the
SIGUSR2 kills the job and the exit message says
"Exited with exit code 145." on the SGI systems
"Exited with exit code 159." on the IBM systems
(Since the batch system is killing the job, one could argue
that it is a bug not to give a more informative message.)
A third queue limit, the STACKSIZE limit, is enforced by the
system (rather than the batch system) killing the job by sending
a SIGSEGV. The exit message says "Exited with exit code 139."