Question: My batch job exited with code ###.  What does that mean?

When a program finishes executing it returns an exit code to
the system. The batch system reports this exit code. There are
three general ways for the exit code of a program to be set.

1) The program can explicitly call exit() (or return from main(),
   which eventually calls exit()). In this case the exit code is
   the argument to exit() and its meaning depends on the program.
   The call to exit() may actually occur in a library routine that
   your program uses. An example of this is the SGI FORTRAN
   io library. The FORTRAN io routines set an exit code in the range
   100 - 185 when an error occurs. The specific meaning of these codes
   can be found in the appendix to the Fortran 77 Programmer's Guide
   (available online as an insight book).
2) The program executes the last instruction in main(), (not
   calling exit() or return).  In this case the system sets the
   exit code to 0.
3) The program can terminate due to the receipt of a signal. In
   this case the system sets the exit code to 128 + <signal number>.
   (This assumes that the program doesn't have a signal handler
   which calls exit(), then we're back in case 1).
   The following table lists the various signals whose default
   action is to terminate a program. Note that the codes are different
   depending on the platform you run on (SGI Origins or IBM pSeries). See
   /usr/include/sys/signal.h for more info.

          Name     Number (SGI)   Number (IBM)
          SIGHUP      1              1
          SIGINT      2              2
          SIGQUIT     3              3
          SIGILL      4              4
          SIGTRAP     5              5
          SIGABRT     6              6
          SIGEMT      7              7
          SIGFPE      8              8
          SIGKILL     9              9
          SIGBUS      10             10
          SIGSEGV     11             11
          SIGSYS      12             12
          SIGPIPE     13             13
          SIGALRM     14             14
          SIGTERM     15             15
          SIGUSR1     16             30
          SIGUSR2     17             31
          SIGPOLL     22             23
          SIGIO       22             23
          SIGVTALRM   28             34
          SIGPROF     29             32
          SIGXCPU     30             24
          SIGXFSZ     31             25
          SIGRTMIN    49             888
          SIGRTMAX    64             999

  There are several reasons that your program might receive
  a signal.
  a) You sent it a signal with kill, bkill, or bdel. If you don't
     specify which signal to send, kill defaults to SIGTERM
     (exit code 143) and bkill defaults to SIGKILL (exit code 137).
     Bdel sends SIGINT (exit code 130), then SIGTERM, then SIGKILL
     until your job dies.
  b) The system sent it a signal because an error occurred or
     a system resource limit was reached. In this case, in addition
     to the exit code, the batch system will usually report an
     error message. Examples of this case are:

     Signal  Exit Code  Typical Reason        
     SIGILL     132     illegal instruction, binary probably corrupt
     SIGTRAP    133     integer divide-by-zero
     SIGFPE     136     floating point exception or integer overflow
                        (these exceptions aren't generated unless
                         special action is taken, see man sigfpe for
                         more information)
     SIGBUS     138     unaligned memory access (e.g. loading a word
                        that is not aligned on a word boundary)
     SIGSEGV    139     attempt to access a virtual address which
                        is not in your address space
     SIGXCPU    158/152 CPU time limit exceeded
     SIGXFSZ    159/153 File size limit exceeded
      
  c) The batch system sent it a signal because it exceeded a limit
     on the queue it was running in. Three queue limits are enforced
     in this way:
     i) CPU limit. When the total CPU usage of all processes in a
        batch job exceeds the queue CPU limit, the batch system
        kills the job by sending SIGXCPU, then SIGINT, then SIGTERM,
        then SIGKILL until the job dies. In this case the exit
        message says "Exited with signal termination: Cputime limit
        exceeded, and core dumped."
        Under rare circumstances, the system (as opposed to the batch
        system) could kill a job by sending SIGXCPU. In this case the
        exit message would say "Exited with exit code 158."
    ii) RUNTIME limit. Most batch queues have a limit on the actual
        time that a job can run. When this limit is exceeded, the
        batch system kills the job by sending SIGUSR2, then SIGINT,
        then SIGTERM, then SIGKILL until the job dies. Usually the
        SIGUSR2 kills the job and the exit message says
        "Exited with exit code 145." on the SGI systems
        "Exited with exit code 159." on the IBM systems

        (Since the batch system is killing the job, one could argue
         that it is a bug not to give a more informative message.)

      A third queue limit, the STACKSIZE limit, is enforced by the
      system (rather than the batch system) killing the job by sending
      a SIGSEGV. The exit message says "Exited with exit code 139."

  • No labels