Running Jobs

Job Scripts

Users submit jobs to Slurm for scheduling by means of a job script – a plain text file containing commands, directives, and syntax specific to Slurm, and shell scripting.

The script will launch a job that executes the my_exe program using the my_alloc account:

#!/bin/csh

#SBATCH --account mscfcons                     # charged account
#SBATCH --time 30                              # 30 minute time limit
#SBATCH --nodes 2                              # 2 nodes
#SBATCH --ntasks-per-node 16                   # 16 processes on each per node
#SBATCH --job-name my_job_name                 # job name in queue (``squeue``)
#SBATCH --error my_job_name-%j.err             # stderr file with job_name-job_id.err
#SBATCH --output my_job_name-%j.out            # stdout file
#SBATCH --mail-user=my_email_address@pnnl.gov  # email user
#SBATCH --mail-type END                        # when job ends

module purge                                   # removes the default module set
module load intel/16.1.150
module load impi/5.1.2.150

mpirun -n 32 ./my_exe

Notes

After purging the modules, the script should load the compiler, MPI library, and math library specified when the program my_exe was compiled. The order of load instructions is important.
All #SBATCH lines must come before shell script commands.
Include your preferred shell as the first line in your batch script.

Slurm Directives

Options passed to Slurm are referred to as directives and can be specified in a submission script like we see above in Job Scripts or specified on the command-line during execution. The directives are the same, though when inserted into the job script, the lines must have the #SBATCH prefix. Some common directives and their description:

-A, --account=<account>: Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command.
-d, --dependency=<dependency_list>: Defer the start of this job until the specified dependencies have been satisfied completed, e.g. -d afterany:226783.
-D, --workdir=<directory>: Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed.
-e, --error=<filename pattern>: Instruct Slurm to connect the batch script’s standard error directly to the file name specified in the “filename pattern”. By default both standard output and standard error are directed to the same file. The default file name is “slurm-%j.out”, where the “%j” is replaced by the job ID.
-J, --job-name=<jobname>: Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just “sbatch” if the script is read on sbatch’s standard input.
--mail-type=<type>: Notify user by email when certain event types occur. Some type values are NONE, BEGIN, END, FAIL, TIME_LIMIT_90 (reached 90 percent of time limit). Multiple type values may be specified in a comma separated list.
--mail-user=<user>: User to receive email notification of state changes as defined by --mail-type.
-N, --nodes=<minnodes[-maxnodes]>: Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count.
--ntasks-per-node=<ntasks>: Request that ntasks be invoked on each node. Meant to be used with the –nodes option.
-o, --output=<filename pattern>: Instruct Slurm to connect the batch script’s standard output directly to the file name specified in the “filename pattern”. By default both standard output and standard error are directed to the same file. The default file name is “slurm-%j.out”, where the “%j” is replaced by the job ID.
-p, --partition=<partition_names>: Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator. If the job can use more than one partition, specify their names in a comma separate list and the one offering earliest initiation will be used with no regard given to the partition name ordering.
-t, --time=<time>: Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition’s time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition’s default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
--gres=<gpu:2>: Run this job on the NVIDIA Tesla GPGPU nodes. Note: Maximum time limit 2 hours.

All available directives can be listed on the command line using sbatch --help, with more complete descriptions on the command-line using man sbatch, or online at https://slurm.schedmd.com/sbatch.html.

Job Submission

Submission scripts are used with the sbatch command to submit jobs. A successful submission results in the job’s ID being returned:

$ sbatch myjobscript

226783

Queues and Queue Limits

Job queue default values limit the time a single job with a particular number of nodes is permitted to run. The system will automatically categorize your job for you based on the resources it requests.

Queue/Partition name	Nodes in 1 Job	Time Limit (hours)	Notes
`large`	128 - 460	48	highest priority
`medium`	16 - 127	48	higher priority than smaller jobs
`small`	1 - 15	48	jobs will be used to backfill with the larger queues
`long`	1 - 16	168	lower priority jobs requesting longer run times with a limit of 16 nodes/job

Notes

Changes to the contents of the script after the job has been submitted do not affect the current job.
The default directory is where you submit your script from. If you need to be in another directory, then your job script will need to cd to that directory or specify sbatch --workdir=<directory>.

Job Policy Constraints

The total number of jobs a person can have running depends on user activity. While busy with many users:

The maximum number of Active jobs in the Running state: 20
Maximum number of nodes per job (except by special arrangement): 460 (7360 processor-cores)
Minimum number of nodes per job: 1 (for test or interactive purposes only)

Warning

Jobs must be run from the batch queue rather than the login node. Jobs found running in the login node will be terminated by the system administrator and the user will be sent an e-mail message. Jobs in the batch queue that have been suspended for more than 12 hours will be deleted.

Large Jobs

Jobs larger than 460 nodes can be run when needed. We ask that you first contact your Science Point-of-Contact (your PI should know who that is) to make arrangements. Or you may send an e-mail to msc-consulting@emsl.pnl.gov with details on what is required.

Scheduling

For communication intensive codes, using nodes on different top level switches will result in a fairly dramatic performance decrease. For this reason we have set the scheduler to preferentially only allocate nodes within one switch per job. That means if you have a 20 node job and only 7 nodes are available on each of the 3 top level switches (for a total of 21 available nodes) your job won’t start until 20 are available in one of the switches. This may be over-ridden for benchmarking purposes or after discussion with MSC staff to ensure that the application’s use of the interconnect is topology aware and/or sufficiently light so as not to cause congestion on Tahoma or Boreal interconnect.

Running Interactive Jobs

Boreal supports interactive jobs. The partition in which the job runs is determined by the time the user has requested.

If your application requires an X session (usually a graphical application), you will need to make sure that you have used the -Y option on your ssh command to login (enables X tunneling):

ssh -Y <name>@boreal.emsl.pnl.gov

To start an interactive job, use the srun command:

srun -A <allocation> -t <time> -N <nodes> --pty /bin/bash

Where:

-A - is the project for which the job is run.
-t - is the number of minutes
-N - is the number of nodes you want
/bin/bash - is the command you want to run. Typically it is your shell.

Alternatively, you can use the long options:

srun --account=<allocation> --time=<time> --nodes=<nodes> --pty /bin/bash

As with all jobs, an interactive job will wait in the queue until there is a set of available nodes for it.