Office of Science
FAQ
Capabilities

Chinook Details

Contents: Configuration - Access - File Systems - Environment - Modules - Compilers and Optimization - MPI - Math Libraries - Debuggers - Job Submission and MOAB - NWChem jobs - Sample Script - Controlling Node Sets - Interactive jobs - Time Allocation Accounts - Job Policies - User Policies

Configuration

Chinook is a 160 TFlops system that consists of 2310 HP DL185 nodes with dual socket, 64-bit, Quad-core AMD 2.2 GHz Opteron processors (also called Barcelona). Each node has 32 Gbytes of memory, i.e. 4 Gbytes per core, and 365 Gbytes of local disk space. Fast communication between the nodes is obtained using a single rail Infiniband interconnect from Voltaire(switches) and Melanox(NIC's). The system runs a version of Linux based on Red Hat Linux Advanced Server. A global 297 Tbyte SFS file system is available to all the nodes. Node allocation is scheduled using MOAB and the SLURM resource manager.

Access [top]

Accessing Chinook with SecurID®

For security reasons, access to EMSL computers is obtained through one-time passcodes using SecurID® cards.

The procedure for remote access to Chinook is presented below. You must have a SecurID® card from MSC-EMSL and initialized your passcode BEFORE you try to logging onto Chinook.

More information on SecurID®

From Linux or Unix systems

{Note: Our machines use protocol 2, you may need to use ssh2 or ssh -2 for it to work.}

  1. Type the following at the window prompt:
    ssh <Username>@chinook.emsl.pnl.gov
  2. When prompted for the PASSCODE, enter your PIN and SecurID® number
  3. When prompted for your password, enter your PNNL network (kerberos) password for Chinook

From PC or Mac systems

You will need to use at least version 5.3 build 23 of SSH software from F-Secure (current version is 8.01). When connecting to chinook, The Authentication method must be set to "Keyboard Interactive". Or you can use PuTTY for Win32 platforms.

  1. Start F-Secure or PuTTY
  2. Set Host name to chinook.emsl.pnl.gov
  3. Set User Name to your Username (userID) on Chinook
  4. Set Authentication method to "Keyboard Interactive" (the default for PuTTY)
  5. Click on 'Connect' {F-Secure} or 'Open' {PuTTY}
  6. When prompted for the PASSCODE, enter your PIN and SecurID® number
  7. When prompted for your password, enter your PNNL network (kerberos) password for Chinook

File Systems [top]

There are five file systems available on Chinook:

Environment [top]

Software development and application requires a correct set of compilers, communication libraries and math libraries as well as tools that are not interchangeable with other pieces in the software development suite, and that are being regularly updated. Another area of variability is the individual user. We provide here basic recommendations for how to setup and manage your Chinook environment.

Home, Environment Variables, and Path for Chinook

Environment setup through modules [top]

To make software more supportable and environment setup more automatic to increase the ease of use for the user community, we have adopted "modules" as a way to present packages of software that work together and to describe required dependencies among software packages. The loading of a module environment will provide the user with the correct paths to commands, compilers, libraries, and will set up the necessary environment variables. The default module environment, which is considered to be integer*4 (i.e. -i4) is loaded at login time.

Various commands are available to probe your environment and to add or change (pieces of) the user environment:

Notes using modules

You Need to Load Your Modules in This Order:

  1. Start with module purge to remove the defaults.
  2. Load an integer precision module (precision/i4(default)   or   precision/i8).
  3. Load a compiler module (intel/10.1.008,   intel/10.1.015(default),   intel/10.1.018,   pathscale/3.1,   pathscale/3.2,   gcc/3.4,  or  gcc/4.1).
  4. Load an mpi module (mvapich/1.0.0-2533(default),  mvapich2/1.2p1,  hpmpi/2.02.07.01,  hpmpi/2.02.07.02(default),   hpmpi/2.03.00.01,   intelmpi/3.1,   voltaire/1.2.5  or   openmpi/1.3) .
  5. Load the math library module of choice (mkl/10.0.11(default),  mkl-scalapack(10.0.2.018),   acml/4.0.1(default),   acml/4.0.1.omp,  acml/4.2.0,  or  acml/4.2.0.omp.
  6. Load the MOAB module

If you wish to change one of those pieces its probably best to do a module purge and load them in the order above again. Remember to also load devtools when debugging.

Compilers and Code Optimization[top]

The default compilers are Intel's ifort (for Fortran) and icc (for C/C++) compilers. Version 10.1.015 is currently the default on the system though newer and previous versions are often available.
The Pathscale compilers (version 3.1) pathf95(for Fortran) and pathcc (for c) and pathCC (for c++) are available with the "pathscale" module.
The GNU compiler, gcc, is also available, the current default version is 3.4. The newer 4.1 version is also available with the "gcc/4.1" module.

Optimizing Code for the Barcelona

The Barcelona processor features an L3 2MB cache shared by the four processor-cores. Each processor-core has its own 512KB L2 and 64KB L1 cache. Poor instruction optimization can cost you a few CPU cycles. Cache misses can cost 300 cycles. TLB (Translation Lookaside Buffers) misses can cost 1000 cycles. Cache contention between cores can cost many cycles on all cores at the same time. The following general practices are recommended based on our experience with the quad-core processors on Chinook. For more details check out the AMD Opteron Developer Central web site.

  1. Avoid using more kernel-level threads than cores on a node. Under Linux kernel-level threads are processes, so the TLB is flushed every time the thread is switched. This does not apply to user-level threads done via user-level threading libraries (such as GNU Portable Threads (Pth)).
  2. Keep data local. Data shared between cores (such as in OpenMP) can lead to multiple cores accessing the same cache line at the same time, which leads to expensive cache line probing. If writing data, then cache line bouncing can occur, which leads to extreme slowness. To fix this, data should be manipulated on per process copies and then communicated when done.
  3. To optimize your code correctly for the PathScale compilers.
    • -Os   -march=barcelona   -m64   -msse   -msse2   -msse3   -m3enow   -msse4a   -OPT:Olimit=0
    • You may find that -O2 optimized code is a little faster than -Os
  4. To optimize your code correctly for the Intel compilers.
    • -xW optimizes code better suited for the Barcelona.
    • -O1 optimizes code for smaller executable size.
    • -O2 optimization might produce faster code than -xW.
    • -O3 optimization adds in loop transformations that may or may not help.
    • Do NOT use -ftz or -fpe flags, it will slow down the code.
  5. Run the "strip" command on your final executable.
  6. Put related subroutines in the same page of memory. This done by carefully crafted makefiles, see your local consultant (not for the novice).
  7. Practice good coding practices. Declare C functions static whenever possible. Use function prototypes in C. Use array indices, not pointers, to access arrays. Avoid recursion.
  8. Align data. In common blocks and structures, put the larger data types first (doubles before singles).
  9. Use large data packets over the network. This not only reduces MPI/IB overhead, but the DMA writes from the network do expensive cache line probes.
  10. Code order matters. Pin processes to CPU's before allocating memory, if you are going to pin (some MPI's pin for you depending upon options). This way each core is "close" to its memory.
  11. Minimize use of CPU microcode. The most common use of microcode is division. Another is math involving 2-byte integers (short, integer*2).
  12. All 2-byte integer math uses microcode, which is slow. Although 2-byte integers use less memory, which has fewer cache misses and fewer TLB misses. But 2-byte integers can also introduce false register dependencies, which is slow. Thus, you have to test to see which is better for your code.

MPI Libraries [top]

The primary communication protocol for running parallel jobs is MPI. There are four different flavors of MPI libraries available on Chinook and there are a number of ways you can compile your parallel codes. All MPI versions on Chinook are set to be "fork() safe".

The environment variables MPI_INCLUDE and MPI_*LIBRARIES get set by the modules environment to point to the appropriate locations and libraries.

The four MPI flavors are:

Math libraries [top]

For each of the math libraries you will need to add the following environment variable to your link line:

The MLIB_LDFLAGS environment variable gets set by the modules environment to point to the appropriate library location and list of libraries to link.

Debugging [top]

Intel's idb parallel debugger is available on the system. It is available when the Intel module is loaded. The GNU gdb debugger can be used to debug individual processes of a parallel program on each processor-core. In addition the TotalView debugger (module load totalview) is available for debugging. For details on how to use the TotalView debugger, click here.

Job Submission and MOAB[top]

Scheduling on Chinook is handled by MOAB, a batch scheduler from Cluster Resources and the Simple Linux Utility for Resource Management (SLURM), resource manager. The "msub" command is used to submit jobs, is similar to the LSF bsub command with some changes discussed below. The msub command returns (prints to your screen) a MOAB jobid. Use the "canceljob" command to cancel a job and the "checkjob" command to check the status of a job. Both these commands will need the MOAB jobid. You can use the "showq" to check which jobs are in the queue. The format of the job submission script will be discussed in the Sample Script section below.

Jobs displayed by the "showq" command are in three groups: Active, Eligible, and Blocked. The jobs can be listed in one of four states: Running, Idle, BatchHold, and Deffered.

  1. Active Jobs: Those that are either Running or Starting and consuming resources.
    • Running - means that the job is running (this should not be surprising). A good practice is to monitor the output to make sure one node has not hung the who job.
    • Starting -briefly appears while the job is getting ready to start.
  2. Eligible Jobs: Those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place.
    • Idle - means the job is waiting for resources to be available, the at least the right number of nodes for the requested time for a soft policy criterion to be met such as the limit on how many jobs a user can have in the queue simultaneously.
  3. Blocked Jobs:Those that are ineligible to be run or queued. Jobs listed here could be in a number of states for the following reasons:
    • Idle - means the job is waiting for resources to be available, the at least the right number of nodes for the requested time.
    • SystemHold - An administrative or system hold is in place.
    • BatchHold - A scheduler batch hold is in place because the job cannot be run due to the requested resources are not available in the system or because the resource manager has repeatedly failed in attempts to start the job. To get a job out of BatchHold mode, you need to kill the job with "canceljob <jobID>", then change the offending time (like run time is 50 hours instead of the maximum of 48 hours, and finally resubmit the job.
    • Deferred - this means that a temporary hold is put on the job after a specified number of attempts to start. This is automatically removed after a short time.

To submit a MOAB jobfile use a command with this form:

To view the MOAB queue:

To remove a jobfile from the queue:

An overview of the processor-core status can be obtained from:

Check how many processor-cores are available and for how long:

Estimate how long you job may wait in the queue before starting:

Submitting NWChem batch jobs [top]

When running NWChem calculations, users are encouraged to submit their jobs through the submit_nwchem script (available in the /home/scicons/bin/ directory). This script will setup the submission script, load the running environment, and makes sure the appropriate files get copied from and to your working directory.

Sample Script for Batch Jobs [top]

Here is a csh example of a MOAB jobfile. The following example is a file for submitting a batch parallel job. Replace the items in green italic with your account information.

	      #!/bin/csh 
	      #MSUB -A account
	  
	      #MSUB -l "nodes=number-of-nodes:ppn=8"
	  
	      #MSUB -l "walltime=04:00:00"      # (HH:MM:SS)
	      #MSUB -N jobname
	  
	      #MSUB -e sample.err.%j
	  
	      #MSUB -o sample.out.%j
	      #MSUB -M your_email@pnl.gov 
	  
	      #MSUB -m ae     # Flag for sending the e-mail on "abort" or "end"
	  
	      #############################################################################################
	      # Copy files to /scratch (if necessary). Always put files to be copied
	      # to the local disks in your /dtemp/ directory for improved efficiency.
	      #############################################################################################
	  
	      bcastf   /dtemp/<userID>/<your file>   /scratch/<your file>
	  
	      #############################################################################################
	  
	      #############################################################################################
	    # In case your job fails:
	    # Capture useful information to be put into your output and error file.
	      echo
	      echo Environment
	      echo
	      printenv      # List all environment variables to assist in debugging
	      echo
	      echo Limits
	      echo
	      limit         # Show the limits on cputime, filesize, memoryuse, stacksize, etc.
	      echo
	      echo module list
	      echo
	    # List the active modules on compute nodes to your error file unless you redirect to stdout.
	      echo
	      echo Ldd output.
	      rvho
	      ldd <your_program>    # List the libraries needed to run your code
	  
	    # Run code (or multiple codes by repeating the crun command)
	  
	      crun -nodes <number of nodes> -cores <number of processor-cores> <your_program>  <your_args>
	  
	      #############################################################################################
	      # Copy back important files from Node rank 0 to working_directory
	      #############################################################################################
	  
	      rank0scrcp   /scratch/<your file>   /dtemp/<userID>/<your file>
	  
	      

The #MSUB options in the script above will be discussed briefly. Note: the -l flag is a lower case "el" and not an upper case "eye".

To see the project accounts available to you, type the following command:

The gbalance command will display all the project accounts a user has available. If they have closed, the time will be zero. If they are still active, the amount of time (in hours) remaining in the account will be displayed. Ignore the lines that say "on MPP2" as that machine is nolonger available. On lines that have "on Chinook", your project account name is the second word on the line (not the Id number (digits only) but the alphanumeric word after it such as gc11111 or emsl22222 or st33333).

There are more options that can be specified. For those please read the man pages of the msub command.

The crun command in the job script specifies the parallel run. The options are:

The crun command is a wrapper program that is still being developed. It lives in /mscf/scicons/bin/ which should be in your default path. It is provided as a unified launch command as each of the four different flavors of MPI on Chinook use a different launch command. The correct syntax of the various commands for different MPI "flavors" are discussed below along with pointers to documentation on additional arguments they take:

 srun --mpi=mvapich -N <number of nodes> -n <total number of processor-cores> <your_program> <your_args>
	  	Doing man srun will provide documentation on other srun options. 
	  	Those would be placed before <your_program>. 
	  	   
 mpirun -srun -n <total number of processor-cores> <your_program> <your_args> 	  
	          With the hpmpi module loaded do a man mpirun to see a list of other mpirun options. 
	  	Those would be placed before <your_program>. 
	  	   
 mpiexec -n <total number of processor-cores> <your_program> <your_args>
	  	With the voltaire module loaded do a man mpiexec to see a list of other mpiexec options. 
	  	Those would be placed before <your_program>.
	  	   
 mpirun --rsh=ssh -np <total number of processor-cores> <your_program> <your_args>  	  
	  	With the intelmpi module loaded do a man mpirun to see a list of other mpirun options. 
	  	Those would be placed before <your_program>. Currently the Intel MPI is not capable of 
	  	launching jobs on more than 512 nodes.	  	
	  	   

Controlling Job Distribution Accross Computational Units (CU's) [top]

The topology of the interconnect on Chinook is that nodes are organized into 12 computational units labeled cu1, cu2, ... cu12. All of the nodes in one computational unit are connected to a 288 port infiniband leaf switch. The leaf switches are in turn connected to top level switches. Thus jobs that run in a single CU can have lower communication latency (and therefore run faster) than jobs spread accross several CU's because no communication over the top level switches is required. By default the MOAB scheduler will try to schedule your job to run on nodes of a single computational unit (CU) comprised of 192 compute nodes if it needs fewer than 192 compute nodes and if that number of nodes is available on a single cu. If the job is for more than 192 compute nodes or if no cu has available the number of nodes your job needs, MOAB will try to allocate your job as evenly accross CU's as the availability of nodes permits.
For small jobs, you can alter this behavior to require your job to run on a single cu by adding the following line to your submission script:

	      #MSUB -l "nodeset=ONEOF,nodesetdelay=99:00:00:00"  	

The nodesetdelay in the line above, tells MOAB to effectively wait forever(99 days) for nodes on a single CU to be available. You could specify that you want to run on a single CU if it doesn't delay scheduling your job by more that 2 hours by setting nodesetdelay to 2:00:00 in the line above.

Running Interactive Jobs [top]

Chinook has a pseudo-interactive queue of 8 nodes available for software testing and debugging purposes. You can use a maximum of 4 of these nodes for a maximum of 30 minutes. You will need to make sure that you have used the -X option on your ssh command to login (enables X tunneling). To bring up an interactive window, use the isub command:

     % isub   -A   <your-account>   -W   <mm>   -N   <n>   -s   <your-shell> 

where   <mm>   is the number of minutes (<=30),   <n>   is the number of nodes you want (<=4), and   < your-shell > is one of csh, bash, sh, tcsh, zsh or ksh.

Note that you can use isub to ask for more than 30 minutes and more than 4 nodes, but then you will wait in the normal queue for resources. To see the currently available number of nodes and the length of time they are available, use the following command:

     % showbf -c normal  

Time Allocation Accounts [top]

Time allocation is tracked by assigning a project account for both batch and interactive jobs. The account management software is called GOLD and can best be seen as a bank account holding the node hours allotted to the project. The name of the account can be obtained from your project PI, or by typing "gbalance -h -u <userID>". This command will show you the account name and the number of hours available on this account to you and the other users on the account. If no accounts are shown please contact the MSC-Consulting team. Some users are involved in multiple projects and have multiple account names to choose from. Please make sure you use the appropriate project account for the job you are planning to submit. If you are not sure which account to use, please contact your PI.

Job Policies [top]

The primary objective of EMSL's Molecular Science Computing capability is to provide teraflop computing resources for large scale computational needs associated with environmental problems as given in the Mission statement of our sponsor, the Office of Biological & Environmental Research. The job scheduling policy has been established to provide higher priorities to large jobs that cannot be run on local clusters or other HPC systems not designed for computational chemistry code. To maximize system flexibility, all batch jobs are submitted in identical fashion. The job scheduler controls the allocation of compute nodes to the users job and will place the job in one of seven available queues: tiny, small, medium, large, and extra-large (xl). This allocation is governed by a number of queue constraints. All queue constraint values must be satisfied or the job will go into "BatchHold" until deleted. For information on MSC User policies, please see User Policies.

These constraints are used as system default values. If you require resources beyond these limits (more nodes, longer run times), please have your Principle Investigator contact the EMSL Computer Projects Capability Steward and the appropriate user account can be configured with exceptions to over-ride the default values.

Job Policy Constraints:

The total number of jobs a person can have running depends on Chinook user activity.
While busy with many users:

When Chinook activity is light, a user can have as many as 15 Active jobs in the Running state. This maximum number is subject to change in the future.

There is also a set of default values that limit the time a single job with a particular number of nodes can have.

Number of Nodes in a Single Job Time Limit Notes
513 - 2048 12 wall-clock hours These jobs (xl) will be placed ahead of the jobs in the queues below, i.e., they will receive highest priority.
257 - 512 24 wall-clock hours These jobs (large) will be placed ahead of the jobs in the queues below, i.e., they will receive higher priority.
9 - 256 48 wall-clock hours Normal priority jobs (medium). Note that many of these jobs will backfill with the large jobs in the larger queues.
2 - 8 48 wall-clock hours Normal priority jobs (small). Note queue will have a longer time limit in the future.
1 - 4 30 minutes Test / Interactive queue, only 16 nodes in this (tiny) queue. A maximum limit of 4 nodes/job may be used.

Idle Queue:

{This queue has not been implemented on Chinook pending further investigation.}

SIGHTS Special Purpose Queue:

In addition to the queue limitations mentioned above the users can request access to a special purpose queue called Scientific Impact Generated by High Teraflop Simulations (SIGHTS). The SIGHTS queue is for compute jobs that require resources beyond the normal queue limits for Chinook, and serve uniquely impactful cutting-edge OBER/EMSL mission science opportunities which cannot be performed at any other computing facility. SIGHTS jobs should require the use of 1600 nodes or more, up to the capacity of Chinook (2130 nodes). SIGHTS jobs are not automatically set in the Chinook queue. SIGHTS jobs can be submitted anytime after approval and will be tended by an MSC scientific consultant and operations personnel to assist in successful job completion. All requests for a week with a monthly outage will need to be submitted by 12 noon on the day before the scheduled outage.

Access to the SIGHTS is by request only and is subject to time availability. All requests are submitted to the MSCF consulting group for review. Please pick keyword "SIGHTS". In your request please provide a short (one-two page) description of what you plan to do and how you plan on doing it. Upon receipt of the request a consultant will be assigned to the job. The consultant will work with the users to be sure the job is ready. The consultants and operations staff will watch all SIGHTS jobs to be sure they are running correctly. Details about SIGHTS jobs are:

To be eligible as a SIGHTS job, the science completed must have a high expectation of being published in a high-impact journal upon successful completion. Time used during the SIGHTS job will not be deducted from you project account.

Short Pool:

The short pool of 16 reserved nodes allows users to run small and short jobs to test or debug their codes. Interactive or test jobs will be limited to a maximum of 4 nodes and a 30 minute time limit per job.

User Policies [top]

Policies for Using Computer Resources

MSCF Lack-of-Use policy

We invite you to log in, exercise the system and report any problematic issues you have with the machine.

For application software, hardware and/or system software questions or problems, please contact the MSC consulting group by sending an email to EMSL MSC Consulting group.