10. High performance computing

20 Nov 2015

High performance computing on OIST infrastructure

Science is more and more moving towards problem that involve data sizes that are only solvable with HPC. HPC stands for computing that occur on networked cluster of machines or on so called supercomputers. OIST currently operates two cluster Sango and Tombo, where the latter one is going to be phased out.

Concepts

A cluster consists of several independent computers (called nodes) that are connected over a very fast network. Each node contains several CPUs (processors) that are clustered in NUMAs. Each cluster of CPUs share memory, but they also can access memory in other NUMA groups. Communication between CPUs in a node is fast and communication between nodes is relatively slow, in a setting where computation is distributed across several nodes, communication can become a bottleneck.

There are two main paradigms of parallel computation, shared-memory and message passing. In the shared-memory paradigm every process has access to the same memory and can read and write to it at any point in time, access to critical regions has to be regulated by programming constructs like mutexes. The message passing model is based around the idea that each process has its own memory and necessary information are passed around via message and synchronizations of the computation happens only during the communication step. Shared memory programs are most of the time easier to write, but they can only run on one node, since the basic assumption is that memory access is cheap. Reasoning about message-passing programs can be harder, but computations can span thousands of nodes.

Sidestepping that issue are embarrassingly parallel problems, where no communication is necessary between processes. As an example performing 10000 stochastic simulation at once.

Lastly in modern supercomputers accelerator cards are used. An accelerator card is either a GPU or a specialized hardware like the Intel Xeon Phi. The general idea is that combining a hundreds of relative simple compute units, that all perform the same operation in parallel over a dataset. These accelerators are particular suited for vectorizable problems in numerical simulations.

Most cluster are controlled by a scheduler system that manages jobs submissions and tries to optimally distribute workload over the cluster and terminates jobs that run longer than there allocated time.

SLURM

The scheduler used at OIST is called SLURM. For submitting jobs you can either use the commandline tools directly or write scripts that do that for you.

SLURM distinguishes between jobs, tasks, nodes, and CPU's. A job consists of several tasks that can be spread over several nodes and each task has a number of CPUs associated with it. When running SLURM jobs you can ask for resources --ntasks=, --ntasks-per-node, --cpus-per-task=, and --mem-per-cpu= or --mem= to allocate specific resources. It is also good form to limit the time your program can run with --time otherwise your program can run a way or can't run longer than 8h.

List of useful commands

Job submission
- salloc Allocate resources and create shell
- sbatch Submit a job script for later execution
- srun Submit job for execution (standalone or as part of salloc)
Job info
- sacct Accounting information for jobs
- sstat Statistics about jobs
- squeue Reports status of jobs
- sinfo State of partitions and nodes
scancel Cancel Jobs

Launching scripts

You can launch script files using sbatch. sbatch (and srun and salloc) accept switches from the command line, but you can also specify them in the header of the script file, specified by the prefix #SBATCH, which minimizes typing:

#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00

echo "Hello, World!"

Array Jobs

Array jobs are particular handy because most of the time in biology we have a lot of different input files that we want to apply the same series of command to.

To start an array job you will have to write a batch script. Array --array=1-16%4 specifies that you want to run 16 copies of the same script and the variable $SLURM_ARRAY_TASK_ID contains the ID of the batch job. (Side note: %4 means that you are only want to run four jobs at once)

#!/bin/bash

#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00
#SBATCH --partition=compute
#SBATCH --ntasks=1


######################
# Begin work section #
######################

# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

# Do some work based on the SLURM_ARRAY_TASK_ID
# For example:
# ./my_process $SLURM_ARRAY_TASK_ID
#
# where my_process is you executable

http://slurm.schedmd.com/job_array.html

Sub jobs

You can also write a script that runs several srun commands from within an allocation and thus start several jobs at once.

#!/bin/bash
#SBATCH --ntasks=4 # run four tasks in parallel
#SBATCH --partition=compute
#SBATCH --time=01:00:00

for i in `seq 1 16`; do
  srun --ntasks=1 echo $i &
done

You could use this to allocate four tasks and the run for each file in a directory a sub job. Please note the --ntasks=1 for the srun command. Next class we are going to learn a better way to solve this.

Interactive Sessions

In order to run graphical programs on Sango you have to connect to Sango with ssh -Y

Example of a session would look like

ssh -Y sango
module load matlab
salloc -c 4 --mem=8g # Allocate 4 Cores and 8Gb of memory
srun --x11=last --pty matlab

Modules

Sango uses a module session to make it possible to have multiple versions of the same software installed.

module avail
module load julia
module show julia

Resources

SCS @ OIST https://groups.oist.jp/scs SCS is the normative information source about HPC @ OIST. They manage and maintain our clusters and provide support for research.
- https://groups.oist.jp/scs/getting-started#gs_prac
SLURM http://slurm.schedmd.com/
- Documentation
- Tutorial

Check job resource use and progress

If you want to see what your job is doing in real time, you can ssh to the node it is running on, e.g., ssh sangoXXXX. You can then see all the processes running on this node using the command top. Can give you a sense whether your job is still using CPU cycles and memory. Really handy for troubleshooting problematic software.

If the login node runs slow, you may want to run top to see whether anyone is running matlab or something else CPU-intensive on the login nodes. You can report these people to IT and have them sent to re-education.

Mounting networked drives

You can directly interact with the Sango filesystem from your own computer. This can be handy for running your favorite text editor and modifying source code that lives on Sango. You can connect to sango using smb://sango-cifs.oist.jp. In Mac OS this command is available through the Finder (GO... Connect to Server).

Tombo vs Sango

The previous cluster Tombo is still available at OIST and should be used for educational purposes like our class. Please run all commands that actually involve SLURM on Tombo and not on Sango, since Sango is used for active research.

Homework

The homework is due on November 27 2015 at 13:00pm (eg. noon).

Take a fairly large file of your choice (a few gb) and copy it to the /work folder on Sango in two ways. First, use scp or rsync. Next, mount the /work folder and copy it using the terminal using cp. Which way is faster?
Create a directory with an arbitrary number of files (at least 10), each containing a number. Now write a SLURM script that performs an operation on each one of these files in parallel, printing the number inside. Make sure your script generalizes to any number of files.
Many programs use multiple threads. How do you allocate them using SLURM?
How do you know how much memory your process actually consumed? Why should you care?
How can you pass a shell variable to a SLURM command executed by sbatch?
How do you cancel just one task in an array job?
Run a job (a) that runs out of memory, (b) runs out of time. Interpret the log files resulting from these experiments.