10. High performance computing
High performance computing on OIST infrastructure
Science is more and more moving towards problem that involve data sizes that are only solvable with HPC. HPC stands for computing that occur on networked cluster of machines or on so called supercomputers. OIST currently operates two cluster Sango and Tombo, where the latter one is going to be phased out.
Concepts
A cluster consists of several independent computers (called nodes) that are connected over a very fast network. Each node contains several CPUs (processors) that are clustered in NUMAs. Each cluster of CPUs share memory, but they also can access memory in other NUMA groups. Communication between CPUs in a node is fast and communication between nodes is relatively slow, in a setting where computation is distributed across several nodes, communication can become a bottleneck.
There are two main paradigms of parallel computation, shared-memory and message passing. In the shared-memory paradigm every process has access to the same memory and can read and write to it at any point in time, access to critical regions has to be regulated by programming constructs like mutexes. The message passing model is based around the idea that each process has its own memory and necessary information are passed around via message and synchronizations of the computation happens only during the communication step. Shared memory programs are most of the time easier to write, but they can only run on one node, since the basic assumption is that memory access is cheap. Reasoning about message-passing programs can be harder, but computations can span thousands of nodes.
Sidestepping that issue are embarrassingly parallel problems, where no communication is necessary between processes. As an example performing 10000 stochastic simulation at once.
Lastly in modern supercomputers accelerator cards are used. An accelerator card is either a GPU or a specialized hardware like the Intel Xeon Phi. The general idea is that combining a hundreds of relative simple compute units, that all perform the same operation in parallel over a dataset. These accelerators are particular suited for vectorizable problems in numerical simulations.
Most cluster are controlled by a scheduler system that manages jobs submissions and tries to optimally distribute workload over the cluster and terminates jobs that run longer than there allocated time.
SLURM
The scheduler used at OIST is called SLURM. For submitting jobs you can either use the commandline tools directly or write scripts that do that for you.
SLURM distinguishes between jobs
, tasks
, nodes
, and CPU's
. A job
consists of several tasks
that can be spread over several nodes
and each task
has a number of CPUs
associated with it. When running SLURM jobs you can ask for resources --ntasks=
, --ntasks-per-node
, --cpus-per-task=
, and --mem-per-cpu=
or --mem=
to allocate specific resources. It is also good form to limit the time your program can run with --time
otherwise your program can run a way or can't run longer than 8h.
List of useful commands
- Job submission
salloc
Allocate resources and create shellsbatch
Submit a job script for later executionsrun
Submit job for execution (standalone or as part ofsalloc
)
- Job info
sacct
Accounting information for jobssstat
Statistics about jobssqueue
Reports status of jobssinfo
State of partitions and nodes
scancel
Cancel Jobs
Launching scripts
You can launch script files using sbatch
. sbatch
(and srun
and salloc
) accept switches from the command line, but you can also specify them in the header of the script file, specified by the prefix #SBATCH
, which minimizes typing:
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00
echo "Hello, World!"
Array Jobs
Array jobs are particular handy because most of the time in biology we have a lot of different input files that we want to apply the same series of command to.
To start an array job you will have to write a batch script. Array --array=1-16%4
specifies that you want to run 16 copies of the same script and the variable $SLURM_ARRAY_TASK_ID
contains the ID
of the batch job. (Side note: %4
means that you are only want to run four jobs at once)
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-16
#SBATCH --time=01:00:00
#SBATCH --partition=compute
#SBATCH --ntasks=1
######################
# Begin work section #
######################
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
# Do some work based on the SLURM_ARRAY_TASK_ID
# For example:
# ./my_process $SLURM_ARRAY_TASK_ID
#
# where my_process is you executable
http://slurm.schedmd.com/job_array.html
Sub jobs
You can also write a script that runs several srun
commands from within an allocation and thus start several jobs at once.
#!/bin/bash
#SBATCH --ntasks=4 # run four tasks in parallel
#SBATCH --partition=compute
#SBATCH --time=01:00:00
for i in `seq 1 16`; do
srun --ntasks=1 echo $i &
done
You could use this to allocate four tasks and the run for each file in a directory a sub job. Please note the --ntasks=1
for the srun command. Next class we are going to learn a better way to solve this.
Interactive Sessions
In order to run graphical programs on Sango you have to connect to Sango with ssh -Y
Example of a session would look like
ssh -Y sango
module load matlab
salloc -c 4 --mem=8g # Allocate 4 Cores and 8Gb of memory
srun --x11=last --pty matlab
Modules
Sango uses a module session to make it possible to have multiple versions of the same software installed.
module avail
module load julia
module show julia
Resources
- SCS @ OIST https://groups.oist.jp/scs SCS is the normative information source about HPC @ OIST. They manage and maintain our clusters and provide support for research.
- SLURM http://slurm.schedmd.com/
- Documentation
- Tutorial
Check job resource use and progress
If you want to see what your job is doing in real time, you can ssh to the node it is running on, e.g., ssh sangoXXXX
. You can then see all the processes running on this node using the command top
. Can give you a sense whether your job is still using CPU cycles and memory. Really handy for troubleshooting problematic software.
If the login node runs slow, you may want to run top
to see whether anyone is running matlab or something else CPU-intensive on the login nodes. You can report these people to IT and have them sent to re-education.
Mounting networked drives
You can directly interact with the Sango filesystem from your own computer. This can be handy for running your favorite text editor and modifying source code that lives on Sango. You can connect to sango using smb://sango-cifs.oist.jp
. In Mac OS this command is available through the Finder (GO... Connect to Server).
Tombo vs Sango
The previous cluster Tombo is still available at OIST and should be used for educational purposes like our class. Please run all commands that actually involve SLURM on Tombo and not on Sango, since Sango is used for active research.
Homework
The homework is due on November 27 2015 at 13:00pm (eg. noon).
- Take a fairly large file of your choice (a few gb) and copy it to the
/work
folder on Sango in two ways. First, usescp
orrsync
. Next, mount the/work
folder and copy it using the terminal usingcp
. Which way is faster? - Create a directory with an arbitrary number of files (at least 10), each containing a number. Now write a SLURM script that performs an operation on each one of these files in parallel, printing the number inside. Make sure your script generalizes to any number of files.
- Many programs use multiple threads. How do you allocate them using SLURM?
- How do you know how much memory your process actually consumed? Why should you care?
- How can you pass a shell variable to a SLURM command executed by sbatch?
- How do you cancel just one task in an array job?
- Run a job (a) that runs out of memory, (b) runs out of time. Interpret the log files resulting from these experiments.