Using Hypnotoad

GETTING STUFF DONE..

Though it's OK to do development and related functions [upload/download..] on hypnotoad, the real work is farmed out to the hypnonodes.

You'll first need some actual work to submit to the node(s). That can be an application you wish to run on a node or nodes or it could be data or instructions that another application [like Mathematica] will process.

Once you've identified a job to run, you then send that job to the SLURM scheduler.  This can be done though SLURM's "srun" or "squeue" commands.  We'll provide some explanation and examples of how to use those here.

We'll start by presenting some common SLURM tools, and then we'll provide you with some information on creating a Job Submission script to configure more complex jobs.  We're only covering the basics here, so please check the Man pages for these commands and take a look at the links at the bottom of this page for more information.

Working with SLURM

SLURM includes some command-line utilities to allow you to summit, monitor, and controll jobs. For your convenience, they very cleverly start with "s".  A few of the more useful ones are described below.

These commands are invoked by you in a shell on hypnotoad.psd.uchicago.edu. You should run man <command name> for more information about them.

  • sbatch [options] script [arguments]
    • submit a job to the job queues. This is the more automated approach
    • e.g.  ->  sbatch --partition=gpu My_CUDA_job  will submit the My_CUDA_job script to run on nodes in the "gpu" partition
  • srun [options] executable [arguments]
    • the interactive job command. This is the "do this now" command
  • squeue
    • see what jobs are sitting in the job queue
  • sinfo
    • View info about the nodes & partitions (groups of nodes) in the cluster.
    • sinfo -Nel will show you details about each node.  Useful info for configuring your jobs so you know how many physical cores and how much RAM (in MB) each node has.  "man sinfo" will show you all options.

JOB SCRIPT..

A simple script is required to submit your computation job to the hypnonodes. It defines the requirements of a compute job and describes the work to be done.

 Below is an example of a Job script. Please note the use of the following types of entries in this script.

  • COMMENTS
    • Comment lines begin with “##”
    • should precede directives (I believe the Slurm docs were picky about this, but it's not required)
  • DIRECTIVES
    • Directives are passed to SLURM through  “#SBATCH” 
    • run man sbatch in a shell to see all the sbatch directives
  • MODULES
    • Other software required by your job
    • Module load instructions begin with “module” followed by “load” or “unload” and the module name
  • WORK
    • The actual name of the application that you intend to run on the cluster, along with any parameters you need to pass to that application.

Example Job Script


## like any shell script, we start by indicating which shell which will execute the script
#!/bin/sh 

##Use the #SBATCH command to tell SLURM how to handle our job.
##First we'll set the number of cores you'll require to Eight. 
##All non-GPU systems currently have 8 physical CPU cores.
#SBATCH --ntasks=8

## --job-name defines a string that will be used to identify your job in the job queue.
## You can run "squeue” in your shell to see all jobs running or queued on the Cluster or “squeue " at your shell prompt to see info about that specific job. 
#SBATCH --job-name=test 

## use --output= to set where your output should be written. 
## here, your job output is written to a file named "test.out"
## It’s a good idea to specify a full path here. 
## if no path is included, it will write directory where this job is executed.
#SBATCH --output=~/Output/test.out 

##the “--time=” directive sets the Wall Time for the job. 
## below, we tell SLURM the job can not run for more than one hour. 
## if it's still running after one hour it will be terminated. 
## please set a reasonable Wall Time to stop run-away jobs... 
## ..especially while you’re debugging code.
#SBATCH --time=1:00:00

## "module" specifies additional files (usually libraries) to load or unload.
## there is no leading "#" required before "module"
## below we're telling slurm to “unload” the standard openmpi libraries 
## ..and then “load” the faster Intel MPI libraries in their place. 
module unload openmpi
module load intelmpi  

##last bit. You need to give SLURM some work to do. 
##In this example we're telling SLURM to run an application called “Super_Duper_Code” 
~/my_research_code/Super_Duper_Code  

Once we've typed that script up, we'll save it and make sure it's executable. For the purposes of this document, let's assume the script has been saved as "MY_TEST.scp".  Run: chmod +x MY_TEST.scp to make it executable.

You can now run this job by invoking "./MY_TEST.scp" at your shell prompt.

Alternately, you can pass your script into the sbatch command to run it, by invoking "sbatch MY_TEST.scp" at your shell  prompt.  We'll cover "sbatch" more below.

USE OTHER APPS TO PROCESS MY WORK..

Some applications require special Modules to run as a job on the cluster.  Mathematica is one example.  Here's some code that you'd use in a Job Script to have Mathematica evaluate your Mathematica code on the cluster.  

NOTE:  Mathematica is not yet installed on the new Hypnotoad cluster (08/29/2018).  It's on the ToDo list so I'm leaving this section in the Documentation.


## The following module directive tells SLURM you'll need to load the module for Mathematica 
module load Mathematica
## You still have to tell SLURM what you want to run 
## At the end of the script we would invoke Mathematica with test.m as your work for Mathematica
## this is the same syntax to run Mathematica interactively from a shell prompt 
math -noprompt -script test.m 

## Note... the Mathematica kernel can't take your work in Workbook format.
## You must convert your Workbooks to *.m format to run them as cluster jobs.

We'll continue to work on our documentation but in the meantime I've provided links to some more thorough resources below.

Useful External resources for SLURM

Common slurm directives.. 

http://www.lrz.de/services/compute/linux-cluster/batch_parallel/

A bunch of example scripts..

https://wiki.uio.no/usit/suf/vd/hpc/index.php/SLURM_example_scripts

Info about running Mathematica jobs..

http://www.tchpc.tcd.ie/node/1086

For more info about Hypnotoad and for instructions on requresting access to it, please review our other Hypnotoad webpage.