Contents
Whatever you read here may need to be adjusted to fit your specific case.
Do not hesitate to ask for some help when needed.
Some but not all partitions are available to the compute nodes. Compute nodes will not be able to access any data from filesystems that are not listed here.
There are currently 2 partitions, normal and bigmem.
The normal partition is the default partition if you submit a job without precising which partition should be used. Your job will be placed in the normal partition (250GB).
The normal partition has limited RAM of 250GB, in case you need more than that please use the bigmem partition.
Use -p to specify needed partition.
By default only some software will be available at login. To be able to use other software scripts you should first load them.
The command module will help you to manage modules and their dependencies.
To check which programs are loaded (ready to be used), use the command below.
module list
Expected output is,
Currently Loaded Modules: 1) autotools 2) prun/1.3 3) gnu8/8.3.0 4) openmpi3/3.1.4 5) ohpc
To check which programs are available (can be loaded). The same command can be used to search a specific package.
module avail
Expected output is,
--------------------------------------------------- /opt/ohpc/pub/moduledeps/gnu8-openmpi3 ---------------------------------------------------- adios/1.13.1 hypre/2.18.1 netcdf-cxx/4.3.1 petsc/3.12.0 py2-scipy/1.2.1 scorep/6.0 trilinos/12.14.1 boost/1.71.0 imb/2018.1 netcdf-fortran/4.5.2 phdf5/1.10.5 py3-mpi4py/3.0.1 sionlib/1.7.4 dimemas/5.4.1 mfem/4.0 netcdf/4.7.1 pnetcdf/1.12.0 py3-scipy/1.2.1 slepc/3.12.0 extrae/3.7.0 mpiP/3.4.1 omb/5.6.2 ptscotch/6.0.6 scalapack/2.0.2 superlu_dist/6.1.1 fftw/3.3.8 mumps/5.2.1 opencoarrays/2.8.0 py2-mpi4py/3.0.2 scalasca/2.5 tau/2.28 -------------------------------------------------------- /opt/ohpc/pub/moduledeps/gnu8 -------------------------------------------------------- hdf5/1.10.5 metis/5.1.0 mvapich2/2.3.2 openblas/0.3.7 pdtoolkit/3.25 py3-numpy/1.15.3 likwid/4.3.4 mpich/3.3.1 ocr/1.0.1 openmpi3/3.1.4 (L) py2-numpy/1.15.3 superlu/5.2.1 ------------------------------------------------------------- /tools/modulefiles -------------------------------------------------------------- MEGAHIT/1.2.9 ---------------------------------------------------------- /opt/ohpc/pub/modulefiles ---------------------------------------------------------- EasyBuild/3.9.4 clustershell/1.8.2 gnu7/7.3.0 llvm5/5.0.1 pmix/2.2.2 valgrind/3.15.0 autotools (L) cmake/3.15.4 gnu8/8.3.0 (L) ohpc (L) prun/1.3 (L) charliecloud/0.11 gnu/5.4.0 hwloc/2.1.0 papi/5.7.0 singularity/3.4.1 Where: L: Module is loaded Use "module spider" to find all possible modules. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
To search for a module,
module avail <<keyword>> #OR module spider <<keyword>>
To load a module do,
module load <<MODULENAME/VERSION>>
Loading a module can be done following these 3 steps,
Locate the module, module avail
Check how to load it, module spider <<MODULENAME/VERSION>>
Load your module using the instructions module load <<MODULENAME/VERSION>>
Check that the module is loaded, module list
Enjoy!!!
Read more about module usage https://lmod.readthedocs.io/en/latest/010_user.html
This prototype should be in a script file, for example, my_first_script.sbatch
1 #!/bin/bash
2
3 #SBATCH -J test # Job name
4 #SBATCH -o /work/<<UID>>/job.%j.out # Name of stdout output file (%j expands to jobId)
5 #SBATCH -e /work/<<UID>>/job.%j.err # Name of stderr output file (%j expands to jobId)
6 #SBATCH -n 16 # Total number of threads requested or total number of mpi tasks
7 #SBATCH --mem=2000 # Maximum amount of memory in Mb the job can use
8 #SBATCH -t 00:30:00 # Run time ([d-]hh:mm:ss) - 1.5 hours
9 #SBATCH --mail-type=ALL
10 #SBATCH --mail-user=your.email@wur.nl
11
12 # Load your software/command
13 module load CMD/version
14
15 # Run your command
16 CMD [OPTIONS] ARGUMENTS
To run a sbatch script use
sbatch <<script name>>
Here are some explanation for obscure parts of this template,
The format for the time is, days-hours:minutes:secondes. This sets a limit of time for your task to run. If 00:05:00, your job will run for 5 minutes. What if it is not finished? You will have to rerun it again giving a higher time. If the command you are running has the ability to continue from a checkpoint, you can use that ability to reduce the running time. This parameter is difficult to estimate in most cases, do not hesitate to over estimate at the beginning. A job can run for a maximum of 10 days, 9-23:59:59.
Let's assume a few things here,
Preparing for the run,
1 mkdir /work/test/
Let's try to run an assembly using megahit,
1 #!/bin/bash
2
3 #SBATCH -J test # Job name
4 #SBATCH -o /work/test/job.%j.out # Name of stdout output file (%j expands to jobId)
5 #SBATCH -e /work/test/job.%j.err # Name of stderr output file (%j expands to jobId)
6 #SBATCH -n 16 # Total number of threads or total number of mpi tasks
7 #SBATCH --mem=2000 # Maximum amount of memory in Mb the job can use, 2000 is 2Gb
8 #SBATCH -t 01:30:00 # Run time ([d-]hh:mm:ss) - 1.5 hours
9
10 # Load your available meghit
11 module load MEGAHIT/1.2.9
12
13 # Defining variable for work directory
14 based=/work/test
15 # Defining variable for temporary directory
16 # We do this because our command uses tmp folder
17 tmp_dir=$based/tmp
18 # Defining variable for output directory
19 output_dir=$based/output
20 # Defining variable for forward and reverse read file
21 f_read=/tools/test_data/assembly/r3_1.fa.gz
22 r_read=/tools/test_data/assembly/r3_2.fa.gz
23
24 # Creating temporary folder
25 # Megahit will complain if "output" folder already exists
26 mkdir $tmp_dir
27
28 # Command to run
29 # We use previously defined varibales to set the values of megahit options
30 megahit -1 $f_read -2 $r_read --tmp-dir $tmp_dir --out-dir $output_dir --out-prefix r3
YOUR ATTENTION PLEASE! Once your job is submitted, slurm takes care of it until it is done. There is no need to run your job in screen or any special terminal.
As the title suggests, every compute node comes with a local storage (only visible on the node). Using that space can make your analysis faster but it must be used with care to not loose any data. It is important to understand that this space is created when a job or a task starts to run. The space is specific of the job or task. So for jobs that run over multiple nodes this might not be a solution. Finally, the space is DELETED as soon as your job is finished (no one can access it anymore)
To reach this space use the variable $TMPDIR in your sbatch script.
Here are 2 ways to use the local storage:
1 #!/bin/bash
2
3 #SBATCH --mail-user=your.email@wur.nl
4 #SBATCH --mem 3000
5 #SBATCH -n 16
6 #SBATCH -t 00:15:00
7 #SBATCH -D $TMPDIR # slurm automatically moves to this folder before the run starts
8 #SBATCH -J MYJOB # The name of my job, could be any name
9 #SBATCH -o $TMPDIR/MYJOB.out
10 #SBATCH -e $TMPDIR/MYJOB.err
11
12 # Load spades
13 module load spades;
14
15 # Make a directory for temporary files
16 mkdir $TMPDIR/spades_tmp
17
18 # Make a directory for output of spades
19 mkdir $TMPDIR/spades_out
20
21 # Run spades, please adjust the resource allocation to your needs
22 spades.py --meta -t 16 -m 3 \
23 --tmp-dir $TMPDIR/spades_tmp \
24 -1 /work/username/forward.fastq.gz \
25 -2 /work/username/reverse.fastq.gz \
26 -o $TMPDIR/spades_out
27
28 # Copy the results back to user's work folder. Can be done using the command "cp -r" as well
29 rsync -rt $TMPDIR/spades_out /work/username/
30
31 # Copy the slurm output files back to user's work folder
32 cp $TMPDIR/MYJOB.out $TMPDIR/MYJOB.err /work/username/
1 #!/bin/bash
2
3 #SBATCH --mail-user=your.email@wur.nl
4 #SBATCH --mem 3000
5 #SBATCH -n 16
6 #SBATCH -t 00:15:00
7 #SBATCH -D /work/username/output
8 #SBATCH -J MYTEST
9 #SBATCH -o /work/username/output/MYTEST.out
10 #SBATCH -e /work/username/output/MYTEST.err
11
12 # Load spades
13 module load spades;
14
15 # Make a directory for temporary files
16 mkdir $TMPDIR/spades_tmp
17
18 # Run spades, please adjust the resource allocation to your needs
19 spades.py --meta -t 16 -m 3 \
20 --tmp-dir $TMPDIR/spades_tmp \
21 -1 /work/username/forward.fastq.gz \
22 -2 /work/username/reverse.fastq.gz \
23 -o /work/username/output/spades_out
It sometimes happen that we need to run the same command with different input files and maybe different parameters. This is often the case with metagenomic samples for example. In this case, 2 options are possible write a script for all your samples and submit them all or write a script that will make the individual script per sample and submit them. The last one is the one we develop in this section. It is important to be comfortable with bash scripts, to be able to make the few adjustments that are needed to run your own commands.
1 #!/usr/bin/env bash
2
3 set -e
4
5 fastqs=$1; shift
6 suffix=$1; shift
7 output=$1; shift
8
9 if [ -z $fastqs ]; then suffix='./'; fi
10 if [ -z $suffix ]; then suffix='.fastq.gz'; fi
11 if [ -z $output ]; then suffix='./'; fi
12
13
14 for fl in $(find $fastqs -name "*_[12]$sufix" | sort | paste -d',' - -)
15 do
16 fq1=$(echo $fl | cut -d',' -f1)
17 fq1=$(realpath $fq1)
18 fq2=$(echo $fl | cut -d',' -f2)
19 fq2=$(realpath $fq2)
20 bs=$(basename $fq1 1$suffix)
21 bs=$(echo $bs | sed -r "s/[,\._\-\+\=\@\s]+$//")
22
23 o=$(realpath $output/$bs)
24
25 if [ ! -d $o ] ; then mkdir $o; fi
26
27 cat >./${bs}.sbatch <<EOF
28 #!/bin/bash
29
30 #SBATCH --mail-user=your.email@wur.nl
31 #SBATCH --mem 3000
32 #SBATCH -n 10
33 #SBATCH -t 00:15:00
34 #SBATCH -D $o
35 #SBATCH -J $bs
36 #SBATCH -o ${bs}.out
37 #SBATCH -e ${bs}.err
38
39 module load fastqc;
40
41 fastqc -o ${o} -t 10 -f fastq $fq1 $fq2
42 EOF
43
44 #sbatch ./${bs}.sbatch
45 done
This is a simple script that look for specific pair of "fastq" files filtered by their extensions, will sort the file names and write a sbatch script to run the program "fastqc" on them. Let's see in a bit more detailed what the script does
Line 1
Line 3
Line 5-7
Line 9-11
Line 14
This is a complex line but to put it simple, that line collects the fastq pairs seperated by a comma (forward,reverse) on the same line. The fastq files usually follow the same scheme for their names, SampleID_ADDITIONAL-INFO_[R]1.fastq[.gz] & SampleID_ADDITIONAL-INFO_[R]2.fastq[.gz] the elements between square brackets are optional.
On that line we collect all the fastq files in the folder specified by the user based on their extensions (find command) then we sort the list of files we obtained hoping that we end with pair of files belonging to the same sample with the forward on top of the reverse file (sort command). At last, the command paste will merge every 2 lines into 1, separated by a comma.
Line 16-19
Makes sure that the path for the forward & reverse read files is in absolute form. It is easier to work with absolute path in a script.
Line 20-21
From the forward read path, it isolate the file name and removes the suffix (basename command) then any trailing spacing character (sed command). The result is use to identify the analysis for each sample. If my filename is SampleID_ADDITIONAL-INFO_R1.fastq.gz the result of these lines will be SampleID_ADDITIONAL-INFO_R
Line 23
Line 25
Line 27-42
The sbatch script for each sample is generated. Refer to the information above and the manual of the command sbatch to understand the script. You will need to adjust this part to your own case, the command to run is of course important but also the resources and time needed to run your samples should be set.
Line 44