Slurm Jobs Tutorial - Rutgers

Date: March 27, 2024 2:55 PM — Table of Contents

Slurm Jobs Tutorial - Rutgers
1. Quick Tutorial:

📌 In-depth Information in:

Quick Tutorial:

sbatch <filename>.sh #send job
sacct #see your current jobs
sacct -e #see all the variables you can specify to see about your jobs
acct --format=JobId,JobName%50,Partition%15,State,Elapsed,ExitCode,Start,End --starttime=2025-04-01T22:43:21 #example of vars
watch -n 1 squeue -u <netID> #watch your live currently running jobs
     # control + C to exit 'watch'

export my script as a NAME.sh file (make sure to add #!/bin/bash at top)
Fill in this with your specs

#! /bin/bash
#SBATCH --partition=p_dz268_1
#SBATCH --job-name=<some job name>.sh 
#SBATCH --requeue
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000
#SBATCH --time=*48:00:00*
#SBATCH --output=/path/to/folder/<some job name>.out 
#SBATCH --error=/path/to/folder/<some job name>.err

module purge

# Activate the holmesenv virtual environment to use installed packages
# if you need additional packages for your script, install them into /holmesenv/bin/activate
source ~/projects/community/holmesenv/bin/activate

eval "$(conda shell.bash hook)"  # Properly initialize Conda
conda activate /projects/community/holmesenv

bash /path/to/your/file/<filename>.sh    
#OR 
python3 /path/to/your/file/<filename>.py

then save that whole script as run_NAME.sh (even if the file running is python)

then do this to make both files executable

 chmod +x NAME.py
 chmod +x run_NAME.sh

then run

sbatch run_NAME.sh

Run all internet downloads on the login node

5-7 simultaneous downloads on the login node
has faster internet connection
slurm has small internet bandwidth, login has more, but also slurm bandwidth will stop everyone else
add an ampersand at the end of the download line so that it runs in the background
and also configure your computer so that it stays active
- To do this, follow this tutorial: NoHup Avoid Broken Pipe Error

Code Template:

Just out file

/path/to/my_script.sh 1>/path/to/my_script.out &

Out file and error file

/path/to/my_script.sh 1>/path/to/my_script.out 2>/path/to/my_script.err &

python /projects/f_ah1491_1/open_data/NAPLS3/docs/scripts/bidsconverter/batchscripts/napls_bidsconverter_bysubj_jun11_copy3.py 1>/projects/f_ah1491_1/open_data/NAPLS3/docs/scripts/bidsconverter/batchscripts/out/terminalpytest3.out 2>/projects/f_ah1491_1/open_data/NAPLS3/docs/scripts/bidsconverter/batchscripts/err/terminalpytest3.out &

This will run your script in the background (&) and save out the terminal outputs into a file with the same name but a different extension (.out) and the error files into that name but (.err)

projects/f_ah1491_1/Open_Data/HCP_EP/code/ndaDownload/ep_long.sh 1> slurm.out 2> slurm.err &

TIPS

if anything weird comes up can do scancel <job number>
sacct -e shows all the variables you could pull up for existing/past jobs
- —state=failed, running, pending, completed
try it with one subject, see time, multiply by subjects = time estimate

NDA download In-Depth Tutorial

Open text editor (BBEdit, Textedit, VSCode, etc.) and paste the code you want to run via terminal
(downloadcmd is a command from the package nda-tools)
```
    
 downloadcmd -dp 1225580 -d /projects/f_ah1491_1/Open_Data/NAPLS3 -wt 5
    
```
- -wt = the number of files you download in parallel. You should use max 10.
- change 1225580 to YOUR PACKAGE ID
- change /projects/f_ah1491_1 to the FOLDER YOU WANT TO DOWNLOAD TO
Save this file as a NAME.sh file, and have NAME be relevant to the package you’re downloading
Create shell script
1. Open a new file in text editor (BBEdit, Textedit, VSCode, etc.)
2. paste this code:
```
 #! /bin/bash
 #SBATCH --partition=p_dz268_1 # CAHBIR partition
 #SBATCH --job-name=any-name # change to what you want the name to be 
 #SBATCH --nodes=1 # change depending on computational needs
 #SBATCH --ntasks=1 # change if parallelizing
 #SBATCH --cpus-per-task=1 # change depending on computational needs
 #SBATCH --mem=2000 # change depending on computational needs
 #SBATCH --time=48:00:00 # change depending on computational needs
 #SBATCH --output=slurm_%x.out # see below
 #SBATCH --error=slurm_%x.err # see below
 cd /download/folder  # where the file will be run from
 module purge
    
 # Activate the holmesenv virtual environment to use installed packages
 # If you want to use a different conda, create a different script like activate.sh which activates your desired conda
 /projects/f_ah1491_1/analysis_tools/holmesenv_conda/activate.sh 
 source ~/.bashrc

 srun /script/path/script.sh  # for bash script
 python3 /script/path/script.py # for python  
```
- Change slurm_%j.out & slurm_%j.err to whatever you want your error/outfiles to be named.
  - You can add filepaths before filenames so that the err and out files are saved to folders (ie batch_jobs/file.err or err/file.err, etc) but you must have any referenced folders CREATED before running the script
  - Renaming to %x.out will mean each time you run this job the slurm.out/slurm.err files will be replaced with the job-name, and will replace the current existing file, so the file would always the most recent run
    - This is recommended unless you’re running jobs in parallel and want to save out each specific job instance log separately
    - Change job-name to change what will also autopopulate into the output/err folders, replacing %x
  - More % Options:
    - %x = Job name.
    - %j = jobid of the running job.
    - %N = short hostname. This will create a separate IO file per node.
    - %n = Node identifier relative to current job (e.g. “0” is the first node of the running job) This will create a separate IO file per node.
    - %s = stepid of the running job.
    - %t = task identifier (rank) relative to current job. This will create a separate IO file per task.
    - %u = User name. NOTE:
- Change time=48:00:00 to however much time you think you’ll need. Max to request is 2 weeks, but the more time you request the longer your slurm job will sit in the queue before running.
  - To estimate timing, try downloading 1 subject file and time how long the download takes, then multiply that by number of subjects
IMMEDIATE FAIL?
- Check if your err and out files have any paths/folders– if so, make sure those folders exist and you have rwx permissions to them
- Make sure any files called have execute (x) permissions– if not, run chmod (eg chmod +x file.py) for the relevant file, and then try running the slurm script again
Save this file as a SHELLNAME.sh file, naming it something relevant to the package + shell
Make sure both .sh files are in the SAME folder in your home directory, or somewhere in amarel, not on your local computer
open terminal
1. run cd /home/kj537/nda_downloads ← replace with wherever your .sh files are saved
2. run sbatch SHELLNAME.sh
It should prompt you here for your username and password from NDA. Make sure these are the credentials that link to the account where you created the data package!
Once the job starts running, you can check it’s running and its progress by entering sacct
1. Your job should be listed in this table like this

Troubleshooting:

open error files via:

  cd /dir/where error file is/
  vi slurm.most.recent.err

the top line will be the reasoning

check permissions
check everything is in the right folders

Common Shell Commands ./ = current directory

.. = parent directory

pwd= print working dir

cd /'dir' = change wd to specified dir

cd .. = change wd to parent directory

cd - = will go back to the directory you were last in

ls = prints all the files in current dir

echo = returns whatever is after echo (or in quotes for stuff w spaces in it)

cat = display contents of files, concatenate (if multiple listed)

vim or vi = displays contents, like cat, but with additional features (scrolling, etc.)

:q gets you out of the vim viewer

cat ‘file’ = prints the contents of a file

**xdg-open** ‘file’ =

~ = home directory

<fn> -l = will give you more info on that function

ls -l = returns the permissions you have on the working directory chmod +x <script.sh> = gives execute permissions to script.sh

chmod ugo+rwx <script.sh> = change your permissions in a folder to read-write-execute

mv /dir = moves a file

mkdir = make a new directory

rm = remove file (theres no undo)

cp rsync [options] <source_file> <destination_directory> = copy files; [options]: These are optional flags that modify the behavior of rsync. Some common options include -a (archive mode, preserves permissions and other attributes), -v (verbose output), -r (recursively copy directories), and -u (update only, skip files that are newer in the destination).

rsync is recommended for larger files

> = specifies where you want the output of that command to be saved/to go

x | y = makes the output of x the input of

sudo = runs the next command as the Root / Super User (can’t run multiple commands w/o shell)

sudo su = run root as shell
dangerous

tee = takes its input, and writes it to a (specific?) file, and prints it out

tail = print __ of the last output

tail -n1 = print the last 1 line of the last output

./script.sh = will run script.sh in current directory

python script.py = run script.py in python

Slurm Jobs Tutorial - Rutgers

Quick Tutorial:

Downloads (in login nodes)

NDA download In-Depth Tutorial