Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: August 31, 2024
SLURM (Simple Linux Utility for Resource Management) is a widely-used job scheduler in high-performance computing (HPC) environments, managing and scheduling jobs across large clusters.
Sometimes, we need to cancel all our submitted jobs, either due to incorrect submissions or to free up resources for other tasks.
In this tutorial, we’ll go over how to cancel all SLURM jobs in the shell.
The scancel command is a reliable and straightforward tool in SLURM that allows us to cancel jobs in the queue. scancel also gives us the ability to quickly cancel multiple jobs at once. This is useful when we need to stop many jobs.
To cancel all jobs related to our account, we can enter our username with the scancel command:
$ scancel -u our_username
scancel: job 22222 canceled
In this example, SLURM will terminate any jobs assigned to our_username, whether they’re ongoing, pending, or in any other condition.
The confirmation message lets us know the jobs that have been canceled. If we try to cancel a job that has already been completed or doesn’t exist, SLURM warns us, preventing any misunderstanding.
Finally, the scancel command is a useful and essential tool for managing SLURM operations. Successfully utilizing this command allows us to manage our workload efficiently, ensuring we use resources effectively and processes run smoothly.
Manually canceling all SLURM jobs can be time consuming, especially if there are a lot of processes running. In this case, writing a shell script can automate the process while maintaining consistency.
To get started, we’ll need to create a new file for our script. We can do this with a text editor like nano:
$ nano cancel_all_jobs.sh
Here, we created and opened a new file named cancel_all_jobs.sh. Then, we’ll create the script that fetches all our active jobs and cancels them:
USER_NAME="our_username_goes_here"
JOB_IDS=$(squeue -u $USER_NAME -h -o "%A")
for JOB_ID in $JOB_IDS; do
scancel $JOB_ID
echo "Canceled job $JOB_ID"
done
In this example, we specify the username whose jobs we want to terminate by assigning it to the USER_NAME variable.
Next, we use the squeue command to retrieve the job IDs associated with this user. Additionally, the -u option specifies the user, the -h option removes the header, and the -o “%A” option returns just the job IDs.
Once we get a list of job IDs, the script will iterate through them and cancel the associated job with the scancel command. After terminating each job, a confirmation message is written to the console.
Once we’ve finished writing the script, we can save and close the file. However, before we can run the script, we’ll need to make it executable using the chmod command:
$ chmod +x cancel_all_jobs.sh
Furthermore, to execute the script, let’s run the program:
$ ./cancel_all_jobs.sh
As the script runs, it will show a message verifying that each job it cancels was successful.
Finally, automating the task cancellation process with a shell script is valuable when handling several jobs, since it saves time and reduces the possibility of manual mistake.
Managing jobs in SLURM, especially in high-performance computing environments, frequently requires the ability to cancel many processes quickly and effectively. Whether we use the scancel command for a simple and fast cancellation or simplify the process with a shell script, both approaches provide reliable options for successfully managing our workload.
We can use the scancel command to cancel jobs linked to a specific user, freeing up resources quickly and stopping incorrect or unnecessary jobs immediately. Nevertheless, utilizing a shell script to automate job cancellations might be quite useful. It saves time, guarantees consistency, and lowers the possibility of human mistake.