Job Management Guide (4)
By Hongyu Xiao
Contact: hongyu.xiao@ou.edu
Using SLURM for Efficient Computing
While Jupyter notebook access through tunneling is available, using SLURM for job management often provides better efficiency and resource utilization. Here's my template for a basic SLURM script:
Here's an example of a GPU-enabled SLURM script for deep learning tasks:
#!/bin/bash
#SBATCH --partition=disc_dual_a100 # GPU partition
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH --output=job_%J_.txt # Output file
#SBATCH --error=job_%J_.txt # Error file
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --mem=100G # Memory request
#SBATCH --time=24:00:00 # Time limit
# Run your deep learning script
python your_training_script.py
When using GPU resources, make sure to specify the appropriate partition (disc_dual_a100) and request GPU resources using the --gres flag. This ensures your job gets scheduled on nodes with available GPUs.
To submit your SLURM job, use:
sbatch your_script.sbatch
Common SLURM commands for job management:
squeue
-u $USER # Check your job queue
scancel
job_id # Cancel a specific job
sinfo
# Check partition information
This approach allows for better resource management and more efficient execution of computational tasks compared to interactive notebook sessions.
Here are examples of using squeue
and grep
to monitor jobs:
# View all jobs in the queue
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 disc_dual python_tr hongyux R 2:30:15 1 node001
123457 disc_dual tensor_jo user2 R 12:45:22 1 node002
123458 disc_dual pytorch_t user3 PD 0:00:00 1 (Resources)
# Filter jobs on disc partitions
$ squeue | grep disc
123456 disc_dual python_tr hongyux R 2:30:15 1 node001
123457 disc_dual tensor_jo user2 R 12:45:22 1 node002
123458 disc_dual pytorch_t user3 PD 0:00:00 1 (Resources)
123459 disc_a100 train_ml user4 R 5:12:33 1 node003
The output shows job ID, partition name, job name, user, status (R=running, PD=pending), runtime, number of nodes, and node assignment or reason for pending.
Advanced SLURM Usage Tips
Here are some additional SLURM commands and features that can help you manage your computational jobs more effectively:
1. Job Dependencies
You can make jobs wait for other jobs to complete:
# Wait for job 123456 to complete before starting
sbatch --dependency=afterok:123456 next_job.sbatch
# Wait for job 123456 to fail before starting
sbatch --dependency=afternotok:123456 cleanup_job.sbatch
2. Resource Monitoring
Monitor your job's resource usage:
sstat
- View resource usage of running jobs
sacct
- View completed job information
# View detailed job information
sacct -j JobID --format=JobID,JobName,MaxRSS,Elapsed
# Monitor memory usage of running job
sstat --format=AveCPU,AveRSS,AveVMSize --jobs JobID