Distributed Training on ALCF Polaris
Overview
Polaris is a high-performance computing (HPC) system at the Argonne Leadership Computing Facility (ALCF) that provides robust support for distributed training workflows and advanced scientific computing applications. This post will go through how to train a deep learning model in distributed parallel using Hugging Face Accelerate, a library that simplifies distributed training across multiple GPUs and nodes.
Prerequisites
Before starting this tutorial, ensure you have:
- ALCF Account: Active account with access to Polaris system
- Project Allocation: Computing time allocation on a project (you’ll need the project name)
- MFA Setup: CRYPTOCard or MobilePASS+ token configured for authentication
- Basic Knowledge: Familiarity with SSH, Linux command line, and Python virtual environments
- Python Experience: Understanding of deep learning concepts and PyTorch/Transformers
What is DeepSpeed?
DeepSpeed is Microsoft’s deep learning optimization library that enables efficient distributed training. It provides memory optimization techniques like ZeRO (Zero Redundancy Optimizer) and supports large model training across multiple GPUs and nodes with minimal code changes.
Log in Polaris
Access Polaris via SSH with multi-factor authentication:
ssh <username>@polaris.alcf.anl.gov
Authentication requires CRYPTOCard/MobilePASS+ token for secure access.
Setup Environment in a Login Node
Load conda module
module use /soft/modulefiles
module load conda
conda activate base
Install uv Package Manager
Install uv
, a fast Python package manager that’s more efficient than pip for large installations:
curl -LsSf https://astral.sh/uv/install.sh | sh
Configure Cache Directories
User home storage is very limited (50GB). Set cache directories to your project’s /eagle/
directory which has much larger storage allocation. Replace <ProjectName>
with your actual project name:
PROJECT_NAME=<ProjectName> # Replace with your actual project name
export UV_CACHE_DIR="/eagle/$PROJECT_NAME/uv"
export PIP_CACHE_DIR="/eagle/$PROJECT_NAME/cache/pip"
uv venv $UV_CACHE_DIR/accelerate_venv --seed --python 3.11
source $UV_CACHE_DIR/accelerate_venv/bin/activate
which pip
pip cache dir
To make sure pip
is from accelerate_venv
.
Then, install pip install accelerate
, this may take a while
pip show accelerate
to make sure they are installed in correct directory.
Run Distributed Training
Clone the accelerate code
accelerate
code base contains training code examples.
git clone https://github.com/huggingface/accelerate.git
cd accelerate
ln -s $UV_CACHE_DIR/accelerate_venv/ .venv
source .venv/bin/activate
Request Interactive Compute Nodes
Request 2 compute nodes for interactive distributed training. The parameters specify:
select=2
: Request 2 compute nodesfilesystems=home:eagle
: Access to both home and eagle filesystemswalltime=1:00:00
: 1-hour time limitdebug
: Debug queue for faster allocation (limited resources)A $PROJECT_NAME
: Charge time to your project allocation
qsub -I -l select=2 -l filesystems=home:eagle -l walltime=1:00:00 -q debug -A $PROJECT_NAME
Load Required Modules
Once your interactive job starts and you’re on the compute nodes:
module use /soft/modulefiles
module load conda
conda activate base
cd $PBS_O_WORKDIR
source .venv/bin/activate
Setup Hostfile for DeepSpeed
Create a hostfile that tells DeepSpeed which nodes and GPUs are available for distributed training:
JOBID=$(echo $PBS_JOBID | tr -d '\n' | cut -d '.' -f 1)
HOSTFILE="hostfile.$JOBID"
gpu_per_node=$(nvidia-smi -L | wc -l) # Count available GPUs per node
cat $PBS_NODEFILE > $HOSTFILE # Copy allocated nodes to hostfile
sed -e "s/$/ slots=${gpu_per_node}/" -i $HOSTFILE # Add GPU count to each node
This creates a file like:
polaris-compute-02 slots=4
polaris-compute-03 slots=4
Setup Environment Variables
NNODES=$(uniq "$PBS_NODEFILE" | wc -l)
WORLD_SIZE=$((NNODES * gpu_per_node))
MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1 | tr -d '\n')
MASTER_PORT=29500
WORLD_SIZE=$WORLD_SIZE
export HF_HOME="/eagle/$PBS_ACCOUNT/cache/huggingface"
Launch the example
deepspeed --hostfile ./$HOSTFILE --launcher MPICH \
examples/nlp_example.py
echo "Job ended at $(date)"
Putting everything in qsub
script
#!/bin/bash
#PBS -l select=2
#PBS -l filesystems=home:eagle
#PBS -l walltime=1:00:00
#PBS -q debug
#PBS -A <ProjectName>
# Load any required modules here
source /etc/profile
## project specific setting
export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"
export no_proxy="admin,polaris-adminvm-01,localhost,*.cm.polaris.alcf.anl.gov,polaris-*,*.polaris.alcf.anl.gov,*.alcf.anl.gov"
export UV_CACHE_DIR="/eagle/$PBS_ACCOUNT/cache/uv"
export HF_HOME="/eagle/$PBS_ACCOUNT/cache/huggingface"
module use /soft/modulefiles
module load conda
conda activate base
cd $PBS_O_WORKDIR
source .venv/bin/activate
# Setup job specifc environment
JOBID=$(echo $PBS_JOBID | tr -d '\n' | cut -d '.' -f 1)
HOSTFILE="hostfile.$JOBID"
gpu_per_node=$(nvidia-smi -L | wc -l)
cat $PBS_NODEFILE > $HOSTFILE
sed -e "s/$/ slots=${gpu_per_node}/" -i $HOSTFILE
NNODES=$(uniq "$PBS_NODEFILE" | wc -l)
WORLD_SIZE=$((NNODES * gpu_per_node))
MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1 | tr -d '\n')
MASTER_PORT=29500
WORLD_SIZE=$WORLD_SIZE
deepspeed --hostfile ./$HOSTFILE --launcher MPICH \
examples/nlp_example.py
echo "Job ended at $(date)"
Additional Resources
For troubleshooting and detailed system information, see the ALCF Polaris User Guide and ALCF Support Center.
Additional documentation: