Distributed Training on ALCF Polaris

Overview

Polaris is a high-performance computing (HPC) system at the Argonne Leadership Computing Facility (ALCF) that provides robust support for distributed training workflows and advanced scientific computing applications. This post will go through how to train a deep learning model in distributed parallel using Hugging Face Accelerate, a library that simplifies distributed training across multiple GPUs and nodes.

Prerequisites

Before starting this tutorial, ensure you have:

ALCF Account: Active account with access to Polaris system
Project Allocation: Computing time allocation on a project (you’ll need the project name)
MFA Setup: CRYPTOCard or MobilePASS+ token configured for authentication
Basic Knowledge: Familiarity with SSH, Linux command line, and Python virtual environments
Python Experience: Understanding of deep learning concepts and PyTorch/Transformers

What is DeepSpeed?

DeepSpeed is Microsoft’s deep learning optimization library that enables efficient distributed training. It provides memory optimization techniques like ZeRO (Zero Redundancy Optimizer) and supports large model training across multiple GPUs and nodes with minimal code changes.

Log in Polaris

Access Polaris via SSH with multi-factor authentication:

ssh <username>@polaris.alcf.anl.gov

Authentication requires CRYPTOCard/MobilePASS+ token for secure access.

Load conda module

module use /soft/modulefiles
module load conda
conda activate base

Install uv Package Manager

Install uv, a fast Python package manager that’s more efficient than pip for large installations:

curl -LsSf https://astral.sh/uv/install.sh | sh

Configure Cache Directories

User home storage is very limited (50GB). Set cache directories to your project’s /eagle/ directory which has much larger storage allocation. Replace <ProjectName> with your actual project name:

PROJECT_NAME=<ProjectName>  # Replace with your actual project name
export UV_CACHE_DIR="/eagle/$PROJECT_NAME/uv"
export PIP_CACHE_DIR="/eagle/$PROJECT_NAME/cache/pip"

uv venv $UV_CACHE_DIR/accelerate_venv --seed --python 3.11
source $UV_CACHE_DIR/accelerate_venv/bin/activate
which pip
pip cache dir

To make sure pip is from accelerate_venv. Then, install pip install accelerate, this may take a while pip show accelerate to make sure they are installed in correct directory.

Run Distributed Training

Clone the accelerate code

accelerate code base contains training code examples.

git clone https://github.com/huggingface/accelerate.git
cd accelerate
ln -s $UV_CACHE_DIR/accelerate_venv/ .venv
source .venv/bin/activate

Request Interactive Compute Nodes

Request 2 compute nodes for interactive distributed training. The parameters specify:

select=2: Request 2 compute nodes
filesystems=home:eagle: Access to both home and eagle filesystems
walltime=1:00:00: 1-hour time limit
debug: Debug queue for faster allocation (limited resources)
A $PROJECT_NAME: Charge time to your project allocation

qsub -I -l select=2 -l filesystems=home:eagle -l walltime=1:00:00 -q debug -A $PROJECT_NAME

Load Required Modules

Once your interactive job starts and you’re on the compute nodes:

module use /soft/modulefiles
module load conda
conda activate base
cd $PBS_O_WORKDIR 
source .venv/bin/activate

Setup Hostfile for DeepSpeed

Create a hostfile that tells DeepSpeed which nodes and GPUs are available for distributed training:

JOBID=$(echo $PBS_JOBID | tr -d '\n' | cut -d '.' -f 1)
HOSTFILE="hostfile.$JOBID"
gpu_per_node=$(nvidia-smi -L | wc -l)  # Count available GPUs per node
cat $PBS_NODEFILE > $HOSTFILE           # Copy allocated nodes to hostfile
sed -e "s/$/ slots=${gpu_per_node}/" -i $HOSTFILE  # Add GPU count to each node

This creates a file like:

polaris-compute-02 slots=4
polaris-compute-03 slots=4

Setup Environment Variables

NNODES=$(uniq "$PBS_NODEFILE" | wc -l)
WORLD_SIZE=$((NNODES * gpu_per_node))
MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1 | tr -d '\n')
MASTER_PORT=29500
WORLD_SIZE=$WORLD_SIZE
export HF_HOME="/eagle/$PBS_ACCOUNT/cache/huggingface"

Launch the example

deepspeed --hostfile ./$HOSTFILE --launcher MPICH \
	examples/nlp_example.py

echo "Job ended at $(date)"

Putting everything in `qsub` script

#!/bin/bash
#PBS -l select=2
#PBS -l filesystems=home:eagle
#PBS -l walltime=1:00:00
#PBS -q debug
#PBS -A <ProjectName>

# Load any required modules here
source /etc/profile

## project specific setting

export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"
export no_proxy="admin,polaris-adminvm-01,localhost,*.cm.polaris.alcf.anl.gov,polaris-*,*.polaris.alcf.anl.gov,*.alcf.anl.gov"
export UV_CACHE_DIR="/eagle/$PBS_ACCOUNT/cache/uv"
export HF_HOME="/eagle/$PBS_ACCOUNT/cache/huggingface"

module use /soft/modulefiles
module load conda
conda activate base
cd $PBS_O_WORKDIR
source .venv/bin/activate

# Setup job specifc environment

JOBID=$(echo $PBS_JOBID | tr -d '\n' | cut -d '.' -f 1)
HOSTFILE="hostfile.$JOBID"
gpu_per_node=$(nvidia-smi -L | wc -l)
cat $PBS_NODEFILE > $HOSTFILE
sed -e "s/$/ slots=${gpu_per_node}/" -i $HOSTFILE

export NNODES=$(uniq "$PBS_NODEFILE" | wc -l)
export WORLD_SIZE=$((NNODES * gpu_per_node))
export MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1 | tr -d '\n')
export MASTER_PORT=29500
export WORLD_SIZE=$WORLD_SIZE

deepspeed --hostfile ./$HOSTFILE --launcher MPICH \
        examples/nlp_example.py

echo "Job ended at $(date)"

Additional Resources

For troubleshooting and detailed system information, see the ALCF Polaris User Guide and ALCF Support Center.

Additional documentation:

Yihui (Ray) Ren

Distributed Training on ALCF Polaris

Overview

Prerequisites

What is DeepSpeed?

Log in Polaris

Load conda module

Install uv Package Manager

Configure Cache Directories

Run Distributed Training

Clone the accelerate code

Request Interactive Compute Nodes

Load Required Modules

Setup Hostfile for DeepSpeed

Setup Environment Variables

Launch the example

Putting everything in `qsub` script

Additional Resources

Table of Contents

Yihui (Ray) Ren

Distributed Training on ALCF Polaris

Overview

Prerequisites

What is DeepSpeed?

Log in Polaris

Setup Environment in a Login Node

Load conda module

Install uv Package Manager

Configure Cache Directories

Run Distributed Training

Clone the accelerate code

Request Interactive Compute Nodes

Load Required Modules

Setup Hostfile for DeepSpeed

Setup Environment Variables

Launch the example

Putting everything in qsub script

Additional Resources

Table of Contents

Putting everything in `qsub` script