Hyperparameter optimization

Ray Tune library offers scalable HyperParameter Optimization (HPO) of NNs (such as learning rate or batch size) or ML models. This library features a smooth integration of PyTorch-based training scripts and enables two stages of parallelism:

Each training of a model with different hyperparameters (trial) can run in parallel on multiple GPUs (e.g., via PyTorch-DDP)
Several trials can run in parallel on an HPC machine (via Ray Tune itself)

At this moment, 3 HPO examples to Optimize ResNet18 on CIFAR-10 dataset with Ray Tune are available using these algorithms:

Setup

To start, initial step is to check the manual via

$ ./setup_hpo.py --help

Any of these HPO examples can be can be installed, for example 1st case AHSA algorithmwith

$ ./setup_hpo.py --case 1

After the installation, the main folder should consist of

hpo.py source file for the chosen case
startscript.sh for submitting that job

The following parameters in startscript.sh can be set for each script:

num-samples: number of samples (trials) to evaluate
max-iterations: for how long to train the trials at max
ngpus: how many GPU workers to allocate per trial
scheduler: which scheduler to use
data-dir: directory where the datasets are stored

In simplest scenario, modify the <account> section of the startscript.sh, and submit the job via

$ sbatch startscript.sh

Note that, for communication via the infiniband network it is important the specify the node ip-address in the startscript (whan launching Ray Tune) in the following format: --node-ip-address="$head_node"i and --address "$head_node"i:"$port".

If multiple Ray instances run on the same machine, there might be problems if all use the same port value (eq. 29500), so it is advisable to change it to a different value in that case.

ASHA

The ASHA scheduler is a variation of Random Search with early stopping of under-performing trials.

BOHB

The BOHB scheduler uses Bayesian Optimization in combination with early stopping.

PBT

The PBT scheduler uses evolutionary optimization and is well suited for optimizing non-stationary hyperparameters (such as learning rate schedules).

CIFAR-10

The CIFAR-10 dataset consists of 60,000 low-resolution images in 10 different classes, such as birds, dogs, or cats. Training is performed on 50,000 images, while the remaining 10,000 images are used for validation and testing. It was first introduced by the Canadian Institute for Advanced Research (CIFAR) and is one of the main benchmark datasets in the computer vision domain.