# Hyperparameter optimization [Ray Tune](https://www.ray.io/ray-tune) library offers scalable HyperParameter Optimization (HPO) of NNs (such as learning rate or batch size) or ML models. This library features a smooth integration of PyTorch-based training scripts and enables two stages of parallelism: - Each training of a model with different hyperparameters (trial) can run in parallel on multiple GPUs (e.g., via PyTorch-DDP) - Several trials can run in parallel on an HPC machine (via Ray Tune itself) At this moment, 3 HPO examples to Optimize [ResNet18](https://arxiv.org/abs/1512.03385) on [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset with Ray Tune are available using these algorithms: 1. [ASHA](https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc/-/blob/main/HPO/Cases/cifar_tune_asha.py) 1. [BOHB](https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc/-/blob/main/HPO/Cases/cifar_tune_bohb.py) 1. [PBT](https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc/-/blob/main/HPO/Cases/cifar_tune_pbt.py) ### Setup To start, initial step is to check the manual via ```bash $ ./setup_hpo.py --help ``` Any of these HPO examples can be can be installed, for example 1st case AHSA algorithmwith ```bash $ ./setup_hpo.py --case 1 ``` After the installation, the main folder should consist of 1. `hpo.py` source file for the chosen case 2. `startscript.sh` for submitting that job The following parameters in `startscript.sh` can be set for each script: - num-samples: number of samples (trials) to evaluate - max-iterations: for how long to train the trials at max - ngpus: how many GPU workers to allocate per trial - scheduler: which scheduler to use - data-dir: directory where the datasets are stored In simplest scenario, modify the `` section of the `startscript.sh`, and submit the job via ```bash $ sbatch startscript.sh ``` Note that, for communication via the infiniband network it is important the specify the node ip-address in the startscript (whan launching Ray Tune) in the following format: `--node-ip-address="$head_node"i` and `--address "$head_node"i:"$port"`. If multiple Ray instances run on the same machine, there might be problems if all use the same port value (eq. 29500), so it is advisable to change it to a different value in that case. ### ASHA The [ASHA](https://arxiv.org/pdf/1810.05934.pdf) scheduler is a variation of Random Search with early stopping of under-performing trials. ### BOHB The [BOHB](http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf) scheduler uses Bayesian Optimization in combination with early stopping. ### PBT The [PBT](https://arxiv.org/pdf/1711.09846.pdf) scheduler uses evolutionary optimization and is well suited for optimizing non-stationary hyperparameters (such as learning rate schedules). ### CIFAR-10 The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 60,000 low-resolution images in 10 different classes, such as birds, dogs, or cats. Training is performed on 50,000 images, while the remaining 10,000 images are used for validation and testing. It was first introduced by the Canadian Institute for Advanced Research (CIFAR) and is one of the main benchmark datasets in the computer vision domain.