Distributed FWs

AI4HPC consist of 4 (existing) Distributed FrameWorks (FWs)

  1. PyTorch-DDP

  2. DeepSpeed

  3. HeAT

  4. Horovod

Why do we need such frameworks? For example, training a CAE with exceptionally large datasets is computationally challenging and can only be performed efficiently when parallelization strategies are exploited. A common parallelization strategy is to distribute the input dataset to separate GPUs, where the trainable parameters between the GPUs are exchanged occasionally. This method is called distributed data parallelization (DDP) and greatly reduces the training time. Depending on the size of the training dataset and data exchange rate between the CPUs and/or GPUs, this type of parallelized training can scale to a very large systems (even exascale)!

AI4HPC utilizes these FWs on an HPC level. AI4HPC can utilize almost all of the available resources on HPC system to drastically reduce the training runtimes! For some scaling performance results of AI4HPC, do not forget to check ./Bench/Results/bench_horovod_juwelsbooster.ipynb.