Installation

AI4HPC can be cloned via

$ git clone https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc
$ cd ai4hpc

To install AI4HPC, initial step is to check the manual via

$ ./setup.py --help

Note that minimum Python version to run AI4HPC is 3.10! Check it via

$ python --version

If this is not the case, simply run ./Scripts/installPython.sh script.

There are 5 different trainable models built inside AI4HPC

  1. Convolutional AutoEncoder (CAE)

  2. Convolutional Defiltering Model (CDM)

  3. Regression Neural Network (REG)

  4. Transformer Network (TR)

  5. Convolutional Neural Network (CNN)

If desired, one can also select for the training a specific distributed backend implemented to AI4HPC:

  1. PyTorch-DDP (default)

  2. DeepSpeed

  3. HeAT

  4. Horovod

For example, AI4HPC for CDM training using Horovod as the distributed backend can be compiled with

$ python setup.py --model 2 --fw 4

if the HPC system is preconfigured in the ./Scripts/setup.sh file, a Python Environment with these Libraries is compiled to the system:

  1. PyTorch

  2. Horovod

  3. DeepSpeed

  4. HeAT

  5. JUBE

The preconfigured HPC systems are

  1. JUWELS

  2. JURECA

  3. DEEP-EST

  4. LUMI

  5. CTE-AMD*
    *CTE-AMD does not allow incoming/outgoing communication, hence, refer to CTE-AMD installation guide.

After the installation, the main folder should consist of

  1. ai4hpc.py source file for the chosen case

  2. startscript.sh for submitting that job

  3. src folder with the rest of the dependencies

In simplest scenario, submit startscript.sh with

$ sbatch startscript.sh

That is it!