Run

Each case in Cases folder is unique. They, however, share the same parsable arguments located in ./Cases/src/parsings.py. These arguments are:

--data-dir                   location of the dataset (str)
--restart-int                checkpoint interval % epoch (int)
--concM                      multiply dataset size with concM (test purposes) (int)

### model parsers
--batch-size                 batch size (int)
--epochs                     epochs (int)
--lr                         learning rate (float)
--wdecay                     weight decay in schedular (float)
--wdecay                     gamma in schedular (float) 
--shuff                      shuffle dataset / epoch (bool)
--schedule                   enable schedulers (bool)
# Horovod only
--gradient-predivide-factor  divide gradients during backprop (float)

### debug parsers
--testrun                    do a test-run (bool)
--skipplot                   disable plots in post-processing (bool)
--nseed                      fix parameter for deterministic run (int)
--log-int                    print out interval % batch (int)
--export-latent              export latent space (bool)

### parallel parsers
--backend                    parallelization backend (str: nccl,gloo,mpi)
--nworker                    dataloader CPU workers (int)
--prefetch                   dataloader prefetch dataset (int)
--no-cuda                    disable CUDA for CPU runs (bool)
# Horovod only
--use-fork                   use forkserver for dataloader (bool) 
# DeepSpeed only
--local_rank                 local rank of a worker in a node (int: from env)

### benchmarking parsers
--synt                       use a synthetic data instead of real (disables I/O) (bool)
--synt-dpw                   data per worker (if synt is selected) (int)
--benchrun                   do a benchmark with profiler (bool)

### optimization parsers
--cudnn                      enable cuDNN optimizations (bool)
--amp                        enable AMP (bool)
--reduce-prec                reduce dataset precision (bool)
--scale-lr                   scale LR with #workers (bool)
--accum-iter                 gradient accumulation interval % epoch (int)
# Horovod only
--fp16-allreduce             reduce parameter precision during allreduce (bool)
--use-adasum                 adaptive summation algorithm (bool)
--batch-per-allreduce        skip allreduce if gradients are accumulated (int)

### CDM additions
--cube                       enable patching of input with cube-cuts (true enables next 3 args) (bool)
--cubeC                      random (if 1) or monotonic (if 2) patch locations (int: 1,2)
--cubeD                      cube's edge dimension n (size is n**3) (int)
--cubeM                      # patches (int)

This argument list with more details on accepted values can simply be found out via

$ ai4hpc.py --help

In default startscript.sh, none of these parameters are explicitly defined. Therefore, depending on the user’s needs, do not forget to add them!

To run AI4HPC, submitting startscript.sh would be enough, but, is only meant for a guide. As usual, user should modify the number of nodes, partition, dataset folder etc. A great guide to these can be found here.