Run
Each case in Cases
folder is unique. They, however, share the same parsable arguments located in ./Cases/src/parsings.py
. These arguments are:
--data-dir location of the dataset (str)
--restart-int checkpoint interval % epoch (int)
--concM multiply dataset size with concM (test purposes) (int)
### model parsers
--batch-size batch size (int)
--epochs epochs (int)
--lr learning rate (float)
--wdecay weight decay in schedular (float)
--wdecay gamma in schedular (float)
--shuff shuffle dataset / epoch (bool)
--schedule enable schedulers (bool)
# Horovod only
--gradient-predivide-factor divide gradients during backprop (float)
### debug parsers
--testrun do a test-run (bool)
--skipplot disable plots in post-processing (bool)
--nseed fix parameter for deterministic run (int)
--log-int print out interval % batch (int)
--export-latent export latent space (bool)
### parallel parsers
--backend parallelization backend (str: nccl,gloo,mpi)
--nworker dataloader CPU workers (int)
--prefetch dataloader prefetch dataset (int)
--no-cuda disable CUDA for CPU runs (bool)
# Horovod only
--use-fork use forkserver for dataloader (bool)
# DeepSpeed only
--local_rank local rank of a worker in a node (int: from env)
### benchmarking parsers
--synt use a synthetic data instead of real (disables I/O) (bool)
--synt-dpw data per worker (if synt is selected) (int)
--benchrun do a benchmark with profiler (bool)
### optimization parsers
--cudnn enable cuDNN optimizations (bool)
--amp enable AMP (bool)
--reduce-prec reduce dataset precision (bool)
--scale-lr scale LR with #workers (bool)
--accum-iter gradient accumulation interval % epoch (int)
# Horovod only
--fp16-allreduce reduce parameter precision during allreduce (bool)
--use-adasum adaptive summation algorithm (bool)
--batch-per-allreduce skip allreduce if gradients are accumulated (int)
### CDM additions
--cube enable patching of input with cube-cuts (true enables next 3 args) (bool)
--cubeC random (if 1) or monotonic (if 2) patch locations (int: 1,2)
--cubeD cube's edge dimension n (size is n**3) (int)
--cubeM # patches (int)
This argument list with more details on accepted values can simply be found out via
$ ai4hpc.py --help
In default startscript.sh
, none of these parameters are explicitly defined. Therefore, depending on the user’s needs, do not forget to add them!
To run AI4HPC, submitting startscript.sh
would be enough, but, is only meant for a guide. As usual, user should modify the number of nodes, partition, dataset folder etc. A great guide to these can be found here.