Dataloaders

Each case in the Cases folder is unique. But, they share the same dataloading routines located in ./Cases/src/dataloaders.py.

At this moment, AI4HPC can load these datasets:

  1. Actuated TBL dataset

  2. MNIST dataset

If a new dataloader is required, dataloaders.py should be extended.

Cubic extractions

A simple approach is to use a complete field as an input to the chosen architecture. This is, however, not only too memory intensive due to possible enormous tensor sizes but also hard to adjust the padding if the dataset shapes vary. This problem can be rectified in AI4HPC by extracting cubes from the dataset instead of using the complete field, perfect for CFD datasets! This option can be turned on simply by adding the cube argument to the startscript.sh, as

# command
dataDir=./
COMMAND="ai4hpc.py"
EXEC="$COMMAND \
  --nworker $SLURM_CPUS_PER_TASK \
  --data-dir $dataDir \
  --cube"

In its default configuration, AI4HPC will extract 20 cubes in each direction (8000 cubes in total) with 16x16x16 dimensions in monotonic order, which means that the cuts would happen in equidistant spaces in the field. Depending on the problem, one can also modify the number of cubes extracted, the dimensions of the cube and the extraction location for being monotonic or random, as

# command
dataDir=./
COMMAND="ai4hpc.py"
EXEC="$COMMAND \
  --nworker $SLURM_CPUS_PER_TASK \
  --data-dir $dataDir \
  --cube \
  --cubeD 32 \
  --cubeM 100 \
  --cubeC 1"

These flags would then extract 100 cubes in each direction with 32x32x32 dimensions in random order. More information on the training options can be found in Run.

Synthetic dataset

For pure benchmarking purposes, AI4HPC has the option to generate synthetic datasets that resemble the memory footprint of a large CFD dataset. Enabling this is possible via

# command
dataDir=./
COMMAND="ai4hpc.py"
EXEC="$COMMAND \
  --nworker $SLURM_CPUS_PER_TASK \
  --data-dir $dataDir \
  --snyt \
  --snyt-dpw 100"

which leads to 100 samples per device (CPU, GPU, IPU, etc.) with each having a size of 3x192x248. Note that the size can be changed from the SyntheticDataset_train class in ./Cases/src/dataloaders.py if needed (but unlikely, as increasing the number of samples eventually would lead to the same desired memory footprint). More information on the training options can be found in Run.