Add documentation about DALI proxy in EfficientNet and ResNet examples (

#5800) Signed-off-by: Joaquin Anton Guirao <[email protected]>
NVIDIA · Feb 4, 2025 · e3e7c72 · e3e7c72
1 parent 43874c2
commit e3e7c72
Show file tree

Hide file tree

Showing 4 changed files with 101 additions and 38 deletions.
diff --git a/docs/examples/use_cases/pytorch/efficientnet/readme.rst b/docs/examples/use_cases/pytorch/efficientnet/readme.rst
@@ -89,11 +89,27 @@ You may need to adjust ``--batch-size`` parameter for your machine.
 
 You can change the data loader and automatic augmentation scheme that are used by adding:
 
-* ``--data-backend``: ``dali`` | ``pytorch`` | ``synthetic``,
+* ``--data-backend``: ``dali`` | ``dali_proxy`` | ``pytorch`` | ``synthetic``,
 * ``--automatic-augmentation``: ``disabled`` | ``autoaugment`` | ``trivialaugment`` (the last one only for DALI),
 * ``--dali-device``: ``cpu`` | ``gpu`` (only for DALI).
 
-By default DALI GPU-variant with AutoAugment is used.
+By default DALI GPU-variant with AutoAugment is used (``dali`` and ``dali_proxy`` backends).
+
+Data Backends
+-------------
+
+- **dali**:
+  Leverages a DALI pipeline along with DALI's PyTorch iterator for data loading, preprocessing, and augmentation.
+
+- **dali_proxy**:
+  Uses a DALI pipeline for preprocessing and augmentation while relying on PyTorch's data loader. DALI Proxy facilitates the transfer of data to DALI for processing.
+  See :ref:`pytorch_dali_proxy`.
+
+- **pytorch**: 
+  Employs the native PyTorch data loader for data preprocessing and augmentation.
+
+- **synthetic**: 
+  Creates synthetic data on the fly, which is useful for testing and benchmarking purposes. This backend eliminates the need for actual datasets, providing a convenient way to simulate data loading.
 
 For example to run the EfficientNet with AMP on a batch size of 128 with DALI using TrivialAugment you need to invoke:
 
@@ -161,6 +177,20 @@ To run training benchmarks with different data loaders and automatic augmentatio
                       --workspace $RESULT_WORKSPACE
                       --report-file bench_report_dali_ta.json $PATH_TO_IMAGENET
 
+  # DALI proxy with AutoAugment
+  python multiproc.py --nproc_per_node 8 ./main.py --amp --static-loss-scale 128
+                      --batch-size 128 --epochs 4 --no-checkpoints --training-only
+                      --data-backend dali_proxy --automatic-augmentation autoaugment
+                      --workspace $RESULT_WORKSPACE
+                      --report-file bench_report_dali_proxy_aa.json $PATH_TO_IMAGENET
+
+  # DALI proxy with TrivialAugment
+  python multiproc.py --nproc_per_node 8 ./main.py --amp --static-loss-scale 128
+                      --batch-size 128 --epochs 4 --no-checkpoints --training-only
+                      --data-backend dali_proxy --automatic-augmentation trivialaugment
+                      --workspace $RESULT_WORKSPACE
+                      --report-file bench_report_dali_proxy_ta.json $PATH_TO_IMAGENET
+
   # PyTorch without automatic augmentations
   python multiproc.py --nproc_per_node 8 ./main.py --amp --static-loss-scale 128
                       --batch-size 128 --epochs 4 --no-checkpoints --training-only

diff --git a/docs/examples/use_cases/pytorch/resnet50/main.py b/docs/examples/use_cases/pytorch/resnet50/main.py
@@ -93,12 +93,14 @@ def parse():
                         '"dali" for DALI data loader, or "dali_proxy" for PyTorch dataloader with DALI proxy preprocessing.')
     parser.add_argument('--prof', default=-1, type=int,
                         help='Only run 10 iterations for profiling.')
-    parser.add_argument('--deterministic', action='store_true')
-
+    parser.add_argument('--deterministic', action='store_true',
+                    help='Enable deterministic behavior for reproducibility')
     parser.add_argument('--fp16-mode', default=False, action='store_true',
                         help='Enable half precision mode.')
-    parser.add_argument('--loss-scale', type=float, default=1)
-    parser.add_argument('--channels-last', type=bool, default=False)
+    parser.add_argument('--loss-scale', type=float, default=1,
+                    help='Scaling factor for loss to prevent underflow in FP16 mode.')
+    parser.add_argument('--channels-last', type=bool, default=False,
+                    help='Use channels last memory format for tensors.')
     parser.add_argument('-t', '--test', action='store_true',
                         help='Launch test mode with preset arguments')
     args = parser.parse_args()

diff --git a/docs/examples/use_cases/pytorch/resnet50/pytorch-resnet50.rst b/docs/examples/use_cases/pytorch/resnet50/pytorch-resnet50.rst
@@ -44,39 +44,69 @@ The default learning rate schedule starts at 0.1 and decays by a factor of 10 ev
 
    python main.py -a alexnet --lr 0.01 [imagenet-folder with train and val folders]
 
+Data loaders
+------------
+
+- **dali**:
+  Leverages a DALI pipeline along with DALI's PyTorch iterator for data loading, preprocessing, and augmentation.
+
+- **dali_proxy**:
+  Uses a DALI pipeline for preprocessing and augmentation while relying on PyTorch's data loader. DALI Proxy facilitates the transfer of data to DALI for processing.
+  See :ref:`pytorch_dali_proxy`.
+
+- **pytorch**:
+  Employs the native PyTorch data loader for data preprocessing and augmentation.
+
 Usage
 -----
 
 .. code-block:: bash
-
-   main.py [-h] [--arch ARCH] [-j N] [--epochs N] [--start-epoch N] [-b N] [--lr LR] [--momentum M] [--weight-decay W] [--print-freq N] [--resume PATH] [-e] [--pretrained] [--opt-level] DIR
-
-   PyTorch ImageNet Training
-
-   positional arguments:
-   DIR                         path(s) to dataset (if one path is provided, it is assumed to have subdirectories named "train" and "val"; alternatively, train and val paths can be specified directly by providing both paths as arguments)
-
-   optional arguments (for the full list please check `Apex ImageNet example
-            <https://github.com/NVIDIA/apex/tree/master/examples/imagenet>`_)
-   -h, --help                  show this help message and exit
-   --arch ARCH, -a ARCH        model architecture: alexnet | resnet | resnet101
-                               | resnet152 | resnet18 | resnet34 | resnet50 | vgg
-                               | vgg11 | vgg11_bn | vgg13 | vgg13_bn | vgg16
-                               | vgg16_bn | vgg19 | vgg19_bn (default: resnet18)
-   -j N, --workers N           number of data loading workers (default: 4)
-   --epochs N                  number of total epochs to run
-   --start-epoch N             manual epoch number (useful on restarts)
-   -b N, --batch-size N        mini-batch size (default: 256)
-   --lr LR, --learning-rate LR initial learning rate
-   --momentum M                momentum
-   --weight-decay W, --wd W    weight decay (default: 1e-4)
-   --print-freq N, -p N        print frequency (default: 10)
-   --resume PATH               path to latest checkpoint (default: none)
-   -e, --evaluate              evaluate model on validation set
-   --pretrained                use pre-trained model
-   --dali_cpu                  use CPU based pipeline for DALI, for heavy GPU
-                               networks it may work better, for IO bottlenecked
-                               one like RN18 GPU default should be faster
-   --data_loader               Select data loader: "pytorch" for native PyTorch data loader, 
-                               "dali" for DALI data loader, or "dali_proxy" for PyTorch dataloader with DALI proxy preprocessing.
-   --fp16-mode                 enables mixed precision mode
+   main.py [-h] [--arch ARCH] [-j N] [--epochs N] [--start-epoch N] [-b N] [--lr LR] [--momentum M] [--weight-decay W] [--print-freq N] [--resume PATH]
+                  [-e] [--pretrained] [--dali_cpu] [--data_loader {pytorch,dali,dali_proxy}] [--prof PROF] [--deterministic] [--fp16-mode]
+                  [--loss-scale LOSS_SCALE] [--channels-last CHANNELS_LAST] [-t]
+                  [DIR ...]
+
+  PyTorch ImageNet Training
+
+  positional arguments:
+    DIR                   path(s) to dataset (if one path is provided, it is assumed to have subdirectories named "train" and "val"; alternatively, train and val paths can
+                          be specified directly by providing both paths as arguments)
+
+  options:
+    -h, --help            show this help message and exit
+    --arch ARCH, -a ARCH  model architecture: alexnet | convnext_base | convnext_large | convnext_small | convnext_tiny | densenet121 | densenet161 | densenet169 |
+                          densenet201 | efficientnet_b0 | efficientnet_b1 | efficientnet_b2 | efficientnet_b3 | efficientnet_b4 | efficientnet_b5 | efficientnet_b6 |
+                          efficientnet_b7 | efficientnet_v2_l | efficientnet_v2_m | efficientnet_v2_s | get_model | get_model_builder | get_model_weights | get_weight |
+                          googlenet | inception_v3 | list_models | maxvit_t | mnasnet0_5 | mnasnet0_75 | mnasnet1_0 | mnasnet1_3 | mobilenet_v2 | mobilenet_v3_large |
+                          mobilenet_v3_small | regnet_x_16gf | regnet_x_1_6gf | regnet_x_32gf | regnet_x_3_2gf | regnet_x_400mf | regnet_x_800mf | regnet_x_8gf |
+                          regnet_y_128gf | regnet_y_16gf | regnet_y_1_6gf | regnet_y_32gf | regnet_y_3_2gf | regnet_y_400mf | regnet_y_800mf | regnet_y_8gf | resnet101 |
+                          resnet152 | resnet18 | resnet34 | resnet50 | resnext101_32x8d | resnext101_64x4d | resnext50_32x4d | shufflenet_v2_x0_5 | shufflenet_v2_x1_0 |
+                          shufflenet_v2_x1_5 | shufflenet_v2_x2_0 | squeezenet1_0 | squeezenet1_1 | swin_b | swin_s | swin_t | swin_v2_b | swin_v2_s | swin_v2_t | vgg11 |
+                          vgg11_bn | vgg13 | vgg13_bn | vgg16 | vgg16_bn | vgg19 | vgg19_bn | vit_b_16 | vit_b_32 | vit_h_14 | vit_l_16 | vit_l_32 | wide_resnet101_2 |
+                          wide_resnet50_2 (default: resnet18)
+    -j N, --workers N     number of data loading workers (default: 4)
+    --epochs N            number of total epochs to run
+    --start-epoch N       manual epoch number (useful on restarts)
+    -b N, --batch-size N  mini-batch size per process (default: 256)
+    --lr LR, --learning-rate LR
+                          Initial learning rate. Will be scaled by <global batch size>/256: args.lr = args.lr*float(args.batch_size*args.world_size)/256. A warmup schedule
+                          will also be applied over the first 5 epochs.
+    --momentum M          momentum
+    --weight-decay W, --wd W
+                          weight decay (default: 1e-4)
+    --print-freq N, -p N  print frequency (default: 10)
+    --resume PATH         path to latest checkpoint (default: none)
+    -e, --evaluate        evaluate model on validation set
+    --pretrained          use pre-trained model
+    --dali_cpu            Runs CPU based version of DALI pipeline.
+    --data_loader {pytorch,dali,dali_proxy}
+                          Select data loader: "pytorch" for native PyTorch data loader, "dali" for DALI data loader, or "dali_proxy" for PyTorch dataloader with DALI proxy
+                          preprocessing.
+    --prof PROF           Only run 10 iterations for profiling.
+    --deterministic       Enable deterministic behavior for reproducibility
+    --fp16-mode           Enable half precision mode.
+    --loss-scale LOSS_SCALE
+                          Scaling factor for loss to prevent underflow in FP16 mode.
+    --channels-last CHANNELS_LAST
+                          Use channels last memory format for tensors.
+    -t, --test            Launch test mode with preset arguments
diff --git a/docs/plugins/pytorch_dali_proxy.rst b/docs/plugins/pytorch_dali_proxy.rst
@@ -1,3 +1,4 @@
+.. _pytorch_dali_proxy:
 PyTorch DALI Proxy
 ==================