Train SwAV model

Author: mathilde@fb.com

VISSL reproduces the self-supervised approach called SwAV Unsupervised learning of visual features by contrasting cluster assignments which was proposed by Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin in this paper. SwAV clusters the features while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning.

How to train SwAV model

VISSL provides a yaml configuration file containing the exact hyperparameter settings to reproduce the model. VISSL implements all the components including loss, data augmentations, collators etc required for this approach.

To train ResNet-50 model on 8-gpus on ImageNet-1K dataset using feature projection dimension 128 for memory:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet

Using Synchronized BatchNorm for training

For training SwAV models, we convert all the BatchNorm layers to Global BatchNorm. For this, VISSL supports PyTorch SyncBatchNorm module and NVIDIA’s Apex SyncBatchNorm layers. Set the config params MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE to apex or pytorch.

If you want to use Apex, VISSL provides anaconda and pip packages of Apex (compiled with Optimzed C++ extensions/CUDA kernels). The Apex packages are provided for all versions of CUDA (9.2, 10.0, 10.1, 10.2, 11.0), PyTorch >= 1.4 and Python >=3.6 and <=3.9.

To use SyncBN during training, one needs to set the following parameters in configuration file:

MODEL:
  SYNC_BN_CONFIG:
    CONVERT_BN_TO_SYNC_BN: True
    SYNC_BN_TYPE: apex
    # 1) if group_size=-1 -> use the VISSL default setting. We synchronize within a
    #     machine and hence will set group_size=num_gpus per node. This gives the best
    #     speedup.
    # 2) if group_size>0 -> will set group_size=value set by user.
    # 3) if group_size=0 -> no groups are created and process_group=None. This means
    #     global sync is done.
    GROUP_SIZE: 8

Using Mixed Precision for training

SwAV approach leverages mixed precision training by default for better training speed and reducing the model memory requirement. For this, we use NVIDIA Apex Library with Apex AMP level O1.

To use Mixed precision training, one needs to set the following parameters in configuration file:

MODEL:
  AMP_PARAMS:
    USE_AMP: True
    # Use O1 as it is robust and stable than O3. If you want to use O3, we recommend
    # the following setting:
    # {"opt_level": "O3", "keep_batchnorm_fp32": True, "master_weights": True, "loss_scale": "dynamic"}
    AMP_ARGS: {"opt_level": "O1"}

Using LARC for training

SwAV training uses LARC from NVIDIA’s Apex LARC. To use LARC, users need to set config option OPTIMIZER.use_larc=True. VISSL exposed LARC parameters that users can tune. Full list of LARC parameters exposed by VISSL:

OPTIMIZER:
  name: "sgd"
  use_larc: False  # supported for SGD only for now
  larc_config:
    clip: False
    eps: 1e-08
    trust_coefficient: 0.001

Note

LARC is currently supported for SGD optimizer only.

Vary the training loss settings

Users can adjust several settings from command line to train the model with different hyperparams. For example: to use a different temperature 0.2 for logits, epsilon of 0.04, the training command would look like:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.LOSS.swav_loss.temperature=0.2 \
    config.LOSS.swav_loss.epsilon=0.04

The full set of loss params that VISSL allows modifying:

swav_loss:
  temperature: 0.1
  use_double_precision: False
  normalize_last_layer: True
  num_iters: 3
  epsilon: 0.05
  temp_hard_assignment_iters: 0
  crops_for_assign: [0, 1]
  embedding_dim: 128            # automatically inferred from HEAD params
  num_crops: 2                  # automatically inferred from data transforms
  num_prototypes: [3000]        # automatically inferred from model HEAD settings
  # for dumping the debugging info in case loss becomes NaN
  output_dir: ""                # automatically inferred and set to checkpoint dir
  queue:
    start_iter: 0
    queue_length: 0             # automatically adjusted to ensure queue_length % global batch size = 0
    local_queue_length: 0       # automatically inferred to queue_length // world_size

Training different model architecture

VISSL supports many backbone architectures including ResNe(X)ts, wider ResNets. Some examples below:

  • Train ResNet-101:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.MODEL.TRUNK.NAME=resnet config.MODEL.TRUNK.RESNETS.DEPTH=101
  • Train ResNet-50-w2 (2x wider):

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.MODEL.TRUNK.NAME=resnet config.MODEL.TRUNK.RESNETS.DEPTH=101 \
    config.MODEL.TRUNK.RESNETS.WIDTH_MULTIPLIER=2
  • Train RegNetY-400MF:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.MODEL.TRUNK.NAME=regnet config.MODEL.TRUNK.REGNET.name=regnet_y_400mf
  • Train RegNetY-256GF:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.MODEL.TRUNK.NAME=regnet \
    config.MODEL.TRUNK.REGNET.depth=27 \
    config.MODEL.TRUNK.REGNET.w_0=640 \
    config.MODEL.TRUNK.REGNET.w_a=230.83 \
    config.MODEL.TRUNK.REGNET.w_m=2.53 \
    config.MODEL.TRUNK.REGNET.group_width=373 \
    config.MODEL.HEAD.PARAMS=[["swav_head", {"dims": [10444, 10444, 128], "use_bn": False, "num_clusters": [3000]}]]

Training with Multi-Crop data augmentation

SwAV is trained using the multi-crop augmentation proposed in SwAV paper.

Multi-crop augmentation can allow using more positives and also positives of different resolutions. In order to train SwAV with multi-crop augmentation say crops 2x224 + 4x96 i.e. 2 crops of resolution 224 and 4 crops of resolution 96, the training command looks like:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    +config/pretrain/swav/transforms=multicrop_2x224_4x96

The multicrop_2x224_4x96.yaml configuration file changes the number of crop settings to 6 crops and the right resolution.

Varying the multi-crop augmentation settings

VISSL allows modifying the crops to use. Full settings exposed:

TRANSFORMS:
  - name: ImgPilToMultiCrop
    total_num_crops: 6                      # Total number of crops to extract
    num_crops: [2, 4]                       # Specifies the number of type of crops.
    size_crops: [160, 96]                   # Specifies the height (height = width) of each patch
    crop_scales: [[0.08, 1], [0.05, 0.14]]  # Scale of the crop

Training with different MLP head

By default, the original SwAV approach used the 2-layer MLP-head similar to SimCLR approach. VISSL allows attaching any different desired head. In order to modify the MLP head (more layers, different dimensions etc), see the following examples:

  • 3-layer MLP head: Use the following head (example for ResNet model)

MODEL:
  HEAD:
    PARAMS: [
      ["swav_head", {"dims": [2048, 2048, 2048, 128], "use_bn": True, "num_clusters": [3000]}],
    ]
  • Use 2-layer MLP with hidden dimension 4096: Use the following head (example for ResNet model)

MODEL:
  HEAD:
    PARAMS: [
      ["swav_head", {"dims": [2048, 4096, 128], "use_bn": True, "num_clusters": [3000]}],
    ]

Vary the number of epochs

In order to vary the number of epochs to use for training SwAV models, one can achieve this simply from command line. For example, to train the SwAV model for 100 epochs instead, pass the num_epochs parameter from command line:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.OPTIMIZER.num_epochs=100

Vary the number of gpus

VISSL makes it extremely easy to vary the number of gpus to be used in training. For example: to train the SwAV model on 4 machines (32gpus) or 1gpu, the changes required are:

  • Training on 1-gpu:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=1 config.DISTRIBUTED.NUM_NODES=1
  • Training on 4 machines i.e. 32-gpu:

python tools/run_distributed_engines.py config=pretrain/swav/swav_8node_resnet \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=8 config.DISTRIBUTED.NUM_NODES=4

Note

Please adjust the learning rate following ImageNet in 1-Hour if you change the number of gpus.

Pre-trained models

See VISSL Model Zoo for the PyTorch pre-trained models with SwAV using DeepClusterV2 approach and the benchmarks.

Citations

  • DeepClusterV2

@misc{caron2020unsupervised,
    title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
    author={Mathilde Caron and Ishan Misra and Julien Mairal and Priya Goyal and Piotr Bojanowski and Armand Joulin},
    year={2020},
    eprint={2006.09882},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}