Train MoCo model


VISSL reproduces the self-supervised approach MoCo Momentum Contrast for Unsupervised Visual Representation Learning proposed by Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick in this paper. The MoCo baselines were improved further by Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He in “Improved Baselines with Momentum Contrastive Learning” proposed in this paper.

VISSL closely follows the implementation provided by MoCo authors themselves.

How to train MoCo (and MoCo v2 model) model

VISSL provides a yaml configuration file containing the exact hyperparameter settings to reproduce the model. VISSL implements all the components including loss, data augmentations, collators etc required for this approach.

To train ResNet-50 model on 8-gpus on ImageNet-1K dataset with MoCo-v2 approach using feature projection dimension 128:

python tools/ config=pretrain/moco/moco_1node_resnet

By default, VISSL provides configuration file for MoCo-v2 model as this has better baselines numbers. To train MoCo baseline instead, users should make 2 changes to the moco configuration file:

  • change the config.DATA.TRAIN.TRANSFORMS by removing the ImgPilGaussianBlur transform.

  • change the config.MODEL.HEAD.PARAMS=[["mlp", {"dims": [2048, 128]}]] i.e. replace the MLP-head with fc-head.

Vary the training loss settings

Users can adjust several settings from command line to train the model with different hyperparams. For example: to use a different momentum value (say 0.99) for memory and different temperature 0.5 for logits, the MoCo training command would look like:

python tools/ config=pretrain/moco/moco_1node_resnet \
    config.LOSS.moco_loss.temperature=0.5 \

The full set of loss params that VISSL allows modifying:

  embedding_dim: 128
  queue_size: 65536
  momentum: 0.999
  temperature: 0.2

Training different model architecture

VISSL supports many backbone architectures including ResNe(X)ts, wider ResNets. Some examples below:

  • Train ResNet-101:

python tools/ config=pretrain/moco/moco_1node_resnet \
  • Train ResNet-50-w2 (2x wider ResNet-50):

python tools/ config=pretrain/moco/moco_1node_resnet \
    config.MODEL.TRUNK.NAME=resnet config.MODEL.TRUNK.RESNETS.DEPTH=50 \

Vary the number of gpus

VISSL makes it extremely easy to vary the number of gpus to be used in training. For example: to train the MoCo model on 4 machines (32-gpus) or 1gpu, the changes required are:

  • Training on 1-gpu:

python tools/ config=pretrain/moco/moco_1node_resnet \
  • Training on 4 machines i.e. 32-gpu:

python tools/ config=pretrain/moco/moco_1node_resnet \


Please adjust the learning rate following ImageNet in 1-Hour if you change the number of gpus. However, MoCo doesn’t work very well with this rule as per the authors in the paper.


If you change the number of gpus for MoCo training, MoCo models require longer training in order to reproduce results. Hence, we recommend users to consult MoCo paper.

Pre-trained models

See VISSL Model Zoo for the PyTorch pre-trained models with VISSL using MoCo-v2 approach and the benchmarks.


  • MoCo

    title={Momentum Contrast for Unsupervised Visual Representation Learning},
    author={Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick},
  • MoCo-v2

    title={Improved Baselines with Momentum Contrastive Learning},
    author={Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He},