Train MoCo model¶
Author: lefaudeux@fb.com
VISSL reproduces the self-supervised approach MoCo Momentum Contrast for Unsupervised Visual Representation Learning proposed by Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick in this paper. The MoCo baselines were improved further by Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He in “Improved Baselines with Momentum Contrastive Learning” proposed in this paper.
VISSL closely follows the implementation provided by MoCo authors themselves.
How to train MoCo (and MoCo v2 model) model¶
VISSL provides a yaml configuration file containing the exact hyperparameter settings to reproduce the model. VISSL implements all the components including loss, data augmentations, collators etc required for this approach.
To train ResNet-50 model on 8-gpus on ImageNet-1K dataset with MoCo-v2 approach using feature projection dimension 128:
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet
By default, VISSL provides configuration file for MoCo-v2 model as this has better baselines numbers. To train MoCo baseline instead, users should make 2 changes to the moco configuration file:
change the
config.DATA.TRAIN.TRANSFORMS
by removing theImgPilGaussianBlur
transform.change the
config.MODEL.HEAD.PARAMS=[["mlp", {"dims": [2048, 128]}]]
i.e. replace the MLP-head with fc-head.
Vary the training loss settings¶
Users can adjust several settings from command line to train the model with different hyperparams. For example: to use a different momentum value (say 0.99) for memory and different temperature 0.5 for logits, the MoCo training command would look like:
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet \
config.LOSS.moco_loss.temperature=0.5 \
config.LOSS.moco_loss.momentum=0.99
The full set of loss params that VISSL allows modifying:
moco_loss:
embedding_dim: 128
queue_size: 65536
momentum: 0.999
temperature: 0.2
Training different model architecture¶
VISSL supports many backbone architectures including ResNe(X)ts, wider ResNets. Some examples below:
Train ResNet-101:
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet \
config.MODEL.TRUNK.NAME=resnet config.MODEL.TRUNK.RESNETS.DEPTH=101
Train ResNet-50-w2 (2x wider ResNet-50):
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet \
config.MODEL.TRUNK.NAME=resnet config.MODEL.TRUNK.RESNETS.DEPTH=50 \
config.MODEL.TRUNK.RESNETS.WIDTH_MULTIPLIER=2
Vary the number of gpus¶
VISSL makes it extremely easy to vary the number of gpus to be used in training. For example: to train the MoCo model on 4 machines (32-gpus) or 1gpu, the changes required are:
Training on 1-gpu:
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet \
config.DISTRIBUTED.NUM_PROC_PER_NODE=1
Training on 4 machines i.e. 32-gpu:
python tools/run_distributed_engines.py config=pretrain/moco/moco_1node_resnet \
config.DISTRIBUTED.NUM_PROC_PER_NODE=8 config.DISTRIBUTED.NUM_NODES=4
Note
Please adjust the learning rate following ImageNet in 1-Hour if you change the number of gpus. However, MoCo doesn’t work very well with this rule as per the authors in the paper.
Note
If you change the number of gpus for MoCo training, MoCo models require longer training in order to reproduce results. Hence, we recommend users to consult MoCo paper.
Pre-trained models¶
See VISSL Model Zoo for the PyTorch pre-trained models with VISSL using MoCo-v2 approach and the benchmarks.
Citations¶
MoCo
@misc{he2020momentum,
title={Momentum Contrast for Unsupervised Visual Representation Learning},
author={Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick},
year={2020},
eprint={1911.05722},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
MoCo-v2
@misc{chen2020improved,
title={Improved Baselines with Momentum Contrastive Learning},
author={Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He},
year={2020},
eprint={2003.04297},
archivePrefix={arXiv},
primaryClass={cs.CV}
}