Training Vision Transformer models¶
Author: matthew.l.leavitt@gmail.com
VISSL contains implementations of multiple vision transformer model variants:
Vision Transformers (ViT): Published in Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), ViT was a breakthrough for transformers in vision tasks, and comprises a set of model architectures and hyperparameters that closely follow Vaswani et al.’s original implementation of transformers for NLP.
Data-efficient image Transformers (DeiT): Published in Touvron et al., Training data-efficient image transformers & distillation through attention (2020), DeiT is architecturally similar to ViT, but is distinguished by its hyperparameters and use of distillation during training. Training with distillation is not currently supported in VISSL, but DeiT provides benefits over ViT even when training without distillation.
Convolutional Vision Transformer (ConViT): Published in d’Ascoli et al., ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases (2021), the ConViT was designed with the goal of combining the expressivity of transformers with the sample-efficiency of the convolutional inductive bias. ConViT achieves this by replacing the standard self-attention layers of the vision transformer with “gated positional self-attention” layers that are initialized to perform both convolution and self-attention, and learn the optimal trade-off between the two functions.
We start by demonstrating how to train ViTs, and continue with examples of training DeiTs and ConViTs. The hyperparameter provided in this walkthrough and the provided config files reflect the authors’ recommendations from the associated publications, when available. However, many of the model-method combinations for which we have config files (e.g. ConViT with SimCLR) have not been examined in published work, so the hyperparameters are what we found to work in preliminary analyses. If you find something that works, please contribute new config files to VISSL!
Supervised ViT training on 1 gpu¶
VISSL provides yaml configuration files for many of the common hyperparameter settings for different vision transformer model variants. We will start with supervised training of a ViT-baseline (ViT-B) model on the ImageNet-1k dataset (see Using Data to set up your dataset) using a single GPU:
python tools/run_distributed_engines.py config=pretrain/vision_transformer/supervised/1gpu_vit_example
Let’s take a look at the model
section of the config file:
MODEL:
GRAD_CLIP:
USE_GRAD_CLIP: True
TRUNK:
NAME: vision_transformer
VISION_TRANSFORMERS:
IMAGE_SIZE: 224
PATCH_SIZE: 16
NUM_LAYERS: 12
NUM_HEADS: 12
HIDDEN_DIM: 768
MLP_DIM: 3072
# MLP and projection layer dropout rate
DROPOUT_RATE: 0
# Attention dropout rate
ATTENTION_DROPOUT_RATE: 0
# Use the token for classification. Currently no alternatives
# supported
CLASSIFIER: token
# Stochastic depth dropout rate. Turning on stochastic depth and
# using aggressive augmentation is essentially the difference
# between a DeiT and a ViT.
DROP_PATH_RATE: 0
QKV_BIAS: False # Bias for QKV in attention layers.
QK_SCALE: False # Scale
HEAD:
PARAMS: [
["vision_transformer_head", {"in_plane": 768, "hidden_dim": 3072,
"num_classes": 1000}],
]
Starting at the top, we can see that MODEL.GRAD_CLIP.USE_GRAD_CLIP
is set to True
, indicating that gradients beyond a certain magnitude will be clipped during training, as per the ViT authors’ training recipe. What is this “certain magnitude”? You can find all the default config values in vissl/config/defaults.yaml
. Pro-tip: defaults.yaml
’s heavy annotations make it a great resource for figuring out how VISSL works.
Moving on to MODEL.TRUNK.NAME
, we can see that we are using a vision_transformer
, which corresponds to class of model in vissl/models/trunks
. The architectural hyperparameters are contained in MODEL.TRUNK.VISION_TRANSFORMERS
(again, see defaults.yaml
for details on these hyperparameters). These are the appropriate architectural hyperparameters for a ViT-B as per the publication.
The head architecture is specified in MODEL.HEAD.PARAMS
. See the models documentation for more information about how to specify model head parameters. "in_plane"
is the dimensionality of the input to the head, which must match the output dimensionality of the trunk, which for the ViT-B is 768.
Let’s move on to the OPTIMIZER
section of the configuration file:
OPTIMIZER:
name: adamw
weight_decay: 0.05
num_epochs: 300
betas: [.9, .999] # for Adam/AdamW
param_schedulers:
lr:
auto_lr_scaling:
auto_scale: True
base_value: 0.0005
base_lr_batch_size: 1024
name: composite
schedulers:
- name: linear
start_value: 0.0
end_value: 0.0005
- name: cosine
start_value: 0.0005
end_value: 0
interval_scaling: [rescaled, rescaled]
update_interval: step
lengths: [0.017, 0.983]
# Parameters to omit from regularization.
# We don't want to regularize the class token or position in the ViT.
non_regularized_parameters: [pos_embedding, class_token]
Again, these hyperparameters reflect the authors’ recipe in the original ViT publication. It’s also worth pointing out that VISSL offers a lot control of the optimizer, so be sure to read up on it and poke around in vissl/config/defaults.yaml
. AdamW thus far seems like the most consistently successful optimizer for training vision transformers, so we use it in all our config files.
This config file is for a ViT-B16. What if we wanted instead to train the next larger ViT, ViT-L? This would require the following changes to the model architecture parameters:
MODEL:
GRAD_CLIP:
USE_GRAD_CLIP: True
TRUNK:
NAME: vision_transformer
VISION_TRANSFORMERS:
IMAGE_SIZE: 224
PATCH_SIZE: 16
NUM_LAYERS: 24 # Increased from 12->24
NUM_HEADS: 16 # Increased from 12->16
HIDDEN_DIM: 1024 # Increased from 768->1024
MLP_DIM: 4096 # Increased from 3072->4096
DROPOUT_RATE: 0.1
ATTENTION_DROPOUT_RATE: 0
CLASSIFIER: token
DROP_PATH_RATE: 0
QKV_BIAS: False # Bias for QKV in attention layers.
QK_SCALE: False # Scale
HEAD:
PARAMS: [
["vision_transformer_head", {"in_plane": 1024, "hidden_dim": 4096,
"num_classes": 1000}],
] # in_plane increased from -> 768->1024
Changing only these parameters would likely lead to an out-of-memory error due to the size difference between the ViT-B and ViT-L, so we also need to decrease the batch size:
DATA:
TRAIN:
BATCHSIZE_PER_REPLICA: 16 # Reduced from 128->32
...
(unchanged parameters skipped for brevity)
...
TEST:
BATCHSIZE_PER_REPLICA: 64 # Reduced from 256->64
MoCo ViT-B16 training¶
config/pretrain/vision_transformer/moco/vit_b16.yaml
is the configuration file for training a ViT-B16 with MoCo. There are a few key differences between this configuration file and the configuration for 1-gpu supervised training. First, the data parameters:
DATA:
NUM_DATALOADER_WORKERS: 5
TRAIN:
DATA_SOURCES: [disk_folder]
DATASET_NAMES: [imagenet1k_folder]
BATCHSIZE_PER_REPLICA: 128
LABEL_TYPE: sample_index # just an implementation detail. Label isn't used
TRANSFORMS:
- name: ImgReplicatePil
num_times: 2
- name: RandomResizedCrop
size: 224
- name: RandomHorizontalFlip
p: 0.5
- name: ImgPilColorDistortion
strength: 1.0
- name: ImgPilGaussianBlur
p: 0.5
radius_min: 0.1
radius_max: 2.0
- name: ToTensor
- name: Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
COLLATE_FUNCTION: moco_collator
MMAP_MODE: True
COPY_TO_LOCAL_DISK: False
COPY_DESTINATION_DIR: /tmp/imagenet1k/
DROP_LAST: True
Most of the contrastive training schemes require duplicating each sample, which is achieved in this case by using the transformation ImgReplicatePil
, which is specified in DATA.TRAIN.TRANSFORMS
. Many of the self-supervised methods also require a specific data collator, specified in DATA.TRAIN.COLLATE_FUNCTION
. See Using Data for more details.
The LOSS section of the config file specifies the parameters for the MoCo loss:
LOSS:
name: moco_loss
moco_loss:
embedding_dim: 128
queue_size: 65536
momentum: 0.999
temperature: 0.2
The output dimensionality of the model head must match LOSS.moco_loss.embedding_dim
.
If you move to the bottom of the file, you can see that this file specifies using 32 gpus across 4 machines:
DISTRIBUTED:
BACKEND: nccl
NUM_NODES: 4
NUM_PROC_PER_NODE: 8
RUN_ID: "60215"
MACHINE:
DEVICE: gpu
See the documentation on running large jobs for more details on scaling up!
Training DeiT with SwAV¶
This section primarily addresses the differences between DeiT and ViT. See here for detailed information about how to use SwAV. Aside from training with distillation, which is not currently supported in VISSL, the differences between DeiT and ViT are mostly in the choice of hyperparameters (see Table 9 in the DeiT paper for details):
MODEL:
TRUNK:
NAME: vision_transformer
VISION_TRANSFORMERS:
IMAGE_SIZE: 224
PATCH_SIZE: 16
NUM_LAYERS: 12
NUM_HEADS: 16
HIDDEN_DIM: 768
MLP_DIM: 3072
CLASSIFIER: token
DROPOUT_RATE: 0 # 0.1 for ViT
ATTENTION_DROPOUT_RATE: 0
DROP_PATH_RATE: 0.1 # stochastic depth dropout probability. 0 for ViT
DROP_PATH_RATE: 0
QKV_BIAS: False # Bias for QKV in attention layers.
QK_SCALE: False # Scale
The DeiT uses stochastic depth, which is set via MODEL.TRUNK.VISION_TRANSORMERS.DROP_PATH_RATE
. In contrast to ViT, DeiT does not use gradient clipping. DeiT also uses a number of data augmentations:
DATA:
NUM_DATALOADER_WORKERS: 8
TRAIN:
DATA_SOURCES: [disk_folder]
DATASET_NAMES: [imagenet1k_folder]
LABEL_TYPE: "zero"
BATCHSIZE_PER_REPLICA: 16
DROP_LAST: True
TRANSFORMS:
- name: ImgPilToMultiCrop
total_num_crops: 2
size_crops: [224]
num_crops: [2]
crop_scales: [[0.14, 1]]
- name: RandomHorizontalFlip
- name: RandAugment
magnitude: 9
magnitude_std: 0.5
increasing_severity: True
- name: ColorJitter
brightness: 0.4
contrast: 0.4
saturation: 0.4
hue: 0.4
- name: ToTensor
- name: RandomErasing
p: 0.25
- name: Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
COLLATE_FUNCTION: cutmixup_collator
COLLATE_FUNCTION_PARAMS: {
"ssl_method": "swav",
"mixup_alpha": 1.0, # mixup alpha value, mixup is active if > 0.
"cutmix_alpha": 1.0, # cutmix alpha value, cutmix is active if > 0.
"prob": 1.0, # probability of applying mixup or cutmix per batch or element
"switch_prob": 0.5, # probability of switching to cutmix instead of mixup when both are active
"mode": "batch", # how to apply mixup/cutmix params (per 'batch', 'pair' (pair of elements), 'elem' (element)
"correct_lam": True, # apply lambda correction when cutmix bbox clipped by image borders
"label_smoothing": 0.1, # apply label smoothing to the mixed target tensor
"num_classes": 1 # number of classes for target
}
DeiT uses RandAugment, Random Erasing, MixUp, CutMix, and Label Smoothing. Note that MixUp, CutMix, and Label Smoothing are not implemented as VISSL transforms, but instead as a custom collator DATA.TRAIN.COLLATE_FUNCTION: cutmixup_collator
, and using Label Smoothing requires setting DATA.TRAIN.LABEL_TYPE: "zero"
(see vissl/config/defaults.yaml
for details).
The LOSS
section contains the parameters for the SwAV loss (See here for detailed information about how to use SwAV):
LOSS:
name: swav_loss
swav_loss:
temperature: 0.1
use_double_precision: False
normalize_last_layer: True
num_iters: 3
epsilon: 0.05
crops_for_assign: [0, 1]
queue:
queue_length: 0
start_iter: 0
ConViT¶
ConViT was designed with the goal of combining the expressivity of transformers with the sample-efficiency of the convolutional inductive bias. This is achieved by modifying the self-attention layers. In addition to the standard N self-attention heads in each layer, each self-attention head is paired with a positional attention head. The positional attention heads are similar to the standard self-attention heads, except their weights are initialized such that they perform convolution. The network then learns the convolutional kernel weights for the positional attention heads (in addition to all the other parameters that are normally learned in a transformer during training), as well as learning a gating parameter that controls the relative contribution of positional- vs. standard self-attention for each pair of heads. These gated positional self-attention (GPSA) heads allow the network to leverage the benefits of convolution without the rigid structure imposed by traditional convolutional architectures. Let’s take a look at the MODEL
section of configs/config/pretrain/vision_transformer/supervised/16_gpu_convit_b
(a ConViT-B+ in the paper) to see how the ConViT differs from the ViT and DeiT:
MODEL:
TRUNK:
NAME: convit
VISION_TRANSFORMERS:
IMAGE_SIZE: 224
PATCH_SIZE: 16
NUM_LAYERS: 12
NUM_HEADS: 16
HIDDEN_DIM: 1024 # Hidden = 64 * NUM_HEADS
MLP_DIM: 4096 # MLP dimension = 4 * HIDDEN_DIM
CLASSIFIER: token
DROPOUT_RATE: 0
ATTENTION_DROPOUT_RATE: 0
DROP_PATH_RATE: 0.1 # stochastic depth dropout probability
QKV_BIAS: False # Bias for QKV in attention layers.
QK_SCALE: False # Scale
CONVIT:
N_GPSA_LAYERS: 10 # Number of gated positional self-attention layers. Remaining layers are standard self-attention layers.
CLASS_TOKEN_IN_LOCAL_LAYERS: False # Whether to add class token in GPSA layers. Recommended not to because it has been shown to lower performance.
# Locality strength determines how much the positional attention is focused on the
# patch of maximal attention. "Alpha" in the paper. Equivalent to
# the temperature of positional attention softmax.
LOCALITY_STRENGTH: 1.
# Dimensionality of the relative positional embeddings * 1/3
LOCALITY_DIM: 10
# Whether to initialize the positional attention to be local
# (equivalent to a convolution). Not much of a point in having GPSA if not True.
USE_LOCAL_INIT: True
HEAD:
PARAMS: [
["mlp", {"dims": [1024, 1000]}],
] # No hidden layer in head
We use a ConViT trunk by specifying MODEL.TRUNK.NAME: convit
. The parameters that ConViT has in common with other vision transformer trunks, such as NUM_LAYERS
are specified in MODEL.TRUNK.VISION_TRANSFORMERS
, just as with the ViT and DeiT. The ConViT-specific parameters are specified in MODEL.TRUNK.CONVIT
. N_GPSA_LAYERS
specifies the number of GPSA layers. The remaining NUM_LAYERS - N_GPSA_LAYERS
layers (in this case 12 - 10 = 2) will be standard self-attention layers. CLASS_TOKEN_IN_LOCAL_LAYERS
controls whether to include the class token from the beginning, and thus in the GPSA layers, or to add it at the first self-attention layer after the GPSA layers. The ConViT authors found that including the class token in the GPSA layers was detrimental to performance. LOCALITY_STRENGTH
controls the “narrowness” of the positional attention (see Figure 3 in the paper). The ConViT also features a single linear head, in contrast to the MLP head of the ViT and DeiT.
Additional information¶
Other important factors related to training include:
Synchronized batch norm: Vision transformers typically don’t use batch norm, but many self-supervised learning methods obtain optimal performance when using heads that have batch norm. Ensure sync batch norm is set up properly if you’re using batch norm and training on multiple GPUs. See the Swav Documentation for a walk-through on sync batch norm.
Mixed precision: Using mixed precision variables can reduce memory usage and afford larger batch sizes. See the Swav Documentation for a walk-through on sync mixed precision training.
Data augmentations: Read about data augmentations in VISSL; the Swav Documentation has details about using multi-crop.
Pre-trained models¶
Pre-trained models will eventually be available in VISSL Model Zoo
Citations¶
ViT
@misc{dosovitskiy2020image,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year={2020},
eprint={2010.11929},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
DeiT
@misc{touvron2021training,
title={Training data-efficient image transformers & distillation through attention},
author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou},
year={2021},
eprint={2012.12877},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
ConViT
@misc{dascoli2021convit,
title={ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases},
author={Stéphane d'Ascoli and Hugo Touvron and Matthew Leavitt and Ari Morcos and Giulio Biroli and Levent Sagun},
year={2021},
eprint={2103.10697},
archivePrefix={arXiv},
primaryClass={cs.CV}
}