Using Optimizers

VISSL support all PyTorch optimizers (SGD, Adam etc) and ClassyVision optimizers.

Creating Optimizers

The optimizers can be easily created from the configuration files. The user needs to set the optimizer name in Users can configure other settings like #epochs, etc as follows:

    name: sgd
    weight_decay: 0.0001
    momentum: 0.9
    nesterov: False
    # for how many epochs to do training. only counts training epochs.
    num_epochs: 90
    # whether to regularize batch norm. if set to False, weight decay of batch norm params is 0.
    regularize_bn: False
    # whether to regularize bias parameter. if set to False, weight decay of bias params is 0.
    regularize_bias: True

Using different LR for Head and trunk

VISSL supports using a different LR and weight decay for head and trunk. User needs to set the config option OPTIMIZER.head_optimizer_params.use_different_values=True in order to enable this functionality.

    # if the head should use a different LR than the trunk. If yes, then specify the
    # param_schedulers.lr_head settings. Otherwise if set to False, the
    # will be used automatically.
    use_different_lr: False
    # if the head should use a different weight decay value than the trunk.
    use_different_wd: False
    # if using different weight decay value for the head, set here. otherwise, the
    # same value as trunk will be automatically used.
    weight_decay: 0.0001

Using LARC

VISSL supports the LARC implementation from NVIDIA’s Apex LARC. To use LARC, users need to set config option OPTIMIZER.use_larc=True. VISSL exposes LARC parameters that users can tune. Full list of LARC parameters exposed by VISSL:

  name: "sgd"
  use_larc: False  # supported for SGD only for now
    clip: False
    eps: 1e-08
    trust_coefficient: 0.001


LARC is currently supported for SGD optimizer only.

Creating LR Schedulers

Users can use different types of Learning rate schedules for the training of their models. We closely follow the LR schedulers supported by ClassyVision and also custom learning rate schedules in VISSL.

How to set learning rate

Below we provide some examples of how to setup various types of Learning rate schedules. Note that these are merely some examples and you should set your desired parameter values.

  1. Cosine

      name: cosine
      start_value: 0.15   # LR for batch size 256
      end_value: 0.0000
  1. Multi-Step

      name: multistep
      values: [0.01, 0.001]
      milestones: [1]
      update_interval: epoch  # update LR after every epoch
  1. Linear Warmup + Cosine

      name: composite
        - name: linear
            start_value: 0.6
            end_value: 4.8
        - name: cosine
            start_value: 4.8
            end_value: 0.0048
      interval_scaling: [rescaled, fixed]
      update_interval: step
      lengths: [0.1, 0.9]                 # 100ep
  1. Cosine with restarts

      name: cosine_warm_restart
      start_value: 0.15   # LR for batch size 256
      end_value: 0.00015
      restart_interval_length: 0.5
      wave_type: half  # full | half
  1. Linear warmup + cosine with restarts

      name: composite
        - name: linear
            start_value: 0.6
            end_value: 4.8
        - name: cosine_warm_restart
            start_value: 4.8
            end_value: 0.0048
            # wave_type: half
            # restart_interval_length: 0.5
            wave_type: full
            restart_interval_length: 0.334
      interval_scaling: [rescaled, rescaled]
      update_interval: step
      lengths: [0.1, 0.9]                 # 100ep
  1. Multiple linear warmups and cosine

        - name: linear
            start_value: 0.6
            end_value: 4.8
        - name: cosine
            start_value: 4.8
            end_value: 0.0048
        - name: linear
            start_value: 0.0048
            end_value: 2.114
        - name: cosine
            start_value: 2.114
            end_value: 0.0048
      update_interval: step
      interval_scaling: [rescaled, rescaled, rescaled, rescaled]
      lengths: [0.0256, 0.48722, 0.0256, 0.46166]         # 1ep IG-500M

Auto-scaling of Learning Rate

VISSL supports automatically scaling LR as per To turn this automatic scaling on, set

scaled_lr is calculated: for a given

  • base_lr_batch_size = batch size for which the base learning rate is specified,

  • base_value = base learning rate value that will be scaled, the current batch size is used to determine how to scale the base learning rate value.

scaled_lr = ((batchsize_per_gpu * world_size) * base_value ) / base_lr_batch_size

For different types of learning rate schedules, the LR scaling is handles as below:

1. cosine:
    end_value = scaled_lr * (end_value / start_value)
    start_value = scaled_lr and
2. multistep:
    gamma = values[1] / values[0]
    values = [scaled_lr * pow(gamma, idx) for idx in range(len(values))]
3. step_with_fixed_gamma
    base_value = scaled_lr
4. linear:
   end_value = scaled_lr
5. inverse_sqrt:
   start_value = scaled_lr
6. constant:
   value = scaled_lr
7. composite:
    recursively call to scale each composition. If the composition consists of a linear
    schedule, we assume that a linear warmup is applied. If the linear warmup is
    applied, it's possible the warmup is not necessary if the global batch_size is smaller
    than the base_lr_batch_size and in that case, we remove the linear warmup from the