Using Optimizers¶
VISSL support all PyTorch optimizers (SGD, Adam etc) and ClassyVision optimizers.
Creating Optimizers¶
The optimizers can be easily created from the configuration files. The user needs to set the optimizer name in OPTIMIZER.name
. Users can configure other settings like #epochs, etc as follows:
OPTIMIZER:
name: sgd
weight_decay: 0.0001
momentum: 0.9
nesterov: False
# for how many epochs to do training. only counts training epochs.
num_epochs: 90
# whether to regularize batch norm. if set to False, weight decay of batch norm params is 0.
regularize_bn: False
# whether to regularize bias parameter. if set to False, weight decay of bias params is 0.
regularize_bias: True
Using different LR for Head and trunk¶
VISSL supports using a different LR and weight decay for head and trunk. User needs to set the config option OPTIMIZER.head_optimizer_params.use_different_values=True
in order to enable
this functionality.
OPTIMIZER:
head_optimizer_params:
# if the head should use a different LR than the trunk. If yes, then specify the
# param_schedulers.lr_head settings. Otherwise if set to False, the
# param_scheduelrs.lr will be used automatically.
use_different_lr: False
# if the head should use a different weight decay value than the trunk.
use_different_wd: False
# if using different weight decay value for the head, set here. otherwise, the
# same value as trunk will be automatically used.
weight_decay: 0.0001
Using LARC¶
VISSL supports the LARC implementation from NVIDIA’s Apex LARC. To use LARC, users need to set config option
OPTIMIZER.use_larc=True
. VISSL exposes LARC parameters that users can tune. Full list of LARC parameters exposed by VISSL:
OPTIMIZER:
name: "sgd"
use_larc: False # supported for SGD only for now
larc_config:
clip: False
eps: 1e-08
trust_coefficient: 0.001
Note
LARC is currently supported for SGD optimizer only.
Creating LR Schedulers¶
Users can use different types of Learning rate schedules for the training of their models. We closely follow the LR schedulers supported by ClassyVision and also custom learning rate schedules in VISSL.
How to set learning rate¶
Below we provide some examples of how to setup various types of Learning rate schedules. Note that these are merely some examples and you should set your desired parameter values.
Cosine
OPTIMIZER:
param_schedulers:
lr:
name: cosine
start_value: 0.15 # LR for batch size 256
end_value: 0.0000
Multi-Step
OPTIMIZER:
param_schedulers:
lr:
name: multistep
values: [0.01, 0.001]
milestones: [1]
update_interval: epoch # update LR after every epoch
Linear Warmup + Cosine
OPTIMIZER:
param_schedulers:
lr:
name: composite
schedulers:
- name: linear
start_value: 0.6
end_value: 4.8
- name: cosine
start_value: 4.8
end_value: 0.0048
interval_scaling: [rescaled, fixed]
update_interval: step
lengths: [0.1, 0.9] # 100ep
Cosine with restarts
OPTIMIZER:
param_schedulers:
lr:
name: cosine_warm_restart
start_value: 0.15 # LR for batch size 256
end_value: 0.00015
restart_interval_length: 0.5
wave_type: half # full | half
Linear warmup + cosine with restarts
OPTIMIZER:
param_schedulers:
lr:
name: composite
schedulers:
- name: linear
start_value: 0.6
end_value: 4.8
- name: cosine_warm_restart
start_value: 4.8
end_value: 0.0048
# wave_type: half
# restart_interval_length: 0.5
wave_type: full
restart_interval_length: 0.334
interval_scaling: [rescaled, rescaled]
update_interval: step
lengths: [0.1, 0.9] # 100ep
Multiple linear warmups and cosine
OPTIMIZER:
param_schedulers:
lr:
schedulers:
- name: linear
start_value: 0.6
end_value: 4.8
- name: cosine
start_value: 4.8
end_value: 0.0048
- name: linear
start_value: 0.0048
end_value: 2.114
- name: cosine
start_value: 2.114
end_value: 0.0048
update_interval: step
interval_scaling: [rescaled, rescaled, rescaled, rescaled]
lengths: [0.0256, 0.48722, 0.0256, 0.46166] # 1ep IG-500M
Auto-scaling of Learning Rate¶
VISSL supports automatically scaling LR as per Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
To turn this automatic scaling on, set config.OPTIMIZER.param_schedulers.lr.auto_lr_scaling.auto_scale=true
.
scaled_lr
is calculated: for a given
base_lr_batch_size
= batch size for which the base learning rate is specified.base_value
= base learning rate value that will be scaled, the current batch size is used to determine how to scale the base learning rate value.
scale_factor = (batchsize_per_gpu * world_size) / base_lr_batch_size
if scaling_type
is set to “sqrt”, scale_factor = sqrt(scale_factor)
scaled_lr = scale_factor * base_value
For different types of learning rate schedules, the LR scaling is handled as below:
1. cosine:
end_value = scaled_lr * (end_value / start_value)
start_value = scaled_lr and
2. multistep:
gamma = values[1] / values[0]
values = [scaled_lr * pow(gamma, idx) for idx in range(len(values))]
3. step_with_fixed_gamma
base_value = scaled_lr
4. linear:
end_value = scaled_lr
5. inverse_sqrt:
start_value = scaled_lr
6. constant:
value = scaled_lr
7. composite:
recursively call to scale each composition. If the composition consists of a linear
schedule, we assume that a linear warmup is applied. If the linear warmup is
applied, it's possible the warmup is not necessary if the global batch_size is smaller
than the base_lr_batch_size and in that case, we remove the linear warmup from the
schedule.
Here is an example configuration for linear scaling, with a base batchsize of 256, and a base learning rate of 0.1:
OPTIMIZER:
param_schedulers:
lr:
# we make it convenient to scale Learning rate automatically as per the scaling
# rule specified in https://arxiv.org/abs/1706.02677 (ImageNet in 1Hour).
auto_lr_scaling:
# if set to True, learning rate will be scaled.
auto_scale: True
# base learning rate value that will be scaled.
base_value: 0.1
# batch size for which the base learning rate is specified. The current batch size
# is used to determine how to scale the base learning rate value.
# scaled_lr = ((batchsize_per_gpu * world_size) * base_value ) / base_lr_batch_size
base_lr_batch_size: 256
# scaling_type can be set to "sqrt" to reduce the impact of scaling on the base value
scaling_type: "linear"