LARC for Large batch size training

What is LARC

LARC (Large Batch Training of Convolutional Networks) is a technique proposed by Yang You, Igor Gitman, Boris Ginsburg in https://arxiv.org/abs/1708.03888 for improving the convergence of large batch size trainings. LARC uses the ratio between gradient and parameter magnitudes to calculate an adaptive local learning rate for each individual parameter.

See the LARC paper for the calculation of the learning rate. In practice, it modifies the gradients of parameters as a proxy for modifying the learning rate of the parameters.

How to enable LARC

VISSL supports the LARC implementation from NVIDIA’s Apex LARC. To use LARC, users need to set config option OPTIMIZER.use_larc=True. VISSL exposes LARC parameters that users can tune. Full list of LARC parameters exposed by VISSL:

OPTIMIZER:
  name: "sgd"
  use_larc: False  # supported for SGD only for now
  larc_config:
    clip: False
    eps: 1e-08
    trust_coefficient: 0.001

Note

LARC is currently supported for SGD optimizer only.

Using Apex

VISSL provides anaconda and pip packages of Apex (compiled with Optimzed C++ extensions/CUDA kernels). The Apex packages are provided for all versions of CUDA (9.2, 10.0, 10.1, 10.2, 11.0), PyTorch >= 1.4 and Python >=3.6 and <=3.9.

Follow VISSL’s instructions to install apex in pip and instructions to install apex in conda.