Train on multiple-gpus

VISSL supports training any model on 1-gpu or more. Typically, a single machine can have 2, 4 or 8-gpus. If users want to train on >1 gpus within a single machine, it’s very easy. Typically for single machine training, this involves correctly setting the number of gpus to use with DISTRIBUTED.NUM_PROC_PER_NODE.

The config will look like:

DISTRIBUTED:
  BACKEND: nccl           # set to "gloo" if desired
  NUM_NODES: 1            # no change needed
  NUM_PROC_PER_NODE: 2    # user sets this to number of gpus to use
  INIT_METHOD: tcp        # set to "file" if desired
  RUN_ID: auto            # Set to file_path if using file method. No change needed for tcp and a free port on machine is automatically detected.

The list of all the options exposed by VISSL:

DISTRIBUTED:
  # backend for communication across gpus. Use nccl by default. For cpu training, set
  # "gloo" as the backend.
  BACKEND: "nccl"
  # whether to output the NCCL info during training. This allows to debug how
  # nccl communication is configured.
  NCCL_DEBUG: False
  # tuning parameter to speed up all reduce by specifying number of nccl threads to use.
  # by default, we use whatever the default is set by nccl or user system.
  NCCL_SOCKET_NTHREADS: ""
  # whether model buffers are BN buffers are broadcast in every forward pass
  BROADCAST_BUFFERS: True
  # number of machines to use in training. Each machine can have many gpus. NODES count
  # number of unique hosts.
  NUM_NODES: 1
  # set this to the number of gpus per machine. This ensures that each gpu of the
  # node has a process attached to it.
  NUM_PROC_PER_NODE: 8
  # this could be: tcp | env | file or any other pytorch supported methods
  INIT_METHOD: "tcp"
  # every training run should have a unique id. Following are the options:
  #   1. If using INIT_METHOD=env, RUN_ID="" is fine.
  #   2. If using INIT_METHOD=tcp,
  #      - if you use > 1 machine, set port yourself. RUN_ID="localhost:{port}".
  #      - If using 1 machine, set RUN_ID=auto and a free port will be automatically selected
  #   3. IF using INIT_METHOD=file, RUN_ID={file_path}
  RUN_ID: "auto"
  # if True, does the gradient reduction in DDP manually. This is useful during the
  # activation checkpointing and sometimes saving the memory from the pytorch gradient
  # buckets.
  MANUAL_GRADIENT_REDUCTION: False

Train on multiple machines

VISSL allows scaling a training beyond 1-machine in order to speed up training. VISSL makes it extremely easy to scale up training. Typically for single machine training, this involves correctly setting the following options:

  • Number of gpus to use

  • Number of nodes

  • INIT_METHOD for PyTorch distributed training which determines how gpus will communicate for all reduce operations.

Putting togethe the above, if user wants to train on 2 machines where each machine has 8-gpus, the config will look like:

DISTRIBUTED:
  BACKEND: nccl
  NUM_NODES: 2               # user sets this to number of machines to use
  NUM_PROC_PER_NODE: 8       # user sets this to number of gpus to use per machine
  INIT_METHOD: tcp           # recommended if feasible otherwise
  RUN_ID: localhost:{port}   # select the free port

The list of all the options exposed by VISSL:

DISTRIBUTED:
  # backend for communication across gpus. Use nccl by default. For cpu training, set
  # "gloo" as the backend.
  BACKEND: "nccl"
  # whether to output the NCCL info during training. This allows to debug how
  # nccl communication is configured.
  NCCL_DEBUG: False
  # tuning parameter to speed up all reduce by specifying number of nccl threads to use.
  # by default, we use whatever the default is set by nccl or user system.
  NCCL_SOCKET_NTHREADS: ""
  # whether model buffers are BN buffers are broadcast in every forward pass
  BROADCAST_BUFFERS: True
  # number of machines to use in training. Each machine can have many gpus. NODES count
  # number of unique hosts.
  NUM_NODES: 1
  # set this to the number of gpus per machine. This ensures that each gpu of the
  # node has a process attached to it.
  NUM_PROC_PER_NODE: 8
  # this could be: tcp | env | file or any other pytorch supported methods
  INIT_METHOD: "tcp"
  # every training run should have a unique id. Following are the options:
  #   1. If using INIT_METHOD=env, RUN_ID="" is fine.
  #   2. If using INIT_METHOD=tcp,
  #      - if you use > 1 machine, set port yourself. RUN_ID="localhost:{port}".
  #      - If using 1 machine, set RUN_ID=auto and a free port will be automatically selected
  #   3. IF using INIT_METHOD=file, RUN_ID={file_path}
  RUN_ID: "auto"
  # if True, does the gradient reduction in DDP manually. This is useful during the
  # activation checkpointing and sometimes saving the memory from the pytorch gradient
  # buckets.
  MANUAL_GRADIENT_REDUCTION: False

Using SLURM

Please follow the documentation here.