Train on multiple-gpus

VISSL supports training any model on 1-gpu or more. Typically, a single machine can have 2, 4 or 8-gpus. If users want to train on >1 gpus within the single machine, it’s very easy. Typically for single machine training, this involves correctly setting the number of gpus to use via DISTRIBUTED.NUM_PROC_PER_NODE.

The config will look like:

DISTRIBUTED:
  BACKEND: nccl           # set to "gloo" if desired
  NUM_NODES: 1            # no change needed
  NUM_PROC_PER_NODE: 2    # user sets this to number of gpus to use
  INIT_METHOD: tcp        # set to "file" if desired
  RUN_ID: auto            # Set to file_path if using file method. No change needed for tcp and a free port on machine is automatically detected.

The list of all the options exposed by VISSL:

DISTRIBUTED:
  # backend for communication across gpus. Use nccl by default. For cpu training, set
  # "gloo" as the backend.
  BACKEND: "nccl"
  # whether to output the NCCL info during training. This allows to debug how
  # nccl communication is configured.
  NCCL_DEBUG: False
  # tuning parameter to speed up all reduce by specifying number of nccl threads to use.
  # by default, we use whatever the default is set by nccl or user system.
  NCCL_SOCKET_NTHREADS: ""
  # whether model buffers are BN buffers are broadcast in every forward pass
  BROADCAST_BUFFERS: True
  # number of machines to use in training. Each machine can have many gpus. NODES count
  # number of unique hosts.
  NUM_NODES: 1
  # set this to the number of gpus per machine. This ensrures that each gpu of the
  # node has a process attached to it.
  NUM_PROC_PER_NODE: 8
  # this could be: tcp | env | file or any other pytorch supported methods
  INIT_METHOD: "tcp"
  # every training run should have a unique id. Following are the options:
  #   1. If using INIT_METHOD=env, RUN_ID="" is fine.
  #   2. If using INIT_METHOD=tcp,
  #      - if you use > 1 machine, set port yourself. RUN_ID="localhost:{port}".
  #      - If using 1 machine, set RUN_ID=auto and a free port will be automatically selected
  #   3. IF using INIT_METHOD=file, RUN_ID={file_path}
  RUN_ID: "auto"

Train on multiple machines

VISSL allows scaling a training beyond 1-machine in order to speed up training. VISSL makes it extremely easy to scale up training. Typically for single machine training, this involves correctly setting the following options:

  • Number of gpus to use

  • Number of nodes

  • INIT_METHOD for PyTorch distributed training which determines how gpus will communicate for all reduce operations.

Putting togethe the above, if user wants to train on 2 machines where each machine has 8-gpus, the config will look like:

DISTRIBUTED:
  BACKEND: nccl
  NUM_NODES: 2               # user sets this to number of machines to use
  NUM_PROC_PER_NODE: 8       # user sets this to number of gpus to use per machine
  INIT_METHOD: tcp           # recommended if feasible otherwise
  RUN_ID: localhost:{port}   # select the free port

The list of all the options exposed by VISSL:

DISTRIBUTED:
  # backend for communication across gpus. Use nccl by default. For cpu training, set
  # "gloo" as the backend.
  BACKEND: "nccl"
  # whether to output the NCCL info during training. This allows to debug how
  # nccl communication is configured.
  NCCL_DEBUG: False
  # tuning parameter to speed up all reduce by specifying number of nccl threads to use.
  # by default, we use whatever the default is set by nccl or user system.
  NCCL_SOCKET_NTHREADS: ""
  # whether model buffers are BN buffers are broadcast in every forward pass
  BROADCAST_BUFFERS: True
  # number of machines to use in training. Each machine can have many gpus. NODES count
  # number of unique hosts.
  NUM_NODES: 1
  # set this to the number of gpus per machine. This ensrures that each gpu of the
  # node has a process attached to it.
  NUM_PROC_PER_NODE: 8
  # this could be: tcp | env | file or any other pytorch supported methods
  INIT_METHOD: "tcp"
  # every training run should have a unique id. Following are the options:
  #   1. If using INIT_METHOD=env, RUN_ID="" is fine.
  #   2. If using INIT_METHOD=tcp,
  #      - if you use > 1 machine, set port yourself. RUN_ID="localhost:{port}".
  #      - If using 1 machine, set RUN_ID=auto and a free port will be automatically selected
  #   3. IF using INIT_METHOD=file, RUN_ID={file_path}
  RUN_ID: "auto"

Using SLURM

VISSL supports SLURM by default for training models. VISSL code automatically detects if the training environment is SLURM based on SLURM environment variables like SLURM_NODEID, SLURMD_NODENAME, SLURM_STEP_NODELIST.

VISSL also provides a helper bash script dev/launch_slurm.sh that allows launching a given training on SLURM. Users can modify this script to meet their needs.

The bash script takes the following inputs:

# number of machines to distribute training on
NODES=${NODES-1}
# number of gpus per machine to use for training
NUM_GPU=${NUM_GPU-8}
# gpus type: P100 | V100 | V100_32G etc. User should set this based on their machine
GPU_TYPE=${GPU_TYPE-V100}
# name of the training. for example: simclr_2node_resnet50_in1k. This is helpful to clearly recognize the training
EXPT_NAME=${EXPT_NAME}
# how much CPU memory to use
MEM=${MEM-250g}
# number of CPUs used for each trainer (i.e. each gpu)
CPU=${CPU-8}
# directory where all the training artifacts like checkpoints etc will be written
OUTPUT_DIR=${OUTPUT_DIR}
# partition of the cluster on which training should run. User should determine this parameter for their cluster
PARTITION=${PARTITION-learnfair}
# any helpful comment that slurm dashboard can display
COMMENT=${COMMENT-vissl_training}
GITHUB_REPO=${GITHUB_REPO-vissl}
# what branch of VISSL should be used. specify your custom branch
BRANCH=${BRANCH-master}
# automatically determined and used for distributed training.
# each training run must have a unique id and vissl defaults to date
RUN_ID=$(date +'%Y%m%d')
# number of dataloader workers to use per gpu
NUM_DATA_WORKERS=${NUM_DATA_WORKERS-8}
# multi-processing method to use in PyTorch. Options: forkserver | fork | spawn
MULTI_PROCESSING_METHOD=${MULTI_PROCESSING_METHOD-forkserver}
# specify the training configuration to run. For example: to train swav for 100epochs
# config=pretrain/swav/swav_8node_resnet config.OPTIMIZER.num_epochs=100
CFG=( "$@" )

To run the script for training SwAV on 8 machines where each machine has 8-gpus and for 100epochs, the script can be run as:

cd $HOME/vissl && NODES=8 \
  NUM_GPU=8 \
  GPU_TYPE=V100 \
  MEM=200g \
  CPU=8 \
  EXPT_NAME=swav_100ep_rn50_in1k \
  OUTPUT_DIR=/tmp/swav/ \
  PARTITION=learnfair \
  BRANCH=master \
  NUM_DATA_WORKERS=4 \
  MULTI_PROCESSING_METHOD=forkserver \
  ./dev/launch_slurm.sh \
  config=pretrain/swav/swav_8node_resnet config.OPTIMIZER.num_epochs=100