Train models on CPU¶

VISSL supports training any model on CPUs. Typically, this involves correctly setting the MACHINE.DEVICE=cpu and adjusting the distributed settings accordingly. For example, the config settings will look like:

MACHINE:
  DEVICE: cpu
DISTRIBUTED:
  BACKEND: gloo           # set to "gloo" for cpu only training
  NUM_NODES: 1            # no change needed
  NUM_PROC_PER_NODE: 2    # user sets this to number of gpus to use
  INIT_METHOD: tcp        # set to "file" if desired
  RUN_ID: auto            # Set to file_path if using file method. No change needed for tcp and a free port on machine is automatically detected.

Train anything on 1-gpu¶

If you have a configuration file (any vissl compatible file) for any training, that you want to run on 1-gpu only (for example: train SimCLR on 1 gpu, etc), you don’t need to modify the config file. VISSL provides a helper script that takes care of all the adjustments. This can facilitate debugging by allowing users to insert pdb in their code.

VISSL also takes care of auto-scaling the Learning rate for various schedules (cosine, multistep, step etc.) if you have enabled the auto_scaling (see config.OPTIMIZER.param_schedulers.lr.auto_lr_scaling). You can simply achieve this by using the low_resource_1gpu_train_wrapper.sh script. An example usage:

cd $HOME/vissl
./dev/low_resource_1gpu_train_wrapper.sh config=test/integration_test/quick_swav

Train on SLURM cluster¶

VISSL supports SLURM by default for training models. VISSL code automatically detects if the training environment is SLURM based on SLURM environment variables like SLURM_NODEID, SLURMD_NODENAME, SLURM_STEP_NODELIST.

VISSL also provides a helper bash script dev/launch_slurm.sh that allows launching a given training on SLURM. This script uses the content of the configuration to allocate the right number of nodes and GPUs on SLURM.

More precisely, the number of nodes and GPU by node to allocate is driven by the usual DISTRIBUTED training configuration:

DISTRIBUTED:
  NUM_NODES: 1            # no change needed
  NUM_PROC_PER_NODE: 2    # user sets this to number of gpus to use

While the more SLURM specific options are located in the “SLURM” configuration block:

# ----------------------------------------------------------------------------------- #
# DISTRIBUTED TRAINING ON SLURM: Additional options for SLURM node allocation
# (options like number of nodes and number of GPUs by node are taken from DISTRIBUTED)
# ----------------------------------------------------------------------------------- #
SLURM:
  # Whether or not to run the job on SLURM
  USE_SLURM: false
  # Name of the job on SLURM
  NAME: "vissl"
  # Comment of the job on SLURM
  COMMENT: "vissl job"
  # Partition of SLURM on which to run the job. This is a required field if using SLURM.
  PARTITION: ""
  # Where the logs produced by the SLURM jobs will be output
  LOG_FOLDER: "."
  # Maximum number of hours / minutes needed by the job to complete. Above this limit, the job might be pre-empted.
  TIME_HOURS: 72
  TIME_MINUTES: 0
  # Additional constraints on the hardware of the nodes to allocate (example 'volta' to select a volta GPU)
  CONSTRAINT: ""
  # GB of RAM memory to allocate for each node
  MEM_GB: 250
  # TCP port on which the workers will synchronize themselves with torch distributed
  PORT_ID: 40050
  # Number of CPUs per GPUs to request on the cluster.
  NUM_CPU_PER_PROC: 8
  # Any other parameters for slurm (e.g. account, hint, distribution, etc.,) as dictated by submitit.
  # Please see https://github.com/facebookincubator/submitit/issues/23#issuecomment-695217824.
  ADDITIONAL_PARAMETERS: {}

Users can customize these values by using the standard hydra override syntax (same as for any other item in the configuration), or can modify the script to fit their needs.

Examples:

To run a linear evaluation benchmark on a chosen checkpoint, on the SLURM partition named “dev”, with the name “lin_eval”:

./dev/launch_slurm.sh \
    config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear \
    config.MODEL.WEIGHTS_INIT.PARAMS_FILE=/path/to/my/checkpoint.torch \
    config.SLURM.NAME=lin_eval \
    config.SLURM.PARTITION=dev

To run a distributed training of SwAV on 8 nodes where each machine has 8 GPUs and for 100 epochs, on the default partition, with the name “swav_100ep_rn50_in1k”:

./dev/launch_slurm.sh \
    config=pretrain/swav/swav_8node_resnet \
    config.OPTIMIZER.num_epochs=100 \
    config.SLURM.NAME=swav_100ep_rn50_in1k