vissl.trainer package

vissl.trainer.trainer_main.build_task(config)[source]

Builds a ClassyTask from a config.

This assumes a ‘name’ key in the config which is used to determine what task class to instantiate. For instance, a config {“name”: “my_task”, “foo”: “bar”} will find a class that was registered as “my_task” (see register_task()) and call .from_config on it.

class vissl.trainer.trainer_main.SelfSupervisionTrainer(cfg: vissl.utils.hydra_config.AttrDict, dist_run_id: str, checkpoint_path: str = None, checkpoint_folder: str = None, hooks: List[classy_vision.hooks.classy_hook.ClassyHook] = None)[source]

Bases: object

The main entry point for any training or feature extraction workflows in VISSL.

The trainer constructs a train_task which prepares all the components of the training (optimizer, loss, meters, model etc) using the settings specified by user in the yaml config file. See the vissl/trainer/train_task.py for more details.

Parameters
  • cfg (AttrDict) – user specified input config that has optimizer, loss, meters etc settings relevant to the training

  • dist_run_id (str) –

    For multi-gpu training with PyTorch, we have to specify how the gpus are going to rendezvous. This requires specifying the communication method: file, tcp and the unique rendezvous run_id that is specific to 1 run. We recommend:

    1. for 1node: use init_method=tcp and run_id=auto

    2) for multi-node, use init_method=tcp and specify run_id={master_node}:{port}

  • checkpoint_path (str) – if the training is being resumed from a checkpoint, path to the checkpoint. The tools/run_distributed_engines.py automatically looks for the checkpoint in the checkpoint directory.

  • checkpoint_folder (str) – what directory to use for checkpointing. The tools/run_distributed_engines.py creates the directory based on user input in the yaml config file.

  • hooks (List[ClassyHooks]) – the list of hooks to use during the training. The hooks vissl/engines/{train, extract_features}.py determine the hooks.

setup_distributed(use_gpu: bool)[source]

Setup the distributed training. VISSL support both GPU and CPU only training.

  1. Initialize the torch.distributed.init_process_group if the distributed is not already initialized. The init_method, backend are specified by user in the yaml config file. See vissl/defaults.yaml file for description on how to set init_method, backend.

  2. We also set the global cuda device index using torch.cuda.set_device or cpu device

train()[source]

The train workflow. We get the training loop to use (vissl default is standard_train_step) but the user can create their own training loop and specify the name TRAINER.TRAIN_STEP_NAME

The training happens: 1. Execute any hooks at the start of training (mostly resets the variable like

iteration num phase_num etc)

  1. For each epoch (train or test), run the hooks at the start of an epoch. Mostly involves setting things like timer, setting dataloader epoch etc

  2. Execute the training loop (1 training iteration) involving forward, loss, backward, optimizer update, metrics collection etc.

  3. At the end of epoch, sync meters and execute hooks at the end of phase. Involves things like checkpointing model, logging timers, logging to tensorboard etc

extract()[source]

Extract workflow supports multi-gpu feature extraction. Since we are only extracting features, only the model is built (and initialized from some model weights file if specified by user). The model is set to the eval mode fully.

The features are extracted for whatever data splits (train, val, test) etc that user wants.

vissl.train_task package

class vissl.trainer.train_task.SelfSupervisionTask(config: vissl.utils.hydra_config.AttrDict)[source]

Bases: classy_vision.tasks.classification_task.ClassificationTask

A task prepares and holds all the components of a training like optimizer, datasets, dataloaders, losses, meters etc. Task also contains the variable like training iteration, epoch number etc. that are updated during the training.

We prepare every single component according to the parameter settings user wants and specified in the yaml config file.

Task also supports 2 additional things: 1) converts the model BatchNorm layers to the synchronized batchnorm 2) sets mixed precision (apex and pytorch both supported)

set_device()[source]

Set the training device: whether gpu or cpu. We use the self.device in the rest of the workflow to determine if we should do cpu only training or use gpu. set MACHINE.DEVICE = “gpu” or “cpu”

set_ddp_bucket_cap_mb()[source]

PyTorch DDP supports setting the bucket_cap_mb for all reduce. Tuning this parameter can help with the speed of the model. We use the default pytorch value of 25MB.

set_available_splits()[source]

Given the data settings, we determine if we are using both train and test datasets. If TEST_MODEL=true, we will add the test to the available_splits. If TEST_ONLY=false, we add train to the split as well.

set_amp_args()[source]

Two automatic mixed precision implementations are available: Apex’s and PyTorch’s.

  • If Apex’s AMP is enabled, amp_args is a dictionary containing arguments

to be passed to amp.initialize. Set to None to disable amp. To enable mixed precision training, pass amp_args={“opt_level”: “O1”} here. See https://nvidia.github.io/apex/amp.html for more info.

  • If Pytorch’s AMP is enabled, no arguments are needed.

set_checkpoint_path(checkpoint_path: str)[source]

Set the checkpoint path for the training

set_checkpoint_folder(checkpoint_folder: str)[source]

Set the checkpoint folder for the training

set_iteration(iteration)[source]

Set the iteration number. we maintain and store the iteration in the state itself. It counts total number of iterations we do in training phases. Updated after every forward pass of training step in UpdateTrainIterationNumHook. Starts from 1

classmethod from_config(config)[source]

Create the task from the yaml config input.

get_config()[source]

Utility function to store and use the config that was used for the given training.

build_datasets()[source]

Get the datasets for the data splits we will use in the training. The set_available_splits variable determines the splits used in the training.

build_dataloaders(pin_memory: bool) → torch.utils.data.dataloader.DataLoader[source]

Build PyTorch dataloaders for all the available_splits. We construct the standard PyTorch Dataloader and allow setting all dataloader options.

get_global_batchsize()[source]

Return global batchsize used in the training across all the trainers. We check what phase we are in (train or test) and get the dataset used in that phase. We call get_global_batchsize() of the dataset.

recreate_data_iterator(phase_type, epoch, compute_start_iter)[source]

Recreate data iterator (including multiprocessing workers) and destroy the previous iterators.

This is called when we load a new checkpoint or when phase changes during the training (one epoch to the next). DataSampler may need to be informed on those events to update the epoch and start_iteration so that the data is deterministically shuffled, so we call them here.

run_hooks(hook_function_name)[source]

Override the ClassyTask run_hook function and run the hooks whenever called

prepare_optimizer()[source]

Constructs the optimizer using the user defined settings in the yaml config. The model must be on the correct device (cuda or cpu) by this point.

prepare(pin_memory: bool = False)[source]

Prepares the task: - dataloaders - model - copy model to correct device - meters - loss - optimizer - LR schedulers - AMP state - resume from a checkpoint if available

prepare_extraction(pin_memory: bool = False)[source]

Prepares a light-weight task for feature extraction on multi-gpu. The model runs in eval mode only.

property enable_manual_gradient_reduction

Lazily initial the enable flag once when model is not None.

set_manual_gradient_reduction()None[source]

Called during __init__ to set a flag if manual gradient reduction is enabled.

vissl.trainer.train_steps module

Here we create all the custom train steps required for SSL model trainings.

vissl.trainer.train_steps.register_train_step(name)[source]

Registers Self-Supervision Train step.

This decorator allows VISSL to add custom train steps, even if the train step itself is not part of VISSL. To use it, apply this decorator to a train step function, like this:

@register_train_step('my_step_name')
def my_step_name():
    ...

To get a train step from a configuration file, see get_train_step().

vissl.trainer.train_steps.get_train_step(train_step_name: str)[source]

Lookup the train_step_name in the train step registry and return. If the train step is not implemented, asserts will be thrown and workflow will exit.

vissl.trainer.train_steps.standard_train_step module

This is the train step that”s most commonly used in most of the model trainings.

vissl.trainer.train_steps.standard_train_step.construct_sample_for_model(batch_data, task)[source]

Given the input batch from the dataloader, verify the input is as expected: the input data and target data is present in the batch. In case of multi-input trainings like PIRL, make sure the data is in right format i.e. the multiple input should be nested under a common key “input”.

vissl.trainer.train_steps.standard_train_step.standard_train_step(task)[source]

Single training iteration loop of the model.

Performs: data read, forward, loss computation, backward, optimizer step, parameter updates.

Various intermediate steps are also performed: - logging the training loss, training eta, LR, etc to loggers - logging to tensorboard, - performing any self-supervised method specific operations (like in MoCo approach, the momentum encoder is updated), computing the scores in swav - checkpointing model if user wants to checkpoint in the middle of an epoch