Using Data

To use a dataset in VISSL, the only requirements are:

  • the dataset name should be registered with VisslDatasetCatalog in VISSL. Only name is important and the paths are not. The paths can be specifed in the configuration file. Users can either edit the dataset_catalog.json or specify the paths in the configuration file.

  • the dataset should be from a supported data source.

Reading data from several sources

VISSL allows reading data from multiple sources (disk, etc) and in multiple formats (a folder path, a .npy file). The GenericSSLDataset class is defined to support reading data from multiple data sources. For example: data = [dataset1, dataset2] and the minibatches generated will have the corresponding data from each dataset. For this reason, we also support labels from multiple sources. For example targets = [dataset1 targets, dataset2 targets].

Source of the data (disk_filelist | disk_folder):

  • disk_folder: this is simply the root folder path to the downloaded data.

  • disk_filelist: These are numpy (or .pkl) files: (1) file containing images information (2) file containing corresponding labels for images. We provide scripts that can be used to prepare these two files for a dataset of choice.

To use a dataset, VISSL takes following inputs in the configuration file for each dataset split (train, test):

  • DATASET_NAMES: names of the datasets that are registered with VisslDatasetCatalog. Registering dataset name is important. Example: DATASET_NAMES=[imagenet1k_folder, my_new_dataset_filelist]

  • DATA_SOURCES: the sources of dataset. Options: disk_folder | disk_filelist. This specifies where the data lives. Users can extend it for their purposes. Example DATA_SOURCES=[disk_folder, disk_filelist]

  • DATA_PATHS: the paths to the dataset. The paths could be folder path (example Imagenet1k folder) or .npy filepaths. For the folder paths, VISSL uses ImageFolder from PyTorch. Example DATA_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset>]

  • LABEL_SOURCES: just like images, the targets can also come from several sources. Example: LABEL_SOURCES=[disk_folder] for Imagenet1k. Example: DATA_SOURCES=[disk_folder, disk_filelist]

  • LABEL_PATHS: similar to DATA_PATHS but for labels. Example LABEL_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset_labels>]

  • LABEL_TYPE: choose from standard | sample_index. sample_index is a common practice in self-supervised learning and sample_index`=id of the sample in the data. :code:`standard label type is used for supervised learning and user specifis the annotated labels to use.

Using dataset_catalog.json

In order to use a dataset with VISSL, the dataset name must be registered with VisslDatasetCatalog. VISSL maintains a dataset_catalog.json which is parsed by VisslDatasetCatalog and the datasets are registered with VISSL, ready-to-use.

Users can edit the template dataset_catalog.json file to specify their datasets paths. The json file can be fully decided by user and can have any number of supported datasets (one or more). User can give the string names to dataset as per their choice.

Template for a dataset entry in dataset_catalog.json

"data_name": {
   "train": [
     "<images_path_or_folder>", "<labels_path_or_folder>"
   ],
   "val": [
     "<images_path_or_folder>", "<labels_path_or_folder>"
   ],
}

The images_path_or_folder and labels_path_or_folder can be directories or filepaths (numpy, pickle.)

User can mix match the source of image, labels i.e. labels can be filelist and images can be folder path. The yaml configuration files require specifying LABEL_SOURCES and DATA_SOURCES which allows the code to figure out how to ingest various sources.

Note

Filling the dataset_catalog.json is a one time process only and provides the benefits of simply accessing any dataset with the dataset name in the configuration files for the rest of the trainings.

Using Builtin datasets

VISSL supports several Builtin datasets as indicated in the dataset_catalog.json file. Users can specify paths to those datasets.

Expected dataset structure for ImageNet, Places205, Places365

{imagenet, places205, places365}
train/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...
val/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...

Expected dataset structure for Pascal VOC [2007, 2012]

VOC20{07,12}/
    Annotations/
    ImageSets/
        Main/
        trainval.txt
        test.txt
    JPEGImages/

Expected dataset structure for COCO2014

coco/
    annotations/
        instances_train2014.json
        instances_val2014.json
    train2014/
        # image files that are mentioned in the corresponding json
    val2014/
        # image files that are mentioned in the corresponding json

Dataloader

VISSL uses PyTorch torch.utils.data.DataLoader and allows setting all the dataloader option as below. The dataloader is wrapped with DataloaderAsyncGPUWrapper or DataloaderSyncGPUWrapper depending on whether user wants to copy data to gpu async or not.

The settings for the Dataloader in VISSL are:

dataset (GenericSSLDataset):    the dataset object for which dataloader is constructed
dataset_config (dict):          configuration of the dataset. it should be DATA.TRAIN or DATA.TEST settings
num_dataloader_workers (int):   number of workers per gpu (or cpu) training
pin_memory (bool):              whether to pin memory or not
multi_processing_method (str):  method to use. options: forkserver | fork | spawn
device (torch.device):          training on cuda or cpu
get_sampler (get_sampler):      function that is used to get the sampler
worker_init_fn (None default):  any function that should be executed during initialization of dataloader workers

Using Data Collators

VISSL supports PyTorch default collator torch.utils.data.dataloader.default_collate and also many custom data collators used in self-supervision. The use any collator, user has to simply specify the DATA.TRAIN.COLLATE_FUNCTION to be the name of the collator to use. See all custom VISSL collators implemented here.

An example for specifying collator for SwAV training:

DATA:
  TRAIN:
    COLLATE_FUNCTION: multicrop_collator

Using Data Transforms

VISSL supports all PyTorch TorchVision transforms as well as many transforms required by Self-supervised approaches including MoCo, SwAV, PIRL, SimCLR, BYOL, etc. Using Transforms is very intuitive and easy in VISSL. Users specify the list of transforms they want to apply on the data in the order of application. This involves using the transform name and the key:value to specify the parameter values for the transform. See the full list of transforms implemented by VISSL here

An example of transform for SwAV:

DATA:
  TRAIN:
    TRANSFORMS:
      - name: ImgPilToMultiCrop
        total_num_crops: 6
        size_crops: [224, 96]
        num_crops: [2, 4]
        crop_scales: [[0.14, 1], [0.05, 0.14]]
      - name: RandomHorizontalFlip
        p: 0.5
      - name: ImgPilColorDistortion
        strength: 1.0
      - name: ImgPilGaussianBlur
        p: 0.5
        radius_min: 0.1
        radius_max: 2.0
      - name: ToTensor
      - name: Normalize
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]

Using Data Sampler

VISSL supports 2 types of samplers:

  • PyTorch default torch.utils.data.distributed.DistributedSampler

  • VISSL sampler StatefulDistributedSampler that is written specifically for large scale dataset trainings. See the documentation for the sampler.

By default, the PyTorch default sampler is used unless user specifies DATA.TRAIN.USE_STATEFUL_DISTRIBUTED_SAMPLER=true in which case StatefulDistributedSampler will be used.