Using Data

To use a dataset in VISSL, the only requirements are:

  • the dataset name should be registered with VisslDatasetCatalog in VISSL. Only name is important and the paths are not. The paths can be specifed in the configuration file. Users can either edit the dataset_catalog.json or specify the paths in the configuration file.

  • the dataset should be from a supported data source.

Reading data from several sources

VISSL allows reading data from multiple sources (disk, etc) and in multiple formats (a folder path, a .npy file, or torchvision datasets). The GenericSSLDataset class is defined to support reading data from multiple data sources. For example: data = [dataset1, dataset2] and the minibatches generated will have the corresponding data from each dataset. For this reason, we also support labels from multiple sources. For example targets = [dataset1 targets, dataset2 targets].

Source of the data (disk_folder | disk_filelist | torchvision_dataset):

  • disk_folder: this is simply the root folder path to the downloaded data.

  • disk_filelist: These are numpy (or .pkl) files: (1) file containing images information (2) file containing corresponding labels for images. We provide scripts that can be used to prepare these two files for a dataset of choice.

  • torchvision_dataset: the root folder path to the torchvision dowloaded dataset. As of now, the supported datasets are: CIFAR10, CIFAR100, MNIST, STL10 and SVHN.

To use a dataset, VISSL takes following inputs in the configuration file for each dataset split (train, test):

  • DATASET_NAMES: names of the datasets that are registered with VisslDatasetCatalog. Registering dataset name is important. Example: DATASET_NAMES=[imagenet1k_folder, my_new_dataset_filelist]

  • DATA_SOURCES: the sources of dataset. Options: disk_folder | disk_filelist. This specifies where the data lives. Users can extend it for their purposes. Example DATA_SOURCES=[disk_folder, disk_filelist]

  • DATA_PATHS: the paths to the dataset. The paths could be folder path (example Imagenet1k folder) or .npy filepaths. For the folder paths, VISSL uses ImageFolder from PyTorch. Example DATA_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset>]

  • LABEL_SOURCES: just like images, the targets can also come from several sources. Example: LABEL_SOURCES=[disk_folder] for Imagenet1k. Example: DATA_SOURCES=[disk_folder, disk_filelist]

  • LABEL_PATHS: similar to DATA_PATHS but for labels. Example LABEL_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset_labels>]

  • LABEL_TYPE: choose from standard | sample_index. sample_index is a common practice in self-supervised learning and sample_index`=id of the sample in the data. :code:`standard label type is used for supervised learning and user specifis the annotated labels to use.

  • DATA_LIMIT: How many samples to train with per epoch. This can be useful for debugging purposes or for evaluating low-shot learning. Note that the default DATA_LIMIT=-1 uses the full dataset. You can also configure additional options, like the seed and ensuring class-balanced sampling in DATA_LIMIT_SAMPLING.

Using dataset_catalog.json

In order to use a dataset with VISSL, the dataset name must be registered with VisslDatasetCatalog. VISSL maintains a dataset_catalog.json which is parsed by VisslDatasetCatalog and the datasets are registered with VISSL, ready-to-use.

Users can edit the template dataset_catalog.json file to specify their datasets paths. Alternatively users can create their own dataset catalog json file and set the environment variable VISSL_DATASET_CATALOG_PATH to its absolute path. This may be helpful if you are not building the code from source or are actively developing on VISSL. The json file can be fully decided by user and can have any number of supported datasets (one or more). Users can give the string names of the dataset they wish to use.

Template for a dataset entry in dataset_catalog.json

"data_name": {
   "train": [
     "<images_path_or_folder>", "<labels_path_or_folder>"
   ],
   "val": [
     "<images_path_or_folder>", "<labels_path_or_folder>"
   ],
}

The images_path_or_folder and labels_path_or_folder can be directories or filepaths (numpy, pickle.)

User can mix and match the source of image, labels. i.e. labels can be filelist and images can be folder path. The yaml configuration files require specifying LABEL_SOURCES and DATA_SOURCES which allows the code to figure out how to ingest various sources.

Note

Filling the dataset_catalog.json is a one time process only and provides the benefits of simply accessing any dataset with the dataset name in the configuration files for the rest of the trainings.

Using Builtin datasets

VISSL supports several Builtin datasets as indicated in the dataset_catalog.json file. Users can specify paths to those datasets.

Expected dataset structure for ImageNet, Places205, Places365

{imagenet, places205, places365}
train/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...
val/
    <n0......>/
        <im-1-name>.JPEG
        ...
        <im-N-name>.JPEG
        ...
    <n1......>/
        <im-1-name>.JPEG
        ...
        <im-M-name>.JPEG
        ...
    ...

Expected dataset structure for Pascal VOC [2007, 2012]

VOC20{07,12}/
    Annotations/
    ImageSets/
        Main/
        trainval.txt
        test.txt
    JPEGImages/

Expected dataset structure for COCO2014

coco/
    annotations/
        instances_train2014.json
        instances_val2014.json
    train2014/
        # image files that are mentioned in the corresponding json
    val2014/
        # image files that are mentioned in the corresponding json

Expected dataset structure for CIFAR10

The expected format is the exact same format used by torchvision, and the exact format obtained after either:

cifar-10-batches-py/
    batches.meta
    data_batch_1
    data_batch_2
    data_batch_3
    data_batch_4
    data_batch_5
    readme.html
    test_batch

Expected dataset structure for CIFAR100

The expected format is the exact same format used by torchvision, and the exact format obtained after either:

cifar-100-python/
    meta
    test
    train

Expected dataset structure for MNIST

The expected format is the exact same format used by torchvision, and the exact format obtained after instantiating the torchvision.datasets.MNIST class with the flag download=True.

MNIST/
    processed/
        test.pt
        training.pt
    raw/
        t10k-images-idx3-ubyte
        t10k-images-idx3-ubyte.gz
        t10k-labels-idx1-ubyte
        t10k-labels-idx1-ubyte.gz
        train-images-idx3-ubyte
        train-images-idx3-ubyte.gz
        train-labels-idx1-ubyte
        train-labels-idx1-ubyte.gz

Expected dataset structure for STL10

The expected format is the exact same format used by torchvision, and the exact format obtained after either:

stl10_binary/
    class_names.txt
    fold_indices.txt
    test_X.bin
    test_y.bin
    train_X.bin
    train_y.bin
    unlabeled_X.bin

Expected dataset structure for SVHN

The expected format is the exact same format used by torchvision, and the exact format obtained after either:

  • downloading the train_32x32.mat, test_32x32.mat and extra_32x32.mat files available at http://ufldl.stanford.edu/housenumbers/ in the same folder

  • instantiating the torchvision.datasets.SVHN class with download=True

svhn_folder/
    test_32x32.mat
    train_32x32.mat

Expected dataset structure for the other benchmark datasets

VISSL supports benchmarks inspired by the VTAB and CLIP papers, for which the datasets either:

  • Do not directly exist but are transformations of existing dataset (like images extracted from videos)

  • Are not in a format directly compatible with the disk_folder or the disk_filelist format of VISSL

  • And are not yet part of torchvision datasets

To run these benchmarks, the following data preparation scripts are mandatory:

  • extra_scripts/datasets/create_caltech101_data_files.py: to transform the Caltech101 dataset to the disk_folder format

  • extra_scripts/datasets/create_clevr_count_data_files.py: to create a disk_filelist dataset from CLEVR where the goal is to count the number of object in the scene

  • extra_scripts/datasets/create_clevr_dist_data_files.py: to create a disk_filelist dataset from CLEVR where the goal is to estimate the distance of the closest object in the scene

  • extra_scripts/datasets/create_dsprites_location_data_files.py: to create a disk_folder dataset from dSprites where the goal is to estimate the x coordinate of the sprite on the scene

  • extra_scripts/datasets/create_dsprites_orientation_data_files.py: to create a disk_folder dataset from dSprites where the goal is to estimate the orientation of the sprite on the scene

  • extra_scripts/datasets/create_dtd_data_files.py: to transform the DTD dataset to the disk_folder format

  • extra_scripts/datasets/create_euro_sat_data_files.py: to transform the EUROSAT dataset to the disk_folder format

  • extra_scripts/datasets/create_fgvc_aircraft_data_files.py: to transform the FGVC Aircrafts dataset to the disk_folder format

  • extra_scripts/datasets/create_food101_data_files.py: to transform the FOOD101 dataset to the disk_folder format

  • extra_scripts/datasets/create_gtsrb_data_files.py: to transform the GTSRB dataset to the disk_folder format

  • extra_scripts/datasets/create_imagenet_ood_data_files.py: to create test sets in disk_filelist format for Imagenet based on Imagenet-A and Imagenet-R

  • extra_scripts/datasets/create_kitti_dist_data_files.py: to create a disk_folder dataset from KITTI where the goal is to estimate the distance of the closest car, van or truck

  • extra_scripts/datasets/create_oxford_pets_data_files.py: to transform the Oxford Pets dataset to the disk_folder format

  • extra_scripts/datasets/create_patch_camelyon_data_files.py: to transform the PatchCamelyon dataset to the disk_folder format

  • extra_scripts/datasets/create_small_norb_azimuth_data_files.py to create a disk_folder dataset from Small NORB where the goal is to find the azimuth or the photographed object

  • extra_scripts/datasets/create_small_norb_elevation_data_files.py to create a disk_folder dataset from Small NORB where the goal is to predict the elevation in the image

  • extra_scripts/datasets/create_sun397_data_files.py to transform the SUN397 dataset to the disk_filelist format

  • extra_scripts/datasets/create_ucf101_data_files.py: to create a disk_folder image action recognition dataset from the video action recognition dataset UCF101 by extracting the middle frame

You can read more about how to download these datasets and run these scripts from here.

After data preparation, the output folders are either compatible with the disk_filelist layout:

train_images.npy  # Paths to the train images
train_labels.npy  # Labels for each of the train images
val_images.npy    # Paths to the val images
val_labels.npy    # Labels for each of the val images

Or with the disk_folder layout:

train/
    label1/
        image_1.jpeg
        image_2.jpeg
        ...
    label2/
        image_x.jpeg
        image_y.jpeg
        ...
    ...
val/
    label1/
        image_1.jpeg
        image_2.jpeg
        ...
    label2/
        image_x.jpeg
        image_y.jpeg
        ...
    ...

Note

In the case of the disk_folder layout, the images are copied into the output folder and the input folder is not necessary anymore. The input folder can for instance be deleted.

In the case of the disk_filelist layout, the images are referenced inside the .npy files. It is therefore important to keep the input folder and not alter it (which includes not moving it).

The disk_filelist has the advantage of using less space, while the disk_folder offers total decoupling from the original dataset files and is more advantageous for small number of images or when the inputs do not allow to reference images (for instance when extracting frames from videos, or dealing with images in an unsupported format).

The aforementioned scripts use the either the disk_folder or disk_filelist based on these constraints.

Dataloader

VISSL uses PyTorch torch.utils.data.DataLoader and allows setting all the dataloader option as below. The dataloader is wrapped with DataloaderAsyncGPUWrapper or DataloaderSyncGPUWrapper depending on whether user wants to copy data to gpu async or not.

The settings for the Dataloader in VISSL are:

dataset (GenericSSLDataset):    the dataset object for which dataloader is constructed
dataset_config (dict):          configuration of the dataset. it should be DATA.TRAIN or DATA.TEST settings
num_dataloader_workers (int):   number of workers per gpu (or cpu) training
pin_memory (bool):              whether to pin memory or not
multi_processing_method (str):  method to use. options: forkserver | fork | spawn
device (torch.device):          training on cuda or cpu
get_sampler (get_sampler):      function that is used to get the sampler
worker_init_fn (None default):  any function that should be executed during initialization of dataloader workers

Using Data Collators

VISSL supports PyTorch default collator torch.utils.data.dataloader.default_collate and also many custom data collators used in self-supervision. The use any collator, user has to simply specify the DATA.TRAIN.COLLATE_FUNCTION to be the name of the collator to use. See all custom VISSL collators implemented here.

An example for specifying collator for SwAV training:

DATA:
  TRAIN:
    COLLATE_FUNCTION: multicrop_collator

Using Data Transforms

VISSL supports all PyTorch TorchVision transforms, Augly Transforms, as well as many custom transforms required by Self-supervised approaches including MoCo, SwAV, PIRL, SimCLR, BYOL, etc. Using Transforms is intuitive and easy in VISSL. Users specify the list of transforms they want to apply on the data in the order of application. This involves using the transform name and the key:value to specify the parameter values for the transform. See the full list of transforms implemented by VISSL here

An example of transform for SwAV:

DATA:
  TRAIN:
    TRANSFORMS:
      - name: ImgPilToMultiCrop
        total_num_crops: 6
        size_crops: [224, 96]
        num_crops: [2, 4]
        crop_scales: [[0.14, 1], [0.05, 0.14]]
      - name: RandomHorizontalFlip
        p: 0.5
      - name: ImgPilColorDistortion
        strength: 1.0
      - name: ImgPilGaussianBlur
        p: 0.5
        radius_min: 0.1
        radius_max: 2.0
      - name: ToTensor
      - name: Normalize
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]

To use an Augly Transforms, please first pip install augly and use >= Python 3.7 `. Then specify the Augly transform class name, set :code:`transform_type: augly, and specify any other arguments required. For example using the Overlay Emoji Transform:

DATA:
  TRAIN:
    TRANSFORMS:
      - name: OverlayEmoji
        transform_type: augly
        opacity: 0.5
        emoji_size: 0.2
        x_pos: 0.3
        y_pos: 0.4
        p: 0.7

Using Data Sampler

VISSL supports 2 types of samplers:

  • PyTorch default torch.utils.data.distributed.DistributedSampler

  • VISSL sampler StatefulDistributedSampler that is written specifically for large scale dataset trainings. See the documentation for the sampler.

By default, the PyTorch default sampler is used unless user specifies DATA.TRAIN.USE_STATEFUL_DISTRIBUTED_SAMPLER=true in which case StatefulDistributedSampler will be used.