Using Data¶
To use a dataset in VISSL, the only requirements are:
the dataset name should be registered with
VisslDatasetCatalog
in VISSL. Only name is important and the paths are not. The paths can be specifed in the configuration file. Users can either edit thedataset_catalog.json
or specify the paths in the configuration file.the dataset should be from a supported data source.
Reading data from several sources¶
VISSL allows reading data from multiple sources (disk, etc) and in multiple formats (a folder path, a .npy
file, or torchvision datasets).
The GenericSSLDataset class is defined to support reading data from multiple data sources. For example: data = [dataset1, dataset2]
and the minibatches generated will have the corresponding data from each dataset.
For this reason, we also support labels from multiple sources. For example targets = [dataset1 targets, dataset2 targets]
.
Source of the data (disk_folder
| disk_filelist
| torchvision_dataset
):
disk_folder
: this is simply the root folder path to the downloaded data.disk_filelist
: These are numpy (or .pkl) files: (1) file containing images information (2) file containing corresponding labels for images. We provide scripts that can be used to prepare these two files for a dataset of choice.torchvision_dataset
: the root folder path to the torchvision dowloaded dataset. As of now, the supported datasets are: CIFAR10, CIFAR100, MNIST, STL10 and SVHN.
To use a dataset, VISSL takes following inputs in the configuration file for each dataset split (train, test):
DATASET_NAMES
: names of the datasets that are registered withVisslDatasetCatalog
. Registering dataset name is important. Example:DATASET_NAMES=[imagenet1k_folder, my_new_dataset_filelist]
DATA_SOURCES
: the sources of dataset. Options:disk_folder | disk_filelist
. This specifies where the data lives. Users can extend it for their purposes. ExampleDATA_SOURCES=[disk_folder, disk_filelist]
DATA_PATHS
: the paths to the dataset. The paths could be folder path (example Imagenet1k folder) or .npy filepaths. For the folder paths, VISSL usesImageFolder
from PyTorch. ExampleDATA_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset>]
LABEL_SOURCES
: just like images, the targets can also come from several sources. Example:LABEL_SOURCES=[disk_folder]
for Imagenet1k. Example:DATA_SOURCES=[disk_folder, disk_filelist]
LABEL_PATHS
: similar toDATA_PATHS
but for labels. ExampleLABEL_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset_labels>]
LABEL_TYPE
: choose fromstandard | sample_index
.sample_index
is a common practice in self-supervised learning andsample_index`=id of the sample in the data. :code:`standard
label type is used for supervised learning and user specifis the annotated labels to use.DATA_LIMIT
: How many samples to train with per epoch. This can be useful for debugging purposes or for evaluating low-shot learning. Note that the defaultDATA_LIMIT=-1
uses the full dataset. You can also configure additional options, like the seed and ensuring class-balanced sampling inDATA_LIMIT_SAMPLING
.
Using dataset_catalog.json
¶
In order to use a dataset with VISSL, the dataset name must be registered with VisslDatasetCatalog
. VISSL maintains a dataset_catalog.json which is parsed by VisslDatasetCatalog
and the datasets
are registered with VISSL, ready-to-use.
Users can edit the template dataset_catalog.json file
to specify their datasets paths. Alternatively users can create their own dataset catalog json file and set the environment variable VISSL_DATASET_CATALOG_PATH
to its absolute path.
This may be helpful if you are not building the code from source or are actively developing on VISSL. The json file can be fully decided by user and can have any number of supported datasets (one or more).
Users can give the string names of the dataset they wish to use.
Template for a dataset entry in dataset_catalog.json
¶
"data_name": {
"train": [
"<images_path_or_folder>", "<labels_path_or_folder>"
],
"val": [
"<images_path_or_folder>", "<labels_path_or_folder>"
],
}
The images_path_or_folder
and labels_path_or_folder
can be directories or filepaths (numpy, pickle.)
User can mix and match the source of image, labels. i.e. labels can be filelist and images can be folder path. The yaml configuration files require specifying LABEL_SOURCES
and DATA_SOURCES
which allows the code to figure out how to ingest various sources.
Note
Filling the dataset_catalog.json
is a one time process only and provides the benefits of simply accessing any dataset with the dataset name in the configuration files for the rest of the trainings.
Using Builtin datasets¶
VISSL supports several Builtin datasets as indicated in the dataset_catalog.json
file. Users can specify paths to those datasets.
Expected dataset structure for ImageNet, Places205, Places365¶
{imagenet, places205, places365}
train/
<n0......>/
<im-1-name>.JPEG
...
<im-N-name>.JPEG
...
<n1......>/
<im-1-name>.JPEG
...
<im-M-name>.JPEG
...
...
val/
<n0......>/
<im-1-name>.JPEG
...
<im-N-name>.JPEG
...
<n1......>/
<im-1-name>.JPEG
...
<im-M-name>.JPEG
...
...
Expected dataset structure for Pascal VOC [2007, 2012]¶
VOC20{07,12}/
Annotations/
ImageSets/
Main/
trainval.txt
test.txt
JPEGImages/
Expected dataset structure for COCO2014¶
coco/
annotations/
instances_train2014.json
instances_val2014.json
train2014/
# image files that are mentioned in the corresponding json
val2014/
# image files that are mentioned in the corresponding json
Expected dataset structure for CIFAR10¶
The expected format is the exact same format used by torchvision, and the exact format obtained after either:
expanding the “CIFAR-10 python version” archive available at https://www.cs.toronto.edu/~kriz/cifar.html
instantiating the
torchvision.datasets.CIFAR10
class withdownload=True
cifar-10-batches-py/
batches.meta
data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5
readme.html
test_batch
Expected dataset structure for CIFAR100¶
The expected format is the exact same format used by torchvision, and the exact format obtained after either:
expanding the “CIFAR-100 python version” archive available at https://www.cs.toronto.edu/~kriz/cifar.html
instantiating the
torchvision.datasets.CIFAR100
class withdownload=True
cifar-100-python/
meta
test
train
Expected dataset structure for MNIST¶
The expected format is the exact same format used by torchvision, and the exact format obtained after
instantiating the torchvision.datasets.MNIST
class with the flag download=True
.
MNIST/
processed/
test.pt
training.pt
raw/
t10k-images-idx3-ubyte
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte
t10k-labels-idx1-ubyte.gz
train-images-idx3-ubyte
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte
train-labels-idx1-ubyte.gz
Expected dataset structure for STL10¶
The expected format is the exact same format used by torchvision, and the exact format obtained after either:
expanding the
stl10_binary.tar.gz
archive available at https://cs.stanford.edu/~acoates/stl10/instantiating the
torchvision.datasets.STL10
class withdownload=True
stl10_binary/
class_names.txt
fold_indices.txt
test_X.bin
test_y.bin
train_X.bin
train_y.bin
unlabeled_X.bin
Expected dataset structure for SVHN¶
The expected format is the exact same format used by torchvision, and the exact format obtained after either:
downloading the
train_32x32.mat
,test_32x32.mat
andextra_32x32.mat
files available at http://ufldl.stanford.edu/housenumbers/ in the same folderinstantiating the
torchvision.datasets.SVHN
class withdownload=True
svhn_folder/
test_32x32.mat
train_32x32.mat
Expected dataset structure for the other benchmark datasets¶
VISSL supports benchmarks inspired by the VTAB and CLIP papers, for which the datasets either:
Do not directly exist but are transformations of existing dataset (like images extracted from videos)
Are not in a format directly compatible with the
disk_folder
or thedisk_filelist
format of VISSLAnd are not yet part of torchvision datasets
To run these benchmarks, the following data preparation scripts are mandatory:
extra_scripts/datasets/create_caltech101_data_files.py
: to transform the Caltech101 dataset to thedisk_folder
formatextra_scripts/datasets/create_clevr_count_data_files.py
: to create adisk_filelist
dataset from CLEVR where the goal is to count the number of object in the sceneextra_scripts/datasets/create_clevr_dist_data_files.py
: to create adisk_filelist
dataset from CLEVR where the goal is to estimate the distance of the closest object in the sceneextra_scripts/datasets/create_dsprites_location_data_files.py
: to create adisk_folder
dataset from dSprites where the goal is to estimate the x coordinate of the sprite on the sceneextra_scripts/datasets/create_dsprites_orientation_data_files.py
: to create adisk_folder
dataset from dSprites where the goal is to estimate the orientation of the sprite on the sceneextra_scripts/datasets/create_dtd_data_files.py
: to transform the DTD dataset to thedisk_folder
formatextra_scripts/datasets/create_euro_sat_data_files.py
: to transform the EUROSAT dataset to thedisk_folder
formatextra_scripts/datasets/create_fgvc_aircraft_data_files.py
: to transform the FGVC Aircrafts dataset to thedisk_folder
formatextra_scripts/datasets/create_food101_data_files.py
: to transform the FOOD101 dataset to thedisk_folder
formatextra_scripts/datasets/create_gtsrb_data_files.py
: to transform the GTSRB dataset to thedisk_folder
formatextra_scripts/datasets/create_imagenet_ood_data_files.py
: to create test sets indisk_filelist
format for Imagenet based on Imagenet-A and Imagenet-Rextra_scripts/datasets/create_kitti_dist_data_files.py
: to create adisk_folder
dataset from KITTI where the goal is to estimate the distance of the closest car, van or truckextra_scripts/datasets/create_oxford_pets_data_files.py
: to transform the Oxford Pets dataset to thedisk_folder
formatextra_scripts/datasets/create_patch_camelyon_data_files.py
: to transform the PatchCamelyon dataset to thedisk_folder
formatextra_scripts/datasets/create_small_norb_azimuth_data_files.py
to create adisk_folder
dataset from Small NORB where the goal is to find the azimuth or the photographed objectextra_scripts/datasets/create_small_norb_elevation_data_files.py
to create adisk_folder
dataset from Small NORB where the goal is to predict the elevation in the imageextra_scripts/datasets/create_sun397_data_files.py
to transform the SUN397 dataset to thedisk_filelist
formatextra_scripts/datasets/create_ucf101_data_files.py
: to create adisk_folder
image action recognition dataset from the video action recognition dataset UCF101 by extracting the middle frame
You can read more about how to download these datasets and run these scripts from here.
After data preparation, the output folders are either compatible with the disk_filelist
layout:
train_images.npy # Paths to the train images
train_labels.npy # Labels for each of the train images
val_images.npy # Paths to the val images
val_labels.npy # Labels for each of the val images
Or with the disk_folder
layout:
train/
label1/
image_1.jpeg
image_2.jpeg
...
label2/
image_x.jpeg
image_y.jpeg
...
...
val/
label1/
image_1.jpeg
image_2.jpeg
...
label2/
image_x.jpeg
image_y.jpeg
...
...
Note
In the case of the disk_folder
layout, the images are copied into the output folder and the input folder is not necessary anymore.
The input folder can for instance be deleted.
In the case of the disk_filelist
layout, the images are referenced inside the .npy
files.
It is therefore important to keep the input folder and not alter it (which includes not moving it).
The disk_filelist
has the advantage of using less space, while the disk_folder
offers total decoupling from the
original dataset files and is more advantageous for small number of images or when the inputs do not allow to reference images
(for instance when extracting frames from videos, or dealing with images in an unsupported format).
The aforementioned scripts use the either the disk_folder
or disk_filelist
based on these constraints.
Dataloader¶
VISSL uses PyTorch torch.utils.data.DataLoader
and allows setting all the dataloader option as below. The dataloader is wrapped with DataloaderAsyncGPUWrapper or DataloaderSyncGPUWrapper depending on whether user wants to copy data to gpu async or not.
The settings for the Dataloader
in VISSL are:
dataset (GenericSSLDataset): the dataset object for which dataloader is constructed
dataset_config (dict): configuration of the dataset. it should be DATA.TRAIN or DATA.TEST settings
num_dataloader_workers (int): number of workers per gpu (or cpu) training
pin_memory (bool): whether to pin memory or not
multi_processing_method (str): method to use. options: forkserver | fork | spawn
device (torch.device): training on cuda or cpu
get_sampler (get_sampler): function that is used to get the sampler
worker_init_fn (None default): any function that should be executed during initialization of dataloader workers
Using Data Collators¶
VISSL supports PyTorch default collator torch.utils.data.dataloader.default_collate
and also many custom data collators used in self-supervision. The use any collator, user has to simply specify the DATA.TRAIN.COLLATE_FUNCTION
to be the name of the collator to use. See all custom VISSL collators implemented here.
An example for specifying collator for SwAV training:
DATA:
TRAIN:
COLLATE_FUNCTION: multicrop_collator
Using Data Transforms¶
VISSL supports all PyTorch TorchVision
transforms, Augly Transforms, as well as many custom transforms required by Self-supervised approaches including MoCo, SwAV, PIRL, SimCLR, BYOL, etc. Using Transforms is intuitive and easy in VISSL. Users specify the list of transforms they want to apply on the data in the order of application.
This involves using the transform name and the key:value to specify the parameter values for the transform. See the full list of transforms implemented by VISSL here
An example of transform for SwAV:
DATA:
TRAIN:
TRANSFORMS:
- name: ImgPilToMultiCrop
total_num_crops: 6
size_crops: [224, 96]
num_crops: [2, 4]
crop_scales: [[0.14, 1], [0.05, 0.14]]
- name: RandomHorizontalFlip
p: 0.5
- name: ImgPilColorDistortion
strength: 1.0
- name: ImgPilGaussianBlur
p: 0.5
radius_min: 0.1
radius_max: 2.0
- name: ToTensor
- name: Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
To use an Augly Transforms, please first pip install augly
and use >= Python 3.7 `. Then specify the Augly transform class name, set :code:`transform_type: augly
, and specify any other arguments required. For example using the Overlay Emoji Transform:
DATA:
TRAIN:
TRANSFORMS:
- name: OverlayEmoji
transform_type: augly
opacity: 0.5
emoji_size: 0.2
x_pos: 0.3
y_pos: 0.4
p: 0.7
Using Data Sampler¶
VISSL supports 2 types of samplers:
PyTorch default
torch.utils.data.distributed.DistributedSampler
VISSL sampler StatefulDistributedSampler that is written specifically for large scale dataset trainings. See the documentation for the sampler.
By default, the PyTorch default sampler is used unless user specifies DATA.TRAIN.USE_STATEFUL_DISTRIBUTED_SAMPLER=true
in which case StatefulDistributedSampler
will be used.