Using Data¶
To use a dataset in VISSL, the only requirements are:
the dataset name should be registered with
VisslDatasetCatalog
in VISSL. Only name is important and the paths are not. The paths can be specifed in the configuration file. Users can either edit thedataset_catalog.json
or specify the paths in the configuration file.the dataset should be from a supported data source.
Reading data from several sources¶
VISSL allows reading data from multiple sources (disk, etc) and in multiple formats (a folder path, a .npy
file).
The GenericSSLDataset class is defined to support reading data from multiple data sources. For example: data = [dataset1, dataset2]
and the minibatches generated will have the corresponding data from each dataset.
For this reason, we also support labels from multiple sources. For example targets = [dataset1 targets, dataset2 targets]
.
Source of the data (disk_filelist
| disk_folder
):
disk_folder
: this is simply the root folder path to the downloaded data.disk_filelist
: These are numpy (or .pkl) files: (1) file containing images information (2) file containing corresponding labels for images. We provide scripts that can be used to prepare these two files for a dataset of choice.
To use a dataset, VISSL takes following inputs in the configuration file for each dataset split (train, test):
DATASET_NAMES
: names of the datasets that are registered withVisslDatasetCatalog
. Registering dataset name is important. Example:DATASET_NAMES=[imagenet1k_folder, my_new_dataset_filelist]
DATA_SOURCES
: the sources of dataset. Options:disk_folder | disk_filelist
. This specifies where the data lives. Users can extend it for their purposes. ExampleDATA_SOURCES=[disk_folder, disk_filelist]
DATA_PATHS
: the paths to the dataset. The paths could be folder path (example Imagenet1k folder) or .npy filepaths. For the folder paths, VISSL usesImageFolder
from PyTorch. ExampleDATA_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset>]
LABEL_SOURCES
: just like images, the targets can also come from several sources. Example:LABEL_SOURCES=[disk_folder]
for Imagenet1k. Example:DATA_SOURCES=[disk_folder, disk_filelist]
LABEL_PATHS
: similar toDATA_PATHS
but for labels. ExampleLABEL_PATHS=[<imagenet1k_folder_path>, <numpy_file_path_for_new_dataset_labels>]
LABEL_TYPE
: choose fromstandard | sample_index
.sample_index
is a common practice in self-supervised learning andsample_index`=id of the sample in the data. :code:`standard
label type is used for supervised learning and user specifis the annotated labels to use.
Using dataset_catalog.json
¶
In order to use a dataset with VISSL, the dataset name must be registered with VisslDatasetCatalog
. VISSL maintains a dataset_catalog.json which is parsed by VisslDatasetCatalog
and the datasets
are registered with VISSL, ready-to-use.
Users can edit the template dataset_catalog.json file to specify their datasets paths. The json file can be fully decided by user and can have any number of supported datasets (one or more). User can give the string names to dataset as per their choice.
Template for a dataset entry in dataset_catalog.json
¶
"data_name": {
"train": [
"<images_path_or_folder>", "<labels_path_or_folder>"
],
"val": [
"<images_path_or_folder>", "<labels_path_or_folder>"
],
}
The images_path_or_folder
and labels_path_or_folder
can be directories or filepaths (numpy, pickle.)
User can mix match the source of image, labels i.e. labels can be filelist and images can be folder path. The yaml configuration files require specifying LABEL_SOURCES
and DATA_SOURCES
which allows the code to figure out how to ingest various sources.
Note
Filling the dataset_catalog.json
is a one time process only and provides the benefits of simply accessing any dataset with the dataset name in the configuration files for the rest of the trainings.
Using Builtin datasets¶
VISSL supports several Builtin datasets as indicated in the dataset_catalog.json
file. Users can specify paths to those datasets.
Expected dataset structure for ImageNet, Places205, Places365¶
{imagenet, places205, places365}
train/
<n0......>/
<im-1-name>.JPEG
...
<im-N-name>.JPEG
...
<n1......>/
<im-1-name>.JPEG
...
<im-M-name>.JPEG
...
...
val/
<n0......>/
<im-1-name>.JPEG
...
<im-N-name>.JPEG
...
<n1......>/
<im-1-name>.JPEG
...
<im-M-name>.JPEG
...
...
Expected dataset structure for Pascal VOC [2007, 2012]¶
VOC20{07,12}/
Annotations/
ImageSets/
Main/
trainval.txt
test.txt
JPEGImages/
Expected dataset structure for COCO2014¶
coco/
annotations/
instances_train2014.json
instances_val2014.json
train2014/
# image files that are mentioned in the corresponding json
val2014/
# image files that are mentioned in the corresponding json
Dataloader¶
VISSL uses PyTorch torch.utils.data.DataLoader
and allows setting all the dataloader option as below. The dataloader is wrapped with DataloaderAsyncGPUWrapper or DataloaderSyncGPUWrapper depending on whether user wants to copy data to gpu async or not.
The settings for the Dataloader
in VISSL are:
dataset (GenericSSLDataset): the dataset object for which dataloader is constructed
dataset_config (dict): configuration of the dataset. it should be DATA.TRAIN or DATA.TEST settings
num_dataloader_workers (int): number of workers per gpu (or cpu) training
pin_memory (bool): whether to pin memory or not
multi_processing_method (str): method to use. options: forkserver | fork | spawn
device (torch.device): training on cuda or cpu
get_sampler (get_sampler): function that is used to get the sampler
worker_init_fn (None default): any function that should be executed during initialization of dataloader workers
Using Data Collators¶
VISSL supports PyTorch default collator torch.utils.data.dataloader.default_collate
and also many custom data collators used in self-supervision. The use any collator, user has to simply specify the DATA.TRAIN.COLLATE_FUNCTION
to be the name of the collator to use. See all custom VISSL collators implemented here.
An example for specifying collator for SwAV training:
DATA:
TRAIN:
COLLATE_FUNCTION: multicrop_collator
Using Data Transforms¶
VISSL supports all PyTorch TorchVision
transforms as well as many transforms required by Self-supervised approaches including MoCo, SwAV, PIRL, SimCLR, BYOL, etc. Using Transforms is very intuitive and easy in VISSL. Users specify the list of transforms they want to apply on the data in the order of application.
This involves using the transform name and the key:value to specify the parameter values for the transform. See the full list of transforms implemented by VISSL here
An example of transform for SwAV:
DATA:
TRAIN:
TRANSFORMS:
- name: ImgPilToMultiCrop
total_num_crops: 6
size_crops: [224, 96]
num_crops: [2, 4]
crop_scales: [[0.14, 1], [0.05, 0.14]]
- name: RandomHorizontalFlip
p: 0.5
- name: ImgPilColorDistortion
strength: 1.0
- name: ImgPilGaussianBlur
p: 0.5
radius_min: 0.1
radius_max: 2.0
- name: ToTensor
- name: Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
Using Data Sampler¶
VISSL supports 2 types of samplers:
PyTorch default
torch.utils.data.distributed.DistributedSampler
VISSL sampler StatefulDistributedSampler that is written specifically for large scale dataset trainings. See the documentation for the sampler.
By default, the PyTorch default sampler is used unless user specifies DATA.TRAIN.USE_STATEFUL_DISTRIBUTED_SAMPLER=true
in which case StatefulDistributedSampler
will be used.