Add new Data Source¶
VISSL supports data loading from disk_folder
, disk_filelist
or torchvision_dataset
as default data sources.
If your dataset lives in a custom data storage solution instead, you can extend VISSL to work with your data storage in several different ways:
Exporting the content of the non supported datasource as
disk_filelist
Exporting the content of the non supported datasource as
disk_folder
Implementing a new type of data source inside VISSL
Each of these options is developed below.
Transforming to a disk_filelist format¶
Out of the box, VISSL supports any dataset following the disk_filelist
format:
/path/to/dataset/
train_images.npy
train_labels.npy
val_images.npy
val_labels.npy
Note
The name and number of partitions may differ: you can for instance create 3 different partitions for train/val/test.
The *_images.npy
files should contain the path to the images, one path for each sample, while the *_labels.npy
files should contain the corresponding labels.
There are two formats supported for the labels: either integers (from 0 to N-1 for N classes) or strings.
Once a disk_filelist
dataset is made available at path /path/to/dataset
, plugging the dataset in the library is a simple two step process:
add the paths to the “dataset_catalog.json” registry of dataset
"my_dataset_filelist": {
"train": ["/path/to/dataset/train_images.npy", "/path/to/dataset/train_labels.npy"],
"val": ["/path/to/dataset/val_images.npy", "/path/to/dataset/val_labels.npy"],
},
reference the new dataset in a configuration file:
config:
DATA:
TRAIN:
DATA_SOURCES: [disk_filelist]
LABEL_SOURCES: [disk_filelist]
DATASET_NAMES: [my_dataset_filelist]
...
TEST:
DATA_SOURCES: [disk_filelist]
LABEL_SOURCES: [disk_filelist]
DATASET_NAMES: [my_dataset_filelist]
...
Some examples of scripts transforming existing data sources to the disk_filelist
format can be found in the extra_script folder.
For example, create_clever_count_data_files.py
creates a new classification dataset from the CLEVR dataset, in which the goal is to count the number of object in the scene.
Please refer to the documentation available here to get more information all the available data preparation scripts.
Transforming to a disk_folder format¶
Out of the box, VISSL also supports any dataset following the disk_folder
format:
/path/to/dataset/
train/
class1/
a.jpg
b.jpg
...
class2/
c.jpg
...
val/
class1/
d.jpg
e.jpg
...
class2/
f.jpg
...
This format requires to copy the images, which might take more disk space than the disk_filelist
format, but is nevertheless the best option in many cases.
In particular, if the original dataset does not allow us to reference image paths (it might be a video dataset or a custom binary format), the disk_filelist
is not an option anymore and disk_folder
might be the best option.
Once a disk_folder
dataset is made available at path /path/to/dataset
, plugging the dataset in the library is a simple two step process:
add the paths to the “dataset_catalog.json” registry of dataset
"my_dataset_folder": {
"train": ["/path/to/dataset/train", "<ignored>"],
"val": ["/path/to/dataset/val", "<ignored>"]
},
reference the new dataset in a configuration file:
config:
DATA:
TRAIN:
DATA_SOURCES: [disk_folder]
LABEL_SOURCES: [disk_folder]
DATASET_NAMES: [my_dataset_folder]
...
TEST:
DATA_SOURCES: [disk_folder]
LABEL_SOURCES: [disk_folder]
DATASET_NAMES: [my_dataset_folder]
...
Some examples of scripts transforming existing data sources to the disk_folder
format can be found in the extra_script folder.
For example, create_ucf101_data_files.py
: creates an image action recognition dataset from the video action recognition dataset UCF101 by extracting the middle frame of each video.
Please refer to the documentation available here to get more information all the available data preparation scripts.
Adding a new type of data source¶
If instead, you want to use a custom data storage solution my_data_source
instead of disk_folder
, you can extend VISSL to work with the my_data_source
data storage by following the steps below:
Step1: Implement your custom data source under
vissl/data/my_data_source.py
following the template:
from vissl.data.data_helper import get_mean_image
from torch.utils.data import Dataset
class MyNewSourceDataset(Dataset):
"""
add documentation on how this dataset works
Args:
add docstrings for the parameters
"""
def __init__(self, cfg, data_source, path, split, dataset_name):
super(MyNewSourceDataset, self).__init__()
assert data_source in [
"disk_filelist",
"disk_folder",
"my_data_source"
], "data_source must be either disk_filelist or disk_folder or my_data_source"
self.cfg = cfg
self.split = split
self.dataset_name = dataset_name
self.data_source = data_source
self._path = path
# implement anything that data source init should do
....
....
self._num_samples = ?? # set the length of the dataset
def num_samples(self):
"""
Size of the dataset
"""
return self._num_samples
def __len__(self):
"""
Size of the dataset
"""
return self.num_samples()
def __getitem__(self, idx: int):
"""
implement how to load the data corresponding to idx element in the dataset
from your data source
"""
....
....
# is_success should be True or False indicating whether loading data was successful or failed
# loaded data should be Image.Image if image data
return loaded_data, is_success
Step2: Register the new data source with VISSL. Extend the
DATASET_SOURCE_MAP
dict invissl/data/__init__.py
.
DATASET_SOURCE_MAP = {
"disk_filelist": DiskImageDataset,
"disk_folder": DiskImageDataset,
"torchvision_dataset": TorchvisionDataset,
"synthetic": SyntheticImageDataset,
"my_data_source": MyNewSourceDataset,
}
Step3: Register the name of the datasets you plan to load using the new data source. There are 2 ways to do this:
See our documentation on Using dataset_catalog.json to update the
configs/dataset_catalog.json
file.Insert a python call following:
# insert the following call in your python code from vissl.data.dataset_catalog import VisslDatasetCatalog VisslDatasetCatalog.register_data(name="my_dataset_name", data_dict={"train": ... , "test": ...})
Step4: Test using your dataset
DATA:
TRAIN:
DATA_SOURCES: [my_data_source]
DATASET_NAMES: [my_dataset_name]