Add new Data Source¶
VISSL supports data loading from disk
as the default data source. If users dataset lives in their custom data storage solution my_data_source
instead of disk
, then users can extend VISSL to work with their data storage. Follow the steps below:
Step1: Implement your custom data source under
vissl/data/my_data_source.py
following the template:
from vissl.data.data_helper import get_mean_image
from torch.utils.data import Dataset
class MyNewSourceDataset(Dataset):
"""
add documentation on how this dataset works
Args:
add docstrings for the parameters
"""
def __init__(self, cfg, data_source, path, split, dataset_name):
super(MyNewSourceDataset, self).__init__()
assert data_source in [
"disk_filelist",
"disk_folder",
"my_data_source"
], "data_source must be either disk_filelist or disk_folder or my_data_source"
self.cfg = cfg
self.split = split
self.dataset_name = dataset_name
self.data_source = data_source
self._path = path
# implement anything that data source init should do
....
....
self._num_samples = ?? # set the length of the dataset
def num_samples(self):
"""
Size of the dataset
"""
return self._num_samples
def __len__(self):
"""
Size of the dataset
"""
return self.num_samples()
def __getitem__(self, idx: int):
"""
implement how to load the data corresponding to idx element in the dataset
from your data source
"""
....
....
# is_success should be True or False indicating whether loading data was successful or failed
# loaded data should be Image.Image if image data
return loaded_data, is_success
Step2: Register the new data source with VISSL. Extend the
DATASET_SOURCE_MAP
dict invissl/data/__init__.py
.
DATASET_SOURCE_MAP = {
"disk_filelist": DiskImageDataset,
"disk_folder": DiskImageDataset,
"synthetic": SyntheticImageDataset,
"my_data_source": MyNewSourceDataset,
}
Step3: Register the name of the datasets you plan to load using the new data source. There are 2 ways to do this:
See our documentation on “Using dataset_catalog.json” to update the
configs/dataset_catalog.json
file.Insert a python call following:
# insert the following call in your python code from vissl.data.dataset_catalog import VisslDatasetCatalog VisslDatasetCatalog.register_data(name="my_dataset_name", data_dict={"train": ... , "test": ...})
Step4: Test using your dataset
DATA:
TRAIN:
DATA_SOURCES: [my_data_source]
DATASET_NAMES: [my_dataset_name]