Handling invalid images in dataloader

How VISSL solves it

Self-supervised approaches like SimCLR, SwAV, etc that perform some form of contrastive learning contrast the features or cluster of one image with the other. During the dataloading time, or in the training dataset itself, it’s possible that there are invalid images. By default, in VISSL, when the dataloader encounters an invalid image, a gray image is returned instead. Using gray images for the purpose of contrastive learning can lead to inferior model accuracy especially if there are a lot of invalid images.

To solve this issue, VISSL provides a custom base dataset class called QueueDataset that maintains 2 queues in CPU memory. One queue is used to enqueue valid seen images from previous minibatches and the other queue is used to dequeue. The QueueDataset is implemented such that the same minibatch will never have the duplicate images. If we can’t dequeue a valid image, we return None from the dequeue. In short, QueueDataset enables using the previously used valid images from the training in the current minibatch in place of invalid images.

Enabling QueueDataset

VISSL makes it convenient for users to use the code:QueueDataset with simple configuration settings. To use the code:QueueDataset, users need to set DATA.TRAIN.ENABLE_QUEUE_DATASET=true and DATA.TEST.ENABLE_QUEUE_DATASET=true.

Tuning the queue size of QueueDataset

VISSL exposes the queue settings to configuration file that users can tune. The configuration settings are:

DATA:
  TRAIN:
    ENABLE_QUEUE_DATASET: True
  TEST:
    ENABLE_QUEUE_DATASET: True

Note

If users encounter CPU out-of-memory issue, they might want to reduce the queue size