Filtering Methods for DataComp

Data is the fuel for modern vision-language models. Nowadays, machine learning researchers are experimenting with various methods for curating the training data, including deduplication, recaptioning, and filtering.

This is a visualization for better understanding the data pool and baseline filters used in the DataComp paper. The 5868 image-text pairs are randomly sampled from the DataComp-small data pool. We employ t-SNE on the concatenation of CLIP image and text features for embedding the pairs, and use color to encode their CLIP score, which serves as the matching score of the image and text.

Try different values and combinations of the filters by yourself, and see what they actually do. Are they effective in finding low-quality data? Are they biased? What are some better methods for data curation? Your idea could make a difference!

Click a dot to display an image and its caption

Adjust the filters using the sliders on the left or click on the following values used in baseline filters:

Also try Basic filtering below:

Recommendations for your exploration: (1) How do the filters interact with each other? For example, does Basic filtering still help after applying CLIP score > 30? (2) Do the baseline filters throw out anything that might be useful for model training? (3) Try locating some semantically meaningful clusters, and observe how they change with your filter.

License Notices: The original image url-text samples and metadata were released by DataComp under Creative Common CC-BY-4.0 license. The individual images are under their own copyrights. The visualization itself and any modifications made were intended for a course project and are the property of Guang Yang and Siting Li. Please contact us for any inquiries.