Metadata

  • Source
  • File: Snapshot
  • Zotero: View Item
  • Type: Webpage
  • Title: CLIP: connecting text and images,
  • Author: Radford, Alec; Sutskever, Ilya; Kim, Jong Wook; Krueger, Gretchen; Agarwal, Sandhini;
  • Year: 2021

Annotations

Notes

Background

CLIP has demonstrated equal or superior performance than ImageNet ResNet101 on various datasets, since the former can recognize visual classifications regardless of the nature of the classification task (e.g. real-world objects, objects in sketches, objects in artwork, etc).

CLIP solves the following issues about training computer vision models:

  • Training is costly yet the application is narrow. Different tasks require another training model.
  • Deep learning models with good performance on benchmark perform poorly on stress tests.

To generalize the model, CLIP is trained using natural language supervision with data available all over the internet. It is able to perform better than ImageNet ResNet101 on multiple benchmarks without specifically optimizing for them.

Instead of uses zero-shot to predict additional classes, CLIP uses natural language as its prediction space so that the model generalizes more easily to other prediction possibilities.

Approach

  • scale simple pre-training task → good performance when generalized
  • data found all over internet: image-caption pairs (or maybe any text surrounding an image? idk)
  • Data used to create a proxy training task: pair an input image with the correct text out of 32k text snippets
  • intuition: under the hood, CLIP have to learn to associate words with different visual concepts / features in the image. e.g. dataset is dog vs cat, caption contains dog, CLIP spits out dog; thus the “contrastive learning” part of its name

benefits of CLIP over other types of CV model:

  • low cost of building dataset: image-caption pairs are freely available on the internet; no manual labelling involved
  • generality: a great deal of effort is needed to repurpose imagenet models / expand range of output classes. CLIP can do it with no examples and only textual information about the new class’s visual concepts; it will then output a linear classifier of CLIP’s visual representations, which has a performance comparable to supervised models.
  • decent real-world performance: Deep learning models often perform well and even beyond human capabilities on benchmark datasets even suffer in performance in the real world, whereas CLIP’s performance on benchmarks are more akin to real-world performance since CLIP is not even trained on the the dataset

Key takeaways

Highly efficient

  • trained on unfiltered, varied, noisy data; intended for zero-shot use
  • like GPT-2 and GPT-3, such data is facilitative for zero-shot performance
  • would have required significant training, but was optimized with a good training algorithm (contrastive learning)
    • initially: image-to-text transfer model (1x efficiency)
    • vision transformer bag of words prediction (3x)
    • CLIP: bag of words contrastive (12x)

Flexible & general

  • CLIP learns a wide range of visual concepts → more general
  • achieved competitive performance on various zero-shot tasks

Limitations

  • good at recognizing common objects, but not at abstract / systematic tasks (i.e. count number of objects), or complex tasks (estimate distance of nearest car in image) – only slightly better than random guessing
  • also struggles at very fine-grained tasks (e.g. classify car models, aircraft variants, flower species)
  • poor generalization to images/visual concepts not covered in pre-training data set
  • sensitive to wording/phrasing, may require “prompt engineering” for good performance

Impacts

  • allows people to design custom classifiers with much less effort (no need for building specific training data)
  • speed up building niche classifiers
  • uses in surveillance (identity recognition, etc)

Conclusion

task agnostic pre-training on natural language