Metadata

Annotations

Notes

This paper proposes a new model called CLIP.

Summary

CLIP is a novel model capable of achieving competitive performance across different zero-shot tasks and data distributions compared to supervised models. Its generality stems from contrastive learning—essentially a scaled up pre-training task of matching images with entire captions. Since the training does not require the model to predict the caption but just to match the captions, it is able to learn visual concepts from textual representations, while using less compute than its predictive counterparts. In addition, by using natural language instead of image classes as its output, the model generalizes more easily to never-before-seen labeling possibilities in new datasets and tasks. This allows users of the model to label images using any arbitrary classification. Its natural language prediction space was even able to grant it surprising performance in an OCR task. Though the model is as flexible as it is, it does struggle in more complex and fine-grained tasks. Nonetheless, CLIP’s flexibility enables people to design custom classifiers without building datasets or retraining models.

Abstract

  • contrastive learning: a pre-training task (matching image to its caption) scaled up (400 million)
    • model learns to associate words with visual concepts/features
  • enable generality of trained models: zero-shot classes described using natural language & learned concept
  • SOTA zero-shot
    • no task-specific image data needed
    • performance competitive with supervised baseline models
    • possibilities include OCR, action recognition, etc

Introduction and Motivating Work

  • pre-training methods on raw texts inspired NLP revolution
    • autoregressive language model: a language model based on predicting the next word in the text using past output (?)
    • masked language modeling: model for predicting missing words (“masked” words) in text
    • Development of text-to-text standardized I/O for zero-shot transfer to other datasets (no need for dataset customization)
    • GPT-3
    • These successes suggest that data aggregated from the internet (usable for pre-training) are better than high-quality crowd-labeled NLP datasets.
  • Attempt to reproduce the revolution in computer vision
    • CV still pre-train models with ImageNet. What about data from the web?
  • bag-of-words phrase n-gram model can do zero-shot
  • VirTex, ICMLM, ConVIRT (transformer-based language modeling, masked language modeling, contrastive objectives)
  • natural language supervision for image representation learning is rare; performance is lower on benchmark (e.g. Li 2017 zero-shot achieves 11.5% accuracy on ImageNet); limited supervision of classes because of limited amount of labeled image data; limits zero-shot ability
  • no model had fully committed to learning from raw image + text data on internet; natural language data offers much more classes, more generallity, and better zero-shot capability

  • 400 million image-text pairs gathered from internet
  • CLIP is a simplified ConVIRT trained from scratch using these data and “is an efficient method of learning from natural language supervision.”
  • various zero-shot capabilities: OCR, geo-localization, action recognition, etc
  • competitive to task-specific models
  • CLIP is much more robust–zero-shot performance may be more indicative of model’s capability

Approach

Natural Language Supervision

See: natural language supervision

  • overtime, natural language supervision has been described as unsupervised/self-supervised to supervised
  • natural language has become a good training material for vision tasks
  • benefit: easier to scale compared to manually labeling images

Creating a Sufficiently Large Dataset

  • high quality labeled datasets are small by modern standards
  • large datasets contain data of varying quality (e.g. no English description)
  • motivation for Natural Language Supervision is quantity of freely available data on internet
  • CLIP uses a new dataset of image-text pairs (“WIT”–WebImageText)
    • image is found using queries matching words that appeared at least 100 times on Wikipedia

Selecting an Efficient Pre-Training Method

  • simply predicting exact image caption requires enormous compute
  • contrastive objective learns visual concepts better than predictive objective
  • scale from easy pretraining task: pair image with caption as a whole instead of trying to pick exact wording (bag-of-words) 4x efficiency in zero-shot transfer to ImageNet
  • to pair image-caption pairs, CLIP needs both an image and a text encoder to maximize cosine similarity between the correct pairs and minimize cosine similarity between wrong pairs.
  • ”optimized a symmetric cross entropy loss over these similiarity score”
  • technique first found in deep metric learning technique “multi-class N-pair loss” (not a new idea), recently applied to contrastive learning for image-text (medical)
  • large dataset, so overfitting shouldn’t be an issue
  • CLIP was trained from scratches (no pre-set weights, i.e. no cheating)
  • used linear instead of non-linear projection between representation and contrastive embedding space, since no training difference (I am not exactly sure what the authors are referring to; are they saying that they are not scrambling sentence-to-embedding mapping?)
  • only used limited data augmentation: “random square crop from resized images”
  • I don’t get the logit stuff. Probably not relevant to the project.

Choosing and Scaling a Model

  • image encoder:
    • ResNet-50 (?)
    • ViT (vision transformer)
  • text encoder: transformer
    • what’s an attention head? masked self-attention? probably need to read the attention paper
  • byte-pair encoding?
  • didn’t understand the last part about scaling

Training

  • trained 5 ResNets and 3 ViTs
  • Adam optimizer “with decoupled weight decay regularization” … which of course I have no idea about
  • what’s a “learnable temperature parameter?“
  • a bunch of techniques that I don’t know
  • wow that’s a lot of computing power and money

Experiments

Zero-shot Transfer

Motivation

  • generalize zero-shot from recognize unseen classes to unseen datasets
  • task-learning instead of representation learning
  • datasets represent a specific task; performance on a dataset represents ability to complete the task
  • Even though regular zero-shot learning adds new object categories, these new categories are not new tasks, they are just new classes. The task is still the same—classify some kind of objects. If the dataset (including the zero-shot images) are real-life, then the task is “identify generic real-life objects.” Some other tasks can be OCR, action recognition, etc.
  • With a general model, we can evaluate the performance of a model on different task by using different types of datasets (different tasks)
  • This means the model really is learning concepts generic enough to apply onto other contexts / tasks.

Using CLIP for Zero-Shot Transfer

  • review: CLIP learns to pair words with visual concepts through image-text pairing
  • for new datasets, the new classes are just words that CLIP already learned. CLIP should be able to pair the new classes (e.g. “photo of {adjective} {object}”) if it really learned what visual concept the word represents
  • skip details, likely not relevant to project
  • No idea how the image and text encoders actually work.
  • Image encoders extract the features of the images, and the text encoder is a “hypernetwork” (?) that generates a linear classifier based on the visual classifications provided by the dataset
  • Haven’t checked it yet, but using the IAB taxonomy directly might not work, doubt the taxonomy contains specific objects. Might need to come up with my own list of words that neatly surjects to the taxonomy. Need to think this through.

Initial Comparison to Visual N-Grams

  • Better performance than visual n-grams; skip details

Prompt Engineering and Ensembling

  • Normal image classification datasets don’t assign importance to descriptive labels—classes are converted to numeric IDs.
  • Polysemy: Mixed word senses. When only the object’s class name/label is provided, CLIP’s text encoder can’t distinguish which sense of the word is being used. This might be an issue because at times there are different classes in a dataset that use the same word but different senses.
  • The pre-training dataset contains full-sentence captions while in normal image classification datasets the label might just be a word. A workaround is to prefix the one-word labels with the text “a photo of a,” so if the label is “dog” the full label would be “a photo of a dog.” This improves ImageNet accuracy by 1.3%.
  • Customizing the label for the task at hand also improves accuracy. For instance, if the current task is to recognize pets, then qualifying the label with “a type of pet” may improve accuracy. For instance, the original label “dog” may be expanded to “a photo of a dog, a type of pet.”
  • Ensembling (not exactly sure how it’s used here) helps with accuracy while the compute cost is close to a single classifier.

Analysis of Zero-Shot CLIP Performance

  • skip

Representation Learning

  • it’s more common to compare representation learning ability between models
  • how to measure / compare performance of representation learning of CLIP
    • measure performance of linear classifiers produced by the models for the representations (what the paper chose)
    • measure performance of “end-to-end fine-tuning of the model”
    • chose the first because fine-tuning can adapt the model to the dataset and mask failures, whereas the limited capabilities of a linear classifiers force the model to actually generalize by clearly demonstrating bad performance; also, fine-tuning has more hyperparameters and is much more complex, expensive, and hard to evaluate/compare
  • skip performance comparisons

Robustness to Natural Distribution Shift

  • distribution shift
  • natural distribution shift
  • natural distribution shift on many ImageNet models causes degraded performance, perhaps due to specific patterns local to the dataset (overfitting)
  • improve both if trying to improve robustness:
    • effective robustness: how much more effective a model is under a distribution shift while accounting for some known relationship between in- and out-of-distribution.
    • relative robustness: a model’s improvement in out-of-distribution accuracy compared to its prior iterations (?)
  • Under zero-shot learning, CLIP shouldn’t be able to learn the patterns specific to a dataset and thus observes a significant improvement in effective robustness
  • Adaptation of CLIP to ImageNet causes an increase in ImageNet data but decreased robustness across other datasets.
  • skip the rest (supervised settings, adapted to ImageNet/class shifts)

Comparison to Human Performance

  • Humans are tested on zero-shot performance (no examples, not searching)
  • Human performance from zero- to one-shot: high improvement, almost all on “I don’t know” questions—humans know what they don’t know and can correct for that
  • CLIP can’t learn that quickly, large gap
  • gap between human few-shot performance and machine few-shot SOTA methods

Data Overlap Analysis

  • Since the dataset is very large & wide, there could be an overlap between the training data and a supposed “zero-shot” test.
  • Prevention
    • Deduplicate test and train data before training? But it’s hard to know what data will be used to evaluate performance and retrains are expensive
    • Instead, at evaluate time, check how similar the image given is to the training data. Split the evaluation set into “clean” and “overlap” by some threshold.
  • skip the details

Limitations

  • Some are already noted in Limitations
  • internet data is varied, but also unfiltered. CLIP could learn social biases.

Broader Impacts

  • An overview is covered in Impacts

Bias

  • skipped

Surveillance

  • skipped

Future Work

  • skipped
  • skipped

Conclusion

  • skipped