Metadata

Source

File: arXiv.org Snapshot; Learning transferable visual models from natural language supervision by Radford et al (2021).pdf

Zotero: View Item

Type: Preprint

Title: Learning transferable visual models from natural language supervision,

Author: Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya;

Year: 2021

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Tags and Collections

Keywords: 05 Finished; Computer Vision; Natural Language Supervision

Comments

Annotations

Notes

This paper proposes a new model called CLIP.

Summary

CLIP is a novel model capable of achieving competitive performance across different zero-shot tasks and data distributions compared to supervised models. Its generality stems from contrastive learning—essentially a scaled up pre-training task of matching images with entire captions. Since the training does not require the model to predict the caption but just to match the captions, it is able to learn visual concepts from textual representations, while using less compute than its predictive counterparts. In addition, by using natural language instead of image classes as its output, the model generalizes more easily to never-before-seen labeling possibilities in new datasets and tasks. This allows users of the model to label images using any arbitrary classification. Its natural language prediction space was even able to grant it surprising performance in an OCR task. Though the model is as flexible as it is, it does struggle in more complex and fine-grained tasks. Nonetheless, CLIP’s flexibility enables people to design custom classifiers without building datasets or retraining models.

Abstract

contrastive learning: a pre-training task (matching image to its caption) scaled up (400 million)
- model learns to associate words with visual concepts/features
enable generality of trained models: zero-shot classes described using natural language & learned concept
SOTA zero-shot
- no task-specific image data needed
- performance competitive with supervised baseline models
- possibilities include OCR, action recognition, etc

Introduction and Motivating Work

pre-training methods on raw texts inspired NLP revolution
- autoregressive language model: a language model based on predicting the next word in the text using past output (?)
- masked language modeling: model for predicting missing words (“masked” words) in text
- Development of text-to-text standardized I/O for zero-shot transfer to other datasets (no need for dataset customization)
- GPT-3
- These successes suggest that data aggregated from the internet (usable for pre-training) are better than high-quality crowd-labeled NLP datasets.
Attempt to reproduce the revolution in computer vision
- CV still pre-train models with ImageNet. What about data from the web?
bag-of-words phrase n-gram model can do zero-shot
VirTex, ICMLM, ConVIRT (transformer-based language modeling, masked language modeling, contrastive objectives)
natural language supervision for image representation learning is rare; performance is lower on benchmark (e.g. Li 2017 zero-shot achieves 11.5% accuracy on ImageNet); limited supervision of classes because of limited amount of labeled image data; limits zero-shot ability
no model had fully committed to learning from raw image + text data on internet; natural language data offers much more classes, more generallity, and better zero-shot capability

400 million image-text pairs gathered from internet
CLIP is a simplified ConVIRT trained from scratch using these data and “is an efficient method of learning from natural language supervision.”
various zero-shot capabilities: OCR, geo-localization, action recognition, etc
competitive to task-specific models
CLIP is much more robust–zero-shot performance may be more indicative of model’s capability

Approach

Natural Language Supervision

See: natural language supervision

overtime, natural language supervision has been described as unsupervised/self-supervised to supervised
natural language has become a good training material for vision tasks
benefit: easier to scale compared to manually labeling images

Creating a Sufficiently Large Dataset

high quality labeled datasets are small by modern standards
large datasets contain data of varying quality (e.g. no English description)
motivation for Natural Language Supervision is quantity of freely available data on internet
CLIP uses a new dataset of image-text pairs (“WIT”–WebImageText)
- image is found using queries matching words that appeared at least 100 times on Wikipedia

Selecting an Efficient Pre-Training Method

simply predicting exact image caption requires enormous compute
contrastive objective learns visual concepts better than predictive objective
scale from easy pretraining task: pair image with caption as a whole instead of trying to pick exact wording (bag-of-words) → 4x efficiency in zero-shot transfer to ImageNet
to pair image-caption pairs, CLIP needs both an image and a text encoder to maximize cosine similarity between the correct pairs and minimize cosine similarity between wrong pairs.
”optimized a symmetric cross entropy loss over these similiarity score”
technique first found in deep metric learning technique “multi-class N-pair loss” (not a new idea), recently applied to contrastive learning for image-text (medical)
large dataset, so overfitting shouldn’t be an issue
CLIP was trained from scratches (no pre-set weights, i.e. no cheating)
used linear instead of non-linear projection between representation and contrastive embedding space, since no training difference (I am not exactly sure what the authors are referring to; are they saying that they are not scrambling sentence-to-embedding mapping?)
only used limited data augmentation: “random square crop from resized images”
I don’t get the logit stuff. Probably not relevant to the project.

Choosing and Scaling a Model

image encoder:
- ResNet-50 (?)
- ViT (vision transformer)
text encoder: transformer
- what’s an attention head? masked self-attention? probably need to read the attention paper
byte-pair encoding?
didn’t understand the last part about scaling

Training

trained 5 ResNets and 3 ViTs
Adam optimizer “with decoupled weight decay regularization” … which of course I have no idea about
what’s a “learnable temperature parameter?“
a bunch of techniques that I don’t know
wow that’s a lot of computing power and money

Experiments

Zero-shot Transfer

Motivation

generalize zero-shot from recognize unseen classes to unseen datasets
task-learning instead of representation learning
datasets represent a specific task; performance on a dataset represents ability to complete the task
Even though regular zero-shot learning adds new object categories, these new categories are not new tasks, they are just new classes. The task is still the same—classify some kind of objects. If the dataset (including the zero-shot images) are real-life, then the task is “identify generic real-life objects.” Some other tasks can be OCR, action recognition, etc.
With a general model, we can evaluate the performance of a model on different task by using different types of datasets (different tasks)
This means the model really is learning concepts generic enough to apply onto other contexts / tasks.

Using CLIP for Zero-Shot Transfer

review: CLIP learns to pair words with visual concepts through image-text pairing
for new datasets, the new classes are just words that CLIP already learned. CLIP should be able to pair the new classes (e.g. “photo of {adjective} {object}”) if it really learned what visual concept the word represents
skip details, likely not relevant to project
No idea how the image and text encoders actually work.
Image encoders extract the features of the images, and the text encoder is a “hypernetwork” (?) that generates a linear classifier based on the visual classifications provided by the dataset
Haven’t checked it yet, but using the IAB taxonomy directly might not work, doubt the taxonomy contains specific objects. Might need to come up with my own list of words that neatly surjects to the taxonomy. Need to think this through.

Initial Comparison to Visual N-Grams

Better performance than visual n-grams; skip details

Prompt Engineering and Ensembling

Normal image classification datasets don’t assign importance to descriptive labels—classes are converted to numeric IDs.
Polysemy: Mixed word senses. When only the object’s class name/label is provided, CLIP’s text encoder can’t distinguish which sense of the word is being used. This might be an issue because at times there are different classes in a dataset that use the same word but different senses.
The pre-training dataset contains full-sentence captions while in normal image classification datasets the label might just be a word. A workaround is to prefix the one-word labels with the text “a photo of a,” so if the label is “dog” the full label would be “a photo of a dog.” This improves ImageNet accuracy by 1.3%.
Customizing the label for the task at hand also improves accuracy. For instance, if the current task is to recognize pets, then qualifying the label with “a type of pet” may improve accuracy. For instance, the original label “dog” may be expanded to “a photo of a dog, a type of pet.”
Ensembling (not exactly sure how it’s used here) helps with accuracy while the compute cost is close to a single classifier.

Analysis of Zero-Shot CLIP Performance

skip

Representation Learning

it’s more common to compare representation learning ability between models
how to measure / compare performance of representation learning of CLIP
- measure performance of linear classifiers produced by the models for the representations (what the paper chose)
- measure performance of “end-to-end fine-tuning of the model”
- chose the first because fine-tuning can adapt the model to the dataset and mask failures, whereas the limited capabilities of a linear classifiers force the model to actually generalize by clearly demonstrating bad performance; also, fine-tuning has more hyperparameters and is much more complex, expensive, and hard to evaluate/compare
skip performance comparisons

Robustness to Natural Distribution Shift

distribution shift
natural distribution shift
natural distribution shift on many ImageNet models causes degraded performance, perhaps due to specific patterns local to the dataset (overfitting)
improve both if trying to improve robustness:
- effective robustness: how much more effective a model is under a distribution shift while accounting for some known relationship between in- and out-of-distribution.
- relative robustness: a model’s improvement in out-of-distribution accuracy compared to its prior iterations (?)
Under zero-shot learning, CLIP shouldn’t be able to learn the patterns specific to a dataset and thus observes a significant improvement in effective robustness
Adaptation of CLIP to ImageNet causes an increase in ImageNet data but decreased robustness across other datasets.
skip the rest (supervised settings, adapted to ImageNet/class shifts)

Comparison to Human Performance

Humans are tested on zero-shot performance (no examples, not searching)
Human performance from zero- to one-shot: high improvement, almost all on “I don’t know” questions—humans know what they don’t know and can correct for that
CLIP can’t learn that quickly, large gap
gap between human few-shot performance and machine few-shot SOTA methods

Data Overlap Analysis

Since the dataset is very large & wide, there could be an overlap between the training data and a supposed “zero-shot” test.
Prevention
- Deduplicate test and train data before training? But it’s hard to know what data will be used to evaluate performance and retrains are expensive
- Instead, at evaluate time, check how similar the image given is to the training data. Split the evaluation set into “clean” and “overlap” by some threshold.
skip the details

Limitations

Some are already noted in Limitations
internet data is varied, but also unfiltered. CLIP could learn social biases.

Broader Impacts

An overview is covered in Impacts

Bias

skipped

Surveillance

skipped

Future Work

skipped

skipped

Conclusion

skipped

Security Memo

Recent Notes

SMART

Bossa Nova

ZFS

post-rock

2024-09-27

Learning transferable visual models from natural language supervision

Annotations

Notes

Abstract

Introduction and Motivating Work

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Training

Experiments

Zero-shot Transfer

Motivation

Using CLIP for Zero-Shot Transfer

Initial Comparison to Visual N-Grams

Prompt Engineering and Ensembling

Analysis of Zero-Shot CLIP Performance

Representation Learning

Robustness to Natural Distribution Shift

Comparison to Human Performance

Data Overlap Analysis

Limitations

Broader Impacts

Bias

Surveillance

Future Work

Conclusion

Graph View

Table of Contents

Backlinks

Security Memo

Recent Notes

SMART

Bossa Nova

ZFS

post-rock

2024-09-27

Learning transferable visual models from natural language supervision

Annotations

Notes

Abstract

Introduction and Motivating Work

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Training

Experiments

Zero-shot Transfer

Motivation

Using CLIP for Zero-Shot Transfer

Initial Comparison to Visual N-Grams

Prompt Engineering and Ensembling

Analysis of Zero-Shot CLIP Performance

Representation Learning

Robustness to Natural Distribution Shift

Comparison to Human Performance

Data Overlap Analysis

Limitations

Broader Impacts

Bias

Surveillance

Future Work

Related Work

Conclusion

Graph View

Table of Contents

Backlinks