Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Related tags

Overview

Conceptual 12M

We introduce the Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than the Conceptual Captions (CC3M), a dataset that is widely used for pre-training and end-to-end training of image captioning models. Check our paper for further details.

Download

Click here to download (2.5GB)

Format (.tsv)

[image_url_1]\t[caption_1]
[image_url_2]\t[caption_2]
[image_url_3]\t[caption_3]
…
[image_url_N]\t[caption_N]

Cite

If you use this dataset in your research, please cite:

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR 2021.

@inproceedings{changpinyo2021cc12m,
  title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
  author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle = {CVPR},
  year = {2021},
}

FAQs

Q1: Can you provide image pixels?

A1: We do not own any of the images in the dataset and hence cannot legally provide them to you. The owner of an image can choose to delete it at anytime, in which case the image will no longer be available. Due to this, unfortunately, some images in the dataset will be lost over time, and we are unable to help with this issue.

Q2: Is it normal that a subset of images cannot be retrieved from the provided URLs?

A2: Yes. See Q1.

Q3: Is CC12M an “expanded” CC3M?

A3: No, CC12M is on purpose designed for vision-and-language pre-training, and meant to be disjoint from CC3M. CC3M is cleaner and more appropriate for fine-tuning, but can be used in conjunction with CC12M for pre-training, as illustrated in our paper. Coincidentally, their intersection is found to be non-zero — approximately 63K URLs.

Contact Us

If you have a question not provided in the FAQs above, please create an issue in this repository.

If you would like to share feedback or report concerns, please email us at conceptual-captions@google.com.

You might also like...

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

470 Dec 30, 2022

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

621 Dec 31, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time.

360 Dec 28, 2022

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

172 Dec 22, 2022

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

42 Jan 7, 2023

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

9 Jan 12, 2022

Comments

The overlap between CC3m and CC12m

Really thanks for your excellent work! I have a small question about the overlap between CC3m and CC12m dataset. From my perspective, the CC12m dataset is a expansion of CC3m, so most of the images in CC3m should be included in CC12m. But after I downloaded both tsv files and compared the urls between them, I only found about 63k urls of CC12m which also appear in CC3m dataset. Is this the expectation? or if I made anything wrong? Any help will be extremely grateful. And I believe this dataset will contribute something really interesting to this area.

opened by weiyx16 2
Image-captioning pre-trained model

Hey,

Hope you are all well and thank you for open-sourcing the dataset! 🤗

Was wondering if you are also planning to release any pre-trained models such as the IC ones described in the paper?

Thanks.

opened by JohannesTK 1

Lots of links are not working?

16it [21:34, 19.91s/it]worker  - success: 0.244 - failed to download: 0.752 - failed to resize: 0.003 - images per sec: 8 - count: 10000
total   - success: 0.241 - failed to download: 0.755 - failed to resize: 0.003 - images per sec: 124 - count: 160000

opened by yxchng 0

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Related tags

Overview

Conceptual 12M

Download

Cite

FAQs

Contact Us

You might also like...

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Code release for SLIP Self-supervision meets Language-Image Pre-training

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Comments

The overlap between CC3m and CC12m

Image-captioning pre-trained model

Lots of links are not working?

Owner

Google Research Datasets

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

A 1.3B text-to-image generation model trained on 14 million image-text pairs

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

X-VLM: Multi-Grained Vision Language Pre-Training

A model that attempts to learn and benefit from data collected on card counting.

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs