RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

Create an empty directory and symlink it relative to this code directory:

cd redcaps-downloader

# Edit path here:
mkdir -p /path/to/redcaps
ln -s /path/to/redcaps ./datasets/redcaps

Download official RedCaps annotations from Dropbox and unzip them.

cd datasets/redcaps
wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip

Download images by using redcaps download-imgs command (for a single annotation file).
```
for ann_file in ./datasets/redcaps/annotations/*.json; do
    redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
    # Set --resize -1 to turn off resizing shorter edge (saves disk space).
done
```
Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

   : 
   " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } "> {
    "reddit": {
        "client_id": "Your client ID here",
        "client_secret": "Your client secret here",
        "username": "Your Reddit username here",
        "password": "Your Reddit password here",
        "user_agent": "
      
       : 
       "
      
    },
    "imgur": {
        "client_id": "Your client ID here",
        "client_secret": "Your client secret here"
    }
} 
  

Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).
Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

Download pre-trained weights of an NSFW detection model.

wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from `r/roses` (2020)

download-anns: Dowload annotations of image posts made in a single month (e.g. January).

redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations

# Similarly, download annotations for all months of 2020:
for ((month = 1; month <= 12; month += 1)); do
    redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
done

NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:

    ./datasets/redcaps/
    └── annotations/
        ├── roses_2020-01.json
        ├── roses_2020-02.json
        ├── ...
        └── roses_2020-12.json

merge: Merge all the monthly annotation files into a single file.

redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
    -o ./datasets/redcaps/annotations/roses_2020.json --delete-old

--delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:

    ./datasets/redcaps/
    └── annotations/
        └── roses_2020.json

download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.
```
redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
```
- --update-annotations removes annotations whose images were not downloaded.
filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images
```
filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images \
    --model ./datasets/redcaps/models/nsfw.299x299.h5
```
filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.
```
redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
    --images ./datasets/redcaps/images  # Model weights auto-downloaded
```
validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.
```
redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json
```

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}

A command line simple note taking app

Why yet another note taking program? note was designed with a very specific target in mind: me, and my 2354 scraps of paper. It runs from the command

64 Nov 20, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

34 Dec 28, 2022

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

174 Dec 22, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

109 Dec 29, 2022

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

68 Jul 18, 2022

39 Oct 5, 2021

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Habitat-Matterport 3D Dataset (HM3D) The Habitat-Matterport 3D Research Dataset is the largest-ever dataset of 3D indoor spaces. It consists of 1,000

62 Dec 27, 2022

Command-line tool for downloading and extending the RedCaps dataset.

Related tags

Overview

RedCaps Downloader

Installation

Basic usage: Download official RedCaps dataset

Advanced usage: Create your own RedCaps-like dataset

Additional one-time setup instructions

Data collection from `r/roses` (2020)

Citation

You might also like...

A command line simple note taking app

This is the dataset and code release of the OpenRooms Dataset.

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Owner

RedCaps dataset

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

The world's simplest facial recognition api for Python and the command line

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Find-Lane-Line - Use openCV library and Python to detect the road-lane-line

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Quadruped-command-tracking-controller - Quadruped command tracking controller (flat terrain)

Command-line tool for downloading and extending the RedCaps dataset.

Related tags

Overview

RedCaps Downloader

Installation

Basic usage: Download official RedCaps dataset

Advanced usage: Create your own RedCaps-like dataset

Additional one-time setup instructions

Data collection from r/roses (2020)

Citation

You might also like...

A command line simple note taking app

This is the dataset and code release of the OpenRooms Dataset.

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Owner

RedCaps dataset

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

The world's simplest facial recognition api for Python and the command line

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Find-Lane-Line - Use openCV library and Python to detect the road-lane-line

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Quadruped-command-tracking-controller - Quadruped command tracking controller (flat terrain)

Data collection from `r/roses` (2020)