SCICAP: Scientific Figures Dataset
SCICAP: Generating Captions for Scientific Figures (Hsu et. al, 2021)
This is the Github repo of the EMNLP 2021 Findings' paper,SCICAP a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020. SCICAP contained 410k figures that focused on one of the dominent figure type - graphplot, extracted from over 290,000 papers.
How to Cite?
@inproceedings{hsu2021scicap,
title={SciCap: Generating Captions for Scientific Figures},
author={Hsu, Ting-Yao E. and Giles, C. Lee and Huang, Ting-Hao K.},
booktitle={Findings of 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021 Findings)},
year={2021}
}
Download the Dataset
Download Link (18.15 GB)
You can dowload the SCICAP dataset here:Folder Structure
scicap_data.zip
├── SciCap-Caption-All #caption text for all figures
│ ├── Train
│ ├── Val
│ └── Test
├── SciCap-No-Subfig-Img #image files for the figures without subfigures
│ ├── Train
│ ├── Val
│ └── Test
├── SciCap-Yes-Subfig-Img #image files for the figures with subfigures
│ ├── Train
│ ├── Val
│ └── Test
├── arxiv-metadata-oai-snapshot.json #arXiv paper's metadata (from arXiv dataset)
└── List-of-Files-for-Each-Experiments #list of figure names used in each experiment
├── Single-Sentence-Caption
│ ├── No-Subfig
│ │ ├── Train
│ │ ├── Val
│ │ └── Test
│ └── Yes-Subfig
│ ├── Train
│ ├── Val
│ └── Test
├── First-Sentence #Same as in Single-Sentence-Caption
└── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption
Number of Figures in Each Subset
Data Collection | Does the figure have subfigures? | Train | Validate | Test |
---|---|---|---|---|
First Sentence | Yes | 226,608 | 28,326 | 28,327 |
First Sentence | No | 106,834 | 13,354 | 13,355 |
Single-Sent Caption | Yes | 123,698 | 15,469 | 15,531 |
Single-Sent Caption | No | 75,494 | 9,242 | 9,459 |
Caption w/ <=100 words | Yes | 216,392 | 27,072 | 27,036 |
Caption w/ <=100 words | No | 105,687 | 13,215 | 13,226 |
JSON Data Format
Example Data Instance (Caption and Figure)
An actual JSON object from SCICAP:
{
"contains-subfigure": true,
"Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"],
"paper-ID": "1001.0025v1",
"figure-ID": "1001.0025v1-Figure2-1.png",
"figure-type": "Graph Plot",
"0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.",
"1-lowercase-and-token-and-remove-figure-index": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."],
"token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
},
"2-normalized": {
"2-1-basic-num": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."],
"token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
},
"2-2-advanced-euqation-bracket": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."],
"tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
}
}
}
Corresponding Figure: 1001.0025v1-Figure2-1.png
JSON Scheme
- contains-subfigure: boolean (check if contain subfigure)
- paper-ID: the unique paper ID in the arXiv dataset
- figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
- figure-type: the figure type
- 0-originally-extracted: original captions of figures
- 1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
- 2-normalized:
- 2-1-basic-num: caption after replacing the number
- 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
- Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.
Within the caption content, we have three attributes:
- caption: caption after each normalization
- sentence: a list of segmented sentences
- token: a list of tokenized words
Normalized Token
In the paper, we used [NUM], [BRACKET], [EQUATION], but we decided to use NUM-TK, BRACKET-TK, EQUAT-TK in the final data release to avoid the extra problems caused by "[]".
Token | Description |
---|---|
NUM-TK | Numbers (e.g., 0, -0.2, 3.44%, 1,000,000). |
BRACKET-TK | Text spans enclosed by any types of bracket pairs, including {}, [], and (). |
EQUAT-TK | Math equations identified using regular expressions. |
Baseline Performance To examine the feasibility and challenges of creating an image-captioning model for scientific figures, we established several baselines and tested them using SCICAP. The caption quality was measured by BLEU-4, using the test set of the corresponding data collection as a reference. We trained the models on each data collection with varying levels of data filtering and text normalization. Table 2 shows the results. We also designed three variations of the baseline models, Vision-only, Vision+Text, and Text-only. Table 3 shows the results.
Data License
The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license, which grants permission to remix, remake, annotate, and publish the data.
Acknowledgements
We thank Chieh-Yang Huang, Hua Shen, and Chacha Chen for helping with the data annotation. We thank Chieh-Yang Huang for the feedback and strong technical support. We also thank the anonymous reviewers for their constructive feedback. This research was partially supported by the Seed Grant (2020) from the College of Information Sciences and Technology (IST), Pennsylvania State University.