A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

(Bill) Yuchen Lin

Last update: Jan 1, 2023

Related tags

Text Data & NLP machine-learning natural-language-processing latex bibtex bibliography publication research-paper

Overview

Rebiber: A tool for normalizing bibtex with official info.

We often cite papers using their arXiv versions without noting that they are already PUBLISHED in some conferences. These unofficial bib entries might violate rules about submissions or camera-ready versions for some conferences. We introduce Rebiber, a simple tool in Python to fix them automatically. It is based on the official conference information from the DBLP or the ACL anthology (for NLP confernces)! You can check the list of supported conferences here. Apart from handling outdated arXiv citations, Rebiber also normalizes citations in a unified way (DBLP-style), supporting abbreviation and value selection.

You can use this google colab notebook as a simple web demo.

Changelogs

2021.02.08 We now support multiple useful feaures: 1) turning off some certain values, e.g., "-r url,pages,address" for removing the values from the output, 2) using abbr. to shorten the booktile values, e.g., Proceedings of the .* Annual Meeting of the Association for Computational Linguistics --> Proc. of ACL. More examples are here.
2021.01.30 We build a colab notebook as a simple web demo. link

Installation

pip install rebiber -U
rebiber --update  # update the bib data and the abbr. info

git clone https://github.com/yuchenlin/rebiber.git
cd rebiber/
pip install -e .

If you would like to use the latest github version with more bug fixes, please use the second installation method.

Usage（v1.1.1）

Normalize your bibtex file with the official converence information:

rebiber -i /path/to/input.bib -o /path/to/output.bib

You can find a pair of example input and output files in rebiber/example_input.bib and rebiber/example_output.bib.

argument	usage
`-i`	or `--input_bib`. The path to the input bib file that you want to update
`-o`	or `--output_bib`. The path to the output bib file that you want to save. If you don't specify any `-o` then it will be the same as the `-i`.
`-r`	or `--remove`. A comma-seperated list of value names that you want to remove, such as "-r pages,editor,volume,month,url,biburl,address,publisher,bibsource,timestamp,doi". Empty by default.
`-s`	or `--shorten`. A bool argument that is `"False"` by default, used for replacing `booktitle` with abbreviation in `-a`. Used as `-s True`.
`-d`	or `--deduplicate`. A bool argument that is `"True"` by default, used for removing the duplicate bib entries sharing the same key. Used as `-d True`.
`-l`	or `--bib_list`. The path to the list of the bib json files to be loaded. Check rebiber/bib_list.txt for the default file. Usually you don't need to set this argument.
`-a`	or `--abbr_tsv`. The list of conference abbreviation data. Check rebiber/abbr.tsv for the default file. Usually you don't need to set this argument.
`-u`	or `--update`. Update the local bib-related data with the lateset Github version.
`-v`	or `--version`. Print the version of current Rebiber.

Example Input and Output

An example input entry with the arXiv information (from Google Scholar or somewhere):

@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}

An example normalized output entry with the official information:

@inproceedings{lin2020birds,
    title = "{B}irds have four legs?! {N}umer{S}ense: {P}robing {N}umerical {C}ommonsense {K}nowledge of {P}re-{T}rained {L}anguage {M}odels",
    author = "Lin, Bill Yuchen  and
      Lee, Seyeon  and
      Khanna, Rahul  and
      Ren, Xiang",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.557",
    doi = "10.18653/v1/2020.emnlp-main.557",
    pages = "6862--6868",
}

Supported Conferences

The bib_list.txt contains a list of converted json files of the official bib data. In this repo, we now support the full ACL anthology, i.e., all papers that are published at *CL conferences (ACL, EMNLP, NAACL, etc.) as well as workshops. Also, we support any conference proceedings that can be downloaded from DBLP, for example, ICLR2020.

The following conferences are supported and their bib/json files are in our data folder. You can turn each item on/off in bib_list.txt. Please feel free to create PR for adding new conferences following this!

Name	Years
ACL Anthology	(until 2021-01)
AAAI	2010 -- 2020
AISTATS	2013 -- 2020
ALENEX	2010 -- 2020
ASONAM	2010 -- 2019
BigDataConf	2013 -- 2019
BMVC	2010 -- 2020
CHI	2010 -- 2020
CIDR	2009 -- 2020
CIKM	2010 -- 2020
COLT	2000 -- 2020
CVPR	2000 -- 2020
ICASSP	2015 -- 2020
ICCV	2003 -- 2019
ICLR	2013 -- 2020
ICML	2000 -- 2020
IJCAI	2011 -- 2020
KDD	2010 -- 2020
MLSys	2019 -- 2020
MM	2016 -- 2020
NeurIPS	2000 -- 2020
RECSYS	2010 -- 2020
SDM	2010 -- 2020
SIGIR	2010 -- 2020
SIGMOD	2010 -- 2020
SODA	2010 -- 2020
STOC	2010 -- 2020
UAI	2010 -- 2020
WSDM	2008 -- 2020
WWW (The Web Conf)	2001 -- 2020

Thanks for Anton Tsitsulin's great work on collecting such a complete set bib files!

Adding a new conference

You can manually add any conferences from DBLP by downloading their bib files to our raw_data folder, and run a prepared script add_conf.sh.

Take ICLR2020 and ICLR2019 as an example:

Step 1: Go to DBLP
Step 2: Download the bib files, and put them here as raw_data/iclr2020.bib and raw_data/iclr2019.bib (name should be in the format as {conf_name}{year}.bib)
Step 3: Run script

bash add_conf.sh iclr 2019 2020

Contact

Please email [email protected] or create Github issues here if you have any questions or suggestions.

Comments

Some references are filtered by `load_bib_file`

It's a great tools, but when I try to transfer my .bib file, which is generated by an application BibDesk, the references are filtered, here is a minimal example of my bib file.

@inproceedings{zhang2019heterogeneous,
        author = {Zhang, Chuxu and Song, Dongjin and Huang, Chao and Swami, Ananthram and Chawla, Nitesh V},
        booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
        date-added = {2021-04-03 01:39:20 +0800},
        date-modified = {2021-04-03 01:44:13 +0800},
        keywords = {Recommender system, Graph Neural Network},
        pages = {793--803},
        title = {Heterogeneous graph neural network},
        year = {2019},
        Bdsk-Url-1 = {https://doi.org/10.1145/3292500.3330961}}

I think this is due to load_bib_file. The last line of this reference contains {, so load_bib_file skipped this reference.

However, in the BibtexParser, this kind of bib file can be recognized.

opened by AyanoClarke 8

Deleted entry after using rebiber

Hi,

Thanks for the great tool! I faced an issue where an entry was deleted after using rebiber though. It's this one:

@article{loon,
  title={Autonomous navigation of stratospheric balloons using reinforcement learning.},
  author={Marc G. Bellemare and Salvatore Candido and P. S. Castro and J. Gong and Marlos C. Machado and Subhodeep Moitra and Sameera S. Ponda and Ziyu Wang},
  journal={Nature},
  year={2020},
  volume={588 7836},
  pages={
          77-82
        }
}

Do you have any idea what could be wrong?

opened by RaghuSpaceRajan 4

Comments in bib file are transformed into `@comments`
I come across two issues here:

[ ] Somehow the tool transforms my comments (ones starting with %) in bib file into @comment{} and places them at the head of the file;

[ ] All the bibs are by default organized in an alphabetic manner, is there a way (option) I can remain the order of bibs (and thus keep the comments where they are) .

Great tool by the way :)
opened by boredtylin 4
Adding arxiv URL when available

When a bib_entry is not found now the scripts checks if it is an arxiv entry. If that is the case the new script adds the field url = {https://arxiv.org/abs/<ID>} to the bibitex entry.
help wanted

opened by nicola-decao 3

Handle @string

Nice tool! It seems that currently it doesn't handle @string in BibTeX. Any plan to add this feature?

Example:

@string{emnlp = "Empirical Methods in Natural Language Processing (EMNLP)"}

@inproceedings{li2020efficient,
 title={Efficient One-Pass End-to-End Entity Linking for Questions},
 author={Li, Belinda Z. and Min, Sewon and Iyer, Srinivasan and Mehdad, Yashar and Yih, Wen-tau},
 booktitle=emnlp,
 year={2020}
}

enhancement normalization

opened by scottyih 3

Incomplete bib entry for conference

Hello, I find that some papers accepted by some conferences (e.g. AAAI 2020) cannot be indexed. The reason might be that we can only download the first 1000 entries when the accepted papers are more than 1,000 from DBLP. Is there any way to address such problem? Thanks very much!

opened by xiaosen-wang 2

Whether to consider providing Python API ？

Although a scripting approach is provided, would you consider providing a Python API ？

for example

import rebiber
str = '@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}'
res = rebiber.trans(str)
print(res)

opened by SinclairCoder 2

Modified the demo Colab for copy and paste bib

This is notebook implemented a few modifications to the original Colab demo: (1) takes pasted BibTex as a string input (i.e. copying BibText from Google Scholar); (2) prints the processed BibTex on screen, which can be copied and pasted into reference management software; (3) modified the upload cell; (4) added a download cell to download processed outputs.

opened by Herais 1
Update data from recent ICML/ICLR/NeurIPS/AAAI

I've added data from NeurIPS 2020/2021, ICML 2021, ICLR 2021, and also missing entries from ICML 2020 and AAAI2019/2020, because they each have more than 1000 papers but currently each bib only contains the first 1000 entries.

opened by shizhouxing 1
Fix bug with parsing arXiv entries with multiline fields

The bug is as follows. When the input is something like:

@article{sharma19_paral_restar_spider, author = {Sharma, Pranay and Kafle, Swatantra and Khanduri, Prashant and Bulusu, Saikiran and Rajawat, Ketan and Varshney, Pramod K.}, title = {Parallel Restarted Spider -- Communication Efficient Distributed Nonconvex Optimization With Optimal Computation Complexity}, journal = {arXiv preprint arXiv:1912.06036}, year = 2019, url = {http://arxiv.org/abs/1912.06036v2}, archivePrefix = {arXiv}, primaryClass = {math.OC}, }

It is changed to:

@article{sharma19_paral_restar_spider, author = {Sharma, Pranay and Kafle, Swatantra and Khanduri, Prashan}, journal = {ArXiv preprint}, title = {Parallel Restarted Spider -- Communication Efficien}, url = {https://arxiv.org/abs/1912.06036}, volume = {abs/1912.06036}, year = {2019}

Author information is lost and the title is shortened to just the first line. This happens because if the entry is an unmatched article from arXiv, the normalizer re-parses the bibtex entry manually looking for the title, author, and arXiv id information. This manual parser fails when any of the entries (title/author/etc.) spans multiple lines. Moreover, the external bibtex parsing library used elsewhere already works quite well. This commit just uses the external parser which can handle multiline fields just fine. Under this, we have the correct output:

@article{sharma19_paral_restar_spider, author = {Sharma, Pranay and Kafle, Swatantra and Khanduri, Prashant and Bulusu, Saikiran and Rajawat, Ketan and Varshney, Pramod K.}, journal = {ArXiv preprint}, title = {Parallel Restarted Spider -- Communication Efficient Distributed Nonconvex Optimization With Optimal Computation Complexity}, url = {https://arxiv.org/abs/1912.06036}, volume = {abs/1912.06036}, year = {2019}

opened by rka97 1
The way `abbr.tsv` is loaded removes entries from the file
abbr.tsv has two entries for ICML:

Proc. of ICML | Proceedings of the .* International Conference on Machine Learning Proc. of ICML | Machine Learning, Proceedings of the .* International Conference

but they are not both loaded, because in load_abbr_tsv() a dictionary is used such that the second entry overwrites the first one:

ls = line.split("|") if len(ls) == 2: abbr_dict[ls[0].strip()] = ls[1].strip()

I see two solutions here: either don't load the file into a dictionary (but just a list of tuples), or allow specifying or regex patterns (i.e., (pattern1|pattern2)) which would require using a different character than | to separate the left- and right-hand sides in abbr.tsv.
bug
opened by thomkeh 1
Question about month

Hi Yuchen, It seems you try to ignore the month field in a bib entry in is_contain_var() and build_json(). Can you please explain why is that necessary? You also ignore '@string' entry. Why not just let bibtexparser parse the entire bib file? Thank you!

opened by christophe-gu 0
Add new conference files

Add entries for the following conferences (mostly machine learning conferences):

ICML 2022 AISTATS 2022 COLT 2021 2022 ICLR 2022 MLSYS 2021 2022 NeurIPS 2021 UAI 2021

opened by rka97 0
Hope add a command for batch files execution
I have multiple bib files for several research fields and hope convert their information in one-click. I've written a bat file to automatically execute bib files in work directory:

@echo off for %%i in (*.bib) do echo "%%i" for %%i in (*.bib) do rebiber -i %%i -o Pub%%i pause exit

But a build-in command would be easier to use. Would you like to add this?
opened by Saltsmart 0

The booktitle contains too much information

I found that the booktitle of many papers in DBLP has too many names and information.

For example：

@inproceedings{seo-etal-2016-bidirectional,
 author = {Min Joon Seo and
Aniruddha Kembhavi and
Ali Farhadi and
Hannaneh Hajishirzi},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/iclr/SeoKFH17.bib},
 booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
 publisher = {OpenReview.net},
 timestamp = {Thu, 25 Jul 2019 01:00:00 +0200},
 title = {Bidirectional Attention Flow for Machine Comprehension},
 url = {https://openreview.net/forum?id=HJ0UKP9ge},
 year = {2017}
}

The booktitle here contains the full name and abbreviation of ICLR, as well as their location. Can you keep only the first one of this information?

For example: booktitle =“5th International Conference on Learning Representations”

opened by AliceNEET 1

Confusing behavior with some author names

The ImageNet paper has its last author listed as Li Fei-Fei, which is how she publishes in general, both on the paper and in the IEEE metadata; their .bib has her as Li Fei-Fei in the author.

The DBLP record lists her as Li Fei{-}Fei.

And yet rebiber/data/cvpr2009.bib.json has her as Fei{-}Fei Li, and so running either through rebiber incorrectly changes it to that ordering.

The same is true for most (but not all) of her papers in the database. No idea why this would be, since DBLP consistently has her as Li Fei{-}Fei.

cc @pranav-ust

opened by djsutherland 1
Add LREC + Automatically sync

The LREC Sign Language workshop has this website - https://www.sign-lang.uni-hamburg.de/lrec/index.html

Which links to two bib files: without abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec.bib with abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec_a.bib

While one can add them manually to this repo, I was wondering if there is a setting somewhere to just put this link, and whenever someone runs an "update" script it will re-fetch the bib file and process it?

opened by AmitMY 0

Releases(v1.1.3)

v1.1.3(Sep 6, 2021)
add a few features (e.g., sorting, url to arXiv, etc.)

update the bib/json files to the latest conferences.

Source code(tar.gz)
Source code(zip)
1.1.1(Feb 9, 2021)

Add the --update and --version features.
Source code(tar.gz)
Source code(zip)
1.1(Feb 8, 2021)

"Apart from handling outdated arXiv citations, Rebiber also normalizes citations in a unified way (DBLP-style), supporting abbreviation and value selection."
Source code(tar.gz)
Source code(zip)
1.0.2(Feb 8, 2021)

Add a few new features such as removing duplicate entries and disabling selected values.

Fixed a few minor bugs such as the @software would not be removed anymore.
Source code(tar.gz)
Source code(zip)
v1.0.1(Jan 30, 2021)

Fix some minor bugs about duplicate items and casing issues.
Source code(tar.gz)
Source code(zip)
v1.0.0(Jan 29, 2021)

Source code(tar.gz)
Source code(zip)

Owner

(Bill) Yuchen Lin

CS PhD student @ USC; NLP/AI/ML

GitHub https://yuchenlin.xyz/

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 3, 2022

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

90 Dec 27, 2022

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

4 Mar 20, 2022

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

1 Jan 15, 2022

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

2 Nov 5, 2021

lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022

Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

32 Jul 29, 2022

Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch >= 1.10 torchtext >= 0.11.0 sklear

0 Jan 5, 2022

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

121 Dec 27, 2022

Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

16 Feb 24, 2022

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

46 Dec 15, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

5 Sep 13, 2022

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

9 Jun 27, 2022

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

49 Dec 26, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

540 Dec 30, 2022

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Related tags

Overview

Rebiber: A tool for normalizing bibtex with official info.

Changelogs

Installation

Usage（v1.1.1）

Example Input and Output

Supported Conferences

Adding a new conference

Contact

Comments

Releases(v1.1.3)

v1.1.3(Sep 6, 2021)

1.1.1(Feb 9, 2021)

1.1(Feb 8, 2021)

1.0.2(Feb 8, 2021)

v1.0.1(Jan 30, 2021)

v1.0.0(Jan 29, 2021)

Owner

(Bill) Yuchen Lin

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

lightweight, fast and robust columnar dataframe for data analytics with online update

Text to speech for Vietnamese, ez to use, ez to update

Continuously update some NLP practice based on different tasks.

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Findings of ACL 2021

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

ACL'2021: Learning Dense Representations of Phrases at Scale