BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Benjamin Heinzerling

Last update: Jan 3, 2023

Related tags

Overview

BPEmb

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Website ・ Usage ・ Download ・ MultiBPEmb ・ Paper (pdf) ・ Citing BPEmb

Usage

Install BPEmb with pip:

pip install bpemb

Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

>>> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
>>> bpemb_en = BPEmb(lang="en", dim=50)
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

You can do two main things with BPEmb. The first is subword segmentation:

>> bpemb_zh = BPEmb(lang="zh", vs=100000) # apply Chinese BPE subword segmentation model >>> bpemb_zh.encode("这是一个中文句子") # "This is a Chinese sentence." ['▁这是一个', '中文', '句子'] # ["This is a", "Chinese", "sentence"] ">

# apply English BPE subword segmentation model
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
# apply Chinese BPE subword segmentation model
>>> bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size	segmentation
1000	['▁str', 'at', 'f', 'ord']
3000	['▁str', 'at', 'ford']
5000	['▁str', 'at', 'ford']
10000	['▁strat', 'ford']
25000	['▁stratford']
50000	['▁stratford']
100000	['▁stratford']
200000	['▁stratford']

The second purpose of BPEmb is to provide pretrained subword embeddings:

>> type(bpemb_en.vectors) numpy.ndarray >>> bpemb_en.vectors.shape (10000, 50) >>> bpemb_zh.vectors.shape (100000, 100) ">

# Embeddings are wrapped in a gensim KeyedVectors object
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
>>> bpemb_en.most_similar("ford")
[('bury', 0.8745079040527344),
 ('ton', 0.8725000619888306),
 ('well', 0.871537446975708),
 ('ston', 0.8701574206352234),
 ('worth', 0.8672043085098267),
 ('field', 0.859795331954956),
 ('ley', 0.8591548204421997),
 ('ington', 0.8126075267791748),
 ('bridge', 0.8099068999290466),
 ('brook', 0.7979353070259094)]
>>> type(bpemb_en.vectors)
numpy.ndarray
>>> bpemb_en.vectors.shape
(10000, 50)
>>> bpemb_zh.vectors.shape
(100000, 100)

To use subword embeddings in your neural network, either encode your input into subword IDs:

>> bpemb_zh.vectors[ids].shape (3, 100) ">

>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)

Or use the embed method:

# apply Chinese subword segmentation and perform embedding lookup
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

Downloads for each language

ab (Abkhazian) ・ ace (Achinese) ・ ady (Adyghe) ・ af (Afrikaans) ・ ak (Akan) ・ als (Alemannic) ・ am (Amharic) ・ an (Aragonese) ・ ang (Old English) ・ ar (Arabic) ・ arc (Official Aramaic) ・ arz (Egyptian Arabic) ・ as (Assamese) ・ ast (Asturian) ・ atj (Atikamekw) ・ av (Avaric) ・ ay (Aymara) ・ az (Azerbaijani) ・ azb (South Azerbaijani)

ba (Bashkir) ・ bar (Bavarian) ・ bcl (Central Bikol) ・ be (Belarusian) ・ bg (Bulgarian) ・ bi (Bislama) ・ bjn (Banjar) ・ bm (Bambara) ・ bn (Bengali) ・ bo (Tibetan) ・ bpy (Bishnupriya) ・ br (Breton) ・ bs (Bosnian) ・ bug (Buginese) ・ bxr (Russia Buriat)

ca (Catalan) ・ cdo (Min Dong Chinese) ・ ce (Chechen) ・ ceb (Cebuano) ・ ch (Chamorro) ・ chr (Cherokee) ・ chy (Cheyenne) ・ ckb (Central Kurdish) ・ co (Corsican) ・ cr (Cree) ・ crh (Crimean Tatar) ・ cs (Czech) ・ csb (Kashubian) ・ cu (Church Slavic) ・ cv (Chuvash) ・ cy (Welsh)

da (Danish) ・ de (German) ・ din (Dinka) ・ diq (Dimli) ・ dsb (Lower Sorbian) ・ dty (Dotyali) ・ dv (Dhivehi) ・ dz (Dzongkha)

ee (Ewe) ・ el (Modern Greek) ・ en (English) ・ eo (Esperanto) ・ es (Spanish) ・ et (Estonian) ・ eu (Basque) ・ ext (Extremaduran)

fa (Persian) ・ ff (Fulah) ・ fi (Finnish) ・ fj (Fijian) ・ fo (Faroese) ・ fr (French) ・ frp (Arpitan) ・ frr (Northern Frisian) ・ fur (Friulian) ・ fy (Western Frisian)

ga (Irish) ・ gag (Gagauz) ・ gan (Gan Chinese) ・ gd (Scottish Gaelic) ・ gl (Galician) ・ glk (Gilaki) ・ gn (Guarani) ・ gom (Goan Konkani) ・ got (Gothic) ・ gu (Gujarati) ・ gv (Manx)

ha (Hausa) ・ hak (Hakka Chinese) ・ haw (Hawaiian) ・ he (Hebrew) ・ hi (Hindi) ・ hif (Fiji Hindi) ・ hr (Croatian) ・ hsb (Upper Sorbian) ・ ht (Haitian) ・ hu (Hungarian) ・ hy (Armenian)

ia (Interlingua) ・ id (Indonesian) ・ ie (Interlingue) ・ ig (Igbo) ・ ik (Inupiaq) ・ ilo (Iloko) ・ io (Ido) ・ is (Icelandic) ・ it (Italian) ・ iu (Inuktitut)

ja (Japanese) ・ jam (Jamaican Creole English) ・ jbo (Lojban) ・ jv (Javanese)

ka (Georgian) ・ kaa (Kara-Kalpak) ・ kab (Kabyle) ・ kbd (Kabardian) ・ kbp (Kabiyè) ・ kg (Kongo) ・ ki (Kikuyu) ・ kk (Kazakh) ・ kl (Kalaallisut) ・ km (Central Khmer) ・ kn (Kannada) ・ ko (Korean) ・ koi (Komi-Permyak) ・ krc (Karachay-Balkar) ・ ks (Kashmiri) ・ ksh (Kölsch) ・ ku (Kurdish) ・ kv (Komi) ・ kw (Cornish) ・ ky (Kirghiz)

la (Latin) ・ lad (Ladino) ・ lb (Luxembourgish) ・ lbe (Lak) ・ lez (Lezghian) ・ lg (Ganda) ・ li (Limburgan) ・ lij (Ligurian) ・ lmo (Lombard) ・ ln (Lingala) ・ lo (Lao) ・ lrc (Northern Luri) ・ lt (Lithuanian) ・ ltg (Latgalian) ・ lv (Latvian)

mai (Maithili) ・ mdf (Moksha) ・ mg (Malagasy) ・ mh (Marshallese) ・ mhr (Eastern Mari) ・ mi (Maori) ・ min (Minangkabau) ・ mk (Macedonian) ・ ml (Malayalam) ・ mn (Mongolian) ・ mr (Marathi) ・ mrj (Western Mari) ・ ms (Malay) ・ mt (Maltese) ・ mwl (Mirandese) ・ my (Burmese) ・ myv (Erzya) ・ mzn (Mazanderani)

na (Nauru) ・ nap (Neapolitan) ・ nds (Low German) ・ ne (Nepali) ・ new (Newari) ・ ng (Ndonga) ・ nl (Dutch) ・ nn (Norwegian Nynorsk) ・ no (Norwegian) ・ nov (Novial) ・ nrm (Narom) ・ nso (Pedi) ・ nv (Navajo) ・ ny (Nyanja)

oc (Occitan) ・ olo (Livvi) ・ om (Oromo) ・ or (Oriya) ・ os (Ossetian)

pa (Panjabi) ・ pag (Pangasinan) ・ pam (Pampanga) ・ pap (Papiamento) ・ pcd (Picard) ・ pdc (Pennsylvania German) ・ pfl (Pfaelzisch) ・ pi (Pali) ・ pih (Pitcairn-Norfolk) ・ pl (Polish) ・ pms (Piemontese) ・ pnb (Western Panjabi) ・ pnt (Pontic) ・ ps (Pushto) ・ pt (Portuguese)

qu (Quechua)

rm (Romansh) ・ rmy (Vlax Romani) ・ rn (Rundi) ・ ro (Romanian) ・ ru (Russian) ・ rue (Rusyn) ・ rw (Kinyarwanda)

sa (Sanskrit) ・ sah (Yakut) ・ sc (Sardinian) ・ scn (Sicilian) ・ sco (Scots) ・ sd (Sindhi) ・ se (Northern Sami) ・ sg (Sango) ・ sh (Serbo-Croatian) ・ si (Sinhala) ・ sk (Slovak) ・ sl (Slovenian) ・ sm (Samoan) ・ sn (Shona) ・ so (Somali) ・ sq (Albanian) ・ sr (Serbian) ・ srn (Sranan Tongo) ・ ss (Swati) ・ st (Southern Sotho) ・ stq (Saterfriesisch) ・ su (Sundanese) ・ sv (Swedish) ・ sw (Swahili) ・ szl (Silesian)

ta (Tamil) ・ tcy (Tulu) ・ te (Telugu) ・ tet (Tetum) ・ tg (Tajik) ・ th (Thai) ・ ti (Tigrinya) ・ tk (Turkmen) ・ tl (Tagalog) ・ tn (Tswana) ・ to (Tonga) ・ tpi (Tok Pisin) ・ tr (Turkish) ・ ts (Tsonga) ・ tt (Tatar) ・ tum (Tumbuka) ・ tw (Twi) ・ ty (Tahitian) ・ tyv (Tuvinian)

udm (Udmurt) ・ ug (Uighur) ・ uk (Ukrainian) ・ ur (Urdu) ・ uz (Uzbek)

ve (Venda) ・ vec (Venetian) ・ vep (Veps) ・ vi (Vietnamese) ・ vls (Vlaams) ・ vo (Volapük)

wa (Walloon) ・ war (Waray) ・ wo (Wolof) ・ wuu (Wu Chinese)

xal (Kalmyk) ・ xh (Xhosa) ・ xmf (Mingrelian)

yi (Yiddish) ・ yo (Yoruba)

za (Zhuang) ・ zea (Zeeuws) ・ zh (Chinese) ・ zu (Zulu)

MultiBPEmb

multi (multilingual)

Citing BPEmb

If you use BPEmb in academic work, please cite:

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Comments

adding special tokens to a BPEmb model

Hi,

Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g. <name>, <digit>, etc.

This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.

Thanks in advance!

opened by tannonk 7
Symbols don't match between model and embedding

It seems that symbols from the sentencepiece model and the associated embeddings are not the same, specifically control symbols are not present in the embedding. For instance, in de.wiki.bpe.op50000.model there are the symbols <s>, </s> and <unk>, but in the associated embedding de.wiki.bpe.vs50000.d300.vectors.bin only <unk> is defined.

Update: Further exploration reveals that the number of common symbols between the aforementioned 2 files are 49631.

opened by noe 6
Vocab length != word vector count

Hey, there is an inconsistency between the count of words in the vocabulary file of the English word vectors for 25000 BPE merges and the count of word vectors with a dimensionality of 200. In the vocab file, there is a count of 25000 words, but the .bin file contains 25777 word vectors after loading it into gensim.

Additionally, the order of the word vectors differs from the vocab file. For example, the word "▁explanation" is on position 19387 in the vocab file, but on position 9138 in the .bin file.

It would be very helpful if the index of the vocabulary in all files would be in the same order and have the same count.

opened by tocab 5
Some embeddings are invalid (majority of vectors is inf or nan)

Firstly thanks for your efforts in providing the pretrained embeddings.

Unfortunately some of the embeddings are not trained correctly. For instance, the d300 embeddings for the 10k model for English contain 9640 vectors with inf entries, out of 10817 total. It would be great if you can provide your training script, or double check it and upload fixed vectors. The d100 embeddings that you use in the Readme are indeed fine and do not contain any inf values.

Furthermore, while debugging this issue I noticed that the embeddings contain Chinese characters at the following indices[10345, 10451, 10458, 10475, 10514, 10531, 10539, 10541, 10601, 10606, 10609, 10622, 10627, 10632, 10633, 10638, 10657, 10702, 10740, 10750, 10755, 10756, 10762, 10781, 10790, 10791, 10802, 10809, 10810, 10815]. Perhaps it would be sensible to filter out sentences containing Chinese characters from the training corpus?

opened by leezu 5
EOFError: Compressed file ended before the end-of-stream marker was reached

Dear @all,

I'm trying to load the Dutch BPEmb model with vocabulary size 50k and 100-dimensional embeddings.

bpemb_de = BPEmb(lang="de", vs=50000)

I got an EOFError error:

EOFError Traceback (most recent call last) in 1 import bpemb 2 from bpemb import BPEmb ----> 3 bpemb_de = BPEmb(lang="de", vs=50000)

~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in init(self, lang, vs, dim, cache_dir, preprocess, encode_extra_options, add_pad_emb, vs_fallback, segmentation_only, model_file, emb_file) 188 else: 189 emb_file = self.emb_tpl.format(lang=lang, vs=vs, dim=dim) --> 190 self.emb_file = self._load_file(emb_file, archive=True) 191 self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb) 192 self.most_similar = self.emb.most_similar

~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in _load_file(self, file, archive, cache_dir) 226 file_url = self.base_url + file + suffix 227 print("downloading", file_url) --> 228 return http_get(file_url, cached_file, ignore_tardir=True) 229 230 def repr(self):

~/anaconda3/lib/python3.6/site-packages/bpemb/util.py in http_get(url, outfile, ignore_tardir) 47 import tarfile 48 tf = tarfile.open(fileobj=temp_file) ---> 49 members = tf.getmembers() 50 if len(members) != 1: 51 raise NotImplementedError("TODO: extract multiple files")

~/anaconda3/lib/python3.6/tarfile.py in getmembers(self) 1759 self._check() 1760 if not self._loaded: # if we want to obtain a list of -> 1761 self._load() # all members, we first have to 1762 # scan the whole archive. 1763 return self.members

~/anaconda3/lib/python3.6/tarfile.py in _load(self) 2356 """ 2357 while True: -> 2358 tarinfo = self.next() 2359 if tarinfo is None: 2360 break

~/anaconda3/lib/python3.6/tarfile.py in next(self) 2287 # Advance the file pointer. 2288 if self.offset != self.fileobj.tell(): -> 2289 self.fileobj.seek(self.offset - 1) 2290 if not self.fileobj.read(1): 2291 raise ReadError("unexpected end of data")

~/anaconda3/lib/python3.6/gzip.py in seek(self, offset, whence) 366 elif self.mode == READ: 367 self._check_not_closed() --> 368 return self._buffer.seek(offset, whence) 369 370 return self.offset

~/anaconda3/lib/python3.6/_compression.py in seek(self, offset, whence) 141 # Read and discard data until we reach the desired position. 142 while offset > 0: --> 143 data = self.read(min(io.DEFAULT_BUFFER_SIZE, offset)) 144 if not data: 145 break

~/anaconda3/lib/python3.6/gzip.py in read(self, size) 480 break 481 if buf == b"": --> 482 raise EOFError("Compressed file ended before the " 483 "end-of-stream marker was reached") 484

EOFError: Compressed file ended before the end-of-stream marker was reached

Kindly, any suggestions to fix this issue !!

opened by aimanmutasem 4
question on https://nlp.h-its.org
Hi,

Now that issue https://github.com/bheinzerling/bpemb/issues/34 is sorted out. For the R wrapper of sentencepiece at https://github.com/bnosac/sentencepiece, I'm planning to implement a simple wrapper to download the models you have been providing and next put the package on CRAN. Before I do this, I would like to know the intention of that site: https://nlp.h-its.org Namely

Who maintains that site

Will that site persist over time

Any objections in redistributing these sentencepiece models and glove embeddings? Either license-wise or other.

Thanks for any input.
opened by jwijffels 4
version of sentencepiece used

hi, I'm writing an R wrapper around sentencepiece and tried to load a few of the models and vocabulary provided here. I have found some inconsistencies. In order to make sure this is not related to the version of SentencePiece you used, can you let me know with which version / commit of SentencePiece the models were constructed? I'm making the R wrapper around sentencepiece release v0.1.84 from Oct 12, 2019.

opened by jwijffels 4
Adding support for own models

Hi,

First of all, thanks for the great package. Currently, the only way to use my own models with bpemb is to first load another model, and then assign the .spm and .emb attributes manually. This is a bit unwieldy.

I am interested in adding a subclass of BPEmb that overrides the __init__ of BPEmb and simply accepts paths to an spm and emb model/file, from which the other attributes (e.g. size/vs) are derived. Is this something you would accept as a PR? Do you see any problems with this approach?

Thanks! Stéphan

opened by stephantul 3
How do you get the embedding/id for the pad token ?
Hi, This may be a dummy question, but when creating a BPEmb with add_pad_emb=True, how do I actually get the padding embedding and what is its ID ? This should maybe figure somewhere in the doc:

from bpemb import BPEmb bp = BPEmb(lang="en", vs=1000, dim=50, add_pad_emb=True) print(bp.vs) # prints 1000, was expecting 1000+1 ?

Thanks for the great work, derlin
opened by derlin 3
numbers/digits conversion

Hi there, thanks for providing such wonderful pretrained models, I noticed the pretrained model converted all digits/numbers to 0. May I ask why? Numbers are very useful and sometime crucial for downstream NLP tasks such as question answering (e.g. when asking questions related to numbers).

opened by csarron 3
Missing tokens in German model

For the german models, there seem to be tokens missing in the vector models (200k merge operations) that are present in the vocab list and sentencepiece models. For example ▁plaisor auens are in the vocab but fail to look up in the model.

The number of tokens in the model len(model.index2word) returns 185712, while in the vocab list there are 200k entries.

Am I missing something here or is there some miss-match between the training of the sentencepiece tokens / glove vectors?

Great resource otherwise, thanks!

opened by maurice-g 3
Truecase supported.

I'm working on a machine translation task. When I encode corpus with bpemb, the output is always lower case. Is it possible to retain case information after encode my corpus?

opened by BrightXiaoHan 0
Train --model_type=unigram

Thank you for making this resource.

It would be nice if we can use the model trained with --model_type=unigram, which is a default mode of sentencepiece.

BPE and Unigram show basically the same BLUE score, but unigram is more flexible and supports subword sampling. https://arxiv.org/abs/1804.10959

opened by taku910 2

Owner

Benjamin Heinzerling

GitHub https://nlp.h-its.org/bpemb

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

1 Dec 23, 2021

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

2 Jan 18, 2022

Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

4.3k Jan 1, 2023

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also very accurate having competitive accuracy with state-of-the-art open-domain QA models

543 Jan 8, 2023

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

82 Dec 19, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

5 Jul 17, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

Must-read papers on improving efficiency for pre-trained language models.

89 Jan 3, 2023

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 7, 2023

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

2.3k Jan 8, 2023

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

564 Jan 8, 2023