Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Text Data & NLP nlp transformers embeddings benchmarks corpora language-model

Overview

Spanish Language Models 💃🏻

A repository part of the MarIA project.

Corpora 📃

Corpora	Number of documents	Number of tokens	Size (GB)
BNE	201,080,084	135,733,450,668	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne
GPT2-base BNE: https://huggingface.co/PlanTL-GOB-ES/gpt2-base-bne
GPT2-large BNE: https://huggingface.co/PlanTL-GOB-ES/gpt2-large-bne
Other models: (WIP)

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

RoBERTa-base-BNE for Capitel-POS: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-pos
RoBERTa-large-BNE for Capitel-POS: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-pos
RoBERTa-base-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner
RoBERTa-base-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus (very robust)
RoBERTa-large-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-ner
RoBERTa-base-BNE for SQAC: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-sqac
RoBERTa-large-BNE for SQAC: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-sqac

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Datasets 🗂️

Spanish Question Answering Corpus (SQAC) 🦆 : https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO*	mBERT	BERTIN**	Electricidad***
UD-POS	F1	0.9907	0.9898	0.9900	0.9886	0.9898	0.9818
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8835	0.7954
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9847	0.9816
Capitel-NER	F1	0.8960	0.8998	0.8772	0.8810	0.8856	0.8035
STS	Combined	0.8533	0.8353	0.8159	0.8164	0.7945	0.8063
MLDoc	Accuracy	0.9623	0.9675	0.9663	0.9550	0.9673	0.9493
PAWS-X	F1	0.9000	0.9060	0.9000	0.8955	0.8990	0.9025
XNLI	Accuracy	0.8016	0.7958	0.8130	0.7876	0.7890	0.7878
SQAC	F1	0.7923	0.7993	0.7923	0.7562	0.7678	0.7383

* A model based on BERT architecture.

** A model based on RoBERTa architecture.

*** A model based on Electra architecture.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

⚖️ Legal Language Model
⚕️ Biomedical and Clinical Language Models

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Comments

Potential issues with HF GPT2 Models
Hello,

I am using the GPT2 models available in HF, and running into a few issues. Firstly, there seems to be an issue with the tokenizer. Trying to calculate perplexity using the evaluate module, as follows:

from evaluate import load perplexity = load("perplexity", module_type="metric") results = perplexity.compute(predictions=["Hola, como estas?"], model_id="PlanTL-GOB-ES/gpt2-base-bne", device="cpu")

Gives the following error:

... File "/ikerlariak/aormazabal024/PhD/Poetry-Generation/demo/poetry-env-traganarru/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self`

This seems to be related to the special tokens for <pad>, <s>, </s> and<unk> not being properly set (but are used by the evaluate module), as the only special token added in the tokenizer is <|endoftext|>. One can manually fix it for the local snapshot:

tokenizer.pad_token = '<pad>' tokenizer.bos_token = '</s>' tokenizer.eos_token = '</s>' tokenizer.unk_token = '<unk>' tokenizer.save_pretrained('[snapshot-path]')

However, even after fixing this, I am getting quite high perplexities compared to the 10-13 reported in the paper for all sentences I am trying (assuming per-word-perplexity is reported). Is it possible there was an issue when converting from fairseq to HF, and are the original fairseq models available somewhere to compare? Or maybe I am making a mistake when calculating the ppl, was there any tokenization done to the text apart from BPE (i.e. replacing newlines with , which is pretty standard in fairseq)?
opened by aitorormazabal 0
GPT-2 state and GPT-j-6B

I would like to ask about the state of the GPT-2 model. Will it arrive soon at huggingface?

I would also like to ask if you have the intention of train GPT-j-6B. Training this model for some people would be impossible due to its hardware requirements, but you have Mare Nostrum, the dataset and the previous version GPT-2.

opened by ghost 2

customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

23 Oct 27, 2022

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

1 Apr 28, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

2 Dec 18, 2021

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Word embeddings 🔤

Datasets 🗂️

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

You might also like...

customer care chatbot made with Rasa Open Source.

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Comments

Potential issues with HF GPT2 Models

GPT-2 state and GPT-j-6B

Owner

Plan de Tecnologías del Lenguaje - Gobierno de España

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Guide to using pre-trained large language models of source code

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

Convolutional 2D Knowledge Graph Embeddings resources