Unsupervised Language Model Pre-training for French

GETALP

Last update: Dec 10, 2022

Related tags

Text Data & NLP Flaubert

Overview

FlauBERT and FLUE

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need.

Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language.

This repository is still under construction and everything will be available soon.

1. FlauBERT models
2. Using FlauBERT
    2.1. Using FlauBERT with Hugging Face's Transformers
    2.2. Using FlauBERT with Facebook XLM's library
3. Pre-training FlauBERT
    3.1. Data
    3.2. Training
    3.3. Convert an XLM pre-trained model to Hugging Face's Transformers
4. Fine-tuning FlauBERT on the FLUE benchmark
5. Citation

1. FlauBERT models

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We have released the pretrained weights for the following model sizes.

The pretrained models are available for download from here or via Hugging Face's library.

Model name	Number of layers	Attention Heads	Embedding Dimension	Total Parameters
`flaubert-small-cased`	6	8	512	54 M
`flaubert-base-uncased`	12	12	768	137 M
`flaubert-base-cased`	12	12	768	138 M
`flaubert-large-cased`	24	16	1024	373 M

Note: flaubert-small-cased is partially trained so performance is not guaranteed. Consider using it for debugging purpose only.

We also provide the checkpoints from here for model base (cased/uncased) and large (cased).

2. Using FlauBERT

In this section, we describe two ways to obtain sentence embeddings from pretrained FlauBERT models: either via Hugging Face's Transformer library or via Facebook's XLM library. We will intergrate FlauBERT into Facebook' fairseq in the near future.

2.1. Using FlauBERT with Hugging Face's Transformers

You can use FlauBERT with Hugging Face's Transformers library as follow.

import torch
from transformers import FlaubertModel, FlaubertTokenizer

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]

Notes: if your transformers version is <=2.10.0, modelname should take one of the following values:

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

2.2. Using FlauBERT with Facebook XLM's library

The pretrained FlauBERT models are available for download from here. Each compressed folder includes 3 files:

*.pth: FlauBERT's pretrained model.
codes: BPE codes learned on the training data.
vocab: BPE vocabulary file.

Note: The following example only works for the modified XLM provided in this repo, it won't work for the original XLM. The code is taken from this tutorial.

import sys
import torch
import fastBPE

# Add Flaubert root to system path (change accordingly)
FLAUBERT_ROOT = '/home/user/Flaubert'
sys.path.append(FLAUBERT_ROOT)

from xlm.model.embedder import SentenceEmbedder
from xlm.data.dictionary import PAD_WORD


# Paths to model files
model_path = '/home/user/flaubert_base_cased/flaubert_base_cased_xlm.pth'
codes_path = '/home/user/flaubert_base_cased/codes'
vocab_path = '/home/user/flaubert_base_cased/vocab'
do_lowercase = False # Change this to True if you use uncased FlauBERT

bpe = fastBPE.fastBPE(codes_path, vocab_path)

sentences = "Le chat mange une pomme ."
if do_lowercase:
    sentences = sentences.lower()

# Apply BPE
sentences = bpe.apply([sentences])
sentences = [(('</s> %s </s>' % sent.strip()).split()) for sent in sentences]
print(sentences)

# Create batch
bs = len(sentences)
slen = max([len(sent) for sent in sentences])

# Reload pretrained model
embedder = SentenceEmbedder.reload(model_path)
embedder.eval()
dico = embedder.dico

# Prepare inputs to model
word_ids = torch.LongTensor(slen, bs).fill_(dico.index(PAD_WORD))
for i in range(len(sentences)):
    sent = torch.LongTensor([dico.index(w) for w in sentences[i]])
    word_ids[:len(sent), i] = sent
lengths = torch.LongTensor([len(sent) for sent in sentences])

# Get sentence embeddings (corresponding to the BERT [CLS] token)
cls_embedding = embedder.get_embeddings(x=word_ids, lengths=lengths)
print(cls_embedding.size())

# Get the entire output tensor for all tokens
# Note that cls_embedding = tensor[0]
tensor = embedder.get_embeddings(x=word_ids, lengths=lengths, all_tokens=True)
print(tensor.size())

3. Pre-training FlauBERT

Install dependencies

You should clone this repo and then install WikiExtractor, fastBPE and Moses tokenizer under tools:

git clone https://github.com/getalp/Flaubert.git
cd Flaubert

# Install toolkit
cd tools
git clone https://github.com/attardi/wikiextractor.git
git clone https://github.com/moses-smt/mosesdecoder.git

git clone https://github.com/glample/fastBPE.git
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

3.1. Data

In this section, we describe the pipeline to prepare the data for training FlauBERT. This is based on Facebook XLM's library. The steps are as follows:

Download, clean, and tokenize data using Moses tokenizer.
Split cleaned data into: train, validation, and test sets.
Learn BPE on the training set. Then apply learned BPE codes to train, validation, and test sets.
Binarize data.

(1) Download and Preprocess Data

In the following, replace $DATA_DIR, $corpus_name respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

To download and preprocess the data, excecute the following commands:

./download.sh $DATA_DIR $corpus_name fr
./preprocess.sh $DATA_DIR $corpus_name fr

For example:

./download.sh ~/data gutenberg fr
./preprocess.sh ~/data gutenberg fr

The first command will download the raw data to $DATA_DIR/raw/fr_gutenberg, the second one processes them and save to $DATA_DIR/processed/fr_gutenberg.

(2) Split Data

Run the following command to split cleaned corpus into train, validation, and test sets. You can modify the train/validation/test ratio in the script.

bash tools/split_train_val_test.sh $DATA_PATH

where $DATA_PATH is path to the file to be split.

The output files are: fr.train, fr.valid, fr.test which are saved under the same directory as the original file.

(3) & (4) Learn BPE and Prepare Data

Run the following command to learn BPE codes on the training set, and apply BPE codes on the train, validation, and test sets. The data is then binarized and ready for training.

bash tools/create_pretraining_data.sh $DATA_DIR $BPE_size

where $DATA_DIR is path to the directory where the 3 above files fr.train, fr.valid, fr.test are saved. $BPE_size is the number of BPE vocabulary size, for example: 30 for 30k,50 for 50k, etc. The output files are saved in $DATA_DIR/BPE/30k or $DATA_DIR/BPE/50k correspondingly.

3.2. Training

Our codebase for pretraining FlauBERT is largely based on the XLM repo, with some modifications. You can use their code to train FlauBERT, it will work just fine.

Execute the following command to train FlauBERT (base) on your preprocessed data:

python train.py \
    --exp_name flaubert_base_cased \
    --dump_path $dump_path \
    --data_path $data_path \
    --amp 1 \
    --lgs 'fr' \
    --clm_steps '' \
    --mlm_steps 'fr' \
    --emb_dim 768 \
    --n_layers 12 \
    --n_heads 12 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 16 \
    --bptt 512 \
    --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_fr_mlm_ppl \
    --stopping_criterion _valid_fr_mlm_ppl,20 \
    --fp16 true \
    --accumulate_gradients 16 \
    --word_mask_keep_rand '0.8,0.1,0.1' \
    --word_pred '0.15'

where $dump_path is the path to where you want to save your pretrained model, $data_path is the path to the binarized data sets, for example $DATA_DIR/BPE/50k.

Run experiments on multiple GPUs and/or multiple nodes

To run experiments on multiple GPUs in a single machine, you can use the following command (the parameters after train.py are the same as above).

export NGPU=4
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 # if you only use some of the GPUs in the machine
python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

To run experiments on multiple nodes, multiple GPUs in clusters using SLURM as a resource manager, you can use the following command to launch training after requesting resources with #SBATCH (the parameters after train.py are the same as above plus --master_port parameter).

srun python train.py

3.3. Convert an XLM pre-trained model to Hugging Face's Transformers

To convert an XLM pre-trained model to Hugging Face's Transformers, you can use the following command.

python tools/use_flaubert_with_transformers/convert_to_transformers.py --inputdir $inputdir --outputdir $outputdir

where $inputdir is path to the XLM pretrained model directory, $outputdir is path to the output directory where you want to save the Hugging Face's Transformer model.

4. Fine-tuning FlauBERT on the FLUE benchmark

FLUE (French Language Understanding Evaludation) is a general benchmark for evaluating French NLP systems. Please refer to this page for an example of fine-tuning FlauBERT on this benchmark.

5. Video presentation

You can watch this 7mn video presentation of FlauBERT [VIDEO 7mn] (https://www.youtube.com/watch?v=NgLM9GuwSwc)

6. Citation

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

LREC paper

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN paper

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}

Comments

dataset in French NLI and STS - like

Hi,

When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?

I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!

As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.

opened by dataislife 15
How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ?

Good afternoon,

I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?

You said we should use this line to train on our preprocessed data :

/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

I tried it after cloning flaubert and all the necessary librairies but I am getting this error :

FAISS library was not found. FAISS not available. Switching to standard nearest neighbors search implementation. ./own_data/data/train.fr.pth not found ./own_data/data/valid.fr.pth not found ./own_data/data/test.fr.pth not found Traceback (most recent call last): File "train.py", line 387, in check_data_params(params) File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()]) AssertionError

Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?

opened by keloemma 11
Will lemmatization negatively affect FlauBERT word embeddings?

Hello!

I am using FlauBERT to generate word embeddings as part of a study on word sense disambiguation (WSD).

The FlauBERT tokenizer does not recognize a significant number of words in my corpus and, as a result, segments them. For example, the FlauBERT tokenizer does not recognize the archaic orthography of some verbs. It also does not recognize plural forms of a number of other words in the corpus. As a result, the tokenizer segments a number of words in the corpus, some of which are significant to my research.

I understand that I could further train FlauBERT on a corpus of eighteenth-century French in order to create a new model and a new tokenizer specifically for eighteenth-century French. However, compiling a corpus of eighteenth-century French that is large and heterogeneous enough to be useful would be challenging (and perhaps not even possible).

As an alternative to training a new model, I thought I might lemmatize my corpus before running it through FlauBERT. Stanford's Stanza NLP package (stanfordnlp.github.io/stanza/) recognizes the archaic orthography of the verbs in my corpus and turns them into the infinitive form, a form FlauBERT recognizes. Similarly, Stanza also changes the plural forms of other words into singular forms, forms FlauBERT also recognizes. Thus, if I were to lemmatize my corpus in Stanza, the FlauBERT tokenizer would then be able to recognize substantially more words in my corpus.

Would lemmatizing my corpus in this way adversely affect my FlauBERT results and a WSD analysis in particular? In general, does lemmatization have a negative effect on BERT results and WSD analyses more particularly?

Given that FlauBERT is not trained on lemmatized text, I imagine that lemmatizing the corpus would indeed negatively effect the results of the analysis. As an alternative to training FlauBERT on a corpus of eighteenth-century French (which may not be possible), could I instead train it on a corpus of lemmatized French and then use this new model for a WSD analysis on my corpus of lemmatized eighteenth-century French? Would that work?

I'm not sure if this is the right place for these sorts of questions!

Thank you in advance for your time.

opened by mcriggs 9
bug in run_flue.py

Hi I got this error when running the run_flue.py from transformers import flue_compute_metrics as compute_metrics Import Error: cannot import name 'flue_compute_metrics'

I already preinstalled the requirements and update the transformer directory.

opened by keloemma 7
tweet sentiment analysis in french

Hi,

Hope you are all well !

Is it possible with Flaubert to do some tweet sentiment analysis written in french ? If so, how can we do that ?

Vive la France ! :-)

Cheers, X

opened by ghost 7

Model name not found or config.json missing

Transformers version 2.3.0 Pytorch version 1.4

When running readme.md code

import torch
from transformers import XLMModel, XLMTokenizer
modelname="xlm_bert_fra_base_lower" # Or absolute path to where you put the folder

# Load model
flaubert, log = XLMModel.from_pretrained(modelname, output_loading_info=True)
# check import was successful, the dictionary should have empty lists as values
print(log)

# Load tokenizer
flaubert_tokenizer = XLMTokenizer.from_pretrained(modelname, do_lowercase_and_remove_accent=False)

sentence="Le chat mange une pomme."
sentence_lower = sentence.lower()

token_ids = torch.tensor([flaubert_tokenizer.encode(sentence_lower)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)

Output

OSError: Model name 'xlm_bert_fra_base_lower' was not found in model name list 
(xlm-mlm-en-2048, xlm-mlm-ende-1024, xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024, xlm-clm-ende-1024,
 xlm-mlm-17-1280, xlm-mlm-100-1280). We assumed 'xlm_bert_fra_base_lower' 
was a path or url to a configuration file named config.json 
or a directory containing such a file but couldn't find any such file at this path or url.

Tried downloading the model as tar file (lower and normal), extracting it and putting its absolute folder path as modelname but keeps getting the same error....

I may be missing something so stupid but I don't see a config.json file in the archive file

Screen Shot 2020-01-21 at 15 59 09

What is wrong?

opened by hzitoun 6

Import Error

Hi I have tried to download Flaubert as described:

import torch
from transformers import FlaubertModel, FlaubertTokenizer

Unfortunately it returns an ImportError:

  from transformers import FlaubertModel, FlaubertTokenizer
ImportError: cannot import name 'FlaubertModel'

opened by rcontesti 5

1.get_toolkit.sh

The script failed with : No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK? I managed to fix it with : sudo update-alternatives --config java and selecting /usr/lib/jvm/java-11-openjdk-amd64/bin/java as default

opened by jourlin 5

Filling masks

Bonjour bonjour ! Merci d'avoir partagé le modèle !

Dans Camembert il est assez facile de deviner un mot à partir du contexte, y a-t-il un working example dans Flaubert ?

Merci d'avance !

from fairseq.models.roberta import CamembertModel
camembert = CamembertModel.from_pretrained('./camembert-base/')
camembert.eval()
masked_line = 'Le camembert est <mask> :)'
camembert.fill_mask(masked_line, topk=3)
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
]#  ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
#  ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]

opened by xiaoouwang 4

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)
Environment info

Environment info

transformers version: 2.5.1

Platform: linux

Python version: 3.7

PyTorch version (GPU?): 1.4

Tensorflow version (GPU?):

Using GPU in script?: no

Using distributed or parallel set-up in script?: no

Who can help

Model I am using (FlauBert):

The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.

[ ] the official example scripts: (I did not change much , pretty close to the original)

import torch from transformers import FlaubertModel, FlaubertTokenizer # Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', # 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased'] modelname = 'flaubert/flaubert_base_cased' # Load pretrained model and tokenizer flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True) flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False) # do_lowercase=False if using cased models, True if using uncased ones sentence = "Le chat mange une pomme." token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)]) last_layer = flaubert(token_ids)[0] print(last_layer.shape) # torch.Size([1, 8, 768]) -> (batch size x number of tokens x embedding dimension) # The BERT [CLS] token correspond to the first hidden state of the last layer cls_embedding = last_layer[:, 0, :]

[ ] My own modified scripts: (give details below)

def get_flaubert_layer(texte): modelname = "flaubert-base-uncased" path = './flau/flaubert-base-unc/' flaubert = FlaubertModel.from_pretrained(path) flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path) tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512))) max_len = 0 for i in tokenized.values: if len(i) > max_len: max_len = len(i) padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values]) token_ids = torch.tensor(padded) with torch.no_grad(): last_layer = flaubert(token_ids)[0][:,0,:].numpy() return last_layer, modelname

The tasks I am working on is:

[ ] Producing vectors/features from a language model and pass it to others classifiers

To reproduce

Steps to reproduce the behavior:

Get transformers library and scikit-learn, pandas and numpy, pytorch

Last lines of code

# Reading the file filename = "corpus" sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0) data_id = sentences.identifiant print("Total phrases: ", len(data_id)) data = sentences.sent label = sentences.etiquette emb, mdlname = get_flaubert_layer(data) # corpus is dataframe of approximately 40 000 lines

Apperently this line produce something which is huge and which take a lot memory : last_layer = flaubert(token_ids)[0][:,0,:].numpy()

I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers , Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?

My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.
opened by keloemma 3
Using flauBERT for similarities between sentences
Hi,

My goal here is to do clustering on sentences. For this purpose, I chose to use similarities between sentence embedding for all my sentences. Unfortunately, camemBERT seems not great for that task and fine-tuning flauBERT could be a solution.

So thanks to @formiel, I managed to fine tune flauBERT on an NLI dataset. My question is about that fine-tuning. What is the output exactly ? I only got a few files in the dump_path:

train.log ==> logs of the training

params.pkl ==> parameters of the training

test.pred.0 ==> prediction of the test dataset after first epoch

valid.pred.0 ==> valid classification of the test dataset after first epoch

test.pred.1 ==> etc

I wonder if after fine-tuning flauBERT, i could use it to make a new embedding of a sentence (like flauBERT before fine-tuning). So where is the new flauBERT model trained on the NLI dataset ? And how use it to make embeddings ?

Thanks in advance
opened by QuentinSpalla 3
Example de paraphrase

Bonjour, Dans la documentation de Flaubert, il est fait mention de la possibilité de paraphraser. Auriez-vous un exemple de code à utiliser pour, par exemple, d'une phrase en générer une dizaines de similaires ?

Merci

opened by ExtReMLapin 1
From pytorch model (with hugging_face library) to XLM model

Hello,

I have a problem currently regarding the finetuning of Flaubert on FLUE. I have a model that I re-trained with custom data, so I have new weights for it, and I currently have it as .json file and .bin files. However, when I want to finetune this model on FLUE tasks, they ask me for vocab and codes files from pretraining, that I don't have when using the hugging face library. I see that there is a module to go from XLM to hugging_face, but not the opposite. Is it possible to transform a model under .json and .bin format to get vocab, codes and .pth files ?

Or maybe there is a clever workaround to this problem ?

Many thanks in advance

opened by ArthurVanSchendel 1
Continued training of FlauBERT (with --reload_model) -- Question about vocab size

Hello. :)

I would like to use the "--reload_model" option with your train.py command to further train one of your pretrained FlauBERT models.

Upon trying to run train.py with the "--reload_model" option I got an error message saying that there was a "size mismatch" between the pretrained FlauBERT model and the adapted model I was trying to train.

The error message referred to a "shape torch.Size([67542]) from checkpoint". This was for the flaubert_base_uncased model. I assume that the number 67542 is the vocabulary size of flaubert-base-uncased.

In order to use the "--reload_model" option with your pretrained FlauBERT models, do I need to ensure that the vocabulary of my training data is identical to that of the pretrained model? If so, do you think that I could manage that simply by concatenating the "vocab" file of the pretrained model with my training data?

Thank you in advance for your help!

opened by mcriggs 1
Finetuning on FLUE

Hi ! I would like to finetune Flaubert on FLUE task with hugging face library. I downloaded the PAWS datas and use the code you gave on your github repo but I have this error message I can't go through :

any idea on what to do ?

Thanks for this project by the way, him looking forward to use it !

Have a good day, Lisa

opened by LisaBanana 12
Pretraining with News Crawls by WMT 19

I have a query regarding regarding your training corpus.

The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?

opened by divkakwani 1

Owner

GETALP

Study Group for Machine Translation and Automated Processing of Languages and Speech

GitHub

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

MASS: Masked Sequence to Sequence Pre-training for Language Generation

1.1k Dec 17, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 2, 2022

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

29 Nov 30, 2022

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 7, 2023

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Unsupervised Language Model Pre-training for French

Related tags

Overview

FlauBERT and FLUE

Table of Contents

1. FlauBERT models

2. Using FlauBERT

2.1. Using FlauBERT with Hugging Face's Transformers

2.2. Using FlauBERT with Facebook XLM's library

3. Pre-training FlauBERT

Install dependencies

3.1. Data

(1) Download and Preprocess Data

(2) Split Data

(3) & (4) Learn BPE and Prepare Data

3.2. Training

Run experiments on multiple GPUs and/or multiple nodes

3.3. Convert an XLM pre-trained model to Hugging Face's Transformers

4. Fine-tuning FlauBERT on the FLUE benchmark

5. Video presentation

6. Citation

Comments

Environment info

Environment info

Who can help

To reproduce

Owner

GETALP

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Pre-training BERT masked language models with custom vocabulary

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

iBOT: Image BERT Pre-Training with Online Tokenizer

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers