Unsupervised Language Model Pre-training for French

Overview

FlauBERT and FLUE

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need.

Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language.

This repository is still under construction and everything will be available soon.

Table of Contents

1. FlauBERT models
2. Using FlauBERT
    2.1. Using FlauBERT with Hugging Face's Transformers
    2.2. Using FlauBERT with Facebook XLM's library
3. Pre-training FlauBERT
    3.1. Data
    3.2. Training
    3.3. Convert an XLM pre-trained model to Hugging Face's Transformers
4. Fine-tuning FlauBERT on the FLUE benchmark
5. Citation

1. FlauBERT models

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We have released the pretrained weights for the following model sizes.

The pretrained models are available for download from here or via Hugging Face's library.

Model name Number of layers Attention Heads Embedding Dimension Total Parameters
flaubert-small-cased 6 8 512 54 M
flaubert-base-uncased 12 12 768 137 M
flaubert-base-cased 12 12 768 138 M
flaubert-large-cased 24 16 1024 373 M

Note: flaubert-small-cased is partially trained so performance is not guaranteed. Consider using it for debugging purpose only.

We also provide the checkpoints from here for model base (cased/uncased) and large (cased).

2. Using FlauBERT

In this section, we describe two ways to obtain sentence embeddings from pretrained FlauBERT models: either via Hugging Face's Transformer library or via Facebook's XLM library. We will intergrate FlauBERT into Facebook' fairseq in the near future.

2.1. Using FlauBERT with Hugging Face's Transformers

You can use FlauBERT with Hugging Face's Transformers library as follow.

import torch
from transformers import FlaubertModel, FlaubertTokenizer

# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]

Notes: if your transformers version is <=2.10.0, modelname should take one of the following values:

['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

2.2. Using FlauBERT with Facebook XLM's library

The pretrained FlauBERT models are available for download from here. Each compressed folder includes 3 files:

  • *.pth: FlauBERT's pretrained model.
  • codes: BPE codes learned on the training data.
  • vocab: BPE vocabulary file.

Note: The following example only works for the modified XLM provided in this repo, it won't work for the original XLM. The code is taken from this tutorial.

import sys
import torch
import fastBPE

# Add Flaubert root to system path (change accordingly)
FLAUBERT_ROOT = '/home/user/Flaubert'
sys.path.append(FLAUBERT_ROOT)

from xlm.model.embedder import SentenceEmbedder
from xlm.data.dictionary import PAD_WORD


# Paths to model files
model_path = '/home/user/flaubert_base_cased/flaubert_base_cased_xlm.pth'
codes_path = '/home/user/flaubert_base_cased/codes'
vocab_path = '/home/user/flaubert_base_cased/vocab'
do_lowercase = False # Change this to True if you use uncased FlauBERT

bpe = fastBPE.fastBPE(codes_path, vocab_path)

sentences = "Le chat mange une pomme ."
if do_lowercase:
    sentences = sentences.lower()

# Apply BPE
sentences = bpe.apply([sentences])
sentences = [(('</s> %s </s>' % sent.strip()).split()) for sent in sentences]
print(sentences)

# Create batch
bs = len(sentences)
slen = max([len(sent) for sent in sentences])

# Reload pretrained model
embedder = SentenceEmbedder.reload(model_path)
embedder.eval()
dico = embedder.dico

# Prepare inputs to model
word_ids = torch.LongTensor(slen, bs).fill_(dico.index(PAD_WORD))
for i in range(len(sentences)):
    sent = torch.LongTensor([dico.index(w) for w in sentences[i]])
    word_ids[:len(sent), i] = sent
lengths = torch.LongTensor([len(sent) for sent in sentences])

# Get sentence embeddings (corresponding to the BERT [CLS] token)
cls_embedding = embedder.get_embeddings(x=word_ids, lengths=lengths)
print(cls_embedding.size())

# Get the entire output tensor for all tokens
# Note that cls_embedding = tensor[0]
tensor = embedder.get_embeddings(x=word_ids, lengths=lengths, all_tokens=True)
print(tensor.size())

3. Pre-training FlauBERT

Install dependencies

You should clone this repo and then install WikiExtractor, fastBPE and Moses tokenizer under tools:

git clone https://github.com/getalp/Flaubert.git
cd Flaubert

# Install toolkit
cd tools
git clone https://github.com/attardi/wikiextractor.git
git clone https://github.com/moses-smt/mosesdecoder.git

git clone https://github.com/glample/fastBPE.git
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

3.1. Data

In this section, we describe the pipeline to prepare the data for training FlauBERT. This is based on Facebook XLM's library. The steps are as follows:

  1. Download, clean, and tokenize data using Moses tokenizer.
  2. Split cleaned data into: train, validation, and test sets.
  3. Learn BPE on the training set. Then apply learned BPE codes to train, validation, and test sets.
  4. Binarize data.

(1) Download and Preprocess Data

In the following, replace $DATA_DIR, $corpus_name respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

To download and preprocess the data, excecute the following commands:

./download.sh $DATA_DIR $corpus_name fr
./preprocess.sh $DATA_DIR $corpus_name fr

For example:

./download.sh ~/data gutenberg fr
./preprocess.sh ~/data gutenberg fr

The first command will download the raw data to $DATA_DIR/raw/fr_gutenberg, the second one processes them and save to $DATA_DIR/processed/fr_gutenberg.

(2) Split Data

Run the following command to split cleaned corpus into train, validation, and test sets. You can modify the train/validation/test ratio in the script.

bash tools/split_train_val_test.sh $DATA_PATH

where $DATA_PATH is path to the file to be split.

The output files are: fr.train, fr.valid, fr.test which are saved under the same directory as the original file.

(3) & (4) Learn BPE and Prepare Data

Run the following command to learn BPE codes on the training set, and apply BPE codes on the train, validation, and test sets. The data is then binarized and ready for training.

bash tools/create_pretraining_data.sh $DATA_DIR $BPE_size

where $DATA_DIR is path to the directory where the 3 above files fr.train, fr.valid, fr.test are saved. $BPE_size is the number of BPE vocabulary size, for example: 30 for 30k,50 for 50k, etc. The output files are saved in $DATA_DIR/BPE/30k or $DATA_DIR/BPE/50k correspondingly.

3.2. Training

Our codebase for pretraining FlauBERT is largely based on the XLM repo, with some modifications. You can use their code to train FlauBERT, it will work just fine.

Execute the following command to train FlauBERT (base) on your preprocessed data:

python train.py \
    --exp_name flaubert_base_cased \
    --dump_path $dump_path \
    --data_path $data_path \
    --amp 1 \
    --lgs 'fr' \
    --clm_steps '' \
    --mlm_steps 'fr' \
    --emb_dim 768 \
    --n_layers 12 \
    --n_heads 12 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 16 \
    --bptt 512 \
    --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_fr_mlm_ppl \
    --stopping_criterion _valid_fr_mlm_ppl,20 \
    --fp16 true \
    --accumulate_gradients 16 \
    --word_mask_keep_rand '0.8,0.1,0.1' \
    --word_pred '0.15'                      

where $dump_path is the path to where you want to save your pretrained model, $data_path is the path to the binarized data sets, for example $DATA_DIR/BPE/50k.

Run experiments on multiple GPUs and/or multiple nodes

To run experiments on multiple GPUs in a single machine, you can use the following command (the parameters after train.py are the same as above).

export NGPU=4
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 # if you only use some of the GPUs in the machine
python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

To run experiments on multiple nodes, multiple GPUs in clusters using SLURM as a resource manager, you can use the following command to launch training after requesting resources with #SBATCH (the parameters after train.py are the same as above plus --master_port parameter).

srun python train.py

3.3. Convert an XLM pre-trained model to Hugging Face's Transformers

To convert an XLM pre-trained model to Hugging Face's Transformers, you can use the following command.

python tools/use_flaubert_with_transformers/convert_to_transformers.py --inputdir $inputdir --outputdir $outputdir

where $inputdir is path to the XLM pretrained model directory, $outputdir is path to the output directory where you want to save the Hugging Face's Transformer model.

4. Fine-tuning FlauBERT on the FLUE benchmark

FLUE (French Language Understanding Evaludation) is a general benchmark for evaluating French NLP systems. Please refer to this page for an example of fine-tuning FlauBERT on this benchmark.

5. Video presentation

You can watch this 7mn video presentation of FlauBERT [VIDEO 7mn] (https://www.youtube.com/watch?v=NgLM9GuwSwc)

6. Citation

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

LREC paper

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN paper

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}
Comments
  • dataset in French NLI and STS - like

    dataset in French NLI and STS - like

    Hi,

    When getting data from get-data-xnli.sh , I notice that most dataset is not in French. Hence, I wonder how you used that in practice?

    I am currently looking for a some NLI-like and STS-like datasets in French. That would be great to fine-tune Flaubert!

    As a suggestion, translation the English version of NLI and STS to French could be a good option to fine tune Flaubert on such tasks.

    opened by dataislife 15
  • How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ?

    How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ?

    Good afternoon,

    I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?

    You said we should use this line to train on our preprocessed data :

    /Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

    I tried it after cloning flaubert and all the necessary librairies but I am getting this error :

    FAISS library was not found. FAISS not available. Switching to standard nearest neighbors search implementation. ./own_data/data/train.fr.pth not found ./own_data/data/valid.fr.pth not found ./own_data/data/test.fr.pth not found Traceback (most recent call last): File "train.py", line 387, in check_data_params(params) File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()]) AssertionError

    Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?

    opened by keloemma 11
  • Will lemmatization negatively affect FlauBERT word embeddings?

    Will lemmatization negatively affect FlauBERT word embeddings?

    Hello!

    I am using FlauBERT to generate word embeddings as part of a study on word sense disambiguation (WSD).

    The FlauBERT tokenizer does not recognize a significant number of words in my corpus and, as a result, segments them. For example, the FlauBERT tokenizer does not recognize the archaic orthography of some verbs. It also does not recognize plural forms of a number of other words in the corpus. As a result, the tokenizer segments a number of words in the corpus, some of which are significant to my research.

    I understand that I could further train FlauBERT on a corpus of eighteenth-century French in order to create a new model and a new tokenizer specifically for eighteenth-century French. However, compiling a corpus of eighteenth-century French that is large and heterogeneous enough to be useful would be challenging (and perhaps not even possible).

    As an alternative to training a new model, I thought I might lemmatize my corpus before running it through FlauBERT. Stanford's Stanza NLP package (stanfordnlp.github.io/stanza/) recognizes the archaic orthography of the verbs in my corpus and turns them into the infinitive form, a form FlauBERT recognizes. Similarly, Stanza also changes the plural forms of other words into singular forms, forms FlauBERT also recognizes. Thus, if I were to lemmatize my corpus in Stanza, the FlauBERT tokenizer would then be able to recognize substantially more words in my corpus.

    Would lemmatizing my corpus in this way adversely affect my FlauBERT results and a WSD analysis in particular? In general, does lemmatization have a negative effect on BERT results and WSD analyses more particularly?

    Given that FlauBERT is not trained on lemmatized text, I imagine that lemmatizing the corpus would indeed negatively effect the results of the analysis. As an alternative to training FlauBERT on a corpus of eighteenth-century French (which may not be possible), could I instead train it on a corpus of lemmatized French and then use this new model for a WSD analysis on my corpus of lemmatized eighteenth-century French? Would that work?

    I'm not sure if this is the right place for these sorts of questions!

    Thank you in advance for your time.

    opened by mcriggs 9
  • bug in run_flue.py

    bug in run_flue.py

    Hi I got this error when running the run_flue.py from transformers import flue_compute_metrics as compute_metrics Import Error: cannot import name 'flue_compute_metrics'

    I already preinstalled the requirements and update the transformer directory.

    opened by keloemma 7
  • tweet sentiment analysis in french

    tweet sentiment analysis in french

    Hi,

    Hope you are all well !

    Is it possible with Flaubert to do some tweet sentiment analysis written in french ? If so, how can we do that ?

    Vive la France ! :-)

    Cheers, X

    opened by ghost 7
  • Model name not found or  config.json missing

    Model name not found or config.json missing

    Transformers version 2.3.0 Pytorch version 1.4

    When running readme.md code

    import torch
    from transformers import XLMModel, XLMTokenizer
    modelname="xlm_bert_fra_base_lower" # Or absolute path to where you put the folder
    
    # Load model
    flaubert, log = XLMModel.from_pretrained(modelname, output_loading_info=True)
    # check import was successful, the dictionary should have empty lists as values
    print(log)
    
    # Load tokenizer
    flaubert_tokenizer = XLMTokenizer.from_pretrained(modelname, do_lowercase_and_remove_accent=False)
    
    sentence="Le chat mange une pomme."
    sentence_lower = sentence.lower()
    
    token_ids = torch.tensor([flaubert_tokenizer.encode(sentence_lower)])
    last_layer = flaubert(token_ids)[0]
    print(last_layer.shape)
    

    Output

    OSError: Model name 'xlm_bert_fra_base_lower' was not found in model name list 
    (xlm-mlm-en-2048, xlm-mlm-ende-1024, xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024, xlm-clm-ende-1024,
     xlm-mlm-17-1280, xlm-mlm-100-1280). We assumed 'xlm_bert_fra_base_lower' 
    was a path or url to a configuration file named config.json 
    or a directory containing such a file but couldn't find any such file at this path or url.
    

    Tried downloading the model as tar file (lower and normal), extracting it and putting its absolute folder path as modelname but keeps getting the same error....

    I may be missing something so stupid but I don't see a config.json file in the archive file

    Screen Shot 2020-01-21 at 15 59 09

    What is wrong?

    opened by hzitoun 6
  • Import Error

    Import Error

    Hi I have tried to download Flaubert as described:

    import torch
    from transformers import FlaubertModel, FlaubertTokenizer
    

    Unfortunately it returns an ImportError:

      from transformers import FlaubertModel, FlaubertTokenizer
    ImportError: cannot import name 'FlaubertModel'
    
    opened by rcontesti 5
  • 1.get_toolkit.sh

    1.get_toolkit.sh

    The script failed with : No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK? I managed to fix it with : sudo update-alternatives --config java and selecting /usr/lib/jvm/java-11-openjdk-amd64/bin/java as default

    opened by jourlin 5
  • Filling masks

    Filling masks

    Bonjour bonjour ! Merci d'avoir partagé le modèle !

    Dans Camembert il est assez facile de deviner un mot à partir du contexte, y a-t-il un working example dans Flaubert ?

    Merci d'avance !

    from fairseq.models.roberta import CamembertModel
    camembert = CamembertModel.from_pretrained('./camembert-base/')
    camembert.eval()
    masked_line = 'Le camembert est <mask> :)'
    camembert.fill_mask(masked_line, topk=3)
    # [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
    ]#  ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
    #  ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
    
    opened by xiaoouwang 4
  • RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)

    RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)

    Environment info

    Environment info

    • transformers version: 2.5.1
    • Platform: linux
    • Python version: 3.7
    • PyTorch version (GPU?): 1.4
    • Tensorflow version (GPU?):
    • Using GPU in script?: no
    • Using distributed or parallel set-up in script?: no

    Who can help

    Model I am using (FlauBert):

    The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.

    • [ ] the official example scripts: (I did not change much , pretty close to the original)
    import torch
    from transformers import FlaubertModel, FlaubertTokenizer
    # Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
    #               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
    modelname = 'flaubert/flaubert_base_cased' 
    
    # Load pretrained model and tokenizer
    flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
    flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
    # do_lowercase=False if using cased models, True if using uncased ones
    
    sentence = "Le chat mange une pomme."
    token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
    
    last_layer = flaubert(token_ids)[0]
    print(last_layer.shape)
    # torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)
    
    # The BERT [CLS] token correspond to the first hidden state of the last layer
    cls_embedding = last_layer[:, 0, :]
    
    • [ ] My own modified scripts: (give details below)
    def get_flaubert_layer(texte):
    
    	modelname = "flaubert-base-uncased"
    	path = './flau/flaubert-base-unc/'
    
    	flaubert = FlaubertModel.from_pretrained(path)
    	flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
    	tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
    	max_len = 0
    	for i in tokenized.values:
    		if len(i) > max_len:
    			max_len = len(i)
    	padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
    	token_ids = torch.tensor(padded)
    	with torch.no_grad():
    		last_layer = flaubert(token_ids)[0][:,0,:].numpy()
    		
    	return last_layer, modelname
    

    The tasks I am working on is:

    • [ ] Producing vectors/features from a language model and pass it to others classifiers

    To reproduce

    Steps to reproduce the behavior:

    1. Get transformers library and scikit-learn, pandas and numpy, pytorch
    2. Last lines of code
    # Reading the file 
    filename = "corpus"
    sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
    data_id = sentences.identifiant
    print("Total phrases: ", len(data_id))
    data = sentences.sent
    label = sentences.etiquette
    emb, mdlname = get_flaubert_layer(data)  # corpus is dataframe of approximately 40 000 lines
    
    

    Apperently this line produce something which is huge and which take a lot memory : last_layer = flaubert(token_ids)[0][:,0,:].numpy()

    I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers , Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?

    My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.

    opened by keloemma 3
  • Using flauBERT for similarities between sentences

    Using flauBERT for similarities between sentences

    Hi,

    My goal here is to do clustering on sentences. For this purpose, I chose to use similarities between sentence embedding for all my sentences. Unfortunately, camemBERT seems not great for that task and fine-tuning flauBERT could be a solution.

    So thanks to @formiel, I managed to fine tune flauBERT on an NLI dataset. My question is about that fine-tuning. What is the output exactly ? I only got a few files in the dump_path:

    • train.log ==> logs of the training
    • params.pkl ==> parameters of the training
    • test.pred.0 ==> prediction of the test dataset after first epoch
    • valid.pred.0 ==> valid classification of the test dataset after first epoch
    • test.pred.1 ==> etc

    I wonder if after fine-tuning flauBERT, i could use it to make a new embedding of a sentence (like flauBERT before fine-tuning). So where is the new flauBERT model trained on the NLI dataset ? And how use it to make embeddings ?

    Thanks in advance

    opened by QuentinSpalla 3
  • Example de paraphrase

    Example de paraphrase

    Bonjour, Dans la documentation de Flaubert, il est fait mention de la possibilité de paraphraser. Auriez-vous un exemple de code à utiliser pour, par exemple, d'une phrase en générer une dizaines de similaires ?

    Merci

    opened by ExtReMLapin 1
  • From pytorch model (with hugging_face library) to XLM model

    From pytorch model (with hugging_face library) to XLM model

    Hello,

    I have a problem currently regarding the finetuning of Flaubert on FLUE. I have a model that I re-trained with custom data, so I have new weights for it, and I currently have it as .json file and .bin files. However, when I want to finetune this model on FLUE tasks, they ask me for vocab and codes files from pretraining, that I don't have when using the hugging face library. I see that there is a module to go from XLM to hugging_face, but not the opposite. Is it possible to transform a model under .json and .bin format to get vocab, codes and .pth files ?

    Or maybe there is a clever workaround to this problem ?

    Many thanks in advance

    opened by ArthurVanSchendel 1
  • Continued training of FlauBERT (with --reload_model) -- Question about vocab size

    Continued training of FlauBERT (with --reload_model) -- Question about vocab size

    Hello. :)

    I would like to use the "--reload_model" option with your train.py command to further train one of your pretrained FlauBERT models.

    Upon trying to run train.py with the "--reload_model" option I got an error message saying that there was a "size mismatch" between the pretrained FlauBERT model and the adapted model I was trying to train.

    The error message referred to a "shape torch.Size([67542]) from checkpoint". This was for the flaubert_base_uncased model. I assume that the number 67542 is the vocabulary size of flaubert-base-uncased.

    In order to use the "--reload_model" option with your pretrained FlauBERT models, do I need to ensure that the vocabulary of my training data is identical to that of the pretrained model? If so, do you think that I could manage that simply by concatenating the "vocab" file of the pretrained model with my training data?

    Thank you in advance for your help!

    opened by mcriggs 1
  • Finetuning on FLUE

    Finetuning on FLUE

    Hi ! I would like to finetune Flaubert on FLUE task with hugging face library. I downloaded the PAWS datas and use the code you gave on your github repo but I have this error message I can't go through : image

    any idea on what to do ?

    Thanks for this project by the way, him looking forward to use it !

    Have a good day, Lisa

    opened by LisaBanana 12
  • Pretraining with News Crawls by WMT 19

    Pretraining with News Crawls by WMT 19

    I have a query regarding regarding your training corpus.

    The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

    So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?

    opened by divkakwani 1
Owner
GETALP
Study Group for Machine Translation and Automated Processing of Languages and Speech
GETALP
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 9, 2023
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 9, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022