🏖 Easy training and deployment of seq2seq models.

Axel Springer Ideas Engineering GmbH

Last update: Nov 18, 2022

Related tags

Overview

Headliner

Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both researchers and developers. You can very easily deploy your models in a few lines of code. It was originally built for our own research to generate headlines from Welt news articles (see figure 1). That's why we chose the name, Headliner.

Figure 1: One example from our Welt.de headline generator.

Update 21.01.2020

The library now supports fine-tuning pre-trained BERT models with custom preprocessing as in Text Summarization with Pretrained Encoders!

check out this tutorial on colab!

🧠 Internals

We use sequence-to-sequence (seq2seq) under the hood, an encoder-decoder framework (see figure 2). We provide a very simple interface to train and deploy seq2seq models. Although this library was created internally to generate headlines, you can also use it for other tasks like machine translations, text summarization and many more.

Figure 2: Encoder-decoder sequence-to-sequence model.

Why Headliner?

You may ask why another seq2seq library? There are a couple of them out there already. For example, Facebook has fairseq, Google has seq2seq and there is also OpenNMT. Although those libraries are great, they have a few drawbacks for our use case e.g. the former doesn't focus much on production whereas the Google one is not actively maintained. OpenNMT was the closest one to match our requirements i.e. it has a strong focus on production. However, we didn't like that their workflow (preparing data, training and evaluation) is mainly done via the command line. They also expose a well-defined API though but the complexity there is still too high with too much custom code (see their minimal transformer training example).

Therefore, we built this library for us with the following goals in mind:

Easy-to-use API for training and deployment (only a few lines of code)
Uses TensorFlow 2.0 with all its new features (tf.function, tf.keras.layers etc.)
Modular classes: text preprocessing, modeling, evaluation
Extensible for different encoder-decoder models
Works on large text data

For more details on the library, read the documentation at: https://as-ideas.github.io/headliner/

Headliner is compatible with Python 3.6 and is distributed under the MIT license.

⚙️ Installation

⚠️ Before installing Headliner, you need to install TensorFlow as we use this as our deep learning framework. For more details on how to install it, have a look at the TensorFlow installation instructions.

Then you can install Headliner itself. There are two ways to install Headliner:

Install Headliner from PyPI (recommended):

pip install headliner

Install Headliner from the GitHub source:

git clone https://github.com/as-ideas/headliner.git
cd headliner
python setup.py install

📖 Usage

Training

For the training, you need to import one of our provided models or create your own custom one. Then you need to create the dataset, a tuple of input-output sequences, and then train it:

from headliner.trainer import Trainer
from headliner.model.transformer_summarizer import TransformerSummarizer

data = [('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are great, but I have other plans.', 'I like you.')]

summarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=2, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=2)
summarizer.save('/tmp/summarizer')

Prediction

The prediction can be done in a few lines of code:

from headliner.model.transformer_summarizer import TransformerSummarizer

summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict('You are the stars, earth and sky for me!')

Models

Currently available models include a basic encoder-decoder, an encoder-decoder with Luong attention, the transformer and a transformer on top of a pre-trained BERT-model:

from headliner.model.basic_summarizer import BasicSummarizer
from headliner.model.attention_summarizer import AttentionSummarizer
from headliner.model.transformer_summarizer import TransformerSummarizer
from headliner.model.bert_summarizer import BertSummarizer

basic_summarizer = BasicSummarizer()
attention_summarizer = AttentionSummarizer()
transformer_summarizer = TransformerSummarizer()
bert_summarizer = BertSummarizer()

Advanced training

Training using a validation split and model checkpointing:

from headliner.model.transformer_summarizer import TransformerSummarizer
from headliner.trainer import Trainer

train_data = [('You are the stars, earth and sky for me!', 'I love you.'),
              ('You are great, but I have other plans.', 'I like you.')]
val_data = [('You are great, but I have other plans.', 'I like you.')]

summarizer = TransformerSummarizer(num_heads=1,
                                   feed_forward_dim=512,
                                   num_layers=1,
                                   embedding_size=64,
                                   max_prediction_len=50)
trainer = Trainer(batch_size=8,
                  steps_per_epoch=50,
                  max_vocab_size_encoder=10000,
                  max_vocab_size_decoder=10000,
                  tensorboard_dir='/tmp/tensorboard',
                  model_save_path='/tmp/summarizer')

trainer.train(summarizer, train_data, val_data=val_data, num_epochs=3)

Advanced prediction

Prediction information such as attention weights and logits can be accessed via predict_vectors returning a dictionary:

from headliner.model.transformer_summarizer import TransformerSummarizer

summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict_vectors('You are the stars, earth and sky for me!')

Resume training

A previously trained summarizer can be loaded and then retrained. In this case the data preprocessing and vectorization is loaded from the model.

train_data = [('Some new training data.', 'New data.')] * 10

summarizer_loaded = TransformerSummarizer.load('/tmp/summarizer')
trainer = Trainer(batch_size=2)
trainer.train(summarizer_loaded, train_data)
summarizer_loaded.save('/tmp/summarizer_retrained')

Use pretrained GloVe embeddings

Embeddings in GloVe format can be injected in to the trainer as follows. Optionally, set the embedding to non-trainable.

trainer = Trainer(embedding_path_encoder='/tmp/embedding_encoder.txt',
                  embedding_path_decoder='/tmp/embedding_decoder.txt')

# make sure the embedding size matches to the embedding size of the files
summarizer = TransformerSummarizer(embedding_size=64,
                                   embedding_encoder_trainable=False,
                                   embedding_decoder_trainable=False)

Custom preprocessing

A model can be initialized with custom preprocessing and tokenization:

from headliner.preprocessing.preprocessor import Preprocessor

train_data = [('Some inputs.', 'Some outputs.')] * 10

preprocessor = Preprocessor(filter_pattern='',
                            lower_case=True,
                            hash_numbers=False)
train_prep = [preprocessor(t) for t in train_data]
inputs_prep = [t[0] for t in train_prep]
targets_prep = [t[1] for t in train_prep]

# Build tf subword tokenizers. Other custom tokenizers can be implemented
# by subclassing headliner.preprocessing.Tokenizer
from tensorflow_datasets.core.features.text import SubwordTextEncoder
tokenizer_input = SubwordTextEncoder.build_from_corpus(
inputs_prep, target_vocab_size=2**13, reserved_tokens=[preprocessor.start_token, preprocessor.end_token])
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = Vectorizer(tokenizer_input, tokenizer_target)
summarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=50)
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=3)

Use pre-trained BERT embeddings

Pre-trained BERT models can be included as follows. Be aware that pre-trained BERT models are expensive to train and require custom preprocessing!

from headliner.preprocessing.bert_preprocessor import BertPreprocessor
from spacy.lang.en import English

train_data = [('Some inputs.', 'Some outputs.')] * 10

# use BERT-specific start and end token
preprocessor = BertPreprocessor(nlp=English()
train_prep = [preprocessor(t) for t in train_data]
targets_prep = [t[1] for t in train_prep]


from tensorflow_datasets.core.features.text import SubwordTextEncoder
from transformers import BertTokenizer
from headliner.model.bert_summarizer import BertSummarizer

# Use a pre-trained BERT embedding and BERT tokenizer for the encoder 
tokenizer_input = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = BertVectorizer(tokenizer_input, tokenizer_target)
summarizer = BertSummarizer(num_heads=2,
                            feed_forward_dim=512,
                            num_layers_encoder=0,
                            num_layers_decoder=4,
                            bert_embedding_encoder='bert-base-uncased',
                            embedding_size_encoder=768,
                            embedding_size_decoder=768,
                            dropout_rate=0.1,
                            max_prediction_len=50))
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=3)

Training on large datasets

Large datasets can be handled by using an iterator:

def read_data_iteratively():
    return (('Some inputs.', 'Some outputs.') for _ in range(1000))

class DataIterator:
    def __iter__(self):
        return read_data_iteratively()

data_iter = DataIterator()

summarizer = TransformerSummarizer(embedding_size=10, max_prediction_len=20)
trainer = Trainer(batch_size=16, steps_per_epoch=1000)
trainer.train(summarizer, data_iter, num_epochs=3)

🤝 Contribute

We welcome all kinds of contributions such as new models, new examples and many more. See the Contribution guide for more details.

📝 Cite this work

Please cite Headliner in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{axelspringerai2019headliners,
  title={Headliner},
  author={Christian Schäfer & Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/as-ideas/headliner}},
}

🏗 Maintainers

Christian Schäfer, github: cschaefer26
Dat Tran, github: datitran

© Copyright

See LICENSE for details.

References

Text Summarization with Pretrained Encoders

Effective Approaches to Attention-based Neural Machine Translation

Acknowlegements

https://www.tensorflow.org/tutorials/text/transformer

https://github.com/huggingface/transformers

https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/

Comments

Bert Model prediction giving same output

Hi @cschaefer26 and @datitran ,

Thank you so much for the headliner library, I have been playing around it for a while and so far really enjoyed it.

I was working on a use case for summarization and use headliner's bert model, I followed the following readme code (with few tweaks for SummarizerTransformer parameters) for headliner with my use case train_data:

from headliner.preprocessing import Preprocessor

train_data = [('Some inputs.', 'Some outputs.')] * 10

# use BERT-specific start and end token
preprocessor = Preprocessor(start_token='[CLS]',
                            end_token='[SEP]',
                            lower_case=True)
train_prep = [preprocessor(t) for t in train_data]
targets_prep = [t[1] for t in train_prep]


from tensorflow_datasets.core.features.text import SubwordTextEncoder
from transformers import BertTokenizer
from headliner.model import SummarizerBert

# Use a pre-trained BERT embedding and BERT tokenizer for the encoder 
tokenizer_input = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = Vectorizer(tokenizer_input, tokenizer_target)
summarizer = SummarizerBert(num_heads=4,
                            feed_forward_dim=512,
                            num_layers_encoder=3,
                            num_layers_decoder=3,
                            bert_embedding_encoder='bert-base-uncased',
                            embedding_encoder_trainable=False,
                            embedding_size_encoder=768,
                            embedding_size_decoder=64,
                            dropout_rate=0.1,
                            max_prediction_len=400)
)
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=200)

I train the model for 200 epochs and i see in logs that the loss keeps on reducing (starting from around 4 to 0.69), so seems like training happens just fine.

After that when i try to do prediction on the saved model using following code, it gives me the same prediction always for any test_sentence :

from headliner.model.summarizer_bert import SummarizerBert

summarizer = SummarizerBert.load('/path/to/headliner_bert_model')
summarizer.predict(test_sentence)

Please if anyone can advise if am missing anything? with respect to prediction part for bert model?

opened by atirpetkar 6

Models are not saved after training.

I am training a transformer model on kaggle. After the training, models are not getting saved. Earlier it was working fine before the addition of BERT transformer.

Here is the notebook: https://www.kaggle.com/mohitsaini235/chatbot?scriptVersionId=22931537

I have commented out some code so that you can run it quickly and see the results.

opened by sainimohit23 6

Training with longer examples leads to `InvalidArgumentError`

When I try to train a SummarizerTransformer on longer training examples I get the following error: InvalidArgumentError: Incompatible shapes: [1,11,64] vs. [1,8,64] in train_step. It looks like it is depending on the length of the targets.

Minimal example:

from headliner.trainer import Trainer
from headliner.model.summarizer_transformer import SummarizerTransformer

data = [
        ('You are the stars, earth and sky for me!', 'I love you I love you I love you.'),
        ('You are the stars, earth and sky for me!', 'I love you.')
]

summarizer = SummarizerTransformer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=1, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=1)

Leads to an InvalidArgumentError while

from headliner.trainer import Trainer
from headliner.model.summarizer_transformer import SummarizerTransformer

data = [
        ('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are the stars, earth and sky for me!', 'I love you.')
]

summarizer = SummarizerTransformer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=1, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=1)

without the ('You are the stars, earth and sky for me!', 'I love you I love you I love you.') pair, works fine.

I use:

python==3.6
tensorflow==2.0.0
headliner== 0.0.22

It does not depend on if I run it on a gpu or cpu only.

Can you reproduce this bug?

opened by pschwllr 2

Question: Pre-trained BERT for MT

Hi,

has there been any research done from your side comparing BLEU score to SOTA results for this pre-trained BERT for MT? Or is it merely to illustrate the flexibility of the library, e.g, attaching a decoder and re-training on a custom dataset?

opened by Stamenov 2

Error while loading trained model

I followed the documentation and got error for the following code:

Training part

NUM_UNITS = 1024
BATCH_SIZE = 32
STEPS_PER_EPOCH = len(data) // BATCH_SIZE
STEPS_TO_LOG = 100
MAX_OUTPUT_LENGTH = 50
EPOCHS = 20
EMB_SIZE = 128

from headliner.trainer import Trainer
from headliner.model.summarizer_attention import SummarizerAttention

summarizer = SummarizerAttention(lstm_size=NUM_UNITS, embedding_size=EMB_SIZE)
trainer = Trainer(batch_size=BATCH_SIZE, 
                  steps_per_epoch=STEPS_PER_EPOCH, 
                  steps_to_log=STEPS_TO_LOG, 
                  max_output_len=MAX_OUTPUT_LENGTH, 
                  model_save_path=save_path)
trainer.train(summarizer, train, num_epochs=EPOCHS, val_data=test)

Loading pre-trained model:

summarizer_loaded = SummarizerAttention.load('summarizer')
trainer = Trainer(batch_size=2)
trainer.train(summarizer_loaded, data)

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-20-1bcb176df5c9> in <module>()
      1 summarizer_loaded = SummarizerAttention.load('summarizer')
      2 trainer = Trainer(batch_size=2)
----> 3 trainer.train(summarizer_loaded, data)
      4 # summarizer_loaded.save('/tmp/summarizer_retrained')

C:\ProgramData\Anaconda3\lib\site-packages\headliner\trainer.py in train(self, summarizer, train_data, val_data, num_epochs, scorers, callbacks)
    203         train_step = summarizer.new_train_step(self.loss_function, self.batch_size, apply_gradients=True)
    204         while epoch_count < num_epochs:
--> 205             for train_source_seq, train_target_seq in train_dataset.take(-1):
    206                 batch_count += 1
    207                 current_loss = train_step(train_source_seq, train_target_seq)

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in __next__(self)
    620 
    621   def __next__(self):  # For Python 3 compatibility
--> 622     return self.next()
    623 
    624   def _next_internal(self):

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in next(self)
    664     """Returns a nested structure of `Tensor`s containing the next element."""
    665     try:
--> 666       return self._next_internal()
    667     except errors.OutOfRangeError:
    668       raise StopIteration

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in _next_internal(self)
    649             self._iterator_resource,
    650             output_types=self._flat_output_types,
--> 651             output_shapes=self._flat_output_shapes)
    652 
    653       try:

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\gen_dataset_ops.py in iterator_get_next_sync(iterator, output_types, output_shapes, name)
   2670       else:
   2671         message = e.message
-> 2672       _six.raise_from(_core._status_to_exception(e.code, message), None)
   2673   # Add nodes to the TensorFlow graph.
   2674   if not isinstance(output_types, (list, tuple)):

C:\ProgramData\Anaconda3\lib\site-packages\six.py in raise_from(value, from_value)

UnknownError: AttributeError: 'Vectorizer' object has no attribute 'max_input_len'
Traceback (most recent call last):

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\script_ops.py", line 221, in __call__
    ret = func(*args)

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 585, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "C:\ProgramData\Anaconda3\lib\site-packages\headliner\trainer.py", line 264, in <genexpr>
    data_vectorized = (vectorizer(d) for d in data_preprocessed)

  File "C:\ProgramData\Anaconda3\lib\site-packages\headliner\preprocessing\vectorizer.py", line 42, in __call__
    if self.max_input_len is not None:

AttributeError: 'Vectorizer' object has no attribute 'max_input_len'


	 [[{{node PyFunc}}]] [Op:IteratorGetNextSync]

opened by sainimohit23 2

Import library

Hi,

Thank you so much for sharing this library. It seems to be very easy to use compared to others.

I try to import the library to reproduce your example but I receive an error message :

File "/home/ubuntu/.local/lib/python3.5/site-packages/headliner/model/summarizer.py", line 14 self.vectorizer: Union[Vectorizer, None] = None ^ SyntaxError: invalid syntax

I am running Python 3.5.2.

Am I doing something wrong?

Thanks

opened by YahyaL 2
Trainig with custom AmazonFoodReview Dataset for TextSummarization

Hi, First of all thanx for bringing such an easy to use Sequence-to-Sequence NN to open source. Actually I was thinking to use HeadLiner, and for testing I started Training with custom data of AmazonFoodReviews for a summarization model. But it ended up with a loss of "4.662851969401042".

I was using TransformerSummarizer to train the custom model and the code is here below.

summarizer = TransformerSummarizer(num_heads=1,embedding_size=64, max_prediction_len=20) trainer = Trainer(batch_size=2, steps_per_epoch=100) trainer.train(summarizer, training_data, num_epochs=100) summarizer.save('/tmp/summarizer')

Training data was in form of tuple (only 10 samples to print here) : [('have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like stew than processed meat and it smells better my labrador is finicky and she appreciates this product better than most', 'good quality dog food'), ('product arrived labeled as jumbo salted peanuts the peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo', 'not as advertised'), ('this is confection that has been around few centuries it is light pillowy citrus gelatin with nuts in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar and it is tiny mouthful of heaven not too chewy and very flavorful highly recommend this yummy treat if you are familiar with the story of lewis the lion the witch and the wardrobe this is the treat that seduces edmund into selling out his brother and sisters to the witch', 'delight says it all'), ('if you are looking for the secret ingredient in robitussin believe have found it got this in addition to the root beer extract ordered and made some cherry soda the flavor is very medicinal', 'cough medicine'), ('great taffy at great price there was wide assortment of yummy taffy delivery was very quick if your taffy lover this is deal', 'great taffy'), ('got wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was bit too much red black licorice flavored pieces between me my kids and my husband this lasted only two weeks would recommend this brand of taffy it was delightful treat', 'nice taffy'), ('this saltwater taffy had great flavors and was very soft and chewy each candy was individually wrapped well none of the candies were stuck together which did happen in the expensive version fralinger would highly recommend this candy served it at beach themed party and everyone loved it', 'great just as good as the expensive brands'), ('this taffy is so good it is very soft and chewy the flavors are amazing would definitely recommend you buying it very satisfying', 'wonderful tasty taffy'), ('right now am mostly just sprouting this so my cats can eat the grass they love it rotate it around with wheatgrass and rye too', 'yay barley'), ('this is very healthy dog food good for their digestion also good for small puppies my dog eats her required amount at every feeding', 'healthy dog food')]

vocab encoder: 18122, vocab decoder: 4439

But the prediction failed badly.

Then I started to give a try to BertSummarizer on same dataset. Then it gives the error : TypeError:generatoryielded an element that did not match the expected structure. The expected structure was (tf.int32, tf.int32, tf.int32), but the yielded element was ([3, 7293, 1725, 14131, 10785, 16089, 17337, 2220, 4703, 6185, 12287, 574, 7293, 6281, 16104, 414, 16352, 1242, 10785, 6793, 12569, 16089, 12274, 9204, 10148, 9029, 15200, 16066, 12257, 9667, 574, 8317, 14571, 1408, 10321, 8751, 8290, 5960, 574, 14182, 736, 16193, 12274, 1408, 16066, 10171, 2], [3, 1655, 3098, 1136, 1500, 2])..

I would be very thankful if you help me out.

Thanx.

opened by ShoubhikBanerjee 1
Colab hosted notebooks for quick demo.

It is not an issue. It is more like a suggestion.

Is it possible to have google colab hosted notebooks like this in the documentation to play around with code and to run quick demos?

opened by sainimohit23 1
Enable retraining with non-trainable embedding.

Currently if a model is retrained the embedding is switched to trainable=true. Use an additional flag for trainable embeddings that is restored when loading the model.

opened by cschaefer26 0
Can I bypass tokenizer and predict seq2seq directly?

I am trying to use Transformer to predict a sequence from another sequence. But I failed to see how can I bypass the tokenizer so that I can directly send my sequence to Transformer.

opened by zhangwangwz 0
does headliner support beam search during decoding?

hello,i want to kown does headliner supper beam search way when pridection,if it support, how can i use it,the document dont mention it, and when i check the source code, it seems you use the greedy search during decoding. many thanks.

opened by zhangshouleibupt 2

🏖 Easy training and deployment of seq2seq models.

Related tags

Overview

Headliner

Update 21.01.2020

🧠 Internals

Why Headliner?

⚙️ Installation

📖 Usage

Training

Prediction

Models

Advanced training

Advanced prediction

Resume training

Use pretrained GloVe embeddings

Custom preprocessing

Use pre-trained BERT embeddings

Training on large datasets

🤝 Contribute

📝 Cite this work

🏗 Maintainers

© Copyright

References

Acknowlegements

Comments

Owner

Axel Springer Ideas Engineering GmbH

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

An open source framework for seq2seq models in PyTorch.

Intent parsing and slot filling in PyTorch with seq2seq + attention

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Super easy library for BERT based NLP models

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Super easy library for BERT based NLP models

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models