ByT5: Towards a token-free future with pre-trained byte-to-byte models

Google Research

Last update: Jan 6, 2023

Related tags

Text Data & NLP byt5

Overview

ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained language models (BERT, XLM-R, T5, GPT-3), our ByT5 model operates directly on UTF-8 bytes, removing the need for any text preprocessing. Beyond the reduction in system complexity, we find that parameter-matched ByT5 models are competitive with mT5 across a range of tasks, and outperform mT5 on tasks that involve noisy text or are sensitive to spelling and pronunciation. This repo can be used to reproduce the experiments in the ByT5 paper.

Usage

Training

To run this code, you need to install the t5 library. General instructions for training, fine-tuning, evaluation, and exporting models for inference can be found in the t5 repo. In order to use the additional ByT5 tasks provided in this library with the t5_mesh_transformer command, run from this directory and add the flag --module_import="byt5.tasks".

To train a ByT5-Large model on the mc4 task from scratch as described in the paper:

export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=byt5_mc4
MODEL_DIR="${BUCKET}${TASK}"

python -m t5.models.mesh_transformer_main \
  --tpu="${TPU}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --gin_file="models/byt5.large.gin" \
  --gin_param="MIXTURE_NAME = '${TASK}'" \
  --gin_param="utils.run.sequence_length = {'inputs': 1024, 'targets': 189}" \
  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \
  --gin_param="utils.run.learning_rate_schedule=@learning_rate_schedules.rsqrt_no_ramp_down" \
  --gin_param="run.train_steps = 1000000" \
  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
  --eval_mode="perplexity_eval" \
  --eval_gin_param="mesh_eval_dataset_fn.num_eval_examples = 10000" \
  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \
  --module_import="byt5.tasks"

Fine-Tuning

The example below shows how to finetune the ByT5-Large model on the XNLI zeroshot task.

export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf

TASK=byt5_xnli_zeroshot
PRETRAINED_DIR=gs://t5-data/pretrained_models/byt5/large
PRETRAINED_STEPS=1000000
FINETUNE_STEPS=262144
MODEL_DIR="${BUCKET}${TASK}"

# Run fine-tuning
python -m t5.models.mesh_transformer_main \
  --tpu="${TPU}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --gin_file="${PRETRAINED_DIR}/operative_config.gin" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
  --gin_param="MIXTURE_NAME = '${TASK}'" \
  --gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))" \
  --gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'" \
  --t5_tfds_data_dir="${BUCKET}/t5-tfds" \
  --module_import="byt5.tasks"
  --gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \
  --gin_param="utils.run.sequence_length = {'inputs': 2048, 'targets': 56}"
  --eval_gin_param="Bitransformer.decode.max_decode_length = 56" \

The remaining experiments are shown in the tasks.py file.

Released Model Checkpoints

We have released the following checkpoints for pre-trained models described in our paper:

ByT5-Small (300 million parameters): gs://t5-data/pretrained_models/byt5/small
ByT5-Base (580 million parameters): gs://t5-data/pretrained_models/byt5/base
ByT5-Large (1.2 billion parameters): gs://t5-data/pretrained_models/byt5/large
ByT5-XL (3.7 billion parameters): gs://t5-data/pretrained_models/byt5/xl
ByT5-XXL (13 billion parameters): gs://t5-data/pretrained_models/byt5/xxl

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{xue2021byt5,
    title={ByT5: Towards a token-free future with pre-trained byte-to-byte models},
    author={Linting Xue and Aditya Barua and Noah Constant and Rami Al-Rfou and Sharan Narang and Mihir Kale and Adam Roberts and Colin Raffel},
    year={2021},
    eprint={2105.13626},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

This is not an officially supported Google product.

You might also like...

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

23 Sep 27, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 7, 2023

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

564 Jan 8, 2023

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

48 Dec 14, 2022

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Comments

Using ByT5 for NER

I wanted to use ByT5 to finetune the pretrained model on a NER task. I can see that a sample NER task has been added to the SeqIO library in tasks.py. But can you explain it to me in detail about how I can go about it, because I am still unsure about the format in which we have to feed in the train data for NER. I'm new to SeqIO and I am still trying to figure out how to use the scripts mentioned in the readme file.

opened by Akshay0799 0

Issue with span_corruption preprocessors

Hi, I'm trying to pretrain byt5 on the custom corpora (of short texts), but I'm stuck with the data pipeline (the code is below). When I decode the outputs, inputs and targets are merged from the different examples, and both are noised.

DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
    "targets": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
}

MEAN_NOISE_SPAN_LENGTH = 5
SEQUENCE_LENGTH = sequence_length={"inputs": 128, "targets": 128}


seqio.TaskRegistry.add(
    name="nelma_byt5",
    source=seqio.TextLineDataSource(split_to_filepattern={
            "train": "/disk1/projekti/mondodb_lm/test.tsv",
        }),
    preprocessors=[
        functools.partial(
          t5.data.preprocessors.parse_tsv,
          field_names=['text','class'],
          field_delim='\t',
        ),
        functools.partial(
              seqio.preprocessors.rekey,
              key_map={"inputs": None, "targets": "text"}
        ),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        functools.partial(
          t5.data.preprocessors.span_corruption,
          mean_noise_span_length=MEAN_NOISE_SPAN_LENGTH),
        seqio.preprocessors.append_eos_after_trim,
     ],
      output_features=DEFAULT_OUTPUT_FEATURES,
      metric_fns=[])

opened by mondonomo 0

How to convert ByT5 model to ONNX format?

Hi,

ONNX allows to compress transformers models and speed up the inference time on CPU and GPU.

Who could share code / notebook to convert mT5 and ByT5 models to ONNX format?

There is the library fastT5 for T5 conversion (great!) but it has not been updated to the latest version of transformers and therefore, it does not accept mT5 and ByT5 models until today.

Thanks.

opened by piegu 0

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Related tags

Overview

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Usage

Training

Fine-Tuning

Released Model Checkpoints

How to Cite

You might also like...

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Google and Stanford University released a new pre-trained model called ELECTRA

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Comments

Using ByT5 for NER

Issue with span_corruption preprocessors

How to convert ByT5 model to ONNX format?

Owner

Google Research

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Must-read papers on improving efficiency for pre-trained language models.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Chinese Pre-Trained Language Models (CPM-LM) Version-I

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Guide to using pre-trained large language models of source code

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning