SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO is a library for processing sequential data to be fed into downstream sequence models. It uses tf.data.Dataset to create scalable data pipelines but requires minimal use of TensorFlow. In particular, with one line of code, the returned dataset can be transformed to a numpy iterator and hence it is fully compatible with other frameworks such as JAX or PyTorch.

Currently, SeqIO assumes that the dataset is a sequence, i.e., each feature is one-dimensional array. Modalities such as text of audio are naturally supported. Images are supported as long as they are represented as sequences (e.g., Image GPT). We will release this constraint in the future in order to support higher dimensional data.

SeqIO is a refactor of the t5.data library used (in conjunction with the Mesh Tensorflow Transformer implementation) to train the T5 models introduced in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

If you have used t5.data in the past and want to know how SeqIO differs, please read this section.

Usage Tutorial

At a high level, we use SeqIO with the following steps.

Define a Task (and optionally a Mixture).
Define (or use an existing) a FeatureConverter based on the model architecture.
Use the top-level function seqio.get_dataset to obtain the tf.data.Dataset instance.

We will look at each of these steps in detail.

Defining a `Task`

The most important class in SeqIO is the Task. It is an abstraction that combines:

a raw data source
one or more preprocessing steps
a vocabulary to tokenize/detokenize each preprocessed feature for the model
a postprocessor to convert detokenized model outputs into a format for evaluation
one or more metrics to evaluate with

Oftentimes a Task lines up with a common benchmark. In this tutorial, we use WMT 19 English-German machine translation task. In the end, our Task will look like this:

seqio.TaskRegistry.add(
    "wmt19_ende",
    seqio.TfdsDataSource(tfds_name="wmt19_translate/de-en:1.0.0"),
    preprocessors=[
        translate, seqio.preprocessors.tokenize, seqio.preprocessors.append_eos
    ],
    output_features={
        "inputs": seqio.Feature(
           seqio.SentencePieceVocabulary("/path/to/inputs/vocab"),
           add_eos=False, dtype=tf.int32
        ),
        "targets": seqio.Feature(
           seqio.SentencePieceVocabulary("/path/to/targets/vocab"),
           add_eos=True, dtype=tf.int32
        ),
    },
    metric_fns=[metrics.bleu])

We typically add the Task to the global registry when we define it (as shown above) to make it easier to use with model configs and flags. Thus, it must have a unique string name ("wmt19_ende" in this case). Note, however, that you may also instantiate a seqio.Task directly without adding it to the registry, if desired.

We'll now break down each part of the task definition.

Data Source

Data sources are the first step in your pipeline, providing a way to load raw data in many format as a tf.data.Dataset. All data sources are subclasses of the DataSource base class and are defined in dataset_providers,

Existing implementations include:

TfdsDataSource for loading examples from TensorFlow Datasets.
TextLineDataset for loading examples from text files (e.g., tsv).
TFExampleDataSource for loading tf.train.Example protos from a file (e.g. a TFRecord file.)
FunctionDataSource for providing an custom function that returns a tf.data.Dataset.

In our example, we are using the TfdsDataSource. We specify the name of the WMT dataset in TFDS ("wmt19_translate"), the specific config for the language pair that excludes the context for the open domain setting ("de-en"), and the version number ("1.0.0").

Output Features

The output_features field expects a dictionary that maps string feature names to seqio.Feature objects. This defines what the Task is expected to produce in its output examples. The output examples may contain additional fields, but they must contain these fields in the specified format or exceptions will be raised.

Each Feature includes:

A vocabulary, which must subclass seqio.Vocabulary, to specify how the feature can be tokenized and detokenized. You may use seqio.PassThroughVocabulary if tokenization is not necessary.
add_eos, which specifies whether the feature should end with the vocabulary's EOS token.
The output dtype which must be a tf.dtypes.DType.

Note: specifying these options on Feature does not by itself ensure the proper transformations are applied -- you must also include then necessary preprocessors.

The tasks used in T5 all produce "inputs" and "targets" features to be consumed by the text-to-text model. For a decoder-only language model, only a single feature (e.g., "targets") would be necessary. Nevertheless, SeqIO is flexible enough to generate arbitrary output features what will be converted into model features by the FeatureConverter later in the pipeline.

Preprocessors

Preprocessors are functions that transform one tf.data.Dataset into a new tf.data.Dataset. Typically this involves executing a map over the given dataset. The preprocessors provided to the Task will be executed sequentially.

As an example, let's look at the previously undefined translate from the "wmt19_ende" example above.

def translate(dataset: tf.data.Dataset,
              source_language: str,
              target_language: str) -> tf.data.Dataset:
  def _translate(ex: Mapping[str, tf.Tensor]) -> Mapping[str, tf.Tensor]:
    """Convert a translation example to a text2text pair.

    For example, say the dataset returns examples of this format:
      {'de': 'Das ist gut.', 'en': 'That is good.'}
    If source_language = 'de', target_language = 'en', then the outputs will have
    the format:
      {'inputs': 'translate German to English: Das ist gut.',
      'targets': 'That is good.'}

    Args:
      x: an example to process.
      source_language: source language code (e.g. 'en') to translate from.
      target_language: target language code (e.g. 'de') to translate to.

    Returns:
      A preprocessed example with the format listed above.
    """
    src_str = f'translate {source_language}'
    tgt_str = f' to {target_language}: '
    return {
        'inputs': tf.strings.join([src_str, tgt_str, ex[source_language]]),
        'targets': ex[target_language],
    }

  return dataset.map(_translate,
                     num_parallel_calls=tf.data.experimental.AUTOTUNE)

The TFDS dataset provides the dataset where each example has the form: {'de': 'Das ist gut.', 'en': 'That is good.'}. We convert this to "inputs" and "targets" with the appropriate prompt to inform the model of the task.

A few important notes:

When instantiating a Task, the preprocessor functions can have the following arguments: dataset, output_features, and sequence_length. The first (positional) dataset argument is always required. If an argument named output_features is provided, the output feature mapping will be passed to the preprocessor. If sequence_length is provided, a mapping from feature name to its maximum final sequence length (provided by the caller will be passed -- any sequences that are too long after preprocessing will be automatically truncated. If a preprocessor function does have other arguments, they must have default values or be bound (e.g., with functools.partial) before instantiating the Task.
Mapping functions operate on and return tf.Tensors using TensorFlow operations, although it is possible to take advantage of automatic AutoGraph conversion for numpy or use tf.py_function to wrap arbitrary Python code. See tf.data.Dataset documentation for more details.
When calling map, it is important to always set num_parallel_calls=tf.data.experimental.AUTOTUNE to avoid creating a bottleneck. The seqio.map_over_dataset decorator helps enforce this as follows.

@seqio.map_over_dataset
def translate(ex: Mapping[str, tf.Tensor],
              source_language: str,
              target_language: str) -> Mapping[str, tf.Tensor]:
  """Convert a translation dataset to a text2text pair.

  For example, say the dataset returns examples of this format:
    {'de': 'Das ist gut.', 'en': 'That is good.'}
  If source_language = 'de', target_language = 'en', then the outputs will have
  the format:
    {'inputs': 'translate German to English: Das ist gut.',
    'targets': 'That is good.'}

  Args:
    x: an example to process.
    source_language: source language code (e.g. 'en') to translate from.
    target_language: target language code (e.g. 'de') to translate to.

  Returns:
    A preprocessed example with the format listed above.
  """
  src_str = f'translate {source_language}'
  tgt_str = f' to {target_language}: '
  return {
      'inputs': tf.strings.join([src_str, tgt_str, ex[source_language]]),
      'targets': ex[target_language],
  }

Note that translate takes as input an individual example. Then seqio.map_over_dataset decorates it to a function that takes in a tf.data.Dataset instance.

Stochastic operations must be stateless if deterministic pipelines are needed. To get (optionally deterministic) seeds for these operations, use the seqio.map_over_dataset(num_seeds=n) decorator. For example:

def random_chunk(
  dataset: tf.data.Dataset,
  sequence_length: Mapping[str, int]
) -> tf.data.Dataset:
"""Takes a random chunk out of each feature the size of `sequence_length`."""

  @seqio.map_over_dataset(num_seeds=1)
  def take_chunk(
      ex: Mapping[str, tf.Tensor],
      seed
  ) -> Mapping[str, tf.Tensor]:
    new_ex = {}
    for k, v in ex.items():
      if k in sequence_length:
        length = sequence_length[k]
        start_idx = tf.random.stateless_uniform(
           (), seed, 0, tf.size(v) - (length + 1))
        new_ex[k] = v[start_idx:start_idx+length]
      else:
        new_ex[k] = v
    return new_ex

return take_chunk(dataset)

If num_seeds > 1, the arg will instead be called seeds and will contain a sequence of seeds.

In our "wmt_19_ende" task, we also use the predefined preprocessors seqio.preprocessors.tokenize and seqio.preprocessors.append_eos. The former uses each Feature.vocabulary to tokenize it, and the the latter appends Feature.vocabulary.eos_id to the feature if the Feaure.add_eos is True. See preprocessors.py for their implementations and other useful preprocessors.

Postprocessor

During evaluation, the model outputs are first detokenized using the output feature vocabulary. Before passing these predictions to the metric functions, they can be run through a Python postprocessing function, alongside the full input example. Similarly, the raw targets are run through this function before being passed to the metrics. Since the postprocess function is used on both the model output and the targets, it is passed an is_target boolean in case the behavior should be different. It is also passed the fully preprocessed example, including fields that were excluded from output_features.

For the "wmt19_ende", we don't need any postprocessors. See "trivia_qa_open" task in the Advanced Postprocessing Task for an example postprocessor.

Metrics

Metrics are functions that are passed (by the Evaluator) the fully-materialized list of postprocessed model outputs (or scores) and targets and return a mapping from string names to Metric objects containing their values. These are most commonly floating-point scalars, but may also be text, images, audio, histograms, etc (see evaluation.py for the full list).

The first argument of a metric function must always be called targets. If the second argument of a metric function is called predictions, it will be passed the decoded and detokenized model prediction. If it is called scores, it will be passed a list of log-likelihood scores for each example.

If multiple metric functions are provided, they will all be used and their returned mappings merged.

Prediction Metrics

Prediction metrics are computed using the postprocessed targets and model outputs (predictions). The args must be named targets and predictions.

Let's look at the metric function used for "wmt19_ende" task. A standard metric for the translation task is BLEU and we use sacrebleu implementation.

def bleu(targets: Sequence[str], predictions: Sequence[str]):
  """Computes BLEU score.

  Args:
    targets: list of strings or list of list of strings if multiple references
      are present.
    predictions: list of strings

  Returns:
    bleu_score across all targets and predictions
  """
  if isinstance(targets[0], list):
    targets = [[x for x in target] for target in targets]
  else:
    # Need to wrap targets in another list for corpus_bleu.
    targets = [targets]

  bleu_score = sacrebleu.corpus_bleu(predictions, targets,
                                     smooth_method="exp",
                                     smooth_value=0.0,
                                     force=False,
                                     lowercase=False,
                                     tokenize="intl",
                                     use_effective_order=False)
  return {"bleu": bleu_score.score}

Score Metrics

Score metrics are computed using the postprocessed targets and their log-likelihood scores according to the model. The args must be named targets and scores.

def perplexity(targets: Sequence[str], scores: Sequence[int]):
  return {
    "perplexity": seqio.evaluation.Scalar(np.exp(np.mean(scores)))
  }

Defining a `Mixture`

Once you have multiple Tasks added to the TaskRegistry, you can define Mixtures that will combine the examples from them according to some specified rate. Examples will then be sampled from each task in proportion to its rate.

As an example, Multilingual T5 uses a Mixture of per-language Tasks with tail languages up-weighted in the mixture.

There are 3 ways to specify the tasks and their rates:

Provide a rate along with each task's name (rates are normalized before sampling):

seqio.MixtureRegistry.add(
  "mix1",
  [("task1", 1), ("task2", 7)]
)

Provide a constant default rate for some or all tasks, which will be used when only the name is provided. The example below will produce identical mixing rates as the previous one.

seqio.MixtureRegistry.add(
  "mix1",
  [("task1", 0.5), "task2"],
  default_rate=3.5
)

Provide a function that generates the rate for each task at runtime. The example below uses the provided seqio.mixing_rate_num_examples, which uses the number of examples (computed during offline caching) as the rate for each task.

seqio.MixtureRegistry.add(
  "mix2",
  ["task1", "task2"],
  default_rate=seqio.mixing_rate_num_examples
)

You can also include Mixtures in your Mixture! For example, the following task would contain 1/24 (from "mix1") + 1/3 "task1", 7/24 (from "mix1") of "task2", and 1/3 "task3".

seqio.MixtureRegistry.add(
  "mix3",
  ["mix1", task1", "task3"],
  default_rate=1
)

Getting a Preprocessed Dataset

Now that your Task (and/or Mixture) is defined, its primary functionality is to use it to generate a dataset.

You may first need to use seqio.get_mixture_or_task(mixture_or_task_name) to access your dataset provider from the registry.

After that, you can call get_dataset to build the tf.data.Dataset. For example:

dataset = seqio.get_mixture_or_task("mix1").get_dataset(
    sequence_length={"inputs": 256, "targets": 128},
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

# Print the first 5 examples.
for _, ex in zip(range(5), dataset.as_numpy_iterator()):
  print(ex)

Some notes on a few the arguments:

sequence_length: An optional mapping from feature name to maximum length. Will be passed to the preprocessors with a sequence_length argument. If not None, the final example features will be truncated if they exceed the specified length. Note that this value may be required to be set if any of the preprocessors use the sequence_length argument and do not handle the None case.
num_epochs: The number of times to repeat the source dataset. Preprocessing will be re-applied with new seeds to enable new samples from stochastic steps. Note that if the CacheDatasetPlaceholder is included (see below) preprocessing is only re-applied after that step.
shard_info: An optional sharding specification for loading a deterministic subset of the dataset. Loading will be most efficient if the number of shards evenly divides the number of shards in the raw data source.
use_cached: Specifies whether to load from a pre-cached task for increased performance or to do the preprocessing on-the-fly. See the following section for details on how to cache your task, which must be done before this can be set to True.
seed: An optional seed to use for deterministic shuffling and (stateless) stochastic ops. These operations will still be pseudorandom but will be reproducible with the same seed. Set to None if determinism is not desired.

(Optional) Offline Caching

For improved performance at load time and avoid redundant computations for commonly used tasks, you can pre-cache your Task with all or part of the preprocessing done in advance of training.

The first step to doing so is to add a seqio.CacheDatasetPlaceholder(required=False) as one of the steps in your preprocessing pipeline. All steps before the placeholder will be cached offline and all steps after will be executed on the fly at load time. You may set required=True if you want get_dataset to fail unless use_cached=True.

Caveats:

Any stochastic operations that you wish to be re-run when num_epochs > 1 or with a different seed should go after the placeholder since only a single sample will be cached.
Any preprocessing steps that use the sequence_length argument must come after the seqio.CacheDatasetPlaceholder preproessor since this is only known at runtime, or an exception will be raised. If you wish to cache for a specific sequence length, you can use seqio.experimental.add_fully_cached_task.

Once your Task is registered, you can run cache_tasks_main to execute the offline preprocessing, providing it with the module containing your task definitions via the --module_import flag. For very large datasets, it's recommended you run this Apache Beam script on a distributed framework like Google Cloud DataFlow.

Finally, you are ready to load the cached version of your Task (or Mixture) containing it. You will need to add the path to the directory you passed to --output_cache_dir via seqio.add_global_cache_dirs(["/my/cache/dir"]). Now when you call task_or_mixture.get_dataset(..., use_cached=True), the data will be loaded from the cache directory instead of the raw data source.

Feature Converters

The role of Task is to provide the dataset object with as little model-specific features (e.g., generic "inputs" and "targets") while the Feature Converters transform the model-agnostic features to model-specific features (e.g., "encoder_input_tokens"). We refer to the former as "task features" and the latter as "model features".

Let's use machine translation (English to German) as a running example.

The raw data consists of sentence pairs such as

"That is good\tDas ist gut."

A task registered to Task (e.g., wmt_t2t_ende_v003) reads these sentence pairs from the data source and applies a series of preprocessors. One of the internal representations looks like

{"inputs": "translate English to German: That is good.",
 "targets": "Das ist gut."}

The final output from the Task is a tokenized version of the parallel sentences. In the following toy example (the token ids do not correspond to the above string example), the dataset consists of 2 examples.

dataset = [{"inputs": [7, 8, 5], "targets": [3, 9]},
           {"inputs": [8, 4, 9, 3], "targets": [4]}]

The format is in the tf.data.Dataset (i.e., each example is a dictionary with "inputs" and "targets" fields.

The FeatureConverter then takes this as an input and converts to the model-specific features. In addition, the feature converter performs padding and optionally packing (for model implementations that support it) for efficiency. For example, let's assume that we are using the standard Transformer architecture with an encoder and a decoder. The output of the feature converter is

converted_dataset = [{
    "encoder_input_token": [7, 8, 5, 1, 8, 4, 9, 3, 1, 0],
     "encoder_segment_id": [1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
       "encoder_position": [0, 1, 2, 3, 0, 1, 2, 3, 4, 0],
   "decoder_target_token": [3, 9, 1, 4, 1, 0, 0],
    "decoder_input_token": [0, 3, 9, 0, 4, 0, 0],
    "decoder_loss_weight": [1, 1, 1, 1, 1, 0, 0],
       "decoder_position": [0, 1, 2, 0, 1, 0, 0],
     "decoder_segment_id": [1, 1, 1, 2, 2, 0, 0],
}]

In this case, two task examples are packed into one. *_segment_id and *_position are the fields used to denote the membership and position of packed token in the original sequence. The EOS ids (i.e., 1) are appended. In addition, each fields is padded to the specified length.

We will look at the details of this example in Encoder-decoder architecture: seqio.EncDecFeatureConverter section.

Feature converters provided out of the box

We provide feature converters for three common architectures: encoder-decoder, decoder-only and encoder-only. Here we describe how users can use the feature converters for each of these architectures out of the box as a part of the SeqIO library.

In the SeqIO library, each architecture has a class defining how the task features are converted to model features. Since these feature converters are already implemented, it is straightforward to use them by providing the class as a feature_converter_cls argument of the seqio.get_dataset function. The following sections will show the example usage of seqio.get_dataset.

Encoder-decoder architecture: `seqio.EncDecFeatureConverter`

This is the architecture of the original Transformer paper. For the English-to-German translation task, the following function call retrieves the tf.data.Dataset object with the model features.

dataset: tf.data.Dataset = seqio.get_dataset(
    mixture_or_task_name="wmt_t2t_ende_v003",
    task_feature_lengths={"inputs": 32, "targets": 32},
    dataset_split="train",
    pack=True,
    shuffle=True,
    feature_converter=seqio.EncDecFeatureConverter(pack=True)
)

The resulting dataset object has the following 7 fields

Feature name	Explanation
`encoder_input_token`	Input tokens to the encoder.
`encoder_position`	Position index in the sequence before packing.
`encoder_segment_id`	Sequence membership before packing. Two positions with the same positive integer mean that they belong to the same sequence before packing.
`decoder_input_token`	Input tokens to the decoder.
`decoder_target_token`	Output tokens from the decoder.
`decoder_loss_weight`	A weight on each position that can be used as a mask.
`decoder_position`	Position index in the sequence before packing.
`decoder_segment_id`	Same as `encoder_segment_id` but for decoder.

Decoder-only architecture

This architecture consists of a single autoregressive stack, which we denote as a "decoder".

A decoder autoregressively produces an output sequence. Therefore, it can be used as a standard language model if the task dataset has only "targets" features, i.e., self-supervised. If the task dataset also has an "inputs" field, e.g., supervised machine translation, the decoder can still be used by concatenating the inputs and targets fields. See Raffel et al. (2020), Section 3.2.1 for more detailed take on this topic.

We support both uses cases and refer to the former as standard language model and the latter as prefix language model. Each of these models is described separately below.

Note that we do not provide special features to denote how the dataset should be consumed. For example, a Transformer-based fully autoregressive decoder has a fully-causal self-attention layer. Since there are many ways of implementing the masking pattern for such attention layer and, more importantly, SeqIO is not limited to attention-based models, we leave it up to the model implementations to apply the masking pattern. There is one exception, and we cover this in the Prefix LM section below.

A common use pattern is to pretrain a decoder model with the left-to-right language modeling objective (unsupervised) using seqio.LMFeatureConverter and then fine-tune (supervised) using seqio.PrefixLMFeatureConverter.

Standard LM

For the standard language model, the task dataset only has "targets" field. Therefore, the sequence length specification only needs to specify targets.

dataset: tf.data.Dataset = seqio.get_dataset(
    mixture_or_task_name="standard_lm",
    task_feature_lengths={"targets": 32},
    dataset_split="train",
    pack=True,
    shuffle=True,
    feature_converter=seqio.LMFeatureConverter(pack=True)
)

Note that "standard_lm" is not a registered task in the T5 codebase. It is the left-to-right language modeling task, i.e., predict the next token given the previous tokens on some language corpus (e.g., C4.

The output dataset has the following model features.

Feature name	Explanation
`decoder_target_token`	Output tokens from the decoder
`decoder_input_token`	Input tokens to the decoder
`decoder_loss_weight`	Binary mask to indicate where the loss should be taken
`decoder_position`	Position index in the sequence before packing
`decoder_segment_id`	Sequence membership before packing. Two positions with the same positive integer mean that they belong to the same sequence before packing.

The decoder_target_token is a shifted version of decoder_input_token for the standard teacher-forced autoregressive training.

Prefix LM: `seqio.PrefixLMFeatureConverter`

If the input dataset has a notion of "inputs" and "targets", we can concatenate them so that we can still use a single stack decoder. Therefore, the output only contains "targets" just like standard LM case.

We use the same toy example for English-to-German translation task as a running example:

{"inputs": "translate English to German: That is good.",
 "targets": "Das ist gut."}

To be consumed by the decoder-only stack, seqio.PrefixLMFeatureConverter concatenates them form the new "targets". Consider 2-layer decoder architecture whose activations are shown below


That  is  good <EOS> Das ist gut <EOS>
 |    |    |    |    |   |    |   |
 u1   u2   u3   u4   u5  u6   u7  u8
 |    |    |    |    |   |    |   |
 v1   v2   v3   v4   v5  v6   v7  v8
 |    |    |    |    |   |    |   |
<BOS> That is  good <EOS> Das ist gut

Let's us denote the first layer's activation in the ith position as vi. Similarly, let ui denote the activation of the second layer in the ith position.

For attention-based sequence models such as Transformer decoders, the self-attention layer is used to encode contextualized representation of the sequence. At a given layer, each position's representation is computed as a function of the representations of the tokens before its position in the previous layer.

Referring to the toy example, when computing u2 with fully-causing masking, we do not use v3. This results in a representation u2 of the word "is" that does not take into account the word "good", which is unnecessarily limiting.

For Prefix LM, this issue is resolved by having the fully visible masking pattern for the inputs portion only. For example, when computing u2, v1, v2, v3, v4 and v5 are all visible and taken into account. For the tokens in the "targets" of the Task dataset, we use the causal masking. For example, when computing u6, all vi for i <= 6 are taken into account but not v7.

Why `v5` is included in the inputs attention pattern

In the same translation example, we note that when computing `u2`, the activation corresponding to the position where \ token was input (i.e., `v5`) was visible. This doesn't count as "cheating" because the model doesn't see the next word "Das". This can provide additional context in building the representation for "good". In this case, `u4` has the context that "good" is the last word in the sentence.

seqio.PrefixLMFeatureConverter provides a feature decoder_causal_attention to encode this information. For the above example, we have

decoder_causal_attention = [1, 1, 1, 1, 1, 0, 0, 0]

indicating that the non-causal attention can be applied to the first five positions. Note that this feature seems trivial but for a packed dataset, the inputs and targets boundary are more nuanced.

A final consideration for the prefix LM is that because we concatenate "inputs" and "targets", which tokens are used for the loss computation is a modeling decision. For example, we can penalize the models only for the "targets" tokens or we may choose to penalize building the representation for "inputs" tokens. This is controlled by loss_on_targets_only argument (defaults to True) to seqio.PrefixLMFeatureConverter constructor. In the above example, we would get

decoder_loss_weights = [0, 0, 0, 0, 1, 1, 1, 1]

This indicates that the last 4 positions are used for the loss computation.

To get the dataset with prefix LM features, we can use

dataset: tf.data.Dataset = seqio.get_dataset(
    mixture_or_task_name="wmt_t2t_ende_v003",
    task_feature_lengths={"inputs": 32, "targets": 32},
    dataset_split="train",
    pack=True,
    shuffle=True,
    feature_converter=seqio.PrefixLMFeatureConverter(
        pack=True,
        loss_on_targets_only=True)
)

The resulting features have length 64 because it concatenates inputs and targets each with length 32.

The output dataset has the following model features. Note that the only additional feature is decoder_causal_attention.

Feature name	Explanation
`decoder_target_token`	Output tokens from the decoder
`decoder_input_token`	Input tokens to the decoder
`decoder_loss_weight`	Binary mask to indicate where the loss should be taken
`decoder_position`	Position index in the sequence before packing
`decoder_segment_id`	Sequence membership before packing. Two positions with the ` same positive integer mean that they belong to the same sequence before packing.
`decoder_causal_attention`	Binary mask denoting which tokens are in the non-causal masking region.

Encoder-only architecture

Like decoder-only architecture, this one is a single stack, but not autoregressive.

One notable assumption is that the inputs and targets are aligned, i.e., they have the same sequence length and ith position in the targets correspond to the output representation of the ith token in the inputs.

A common model using encoder-only architecture is BERT. We provide Encoder feature converter class to support the Masked Language Modeling (MLM) objective from BERT.

We assume that a unique sentinel such as [MASK] token is used to mask some fraction of the input text and the task is to recover the original text. Therefore, the "targets" is naturally defined as the original text whereas "inputs" are the masked text.

Encoder-only models are often used for classification tasks. In BERT, a special token [CLS] is prepended to the input sequence. The last layer's activation corresponding to this sentinel token is the contextualized representation of the sequence. We assume that such "classification" sentinel is prepended.

Consider the following example for the MLM task. The input dataset has two examples, which is packed to one example. We assume that mask_id = 9 and the [CLS] token has id of 8.

dataset = [{"inputs": [8, 9, 9, 3, 4], "targets": [8, 7, 4, 3, 4]},
           {"inputs": [8, 3, 9], "targets": [8, 3, 6]}]

converted_dataset = {
     "encoder_input_token": [8, 9, 9, 3, 4, 1, 8, 3, 9, 1, 0],
    "encoder_target_token": [8, 7, 4, 3, 4, 1, 8, 3, 6, 1, 0],
      "encoder_segment_id": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0],
        "encoder_position": [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0],
     "encoder_loss_weight": [0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0],
}

Note that the packed sequence has [CLS] token at the beginning of each sequences. Also note that the loss is taken only on the masked position.

To use the pre-defined EncoderFeatureConverter, provide mask_id as an argument.

dataset: tf.data.Dataset = seqio.get_dataset(
    mixture_or_task_name="some mlm task",
    task_feature_lengths={"inputs": 32, "targets": 32},
    dataset_split="train",
    pack=True,
    shuffle=True,
    feature_converter=seqio.EncoderFeatureConverter(
        pack=True,
        mask_id=9)
)

The resulting dataset object has the following 5 fields

Feature name	Explanation
`encoder_input_token`	Input tokens to the encoder
`encoder_position`	Position index in the sequence before packing
`encoder_segment_id`	Sequence membership before packing. Two positions with the ` same positive integer mean that they belong to the same sequence before packing.
`encoder_target_token`	Output tokens from the encoder
`encoder_loss_weight`	Binary mask to indicate where the loss should be taken

Custom architectures

For a model architectures, you would need to create a subclass of FeatureConverter and override two methods _convert_features and get_model_feature_lengths to define how task features are mapped to the model features including the length relationships. The existing feature converters (e.g., seqio.EncDecFeatureConverter) follows the same pattern. So this can be useful starting point.

`Evaluator`

TODO(hwchung)

Differences from `t5.data`

The original t5 library introduced and implemented the t5.data.Task abstraction for specifying preprocessing and evaluation metrics for text-to-text tasks. When creating a task, users specify a source dataset of raw text, some preprocessing steps, a vocabulary for tokenization, and evaluation metrics. The fully-specified Task can then be used to pre-train or fine-tune a encoder-decoder transformer model. However, the design included many baked-in assumptions about the types of tasks users could specify.

SeqIO removes some of the constraints of this abstraction:

Inputs and outputs are no longer required to be strings (e.g., it may be images or audio).
Architectures other than the original encoder-decoder are supported (e.g., decoder-only languaged models like GPT or encoder-only models like BERT).
Users can control at which stage of the pipeline offline caching occurs.
Users can control when and where EOS tokens are added.

Furthermore, SeqIO has been made more modular with respect to the Mesh TensorFlow Transformer. This allows it to be used with other model implementations with more consistency and much less code duplication.

Advanced Postprocessing `Task`

TriviaQA (Closed-book, open-domain version)

This version of TriviaQA was introduced in Roberts et al. 2020.

seqio.TaskRegistry.add(
    "trivia_qa_open",
    source=seqio.TfdsDataSource(
      tfds_name="trivia_qa/unfiltered.nocontext:1.1.0",
      splits={
          "train": "train[:90%]",
          "validation": "train[90%:]",
          "test": "validation"
      }),
    preprocessors=[
        tqa_open_preprocessor,
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos,
    ],
    output_features={
        "inputs": seqio.Feature(
           seqio.SentencePieceVocabulary("/path/to/inputs/vocab"),
           add_eos=False, dtype=tf.int32
        ),
        "targets": seqio.Feature(
           seqio.SentencePieceVocabulary("/path/to/targets/vocab"),
           add_eos=True, dtype=tf.int32
        ),
    },
    postprocess_fn=tqa_open_postprocessor,
    metric_fns=[tqa_metric])

In this example, we are using the TfdsDataSource. We specify the name of the TriviaQA dataset in TFDS ("trivia_qa"), the specific config that excludes the context for the open domain setting ("unfiltered.nocontext"), and the version number ("1.1.0"). We also override the default splits to match what is commonly used for the open domain setting. Specifically, we set our "test" split to be the TFDS "validation" split, and create a small pseudo-"validation" set by taking examples out of the TFDS "train" split.

The preprocessor tqa_open_preprocessor is defined as follows.

def trivia_qa_open(
    dataset: tf.data.Dataset,
    prefix:str = "trivia_qa question: "
  ) -> tf.data.Dataset:
  """Convert TriviaQA dataset to open domain qa examples.

  The function takes the trivia_qa TFDS dataset and emits examples of the
  form:
  {
    "inputs": "trivia_qa question: What are the names of the Olsen Twins?"
    "targets": "Mary-Kate and Ashley",
    "answers": ["Mary-Kate and Ashley", "Ashley and Mary-Kate"]
  }

  Args:
    dataset: a tf.data.Dataset to process.
    prefix: str, prefix to prepend to the inputs.

  Returns:
    a tf.data.Dataset
  """
  def tqa_map(ex):
    """Map TriviaQA example to text-to-text example."""
    return {
        "inputs": prefix + ex["question"],
        "targets": ex["answer"]["value"],
        "answers": ex["answer"]["aliases"],
    }

  return dataset.map(tqa_map, num_parallel_calls=tf.data.experimental.AUTOTUNE)

Or with the seqio.map_overdataset decorator, we have

def trivia_qa_open(
  dataset: tf.data.Dataset,
  prefix: str = "trivia_qa question: "
) -> tf.data.Dataset:

  @seqio.map_over_dataset
  def tqa_map(ex: Mapping[str, tf.Tensor]) -> Mapping[str, tf.Tensor]:
    """Map TriviaQA example to text-to-text example."""
    return {
        "inputs": prefix + ex["question"],
        "targets": ex["answer"]["value"],
        "answers": ex["answer"]["aliases"],
    }

return tqa_map(dataset)

Here we made a thin wrapper to emphasize that the function decorated by seqio.map_over_dataset takes in an instance of tf.data.Dataset. In practice, this wrapper is not necessary.

The postprocessor for this example is tqa_open_postprocessor, which is defined as follows:

def tqa_open_postprocessor(output_or_target, example=None, is_target=False):
  """Returns output as answer, or all answers if the full example is provided."""
  if is_target:
    return [a.decode("utf-8") for a in example["answers"]]
  else:
    return output_or_target.decode("utf-8")

When processing the target, we ignore output_or_target (equivalent to example["targets"]) since it is just selecting a single answer in trivia_qa_open. Instead, we extract the full list of answers from the example and convert them from bytes to text. When handling the model output, we simply convert it to text from detokenized bytes.

The metric function tqa_metric is defined as:

def tqa_metric(
  targets: Sequence[Sequence[str]],
  predictions: Sequence[str]
) -> Mapping[str, seqio.Metric]:
  """Computes official TriviaQA metrics.

  Args:
    targets: list of lists of strings
    predictions: list of strings

  Returns:
    dict with score_key: squad score across all targets and predictions
  """

  if len(targets) != len(predictions):
    raise ValueError("Number of targets and predictions must match.")

  def _normalize_answer(text):
    """Lower text and remove punctuation, articles and extra whitespace."""
    # Remove articles.
    text = re.sub(r"\b(a|an|the)\b", " ", s)
    # Remove punctuation.
    for punc in string.punctuation:
      text = text.replace(punc, '')
    # Normalize white space
    text = " ".join(s.split())
    return text

  # Normalize answers before comparing.
  targets = [[_normalize_answer(t) for t in u] for u in targets]
  predictions = [_normalize_answer(p) for p in predictions]

  em = np.mean([
      max(pred == gt for gt in ground_truths)
      for pred, ground_truths in zip(predictions, targets)
  ])
  return {
      "exact_match": seqio.evaluation.Scalar(em),
  }

Fix several issues in FewShotDataSource that cause train and eval examples to be identical in the one-shot setting.

This change addresses a potential lack of proper randomization in FewShotDataSource.

ISSUE #1

Consider the setup where num_shots=1, split=self._train_split, and shuffle=True. Furthermore, suppose that the original_source is a TfdsDataSource.

Note that TfdsDataSource.get_dataset does not implement proper example-level shuffling -- it only implements file-shard-level shuffling. For any TfdsDataSource with only a single file, this amounts to no shuffling at all even when requesting get_dataset(..., shuffle=True).

Due to this issue, FewShotDataSource will zip together two datasets that have not been shuffled, resulting in degenerate few-shot examples where the train and eval fields contain the exact same examples (this makes the few-shot learning task trivially easy).

More generally, we cannot trust original_source.get_dataset to properly implement shuffling. So, we add our own shuffle operation inside FewShotDataSource to protect it from this problem.

ISSUE #2

Consider the setup where num_shots=1, split=self._train_split, shuffle=True and seed=.

The current code applies the exact same shuffle seed to both the train and eval split before they are zipped together. This will again cause degenerate few-shot examples where the train and eval fields contain the exact same examples.

We fix this by using seed for the train split and seed + 1 for the eval split.

ISSUE #3

If the user specifies shuffle=False, then neither the train nor eval splits are shuffled. If num_shots=1 and split=self._train_split, then this will again result in degenerate few-shot examples where train and eval are identical.

To fix this, we assume that a user requesting shuffle=False only wants the eval split to be in its original unshuffled order. We assume that the user still wants the train split to be shuffled. We deterministically shuffle the train split using a seed of 0, so that determinism is still guaranteed.

cla: no

How to just use the mixture functionality in seqio

Hey there, I've been wanting to pretrain MT5 on Huggingface training script as mentioned here: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py

But sadly the Huggingface script doesn't support a mixture to pretrain MT5 in such a way that the model generalise well on low-resource as well as high-resource langauges.

Hence I've been wanting to use the mixture functionality of seqio, but sadly upon using it i have to tokenize the model into the T5 sentencepiece vocabulary and seqio tasks does all the preprocessing.

The Huggingface trainer takes care of the preprocessing maping the dataset to the tokenizer etc.

My question is is there a way where i could only just use the mixture functionality of seqio without actually doing any preprocessing on the incoming datasets.

I was wondering if there is a way to feed in multiple datasets, get an output dataset (in text str format) which is only an appropriate mixture of all samples of the datsets, passed by the mixture function. which i could then use to pretrain on the HF trainer and then do all the preprocessing on it in HF trainer

opened by StephennFernandes 9

Dataset seeking for restarting from a T5X crashed run using HuggingFace datasets

Re-opening here as suggested by @adarob in https://github.com/google-research/t5x/issues/421#issuecomment-1095825702.

I wrote some hacky support for HuggingFace datasets using seqio.FunctionDataSource, specifically for pretraining and further pretraining models using T5X.

def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:  # TODO: add for...loop over num_epochs
        for item in dataset[str(split)]:
            yield item[column]

def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )

dataset_name = 'NbAiLab/NCC'
dataset_params = {"path": dataset_name, "streaming": True}
dataset_shapes = {"train": 20830348, "validation": 473079}
source = seqio.FunctionDataSource(
    dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
    splits=("train", "validation"),
    caching_permitted=False,
    num_input_examples=dataset_shapes,
)

But unfortunately, as I face constant random crashes during training (https://github.com/google-research/t5x/issues/366), I need a way to seek to the right dataset batch to properly continue training.

I see there's a continue_from_last_checkpoint variable in get_dataset(), bit it seems is not used for anything yet.

Is there a way to pass in the needed information to get_dataset_fn() so I can write the logic without using any hard-coded global variables?

opened by versae 8

Using seqio for T5X Dataset Generation

Hi :hugs:

I would like to pre-train a T5 Base model with T5X library.

When I understand the pre-training process correctly, I need TFRecords stored on a cloud bucket for that training (like it is done for BERT pre-training).

Now I have the following questions:

How is possible to generate such a dataset from an own corpus. Corpus is a plain text file (each line = one sentence). I have also a T5 compatible vocab (sentencepiece model), because I don't want to use the existing T5 or mT5 vocabs.

Many thanks advance!

opened by stefan-it 7
Throw a warning instead of exiting successfully when no tasks have been

Throw a warning instead of exiting successfully when no tasks have been selected. This change provides a better user experience when the user has misspelt a task (youcook2_t5 vs youcook2_T5).
cla: no

opened by copybara-service[bot] 5
Adds the passthrough features option to mixture get_dataset function to specify the features after filtering. This way some nested features not defined in the output_features list could still be kept in the mixture datasets.

Adds the passthrough features option to mixture get_dataset function to specify the features after filtering. This way some nested features not defined in the output_features list could still be kept in the mixture datasets.
cla: no

opened by copybara-service[bot] 5
In certain use cases (e.g. in-context few-shot evaluation) where training examples are randomly sampled for during eval time, we might want to pass in random seeds to control randomness.

In certain use cases (e.g. in-context few-shot evaluation) where training examples are randomly sampled for during eval time, we might want to pass in random seeds to control randomness.
cla: no

opened by copybara-service[bot] 5
Fix several issues in FewShotDataSource that cause `train` and `eval` examples to be identical in the one-shot setting.

Fix several issues in FewShotDataSource that cause train and eval examples to be identical in the one-shot setting.

This change addresses a potential lack of proper randomization in FewShotDataSource.

ISSUE #1

Consider the setup where num_shots=1, split=self._train_split, and shuffle=True. Furthermore, suppose that the original_source is a TfdsDataSource.

Note that TfdsDataSource.get_dataset does not implement proper example-level shuffling -- it only implements file-shard-level shuffling. For any TfdsDataSource with only a single file, this amounts to no shuffling at all even when requesting get_dataset(..., shuffle=True).

Due to this issue, FewShotDataSource will zip together two datasets that have not been shuffled, resulting in degenerate few-shot examples where the train and eval fields contain the exact same examples (this makes the few-shot learning task trivially easy).

More generally, we cannot trust original_source.get_dataset to properly implement shuffling. So, we add our own shuffle operation inside FewShotDataSource to protect it from this problem.

ISSUE #2

Consider the setup where num_shots=1, split=self._train_split, shuffle=True and seed=.

The current code applies the exact same shuffle seed to both the train and eval split before they are zipped together. This will again cause degenerate few-shot examples where the train and eval fields contain the exact same examples.

We fix this by using seed for the train split and seed + 1 for the eval split.

ISSUE #3

If the user specifies shuffle=False, then neither the train nor eval splits are shuffled. If num_shots=1 and split=self._train_split, then this will again result in degenerate few-shot examples where train and eval are identical.

To fix this, we assume that a user requesting shuffle=False only wants the eval split to be in its original unshuffled order. We assume that the user still wants the train split to be shuffled. We deterministically shuffle the train split using a seed of 0, so that determinism is still guaranteed.
cla: no

opened by copybara-service[bot] 4
Perfectly shuffle files for better randomization.

Perfectly shuffle files for better randomization.

With a limited buffer size, you are likely to pick the earlier shards irrespective of the seed in the initial cycle_length draws.

import tensorflow as tf

def sample(seed, buffer_size, cycle_length=16, num_files=10155): dataset = tf.data.Dataset.range(num_files)

dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)

dataset = dataset.interleave( lambda x: tf.data.Dataset.from_tensors(x).repeat(1), cycle_...

cla: no

opened by copybara-service[bot] 3
Perfectly shuffle files for better randomization.

Perfectly shuffle files for better randomization.

With a limited buffer size, you are likely to pick the earlier shards irrespective of the seed in the initial cycle_length draws.

import tensorflow as tf

def sample(seed, buffer_size, cycle_length=16, num_files=10155): dataset = tf.data.Dataset.range(num_files)

dataset = dataset.shuffle(buffer_size=buffer_size, seed=seed)

dataset = dataset.interleave( lambda x: tf.data.Dataset.from_tensors(x).repeat(1), cycle_length=cycle_length, block_length=16) return list(dataset.take(cycle_length).as_numpy_iterator())

print("Before (with limited buffer size):") for seed in range(3): print("seed = {}, sorted draw = {}".format(seed, sorted(sample(seed=seed, buffer_size=16, num_files=10155))))

print("After (with perfect shuffling):") for seed in range(3): print("seed = {}, sorted draw = {}".format(seed, sorted(sample(seed=seed, buffer_size=10155, num_files=10155))))

Before (with limited buffer size): seed = 0, sorted draw = [2, 3, 6, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 28] seed = 1, sorted draw = [0, 1, 2, 4, 5, 8, 12, 13, 14, 15, 17, 19, 21, 23, 25, 27] seed = 2, sorted draw = [0, 1, 2, 4, 5, 9, 10, 11, 12, 14, 17, 18, 19, 22, 23, 26] After (with perfect shuffling): seed = 0, sorted draw = [79, 190, 639, 3483, 5805, 5964, 6176, 6792, 7666, 7971, 8131, 8783, 9546, 9742, 9770, 10038] seed = 1, sorted draw = [514, 1145, 1406, 2416, 3277, 3625, 3697, 4255, 4748, 4873, 7129, 8343, 8749, 8815, 9152, 9975] seed = 2, sorted draw = [76, 1210, 2266, 2870, 3275, 4990, 5352, 5848, 5889, 6479, 6842, 6971, 7535, 8327, 8932, 10008]
cla: no

opened by copybara-service[bot] 3
[Seqio] Add a boolean arg `eval_on_fixed_exemplars` to FewshotDataSource.

[Seqio] Add a boolean arg eval_on_fixed_exemplars to FewshotDataSource.

If eval_on_fixed_exemplars is true, a fixed set of exemplars will be used for all eval examples. This is ignored when train_ds and eval_ds are instantiated from the same split.

TESTED=unittest
cla: no

opened by copybara-service[bot] 3
This applies disable comments (`# pytype: disable=error-class`) to any files affected by releasing --enforce-noniterable-strings to silence type errors. Previously, these errors were implicitly disabled by adding the affected files to feature blocklists; this moves the disables into the code so we can delete the blocklists.

This applies disable comments (# pytype: disable=error-class) to any files affected by releasing --enforce-noniterable-strings to silence type errors. Previously, these errors were implicitly disabled by adding the affected files to feature blocklists; this moves the disables into the code so we can delete the blocklists.
cla: no

opened by copybara-service[bot] 3
Add adhoc dataset builder for Beam collections.

Add adhoc dataset builder for Beam collections.

If a user has a Beam collection or transform that produces examples, but doesn't want to define a new dataset builder, then this builder can be used.

opened by copybara-service[bot] 0

How to apply the huggingface tokenizer in seqio.vocabulary

Hello.

I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.

I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string).

Since the seqio.sentencepice module load module by using so file in tf_text.sentencepiece, I don't know how to handle it inside.

I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.

I am attaching the code I used below.

Thank you:)

seqio/custom_task.py

from src.vocabularies import BBPEVocabulary
bbpe_vocab = BBPEVocabulary('custom_path')

seqio.TaskRegistry.add(
    "my_span_corruption_task",
    source=seqio.TFExampleDataSource(
        split_to_filepattern={"train": os.path.join('[MY_TF_RECORD_PATH]', "*train.tfrecord*")},
        feature_description={"text": tf.io.FixedLenFeature([], tf.string)}
    ),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=BBPE_OUTPUT_FEATURES,
    metric_fns=[])

seqio/preprocessors.py

def tokenize(dataset: tf.data.Dataset,
             output_features: OutputFeaturesType,
             copy_pretokenized: bool = True,
             with_eos: bool = False) -> tf.data.Dataset:
  tokenize_fn = functools.partial(
      tokenize_impl,
      output_features=output_features,
      copy_pretokenized=copy_pretokenized,
      with_eos=with_eos)
  return utils.map_over_dataset(fn=tokenize_fn)(dataset)

def tokenize_impl(features: Mapping[str, tf.Tensor],
                  output_features: OutputFeaturesType,
                  copy_pretokenized: bool = True,
                  with_eos: bool = False) -> Mapping[str, tf.Tensor]:
  ret = {}
  for k, v in features.items():
    if k in output_features:
      if copy_pretokenized:
        ret[f'{k}_pretokenized'] = v
      vocab = output_features[k].vocabulary
      v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
      ...[omitted]...

    ret[k] = v
  print(f'tokenize_impl | complete | return : {ret}')
  return ret

opened by nawnoes 0

seqio 0.0.13 cannot be installed on Apple Silicon due to transitive tensorflow dependency of clu

https://github.com/google/seqio/commit/db4d4b05de7e7151bc2b4a8310f5a7e040f528c1 added clu as a dependency of seqio.

With this change, we can no longer install seqio on Apple Silicon machines (e.g. M1, M2). This is because clu requires tensorflow (https://github.com/google/CommonLoopUtils/blob/85f9d28556f2684e2c5f2e412cbef5119d6682ba/setup.py#L54) but on Apple Silicon tensorflow should be installed as tensorflow-macos based on the instructions at https://developer.apple.com/metal/tensorflow-plugin/.

A simple fix is to update the clu tensorflow line in the setup.py to tensorflow; platform_machine == 'x86_64'. However, that project doesn't accept GitHub issues or contributions so I am creating an issue here.

opened by tuzhucheng 2