Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Google Research

Last update: Jan 7, 2023

Related tags

Overview

TAble PArSing (TAPAS)

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

News

2021/08/24

Added a colab to try predictions on open domain question answering.

2021/08/20

New models and code for DoT: An efficient Double Transformer for NLP tasks with tables released here.

2021/07/23

New release of NQ with tables data used in Open Domain Question Answering over Tables via Dense Retrieval. The use of the data is detailed here.

2021/05/13

New models and code for Open Domain Question Answering over Tables via Dense Retrieval released here.

2021/03/23

The upcoming NAACL 2021 short paper Open Domain Question Answering over Tables via Dense Retrieval extends the TAPAS capabilities to table retrieval and open-domain QA. We are planning to release the new models and code soon.

2020/12/17

TAPAS is added to huggingface/transformers in version 4.1.1. 28 checkpoints are added to the huggingface model hub and can be played with using a custom table question answering widget.

2020/10/19

Small change to WTQ training example creation
- Questions with ambiguous cell matches will now be discarded
- This improves denotation accuracy by ~1 point
- For more details see this issue.
Added option to filter table columns by textual overlap with question
- Based on the HEM method described in section 3.3 of Understanding tables with intermediate pre-training.

2020/10/09

Released code & models to run TAPAS on TabFact for table entailment, companion for the EMNLP 2020 Findings paper Understanding tables with intermediate pre-training.
Added a colab to try predictions on TabFact
Added new page describing the intermediate pre-training process.

2020/08/26

Added a colab to try predictions on WTQ

2020/08/05

New pre-trained models (see Data section below)
reset_position_index_per_cell: New option that allows to train models that instead of using absolute position indices reset the position index when a new cell starts.

2020/06/10

Bump TensorFlow to v2.2

2020/06/08

Released the pre-training data.

2020/05/07

Added a colab to try predictions on SQA

Installation

The easiest way to try out TAPAS with free GPU/TPU is in our Colab, which shows how to do predictions on SQA.

The repository uses protocol buffers, and requires the protoc compiler to run. You can download the latest binary for your OS here. On Ubuntu/Debian, it can be installed with:

sudo apt-get install protobuf-compiler

Afterwards, clone and install the git repository:

git clone https://github.com/google-research/tapas
cd tapas
pip install -e .

To run the test suite we use the tox library which can be run by calling:

pip install tox
tox

Models

We provide pre-trained models for different model sizes.

The metrics are computed by our tool and not the official metrics of the respective tasks. We provide them so one can verify whether one's own runs are in the right ballpark. They are medians over three individual runs.

Models with intermediate pre-training (2020/10/07).

New models based on the ideas discussed in Understanding tables with intermediate pre-training. Learn more about the methods use here.

WTQ

Trained from Mask LM, intermediate data, SQA, WikiSQL.

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.5062	tapas_wtq_wikisql_sqa_inter_masklm_large.zip
LARGE	reset	0.5097	tapas_wtq_wikisql_sqa_inter_masklm_large_reset.zip
BASE	noreset	0.4525	tapas_wtq_wikisql_sqa_inter_masklm_base.zip
BASE	reset	0.4638	tapas_wtq_wikisql_sqa_inter_masklm_base_reset.zip
MEDIUM	noreset	0.4324	tapas_wtq_wikisql_sqa_inter_masklm_medium.zip
MEDIUM	reset	0.4324	tapas_wtq_wikisql_sqa_inter_masklm_medium_reset.zip
SMALL	noreset	0.3681	tapas_wtq_wikisql_sqa_inter_masklm_small.zip
SMALL	reset	0.3762	tapas_wtq_wikisql_sqa_inter_masklm_small_reset.zip
MINI	noreset	0.2783	tapas_wtq_wikisql_sqa_inter_masklm_mini.zip
MINI	reset	0.2854	tapas_wtq_wikisql_sqa_inter_masklm_mini_reset.zip
TINY	noreset	0.0823	tapas_wtq_wikisql_sqa_inter_masklm_tiny.zip
TINY	reset	0.1039	tapas_wtq_wikisql_sqa_inter_masklm_tiny_reset.zip

WIKISQL

Trained from Mask LM, intermediate data, SQA.

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.8948	tapas_wikisql_sqa_inter_masklm_large.zip
LARGE	reset	0.8979	tapas_wikisql_sqa_inter_masklm_large_reset.zip
BASE	noreset	0.8859	tapas_wikisql_sqa_inter_masklm_base.zip
BASE	reset	0.8855	tapas_wikisql_sqa_inter_masklm_base_reset.zip
MEDIUM	noreset	0.8766	tapas_wikisql_sqa_inter_masklm_medium.zip
MEDIUM	reset	0.8773	tapas_wikisql_sqa_inter_masklm_medium_reset.zip
SMALL	noreset	0.8552	tapas_wikisql_sqa_inter_masklm_small.zip
SMALL	reset	0.8615	tapas_wikisql_sqa_inter_masklm_small_reset.zip
MINI	noreset	0.8063	tapas_wikisql_sqa_inter_masklm_mini.zip
MINI	reset	0.82	tapas_wikisql_sqa_inter_masklm_mini_reset.zip
TINY	noreset	0.3198	tapas_wikisql_sqa_inter_masklm_tiny.zip
TINY	reset	0.6046	tapas_wikisql_sqa_inter_masklm_tiny_reset.zip

TABFACT

Trained from Mask LM, intermediate data.

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.8101	tapas_tabfact_inter_masklm_large.zip
LARGE	reset	0.8159	tapas_tabfact_inter_masklm_large_reset.zip
BASE	noreset	0.7856	tapas_tabfact_inter_masklm_base.zip
BASE	reset	0.7918	tapas_tabfact_inter_masklm_base_reset.zip
MEDIUM	noreset	0.7585	tapas_tabfact_inter_masklm_medium.zip
MEDIUM	reset	0.7587	tapas_tabfact_inter_masklm_medium_reset.zip
SMALL	noreset	0.7321	tapas_tabfact_inter_masklm_small.zip
SMALL	reset	0.7346	tapas_tabfact_inter_masklm_small_reset.zip
MINI	noreset	0.6166	tapas_tabfact_inter_masklm_mini.zip
MINI	reset	0.6845	tapas_tabfact_inter_masklm_mini_reset.zip
TINY	noreset	0.5425	tapas_tabfact_inter_masklm_tiny.zip
TINY	reset	0.5528	tapas_tabfact_inter_masklm_tiny_reset.zip

SQA

Trained from Mask LM, intermediate data.

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7223	tapas_sqa_inter_masklm_large.zip
LARGE	reset	0.7289	tapas_sqa_inter_masklm_large_reset.zip
BASE	noreset	0.6737	tapas_sqa_inter_masklm_base.zip
BASE	reset	0.6874	tapas_sqa_inter_masklm_base_reset.zip
MEDIUM	noreset	0.6464	tapas_sqa_inter_masklm_medium.zip
MEDIUM	reset	0.6561	tapas_sqa_inter_masklm_medium_reset.zip
SMALL	noreset	0.5876	tapas_sqa_inter_masklm_small.zip
SMALL	reset	0.6155	tapas_sqa_inter_masklm_small_reset.zip
MINI	noreset	0.4574	tapas_sqa_inter_masklm_mini.zip
MINI	reset	0.5148	tapas_sqa_inter_masklm_mini_reset.zip
TINY	noreset	0.2004	tapas_sqa_inter_masklm_tiny.zip
TINY	reset	0.2375	tapas_sqa_inter_masklm_tiny_reset.zip

INTERMEDIATE

Trained from Mask LM.

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.9309	tapas_inter_masklm_large.zip
LARGE	reset	0.9317	tapas_inter_masklm_large_reset.zip
BASE	noreset	0.9134	tapas_inter_masklm_base.zip
BASE	reset	0.9163	tapas_inter_masklm_base_reset.zip
MEDIUM	noreset	0.8988	tapas_inter_masklm_medium.zip
MEDIUM	reset	0.9005	tapas_inter_masklm_medium_reset.zip
SMALL	noreset	0.8788	tapas_inter_masklm_small.zip
SMALL	reset	0.8798	tapas_inter_masklm_small_reset.zip
MINI	noreset	0.8218	tapas_inter_masklm_mini.zip
MINI	reset	0.8333	tapas_inter_masklm_mini_reset.zip
TINY	noreset	0.6359	tapas_inter_masklm_tiny.zip
TINY	reset	0.6615	tapas_inter_masklm_tiny_reset.zip

Small Models & position index reset (2020/08/08)

Based on the pre-trained checkpoints available at the BERT github page. See the page or the paper for detailed information on the model dimensions.

Reset refers to whether the parameter reset_position_index_per_cell was set to true or false during training. In general it's recommended to set it to true.

The accuracy depends on the respective task. It's denotation accuracy for WTQ and WIKISQL, average position accuracy with gold labels for the previous answers for SQA and Mask-LM accuracy for Mask-LM.

The models were trained in a chain as indicated by the model name. For example, sqa_masklm means the model was first trained on the Mask-LM task and then on SQA. No destillation was performed.

WTQ

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.4822	tapas_wtq_wikisql_sqa_masklm_large.zip
LARGE	reset	0.4952	tapas_wtq_wikisql_sqa_masklm_large_reset.zip
BASE	noreset	0.4288	tapas_wtq_wikisql_sqa_masklm_base.zip
BASE	reset	0.4433	tapas_wtq_wikisql_sqa_masklm_base_reset.zip
MEDIUM	noreset	0.4158	tapas_wtq_wikisql_sqa_masklm_medium.zip
MEDIUM	reset	0.4097	tapas_wtq_wikisql_sqa_masklm_medium_reset.zip
SMALL	noreset	0.3267	tapas_wtq_wikisql_sqa_masklm_small.zip
SMALL	reset	0.3670	tapas_wtq_wikisql_sqa_masklm_small_reset.zip
MINI	noreset	0.2275	tapas_wtq_wikisql_sqa_masklm_mini.zip
MINI	reset	0.2409	tapas_wtq_wikisql_sqa_masklm_mini_reset.zip
TINY	noreset	0.0901	tapas_wtq_wikisql_sqa_masklm_tiny.zip
TINY	reset	0.0947	tapas_wtq_wikisql_sqa_masklm_tiny_reset.zip

WIKISQL

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.8862	tapas_wikisql_sqa_masklm_large.zip
LARGE	reset	0.8917	tapas_wikisql_sqa_masklm_large_reset.zip
BASE	noreset	0.8772	tapas_wikisql_sqa_masklm_base.zip
BASE	reset	0.8809	tapas_wikisql_sqa_masklm_base_reset.zip
MEDIUM	noreset	0.8687	tapas_wikisql_sqa_masklm_medium.zip
MEDIUM	reset	0.8736	tapas_wikisql_sqa_masklm_medium_reset.zip
SMALL	noreset	0.8285	tapas_wikisql_sqa_masklm_small.zip
SMALL	reset	0.8550	tapas_wikisql_sqa_masklm_small_reset.zip
MINI	noreset	0.7672	tapas_wikisql_sqa_masklm_mini.zip
MINI	reset	0.7944	tapas_wikisql_sqa_masklm_mini_reset.zip
TINY	noreset	0.3237	tapas_wikisql_sqa_masklm_tiny.zip
TINY	reset	0.3608	tapas_wikisql_sqa_masklm_tiny_reset.zip

SQA

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7002	tapas_sqa_masklm_large.zip
LARGE	reset	0.7130	tapas_sqa_masklm_large_reset.zip
BASE	noreset	0.6393	tapas_sqa_masklm_base.zip
BASE	reset	0.6689	tapas_sqa_masklm_base_reset.zip
MEDIUM	noreset	0.6026	tapas_sqa_masklm_medium.zip
MEDIUM	reset	0.6141	tapas_sqa_masklm_medium_reset.zip
SMALL	noreset	0.4976	tapas_sqa_masklm_small.zip
SMALL	reset	0.5589	tapas_sqa_masklm_small_reset.zip
MINI	noreset	0.3779	tapas_sqa_masklm_mini.zip
MINI	reset	0.3687	tapas_sqa_masklm_mini_reset.zip
TINY	noreset	0.2013	tapas_sqa_masklm_tiny.zip
TINY	reset	0.2194	tapas_sqa_masklm_tiny_reset.zip

MASKLM

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7513	tapas_masklm_large.zip
LARGE	reset	0.7528	tapas_masklm_large_reset.zip
BASE	noreset	0.7323	tapas_masklm_base.zip
BASE	reset	0.7335	tapas_masklm_base_reset.zip
MEDIUM	noreset	0.7059	tapas_masklm_medium.zip
MEDIUM	reset	0.7054	tapas_masklm_medium_reset.zip
SMALL	noreset	0.6818	tapas_masklm_small.zip
SMALL	reset	0.6856	tapas_masklm_small_reset.zip
MINI	noreset	0.6382	tapas_masklm_mini.zip
MINI	reset	0.6425	tapas_masklm_mini_reset.zip
TINY	noreset	0.4826	tapas_masklm_tiny.zip
TINY	reset	0.5282	tapas_masklm_tiny_reset.zip

Original Models

The pre-trained TAPAS checkpoints can be downloaded here:

The first two models are pre-trained on the Mask-LM task and the last two on the Mask-LM task first and SQA second.

Fine-Tuning Data

You also need to download the task data for the fine-tuning tasks:

Pre-Training

Note that you can skip pre-training and just use one of the pre-trained checkpoints provided above.

Information about the pre-taining data can be found here.

The TF examples for pre-training can be created using Google Dataflow:

python3 setup.py sdist
python3 tapas/create_pretrain_examples_main.py \
  --input_file="gs://tapas_models/2020_05_11/interactions.txtpb.gz" \
  --vocab_file="gs://tapas_models/2020_05_11/vocab.txt" \
  --output_dir="gs://your_bucket/output" \
  --runner_type="DATAFLOW" \
  --gc_project="you-project" \
  --gc_region="us-west1" \
  --gc_job_name="create-pretrain" \
  --gc_staging_location="gs://your_bucket/staging" \
  --gc_temp_location="gs://your_bucket/tmp" \
  --extra_packages=dist/tapas-0.0.1.dev0.tar.gz

You can also run the pipeline locally but that will take a long time:

python3 tapas/create_pretrain_examples_main.py \
  --input_file="$data/interactions.txtpb.gz" \
  --output_dir="$data/" \
  --vocab_file="$data/vocab.txt" \
  --runner_type="DIRECT"

This will create two tfrecord files for training and testing. The pre-training can then be started with the command below. The init checkpoint should be a standard BERT checkpoint.

python3 tapas/experiments/tapas_pretraining_experiment.py \
  --eval_batch_size=32 \
  --train_batch_size=512 \
  --tpu_iterations_per_loop=5000 \
  --num_eval_steps=100 \
  --save_checkpoints_steps=5000 \
  --num_train_examples=512000000 \
  --max_seq_length=128 \
  --input_file_train="${data}/train.tfrecord" \
  --input_file_eval="${data}/test.tfrecord" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \
  --model_dir="..." \
  --compression_type="" \
  --do_train

Where compression_type should be set to GZIP if the tfrecords are compressed. You can start a separate eval job by setting --nodo_train --doeval.

Running a fine-tuning task

We need to create the TF examples before starting the training. For example, for SQA that would look like:

python3 tapas/run_task_main.py \
  --task="SQA" \
  --input_dir="${sqa_data_dir}" \
  --output_dir="${output_dir}" \
  --bert_vocab_file="${tapas_data_dir}/vocab.txt" \
  --mode="create_data"

Optionally, to handle big tables, we can add a --prune_columns flag to apply the HEM method described section 3.3 of our paper to discard some columns based on textual overlap with the sentence.

Afterwards, training can be started by running:

python3 tapas/run_task_main.py \
  --task="SQA" \
  --output_dir="${output_dir}" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \
  --mode="train" \
  --use_tpu

This will use the preset hyper-parameters set in hparam_utils.py.

It's recommended to start a separate eval job to continuously produce predictions for the checkpoints created by the training job. Alternatively, you can run the eval job after training to only get the final results.

python3 tapas/run_task_main.py \
  --task="SQA" \
  --output_dir="${output_dir}" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \
  --mode="predict_and_evaluate"

Another tool to run experiments is tapas_classifier_experiment.py. It's more flexible than run_task_main.py but also requires setting all the hyper-parameters (via the respective command line flags).

Evaluation

Here we explain some details about different tasks.

SQA

By default, SQA will evaluate using the reference answers of the previous questions. The number in the paper (Table 5) are computed using the more realistic setup where the previous answer are model predictions. run_task_main.py will output additional prediction files for this setup as well if run on GPU.

WTQ

For the official evaluation results one should convert the TAPAS predictions to the WTQ format and run the official evaluation script. This can be done using convert_predictions.py.

WikiSQL

As discussed in the paper our code will compute evaluation metrics that deviate from the official evaluation script (Table 3 and 10).

Hardware Requirements

TAPAS is essentialy a BERT model and thus has the same requirements. This means that training the large model with 512 sequence length will require a TPU. You can use the option max_seq_length to create shorter sequences. This will reduce accuracy but also make the model trainable on GPUs. Another option is to reduce the batch size (train_batch_size), but this will likely also affect accuracy. We added an options gradient_accumulation_steps that allows you to split the gradient over multiple batches. Evaluation with the default test batch size (32) should be possible on GPU.

How to cite TAPAS?

You can cite the ACL 2020 paper and the EMNLP 2020 Findings paper for the laters work on pre-training objectives.

Disclaimer

This is not an official Google product.

Contact information

For help or issues, please submit a GitHub issue.

Comments

Error when create_data is used on WIKISQL

I'm trying to replicate the results (to ensure I'm setting everything up properly) using the WikiSQL dataset. I am using "tapas_wikisql_sqa_masklm_small_reset.zip" model. However when I try running:

!python tapas/tapas/run_task_main.py \
  --task="WIKISQL" \
  --input_dir="data/" \
  --output_dir="results/wsql/input_data" \
  --bert_vocab_file="tapas_model/bert_config.json" \
  --mode="create_data"

I encounter the following error:

I1005 02:55:27.819439 139825599944576 sqa_utils.py:102] Total	Valid	Failed	File
56355	55775	580	train.tsv
8421	8421	0	dev.tsv
15878	15878	0	test.tsv
Creating TF examples ...
I1005 02:55:36.143442 139825599944576 run_task_main.py:152] Creating TF examples ...
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tapas/scripts/prediction_utils.py:48: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
W1005 02:55:36.144381 139825599944576 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tapas/scripts/prediction_utils.py:48: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
I1005 02:55:36.159258 139825599944576 number_annotation_utils.py:149] Can't consolidate types: (None, text: "Current slogan"
) {0: [float_value: 2013.0
, date {
  year: 2013
}
], 1: [], 2: [], 3: [], 4: [], 5: [], 6: []} 1
Traceback (most recent call last):
  File "tapas/tapas/run_task_main.py", line 782, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tapas/tapas/run_task_main.py", line 743, in main
    output_dir=output_dir)
  File "tapas/tapas/run_task_main.py", line 178, in _create_all_examples
    test_mode=test_mode)
  File "tapas/tapas/run_task_main.py", line 231, in _create_examples
    examples.append(converter.convert(interaction, i))
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/tf_example_utils.py", line 1096, in convert
    drop_rows_to_fit=self._drop_rows_to_fit)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/tf_example_utils.py", line 1018, in _to_trimmed_features
    serialized_example.tokens, feature_dict, table=table, question=question)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/tf_example_utils.py", line 673, in _to_features
    input_ids = self._to_token_ids(tokens)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/tf_example_utils.py", line 656, in _to_token_ids
    return self._tokenizer.convert_tokens_to_ids(_get_pieces(tokens))
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/tf_example_utils.py", line 332, in convert_tokens_to_ids
    return self._wp_tokenizer.convert_tokens_to_ids(word_pieces)
  File "/usr/local/lib/python3.6/dist-packages/official/nlp/bert/tokenization.py", line 190, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)
  File "/usr/local/lib/python3.6/dist-packages/official/nlp/bert/tokenization.py", line 150, in convert_by_vocab
    output.append(vocab[item])
KeyError: '[CLS]'

Is this an issue with the WikiSQL data itself, or is it the .py utility within TAPAS?

opened by aminfardi 41

Slow performance

Hi,

Thanks for releasing this in open source. Wonderful concept and very different for other seq2seq or ln2sql like approaches.

I am facing performance issues when trying with sqa prediction notebook (using SQA Large). It takes more than 60 seconds on a dual gpu machine for evaluating the model and giving response to a query. Is this normal? How can we improve the prediction time?

Thanks, Manish

opened by guptam 18

Multi-gpu issues (not utilizing >1 gpu?)

Hello,

I am trying to run this codebase on a single machine with eight GPUs. I have installed all through requirements.txt and prepped the data. When I run, I am only able to use train_batch_size=8 and notice that only one of the eight GPUs is utilized (the other 7 show ~300MB of data on device while the first GPU shows ~15GB). Additionally, while I can see this usage of the GPU(s) by the run script, I get an output message in the train log of: I0514 19:25:27.752172 139628342118208 tpu_estimator.py:2965] Running train on CPU, though I have been ignoring this for now. So I am trying to get the other seven GPUs in the loop so that I can train with train_batch_size=64.

I initially tried wrapping the optimization code in:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
   # rest here from bert/optimization.py

and I notice that the model is properly replicated across the eight GPUs, however I cannot expand my train_batch_size to any multiple larger than 8. I tried wrapping the dataset object, at the end of input_fn before returning, in strategy.experimental_distribute_dataset(ds) to see if it was a matter of not sending batches to each device. However, I ran into deeper errors that I am unfamiliar with when pursuing this route (if this is a preferable way to enable multi-GPU I could update this issue with stack traces I got after running with the aforementioned changes).

Before debugging further in this direction, I tried to step back to the outer run_task_main.py after reading that you can instead pass MirroredStrategy or CentralStorageStrategy objects directly into the RunConfig that goes into an Estimator. So I undid the aforementioned changes that I manually made in the lower levels (e.g. reset repo back to master) and added to:

run_config = tf.estimator.tpu.RunConfig(
  ...
  train_distribute=strategy,
  ...

However, I now run into the error:

Traceback (most recent call last):
  File "tapas/run_task_main.py", line 777, in <module>
    app.run(main)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "tapas/run_task_main.py", line 762, in main
    loop_predict=FLAGS.loop_predict,
  File "tapas/run_task_main.py", line 440, in _train_and_predict
    max_steps=tapas_config.num_train_steps,
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1156, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1219, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1255, in _actual_train_model_distributed
    input_fn, ModeKeys.TRAIN, strategy)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1009, in _get_iterator_from_input_fn
    lambda input_context: self._call_input_fn(input_fn, mode,
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 774, in make_input_fn_iterator
    input_fn, replication_mode)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 406, in make_input_fn_iterator
    input_fn, replication_mode=replication_mode)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow/python/distribute/parameter_server_strategy.py", line 318, in _make_input_fn_iterator
    self._container_strategy())
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 550, in __init__
    result = input_fn(ctx)
  File "/mydata/repos/tapas2/venv/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1010, in <lambda>
    input_context))
TypeError: _call_input_fn() takes 3 positional arguments but 4 were given

which I suspect may have to do with the functools.partial wrap around input_fn, but I am having trouble understanding this or determining next steps (I am generally unfamiliar with Tensorflow as a library).

If anybody can help me with this it would be greatly appreciated. Thanks so much for the work and time!

opened by nmallinar 16

Reproducing WikiSQL and SQA numbers

I'm trying to produce the reported numbers in the README in Colab for WikiSQL (supervised and weakly supervised) and SQA.

For WikiSQL I am using tapas_wikisql_sqa_masklm_small_reset. For the weakly supervised I tried:

!python tapas/tapas/run_task_main.py \
  --task="WIKISQL" \
  --output_dir="gs://gs-colab-bucket/data" \
  --bert_config_file="data/wikisql/model/bert_config.json" \
  --bert_vocab_file="data/wikisql/model/vocab.txt" \
  --tpu_name='grpc://10.11.116.106:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

But results were nowhere close:

I1006 22:51:43.239191 140375104935808 calc_metrics_utils.py:414] denotation_accuracy=0.004988
dev denotation accuracy: 0.0050
I1006 22:51:43.456792 140375104935808 run_task_main.py:152] dev denotation accuracy: 0.0050
I1006 22:52:38.437489 140375104935808 calc_metrics_utils.py:414] denotation_accuracy=0.006361
test denotation accuracy: 0.0064
I1006 22:52:38.888067 140375104935808 run_task_main.py:152] test denotation accuracy: 0.0064

Trying WikiSQL supervised:


!python tapas/tapas/run_task_main.py \
  --task="WIKISQL_SUPERVISED" \
  --output_dir="gs://gs-colab-bucket/data_supervised" \
  --bert_config_file="data_supervised/wikisql_supervised/model/bert_config.json" \
  --bert_vocab_file="data_supervised/wikisql_supervised/model/vocab.txt" \
  --tpu_name='grpc://10.11.116.106:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

Resulted in much better numbers, but still not matching the README:

`tf.data.TFRecordDataset(path)`
I1010 04:55:43.283555 140631071238016 calc_metrics_utils.py:414] denotation_accuracy=0.749673
dev denotation accuracy: 0.7497
I1010 04:55:43.428153 140631071238016 run_task_main.py:152] dev denotation accuracy: 0.7497
I1010 04:56:35.534448 140631071238016 calc_metrics_utils.py:414] denotation_accuracy=0.736554
test denotation accuracy: 0.7366
I1010 04:56:35.813148 140631071238016 run_task_main.py:152] test denotation accuracy: 0.7366

Next I tried SQA using tapas_sqa_masklm_large_reset:

!python tapas/tapas/run_task_main.py \
  --task="SQA" \
  --output_dir="gs://gs-colab-bucket/data_sqa" \
  --bert_config_file="gs://gs-colab-bucket/data_sqa/sqa/model/bert_config.json" \
  --bert_vocab_file="gs://gs-colab-bucket/data_sqa/sqa/model/vocab.txt" \
  --loop_predict="false" \
  --tpu_name='grpc://10.116.216.194:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

But I again get results 5% lower than README:

I1010 20:17:49.353913 140657230886784 calc_metrics_utils.py:414] denotation_accuracy=0.642384
dev denotation accuracy: 0.6424
I1010 20:17:49.409619 140657230886784 run_task_main.py:152] dev denotation accuracy: 0.6424
Warning: Can't evaluate for dev_seq because gs://gs-colab-bucket/data_sqa/sqa/model/random-split-1-dev_sequence.tsv doesn't exist.
W1010 20:17:49.546101 140657230886784 run_task_main.py:157] Can't evaluate for dev_seq because gs://gs-colab-bucket/data_sqa/sqa/model/random-split-1-dev_sequence.tsv doesn't exist.
I1010 20:18:00.948407 140657230886784 calc_metrics_utils.py:414] denotation_accuracy=0.652722
test denotation accuracy: 0.6527
I1010 20:18:01.029566 140657230886784 run_task_main.py:152] test denotation accuracy: 0.6527
Warning: Can't evaluate for test_seq because gs://gs-colab-bucket/data_sqa/sqa/model/test_sequence.tsv doesn't exist.

I see the warnings in above regarding sequence, but that seems to have to do with the TPU:

Warning: Skipping SQA sequence evaluation because eval is running on TPU.

Update: also tried the WTQ dataset and tapas_wtq_wikisql_sqa_masklm_large_reset model:

!python tapas/tapas/run_task_main.py \
  --task="WTQ" \
  --output_dir="gs://gs-colab-bucket/data_wtq" \
  --bert_config_file="gs://gs-colab-bucket/data_wtq/wtq/model/bert_config.json" \
  --bert_vocab_file="gs://gs-colab-bucket/data_wtq/wtq/model/vocab.txt" \
  --loop_predict="false" \
  --tpu_name='grpc://10.92.172.218:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

Getting results 10% lower than the README:

`tf.data.TFRecordDataset(path)`
I1011 04:48:55.989768 140620040472448 calc_metrics_utils.py:414] denotation_accuracy=0.387544
dev denotation accuracy: 0.3875
I1011 04:48:56.042782 140620040472448 run_task_main.py:152] dev denotation accuracy: 0.3875
I1011 04:49:14.033151 140620040472448 calc_metrics_utils.py:414] denotation_accuracy=0.390654
test denotation accuracy: 0.3907
I1011 04:49:14.111141 140620040472448 run_task_main.py:152] test denotation accuracy: 0.3907

I'm wondering if anyone has been able to successfully replicate the reported numbers?

opened by aminfardi 15

Not able to run sample code

Hi, I am stuck at result path in sample code it throwing following error [Errno 2] No such file or directory: 'results/sqa/model/test_sequence.tsv' I even checked in result folder i did not find the test_sequence.tsv file

opened by panddu15 14
running Colab notebook

Hi all,

When trying to run the Colab notebook I get:

FileNotFoundError: [Errno 2] No such file or directory: 'results/sqa/model/test_sequence.tsv'

Also, when I tried to run the run_task_main.py on my local device I get ModuleNotFoundError: No module named 'tapas'

I've checked and all the folders contains init.py file and not sure how to run it Can you please advise?

opened by eliwilner 12
Accommodating large tables
I tried to run the prediction on a table of size 1.8M X 60 (using the sqa large model) in the example colab file. The following are the steps that I had taken:

Changed the following line in the bert_config.json max_position_embeddings: 2048

Changed the max_seq_length max_seq_length = 2048

However, it just returns the table without any predictions.

Could you please provide the changes that one should make to accommodate a table of such size. Thank you.
enhancement
opened by sbhttchryy 11
Trained Model on WikiTable Questions

Hi, can you share the trained model on WikiTableQuestion please?

Because I tried all base models that you shared without doing any fine-tuning and got error like this tensorflow.python.framework.errors_impl.NotFoundError: Key column_output_bias not found in checkpoint. I guess the base model are not trained using WikiTableQuestions right?

opened by mdmustafizurrahman 11
Pruning method in WTQ

Hello,

I am new to this topic and I'm currently trying to use the pruning/filtering method for long tables in the WTQ notebook. I tried using the flag --prune_columns in the prediction function, but it still gives me "Can't convert interaction: error: Sequence too long". What are the necessary steps to filter/prune long tables during prediction?

Thank you in advance.

opened by sophgit 9

Got AttributeError: 'GFile' object has no attribute 'readable'

Hello,

While model data creation using task WTQ, I got AttributeError: 'GFile' object has no attribute 'readable'

COMMAND USED: ! python tapas/tapas/run_task_main.py
--task="WTQ"
--input_dir="{input_dir}"
--output_dir="{output_dir}"
--bert_vocab_file="/content/tapas_model/vocab.txt"
--mode="create_data"

ERROR :


`WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Creating interactions ...
I0928 05:29:55.827632 140189694424960 run_task_main.py:152] Creating interactions ...
I0928 05:29:55.827936 140189694424960 wtq_utils.py:152] Converting data from: training.tsv...
Traceback (most recent call last):
  File "tapas/tapas/run_task_main.py", line 782, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tapas/tapas/run_task_main.py", line 736, in main
    task_utils.create_interactions(task, FLAGS.input_dir, output_dir)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/task_utils.py", line 110, in create_interactions
    wtq_utils.convert(input_dir, output_dir)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/wtq_utils.py", line 241, in convert
    _convert_data(table_cache, input_dir, output_dir, train_file, version)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/wtq_utils.py", line 168, in _convert_data
    table = _read_wtq_table(input_dir, wtq_table_id)
  File "/usr/local/lib/python3.6/dist-packages/tapas/utils/wtq_utils.py", line 96, in _read_wtq_table
    dtype='str',
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1880, in __init__
    src = TextIOWrapper(src, encoding=encoding, newline="")
AttributeError: 'GFile' object has no attribute 'readable'`

opened by sparshbhawsar 9

TF records for pretraining data

@muelletm @ebursztein Can you release the tfrecords data for pre-training? I want to develop a 6 layers pretrained checkpoint for TAPAS.

I am currently running the pre-training data generation in my single CPU machine and it has been running for 3 days. Does it take so long on CPU? Should it create any temporary files? Because I cannot see train.tfrecords and test.tfrecords file even though it has been running for 3 days.

Can we run pre-training data on GPU? I tried to allocate the gpu but the existing code is not using any?

opened by mdmustafizurrahman 8
Does anyone know of a Tapas tokenizer that is written in Java, C, or Rust?

Basically im trying to use HuggingFace's Tapas transfer to make an open-domain Q&A bot, but I need to tokenize the strings and table. I've been trying to use Deep Java Library to run it on Android, but i don't think I could do that without rewriting Huggingface's tokenizer in Java from scratch. Does anyone know if there's an Tapas model that runs on Android/Java?

opened by memetrusidovski 0
docs: demo, experiments and live inference API on Tiyaro
Hello Maintainer of Github repo google-research/tapas!

Thank you for your work on google-research/tapas. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

The list of model card(s) covered by this PR are:

https://console.tiyaro.ai/explore/google-tapas-base-finetuned-wtq-3434648

https://console.tiyaro.ai/explore/google-tapas-large-finetuned-wtq-9758148

On Tiyaro, every model in google-research/tapas will get its own:

Dedicated model card (e.g. https://console.tiyaro.ai/explore/google-tapas-base-finetuned-wtq-3434648

Model demo (e.g. https://console.tiyaro.ai/explore/google-tapas-base-finetuned-wtq-3434648/demo)

Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//google/tapas-base-finetuned-wtq)

Sample code snippets and swagger spec for the API

Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

—- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

Sincerely, I-Jong Lin
opened by ijonglin 0
Wrong calculation in table

Hi,

For below table I search "total profit for delhi?" Delhi is not in the table but it's calculating sum of profit for all the rows.

It seems to be happen for each table if the search keyword is not in the table. Please suggest how to proceed?

I tried on hugging face with similar query?

opened by tiwari93 0
Fix for issue #160

Split the cell with "clone and install" and "run import" code. And added a note to remind the user to restart the runtime.

Because execution under one single cell does not give the user a chance to restart the runtime which eventually results in ModuleNotFoundError for imports of tapas.utils fucntions.

opened by bharatji30 0
Populate float_answer for Tapas Weak supervision for aggregation (WTQ). TypeError: Parameter to CopyFrom() must be instance of same class: expected language.tapas.Question got str.

I am trying to fine-tune Tapas following the instructions here: https://huggingface.co/transformers/v4.3.0/model_doc/tapas.html#usage-fine-tuning , Weak supervision for aggregation (WTQ) using the https://www.microsoft.com/en-us/download/details.aspx?id=54253 , which follow the required format of dataset in the SQA format, tsv files with most of the named columns. But, there is no float_answer column. And as mentioned,

float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)

Since I am using WTQ, I need the float_answer column. I tried populating float_answer based on answer_text as suggested here, using https://github.com/google-research/tapas/blob/master/tapas/utils/interaction_utils_parser.py 's parse_question(table, question, mode) function. However, I am getting errors.

I copied everything from here and put these args: .

But, I get this error: TypeError: Parameter to CopyFrom() must be instance of same class: expected language.tapas.Question got str.

1) Can you, please help understand what args should I Use or how else can I populate float_answer?

I am using table_csv and the question, answer to which is in the table given:

2) Also we have tried to simply add float_answer column and make all the values np.nan. Crashed, too.

encoding["float_answer"] = torch.tensor(float("nan"))

Is there tutorial for WTQ fine-tuning? Thanx!

opened by ayazhankadessova 0

Owner

Google Research

GitHub

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

83 Jan 9, 2023

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

10 Apr 30, 2022

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

152 Sep 2, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 2, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

237 Jan 2, 2023

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

605 Jan 2, 2023

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

17 Nov 9, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Related tags

Overview

TAble PArSing (TAPAS)

News

2021/08/24

2021/08/20

2021/07/23

2021/05/13

2021/03/23

2020/12/17

2020/10/19

2020/10/09

2020/08/26

2020/08/05

2020/06/10

2020/06/08

2020/05/07

Installation

Models

Models with intermediate pre-training (2020/10/07).

WTQ

WIKISQL

TABFACT

SQA

INTERMEDIATE

Small Models & position index reset (2020/08/08)

WTQ

WIKISQL

SQA

MASKLM

Original Models

Fine-Tuning Data

Pre-Training

Running a fine-tuning task

Evaluation

SQA

WTQ

WikiSQL

Hardware Requirements

How to cite TAPAS?

Disclaimer

Contact information

Comments

Owner

Google Research

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Pre-training BERT masked language models with custom vocabulary

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Guide to using pre-trained large language models of source code

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃