Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

Overview

PEGASUS library

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. ICML 2020 accepted.

If you use this code or these models, please cite the following paper:

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Results update

We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated the results are reported in this table.

dataset C4 HugeNews Mixed & Stochastic
xsum 45.20/22.06/36.99 47.21/24.56/39.25 47.60/24.83/39.64
cnn_dailymail 43.90/21.20/40.76 44.17/21.47/41.11 44.16/21.56/41.30
newsroom 45.07/33.39/41.28 45.15/33.51/41.33 45.98/34.20/42.18
multi_news 46.74/17.95/24.26 47.52/18.72/24.91 47.65/18.75/24.95
gigaword 38.75/19.96/36.14 39.12/19.86/36.24 39.65/20.47/36.76
wikihow 43.07/19.70/34.79 41.35/18.51/33.42 46.39/22.12/38.41 *
reddit_tifu 26.54/8.94/21.64 26.63/9.01/21.60 27.99/9.81/22.94
big_patent 53.63/33.16/42.25 53.41/32.89/42.07 52.29/33.08/41.66 *
arxiv 44.70/17.27/25.80 44.67/17.18/25.73 44.21/16.95/25.67
pubmed 45.49/19.90/27.69 45.09/19.56/27.42 45.97/20.15/28.25
aeslc 37.69/21.85/36.84 37.40/21.22/36.45 37.68/21.25/36.51
billsum 57.20/39.56/45.80 57.31/40.19/45.82 59.67/41.58/47.59

The "Mixed & Stochastic" model has the following changes:

  • trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
  • trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
  • the model uniformly sample a gap sentence ratio between 15% and 45%.
  • importance sentences are sampled using a 20% uniform noise to importance scores.
  • the sentencepiece tokenizer is updated to be able to encode newline character.

(*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:

  • wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information.
  • we update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.

Setup

create an instance on google cloud with GPU (optional)

Please create a project first and create an instance

gcloud compute instances create \
  ${VM_NAME} \
  --zone=${ZONE} \
  --machine-type=n1-highmem-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --boot-disk-size=500GB \
  --image-project=ml-images \
  --image-family=tf-1-15 \
  --maintenance-policy TERMINATE --restart-on-failure

install library and dependencies

Clone library on github and install requirements.

git clone https://github.com/google-research/pegasus
cd pegasus
export PYTHONPATH=.
pip3 install -r requirements.txt

Download vocab, pretrained and fine-tuned checkpoints of all experiments from Google Cloud.

Alternatively in terminal, follow the instruction and install gsutil. Then

mkdir ckpt
gsutil cp -r gs://pegasus_ckpt/ ckpt/

Finetuning on downstream datasets

on existing dataset

Finetune on an existing dataset aeslc.

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

If you would like to finetune on a subset of dataset, please refer to the example of input pattern.

Evaluate on the finetuned dataset.

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper.

add new finetuning dataset

Two types of dataset format are supported: TensorFlow Datasets (TFDS) or TFRecords.

This tutorial shows how to add a new dataset in TFDS. (The fine-tuning dataset is expected to be supervised, please provide supervised_keys in dataset info).

Tfrecords format requires each record to be a tf example of {"inputs":tf.string, "targets":tf.string}.

For example, if you registered a TFDS dataset called new_tfds_dataset for training and evaluation, and have some files in tfrecord format called new_dataset_files.tfrecord* for test, they can be registered in /pegasus/params/public_params.py.

@registry.register("new_params")
def my_param(param_overrides):
  return public_params.transformer_params(
      {
          "train_pattern": "tfds:new_tfds_dataset,train",
          "dev_pattern": "tfds:new_tfds_dataset,validation",
          "test_pattern": "tfrecord:new_dataset_files.tfrecord*",
          "max_input_len": 512,
          "max_output_len": 128,
          "train_steps": 10000,
          "learning_rate": 0.0001,
          "batch_size": 8,
      }, param_overrides)

Evaluation metrics.

Evaluation results can be found in mode_dir. Summarization metrics are automatically calculated for each evaluation point.

  • ROUGE is the main metric for summarization quality.

  • BLEU is an alternative quality metric for language generation.

  • Extractive Fragments Coverage & Density are metrics that measures the abstractiveness of the summary.

  • Repetition Rates measures generation repetition failure modes.

  • Length statistics measures the length distribution of decodes comparing to gold summary.

Several types of output files can be found in model_dir

  • text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value.
  • inputs-.txt, targets-.txt, predictions-*.txt: raw text files of model inputs/outputs.

Pre-training

Pretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details.

Acknowledgements

Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich [email protected].

Comments
  • Extractive Prediction Instead of Abstractive Prediction

    Extractive Prediction Instead of Abstractive Prediction

    Hi! I have tried to run the pre-trained model to test it on my dataset which consists of paragraphs as inputs and one line sentence as targets. The problem was when I saw the prediction it was extracted from the input instead of generating one as expected.

    kandarp@kandarp:~/Downloads/pegasus$ python3 pegasus/bin/evaluate.py --params=new_params --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 --model_dir=ckpt/pegasus_ckpt
    

    The new_params is the new .tfrecords dataset for testing.

    In the output, I am getting the following output:

    I0612 16:45:32.333093 140653653737856 text_eval.py:126] INPUTS: [0]:
    Live in the country and last three years longer than my city friends? Good news indeed, more backing for a lifestyle choice made half a lifetime ago when it seemed a good idea to exchange an Edinburgh terrace for a farm cottage. I knew it was a good idea because I had been there before. Born and reared on a farm I had been seduced for a few years by the idea of being a big shot who lived and worked in a city rather than only going for the day to wave at the buses. True, I was familiar with some of the minor disadvantages of country living such as an iffy private water supply sometimes infiltrated by a range of flora and fauna (including, on one memorable occasion, a dead lamb), the absence of central heating in farmhouses and cottages, and a single track farm road easily blocked by snow, broken-down machinery or escaped livestock. But there were many advantages as I told Liz back in the mid-Seventies. Town born and bred, eight months pregnant and exchanging a warm, substantial Corstorphine terrace for a windswept farm cottage on a much lower income, persuading her that country had it over town might have been difficult.
    I0612 16:45:32.334013 140653653737856 text_eval.py:126] TARGETS: Although there are many advantages of country living, it is still difficult to persuade a town- born and bred person to live in the country due to disadvantages and inconvenience of country living life.
    I0612 16:45:32.335105 140653653737856 text_eval.py:126] PREDICTIONS: Good news indeed, more backing for a lifestyle choice made half a lifetime ago when it seemed a good idea to exchange an Edinburgh terrace for a farm cottage.
    

    The prediction is the 2nd line of the Input.

    Is there a mistake by me or is it the problem of the model?

    opened by kandarpkakkad 38
  • How to generate abstractive summary

    How to generate abstractive summary

    Hi - I was able to generate extractive summary by referring this link. But I am stuck on how to generate abstractive summary. I would like to try this on sample text in the below Colab notebook and then will work on adding separate dataset.

    https://colab.research.google.com/drive/1vRfhz_arrgnmbnLSr3c7YTGJfBH6TvGE?usp=sharing

    Please do suggest how to get abstractive summary.

    opened by chetanambi 15
  • Unable to run Readme example

    Unable to run Readme example

    Hi! I run your work on google colab. At this step: !python3 pegasus/bin/train.py --params=aeslc_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/aeslc The error comes up. Traceback (most recent call last): File "pegasus/bin/train.py", line 17, in <module> from pegasus.data import infeed File "/usr/local/lib/python3.6/dist-packages/pegasus/__init__.py", line 1, in <module> from pegasus.parser import * File "/usr/local/lib/python3.6/dist-packages/pegasus/parser.py", line 10, in <module> from pegasus.rules import _build_rule, ParseError, Lazy File "/usr/local/lib/python3.6/dist-packages/pegasus/rules.py", line 62 print 'pegasus: {}\x1b[2;38;5;241menter {} -> {}\x1b[m'.format(depth, repr(char()), _name) ^ SyntaxError: invalid syntax

    opened by matt9704 13
  • Test set output summaries for CNN/DM

    Test set output summaries for CNN/DM

    Hi Jingqing,

    I am looking to run some analysis on PEGASUS' output summaries for CNN/DM. Can I get hold of these in any way? (I know I could just run the model to produce them myself but thought I would ask if they are already available before doing this!)

    Great work on all this!

    All the best, Alex

    opened by alexgaskell10 11
  • CUDA out of memory

    CUDA out of memory

    I did run the fine-tuning scripts in a virtual environment and it worked. Later on, I created a new virtual environment and when i run the model again the following error keeps popping out:

    RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.28 GiB already allocated; 4.55 MiB free; 1.28 GiB reserved in total by PyTorch)

    Note: batch size is 1 The fine-tuning Script: https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

    opened by karimfayed 10
  • Resource and time taken

    Resource and time taken

    Hey If I just pass a tf-record with one example with features inputs and targets. Evaluate execution has happened and then after finishing just text_metrics-2-.dev.txt file is created and the prediction, targets and inputs text file isn't created. So, Ideally I can't view prediction generated by model if I just pass single example in tfrecord.

    If I pass tf-record with 9 examples then it's using the whole 60gb ram from the instance and taking longer time for finishing the task.

    Would like to explain why this is happening?

    opened by rohithsiddhartha 10
  • Not able to run readme example

    Not able to run readme example

    Followed the steps mentioned in the ReadMe. Installations done. Trying to run following example =>

    python pegasus/bin/evaluate.py --params=aeslc_transformer --param_overrides=vocab_filename=./ckpt/pegasus_ckpt/aeslc/model.ckpt-32000.data-00000-of-00001,batch_size=1,beam_size=5,beam_alpha=0.6 --model_dir=./ckpt/pegasus_ckpt/aeslc
    

    Error/Logs

    WARNING:tensorflow:From pegasus/bin/evaluate.py:152: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.
    
    WARNING:tensorflow:From pegasus/bin/evaluate.py:153: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
    
    WARNING:tensorflow:From pegasus/bin/evaluate.py:85: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use standard file APIs to check for files with this prefix.
    W0623 13:05:34.706781 140089317664512 deprecation.py:323] From pegasus/bin/evaluate.py:85: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use standard file APIs to check for files with this prefix.
    WARNING:tensorflow:From /home/ubuntu/pegasus/pegasus/ops/public_parsing_ops.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
    
    W0623 13:05:34.710312 140089317664512 module_wrapper.py:139] From /home/ubuntu/pegasus/pegasus/ops/public_parsing_ops.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
    
    python: /sentencepiece/src/../third_party/protobuf-lite/google/protobuf/stubs/stringpiece.h:230: google::protobuf::StringPiece::StringPiece(const char*, google::protobuf::stringpiece_ssize_type): Assertion `len >= 0' failed.
    Fatal Python error: Aborted
    
    Current thread 0x00007f6916042700 (most recent call first):
      File "/root/anaconda3/lib/python3.7/site-packages/sentencepiece.py", line 75 in LoadFromSerializedProto
      File "/home/ubuntu/pegasus/pegasus/ops/public_parsing_ops.py", line 94 in __init__
      File "/home/ubuntu/pegasus/pegasus/ops/public_parsing_ops.py", line 75 in create_text_encoder
      File "/home/ubuntu/pegasus/pegasus/params/public_params.py", line 95 in transformer_params
      File "/home/ubuntu/pegasus/pegasus/params/public_params.py", line 162 in aeslc_transformer
      File "pegasus/bin/evaluate.py", line 110 in main
      File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250 in _run_main
      File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299 in run
      File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
      File "pegasus/bin/evaluate.py", line 153 in <module>
    Aborted (core dumped)
    

    System - Ubuntu 16.04 Python - anaconda3, python 3.7.3 GPU - 16gb Telsa T4 Tensorflow pkgs

    mesh-tensorflow==0.1.13
    tensorflow==1.15.3
    tensorflow-datasets==3.1.0
    tensorflow-estimator==1.15.1
    tensorflow-gan==2.0.0
    tensorflow-gpu==1.15.2
    tensorflow-hub==0.8.0
    tensorflow-metadata==0.22.2
    tensorflow-probability==0.7.0
    tensorflow-text==1.15.0rc0
    
    opened by allhelllooz 9
  • Max input length for Reddit_tifu and xsum

    Max input length for Reddit_tifu and xsum

    Hi, @JingqingZ

    According to the hyperparamter table reported in paper, max input/tokens kept for xsum and reddit dataset were 512 whereas the in the dataset registry they are defined as 1024. Why so?

    Moreover, the checkpoints uploaded on gcloud are obtained by using hyperparameter (including the max token length) defined in dataset registry, right? Though for some dataset, train steps in dataset registry are different than the checkpoint available on gcloud but I guess it's the one with best validation result.

    Give me some insights on max input length thing. I am quite confused about it.

    Edit: One more thing, why did you report RougeL-F for pubmed dataset and not the RougeLsum-F?

    opened by agenius5 8
  • Is it possible to fine-tune pegasus with wikihow dataset in google colaboratory?

    Is it possible to fine-tune pegasus with wikihow dataset in google colaboratory?

    Hello. I'm using pegasus for my school work. I'd like to try to fine-tune pegasus with wikihow dataset.

    I've downloaded wikihow csv data "wikihowAll.csv" and "wikihowSep.csv", then placed the files on the designated directory. After some processes, I get able to run train.py in the environment.

    But... in a some minutes, train.py stops due to OOM. So I tried to trim csv data to 10000 rows, 1000 rows, 100rows... They all failed due to OOM after all. I'm wondering that it is not a csv problem.

    If you know how to fine-tune with wikihow datasets, please let me know!

    opened by Papillon6814 7
  • Input length greater than 1024

    Input length greater than 1024

    Hi,

    I am trying to understand if I can have input lengths greater than 1024. Can I set the pubmed max_input_length >1024 (in public_params) and then fine tune the model on pubmed data ? Currently the the max_input_length for pubmed is 1024.

    I ask because I am trying to understand the 2nd line in the following paragraph from the paper :

    "CNN/DailyMail, Multi-News, arXiv, PubMed, BIGPATENT datasets contain input documents longer than the maximum input length (Linput = 512 tokens) in pretraining. This would present a problem for position embeddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal positional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to Linput = 1024 tokens."

    Any inputs on this would be of great help. Thanks in advance!

    opened by Asayesha 7
  •  No matching distribution found for tensorflow-text==1.15.0rc0 - using pip latest

    No matching distribution found for tensorflow-text==1.15.0rc0 - using pip latest

    Hi,

    I'm unable to install pegasus on OSX, using latest version of pip, it throws:

    ERROR: Could not find a version that satisfies the requirement tensorflow-text==1.15.0rc0 (from -r requirements.txt (line 7)) (from versions: 2.2.0, 2.2.1, 2.3.0rc1)
    ERROR: No matching distribution found for tensorflow-text==1.15.0rc0 (from -r requirements.txt (line 7))
    

    Tried to install it in a virtual env by downgrading pip to different versions, still no luck.

    Could someone please help me out? Thanks.

    opened by vedtam 7
  • Bump tensorflow from 1.15 to 2.9.3

    Bump tensorflow from 1.15 to 2.9.3

    Bumps tensorflow from 1.15 to 2.9.3.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.9.3

    Release 2.9.3

    This release introduces several vulnerability fixes:

    TensorFlow 2.9.2

    Release 2.9.2

    This releases introduces several vulnerability fixes:

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.9.3

    This release introduces several vulnerability fixes:

    Release 2.8.4

    This release introduces several vulnerability fixes:

    ... (truncated)

    Commits
    • a5ed5f3 Merge pull request #58584 from tensorflow/vinila21-patch-2
    • 258f9a1 Update py_func.cc
    • cd27cfb Merge pull request #58580 from tensorflow-jenkins/version-numbers-2.9.3-24474
    • 3e75385 Update version numbers to 2.9.3
    • bc72c39 Merge pull request #58482 from tensorflow-jenkins/relnotes-2.9.3-25695
    • 3506c90 Update RELEASE.md
    • 8dcb48e Update RELEASE.md
    • 4f34ec8 Merge pull request #58576 from pak-laura/c2.99f03a9d3bafe902c1e6beb105b2f2417...
    • 6fc67e4 Replace CHECK with returning an InternalError on failing to create python tuple
    • 5dbe90a Merge pull request #58570 from tensorflow/r2.9-7b174a0f2e4
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump tensorflow-gpu from 1.15.2 to 2.9.3

    Bump tensorflow-gpu from 1.15.2 to 2.9.3

    Bumps tensorflow-gpu from 1.15.2 to 2.9.3.

    Release notes

    Sourced from tensorflow-gpu's releases.

    TensorFlow 2.9.3

    Release 2.9.3

    This release introduces several vulnerability fixes:

    TensorFlow 2.9.2

    Release 2.9.2

    This releases introduces several vulnerability fixes:

    ... (truncated)

    Changelog

    Sourced from tensorflow-gpu's changelog.

    Release 2.9.3

    This release introduces several vulnerability fixes:

    Release 2.8.4

    This release introduces several vulnerability fixes:

    ... (truncated)

    Commits
    • a5ed5f3 Merge pull request #58584 from tensorflow/vinila21-patch-2
    • 258f9a1 Update py_func.cc
    • cd27cfb Merge pull request #58580 from tensorflow-jenkins/version-numbers-2.9.3-24474
    • 3e75385 Update version numbers to 2.9.3
    • bc72c39 Merge pull request #58482 from tensorflow-jenkins/relnotes-2.9.3-25695
    • 3506c90 Update RELEASE.md
    • 8dcb48e Update RELEASE.md
    • 4f34ec8 Merge pull request #58576 from pak-laura/c2.99f03a9d3bafe902c1e6beb105b2f2417...
    • 6fc67e4 Replace CHECK with returning an InternalError on failing to create python tuple
    • 5dbe90a Merge pull request #58570 from tensorflow/r2.9-7b174a0f2e4
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • mask token id == 3 during pretraining?

    mask token id == 3 during pretraining?

    I noticed that kMaskWordTokenId (mask2 as defined in the paper) is 3 as defined below. https://github.com/google-research/pegasus/blob/main/pegasus/ops/pretrain_parsing_ops.cc#L69

    However, the id of token 'a' is also 3 in sentencepiece vocab from "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model"

    @EKebriaei

    opened by whaleloops 0
  • how does fine tuning change the original transformer language model?

    how does fine tuning change the original transformer language model?

    what exactly happens to the token relationships behind the scenes so that it knows how to paraphrase never before seen sentences? How does it get from training data I would like to go to the movies. The cinema seems like an ideal choice.

    to be able to paraphrase a sentence like I would vote for this candidate for president.

    opened by mishav78 0
  • unable to use pegasus-x checkpoints

    unable to use pegasus-x checkpoints

    Hi,

    I try to download the checkpoints in the README for pegasus-x models but they are not complete (there is only the .ckpt file, no .meta and .index file). I don't see they in the GCS bucket either. Would you upload usable checkpoints to GCS bucket soon? Thank you!

    opened by yingrui-yang 0
Owner
Google Research
Google Research
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 9, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 9, 2023
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 1, 2023
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 4.7k Feb 17, 2021
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

null 8 Mar 24, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 3, 2023
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 5, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 20.5k Jan 8, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022