XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zihang Dai

Last update: Jan 7, 2023

Related tags

Overview

Introduction

XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

For a detailed description of technical details and experimental results, please refer to our paper:

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

(*: equal contribution)

Preprint 2019

Release Notes

July 16, 2019: XLNet-Base.
June 19, 2019: initial release with XLNet-Large and code.

Results

As of June 19, 2019, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks. Below are some comparison between XLNet-Large and BERT-Large, which have similar model sizes:

Results on Reading Comprehension

Model	RACE accuracy	SQuAD1.1 EM	SQuAD2.0 EM
BERT-Large	72.0	84.1	78.98
XLNet-Base			80.18
XLNet-Large	81.75	88.95	86.12

We use SQuAD dev results in the table to exclude other factors such as using additional training data or other data augmentation techniques. See SQuAD leaderboard for test numbers.

Results on Text Classification

Model	IMDB	Yelp-2	Yelp-5	DBpedia	Amazon-2	Amazon-5
BERT-Large	4.51	1.89	29.32	0.64	2.63	34.17
XLNet-Large	3.79	1.55	27.80	0.62	2.40	32.26

The above numbers are error rates.

Results on GLUE

Model	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	STS-B
BERT-Large	86.6	92.3	91.3	70.4	93.2	88.0	60.6	90.0
XLNet-Base	86.8	91.7	91.4	74.0	94.7	88.2	60.2	89.5
XLNet-Large	89.8	93.9	91.8	83.8	95.6	89.2	63.6	91.8

We use single-task dev results in the table to exclude other factors such as multi-task learning or using ensembles.

Pre-trained models

Released Models

As of July 16, 2019, the following models have been made available:

XLNet-Large, Cased: 24-layer, 1024-hidden, 16-heads
XLNet-Base, Cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

We only release cased models for now because on the tasks we consider, we found: (1) for the base setting, cased and uncased models have similar performance; (2) for the large setting, cased models are a bit better in some tasks.

Each .zip file contains three items:

A TensorFlow checkpoint (xlnet_model.ckpt) containing the pre-trained weights (which is actually 3 files).
A Sentence Piece model (spiece.model) used for (de)tokenization.
A config file (xlnet_config.json) which specifies the hyperparameters of the model.

Future Release Plan

We also plan to continuously release more pretrained models under different settings, including:

A pretrained model that is finetuned on Wikipedia. This can be used for tasks with Wikipedia text such as SQuAD and HotpotQA.
Pretrained models with other hyperparameter configurations, targeting specific downstream tasks.
Pretrained models that benefit from new techniques.

Subscribing to XLNet on Google Groups

To receive notifications about updates, announcements and new releases, we recommend subscribing to the XLNet on Google Groups.

Fine-tuning with XLNet

As of June 19, 2019, this code base has been tested with TensorFlow 1.13.1 under Python2.

Memory Issue during Finetuning

Most of the SOTA results in our paper were produced on TPUs, which generally have more RAM than common GPUs. As a result, it is currently very difficult (costly) to re-produce most of the XLNet-Large SOTA results in the paper using GPUs with 12GB - 16GB of RAM, because a 16GB GPU is only able to hold a single sequence with length 512 for XLNet-Large. Therefore, a large number (ranging from 32 to 128, equal to batch_size) of GPUs are required to reproduce many results in the paper.
We are experimenting with gradient accumulation to potentially relieve the memory burden, which could be included in a near-future update.
Alternative methods of finetuning XLNet on constrained hardware have been presented in renatoviolin's repo, which obtained 86.24 F1 on SQuAD2.0 with a 8GB memory GPU.

Given the memory issue mentioned above, using the default finetuning scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on a single 16GB GPU with TensorFlow 1.13.1:

System	Seq Length	Max Batch Size
`XLNet-Base`	64	120
...	128	56
...	256	24
...	512	8
`XLNet-Large`	64	16
...	128	8
...	256	2
...	512	1

In most cases, it is possible to reduce the batch size train_batch_size or the maximum sequence length max_seq_length to fit in given hardware. The decrease in performance depends on the task and the available resources.

Text Classification/Regression

The code used to perform classification/regression finetuning is in run_classifier.py. It also contains examples for standard one-document classification, one-document regression, and document pair classification. Here, we provide two concrete examples of how run_classifier.py can be used.

From here on, we assume XLNet-Large and XLNet-base has been downloaded to $LARGE_DIR and $BASE_DIR respectively.

(1) STS-B: sentence pair relevance regression (with GPUs)

Download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

Perform multi-GPU (4 V100 GPUs) finetuning with XLNet-Large by running

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
  --do_train=True \
  --do_eval=False \
  --task_name=sts-b \
  --data_dir=${GLUE_DIR}/STS-B \
  --output_dir=proc_data/sts-b \
  --model_dir=exp/sts-b \
  --uncased=False \
  --spiece_model_file=${LARGE_DIR}/spiece.model \
  --model_config_path=${LARGE_DIR}/xlnet_config.json \
  --init_checkpoint=${LARGE_DIR}/xlnet_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=4 \
  --learning_rate=5e-5 \
  --train_steps=1200 \
  --warmup_steps=120 \
  --save_steps=600 \
  --is_regression=True

Evaluate the finetuning results with a single GPU by

CUDA_VISIBLE_DEVICES=0 python run_classifier.py \
  --do_train=False \
  --do_eval=True \
  --task_name=sts-b \
  --data_dir=${GLUE_DIR}/STS-B \
  --output_dir=proc_data/sts-b \
  --model_dir=exp/sts-b \
  --uncased=False \
  --spiece_model_file=${LARGE_DIR}/spiece.model \
  --model_config_path=${LARGE_DIR}/xlnet_config.json \
  --max_seq_length=128 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --eval_all_ckpt=True \
  --is_regression=True

# Expected performance: "eval_pearsonr 0.916+ "

Notes:

In the context of GPU training, num_core_per_host denotes the number of GPUs to use.
In the multi-GPU setting, train_batch_size refers to the per-GPU batch size.
eval_all_ckpt allows one to evaluate all saved checkpoints (save frequency is controlled by save_steps) after training finishes and choose the best model based on dev performance.
data_dir and output_dir refer to the directories of the "raw data" and "preprocessed tfrecords" respectively, while model_dir is the working directory for saving checkpoints and tensorflow events. model_dir should be set as a separate folder to init_checkpoint.
To try out XLNet-base, one can simply set --train_batch_size=32 and --num_core_per_host=1, along with according changes in init_checkpoint and model_config_path.
For GPUs with smaller RAM, please proportionally decrease the train_batch_size and increase num_core_per_host to use the same training setting.
Important: we separate the training and evaluation into "two phases", as using multi GPUs to perform evaluation is tricky (one has to correctly separate the data across GPUs). To ensure correctness, we only support single-GPU evaluation for now.

(2) IMDB: movie review sentiment classification (with TPU V3-8)

Download and unpack the IMDB dataset by running

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz

Launch a Google cloud TPU V3-8 instance (see the Google Cloud TPU tutorial for how to set up Cloud TPUs).
Set up your Google storage bucket path $GS_ROOT and move the IMDB dataset and pretrained checkpoint into your Google storage.

Perform TPU finetuning with XLNet-Large by running

python run_classifier.py \
  --use_tpu=True \
  --tpu=${TPU_NAME} \
  --do_train=True \
  --do_eval=True \
  --eval_all_ckpt=True \
  --task_name=imdb \
  --data_dir=${IMDB_DIR} \
  --output_dir=${GS_ROOT}/proc_data/imdb \
  --model_dir=${GS_ROOT}/exp/imdb \
  --uncased=False \
  --spiece_model_file=${LARGE_DIR}/spiece.model \
  --model_config_path=${GS_ROOT}/${LARGE_DIR}/model_config.json \
  --init_checkpoint=${GS_ROOT}/${LARGE_DIR}/xlnet_model.ckpt \
  --max_seq_length=512 \
  --train_batch_size=32 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=8 \
  --learning_rate=2e-5 \
  --train_steps=4000 \
  --warmup_steps=500 \
  --save_steps=500 \
  --iterations=500

# Expected performance: "eval_accuracy 0.962+ "

Notes:

To obtain the SOTA on the IMDB dataset, using sequence length 512 is necessary. Therefore, we show how this can be done with a TPU V3-8.
Alternatively, one can use a sequence length smaller than 512, a smaller batch size, or switch to XLNet-base to train on GPUs. But performance drop is expected.
Notice that the data_dir and spiece_model_file both use a local path rather than a Google Storage path. The reason is that data preprocessing is actually performed locally. Hence, using local paths leads to a faster preprocessing speed.

SQuAD2.0

The code for the SQuAD dataset is included in run_squad.py.

To run the code:

(1) Download the SQuAD2.0 dataset into $SQUAD_DIR by:

mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

(2) Perform data preprocessing using the script scripts/prepro_squad.sh.

This will take quite some time in order to accurately map character positions (raw data) to sentence piece positions (used for training).
For faster parallel preprocessing, please refer to the flags --num_proc and --proc_id in run_squad.py.

(3) Perform training and evaluation.

For the best performance, XLNet-Large uses sequence length 512 and batch size 48 for training.

As a result, reproducing the best result with GPUs is quite difficult.
For training with one TPU v3-8, one can simply run the script scripts/tpu_squad_large.sh after both the TPU and Google storage have been setup.
run_squad.py will automatically perform threshold searching on the dev set of squad and output the score. With scripts/tpu_squad_large.sh, the expected F1 score should be around 88.6 (median of our multiple runs).

Alternatively, one can use XLNet-Base with GPUs (e.g. three V100). One set of reasonable hyper-parameters can be found in the script scripts/gpu_squad_base.sh.

RACE reading comprehension

The code for the reading comprehension task RACE is included in run_race.py.

Notably, the average length of the passages in RACE is over 300 tokens (not peices), which is significantly longer than other popular reading comprehension datasets such as SQuAD.
Also, many questions can be very difficult and requires complex reasoning for machines to solve (see one example here).

To run the code:

(1) Download the RACE dataset from the official website and unpack the raw data to $RACE_DIR.

(2) Perform training and evaluation:

The SOTA performance (accuracy 81.75) of RACE is produced using XLNet-Large with sequence length 512 and batch size 32, which requires a large TPU v3-32 in the pod setting. Please refer to the script script/tpu_race_large_bsz32.sh for this setting.
Using XLNet-Large with sequence length 512 and batch size 8 on a TPU v3-8 can give you an accuracy of around 80.3 (see script/tpu_race_large_bsz8.sh).

Using Google Colab

An example of using Google Colab with GPUs has been provided. Note that since the hardware is constrained in the example, the results are worse than the best we can get. It mainly serves as an example and should be modified accordingly to maximize performance.

Custom Usage of XLNet

XLNet Abstraction

For finetuning, it is likely that you will be able to modify existing files such as run_classifier.py, run_squad.py and run_race.py for your task at hand. However, we also provide an abstraction of XLNet to enable more flexible usage. Below is an example:

import xlnet

# some code omitted here...
# initialize FLAGS
# initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

# XLNetConfig contains hyperparameters that are specific to a model checkpoint.
xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path)

# RunConfig contains hyperparameters that could be different between pretraining and finetuning.
run_config = xlnet.create_run_config(is_training=True, is_finetune=True, FLAGS=FLAGS)

# Construct an XLNet model
xlnet_model = xlnet.XLNetModel(
    xlnet_config=xlnet_config,
    run_config=run_config,
    input_ids=input_ids,
    seg_ids=seg_ids,
    input_mask=input_mask)

# Get a summary of the sequence using the last hidden state
summary = xlnet_model.get_pooled_out(summary_type="last")

# Get a sequence output
seq_out = xlnet_model.get_sequence_output()

# build your applications based on `summary` or `seq_out`

Tokenization

Below is an example of doing tokenization in XLNet:

import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids

# some code omitted here...
# initialize FLAGS

text = "An input text string."

sp_model = spm.SentencePieceProcessor()
sp_model.Load(FLAGS.spiece_model_file)
text = preprocess_text(text, lower=FLAGS.uncased)
ids = encode_ids(sp_model, text)

where FLAGS.spiece_model_file is the SentencePiece model file in the same zip as the pretrained model, FLAGS.uncased is a bool indicating whether to do uncasing.

Pretraining with XLNet

Refer to train.py for pretraining on TPUs and train_gpu.py for pretraining on GPUs. First we need to preprocess the text data into tfrecords.

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=16 \
	--seq_len=512 \
	--reuse_len=256 \
	--input_glob=*.txt \
	--save_dir=${SAVE_DIR} \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=spiece.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85

where input_glob defines all input text files, save_dir is the output directory for tfrecords, and sp_path is a Sentence Piece model. Here is our script to train the Sentence Piece model

spm_train \
	--input=$INPUT \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.99995 \
	--model_type=unigram \
	--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \
	--user_defined_symbols=<eop>,.,(,),",-,–,£,€ \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

Special symbols are used, including control_symbols and user_defined_symbols. We use and to denote End of Paragraph and End of Document respectively.

The input text files to data_utils.py must use the following format:

Each line is a sentence.
An empty line means End of Document.
(Optional) If one also wants to model paragraph structures, can be inserted at the end of certain lines (without any space) to indicate that the corresponding sentence ends a paragraph.

For example, the text input file could be:

This is the first sentence.
This is the second sentence and also the end of the paragraph.
   
    
Another paragraph.

Another document starts here.

After preprocessing, we are ready to pretrain an XLNet. Below are the hyperparameters used for pretraining XLNet-Large:

python train.py
  --record_info_dir=$DATA/tfrecords \
  --train_batch_size=2048 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=384 \
  --perm_size=256 \
  --n_layer=24 \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=4096 \
  --untie_r=True \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85

where we only list the most important flags and the other flags could be adjusted based on specific use cases.

Comments

XLNET Base for Malay and Indonesian languages (not an issue)

Hi! This is not an issue, I just want to say XLNET is really great and I successfully pretrained XLNET from scratch for Malay and Indonesian languages. You can read comparison and download pretrained from here, https://github.com/huseinzol05/Malaya/tree/master/xlnet

I am planning to release XLNET Large for these languages!

opened by huseinzol05 23
Out of memory with TPU v3-8 when running tpu_squad_large.sh
Hello, thank you for the interesting paper and for releasing your code alongside the paper!

I am trying to train XL-Net on Squad, but I am getting OOM errors when running scripts/tpu_squad_large.sh. This strikes me as odd, because you say in the README that you can run this script without issues. I have not modified the parameters of the script, except for specifying the necessary data/model directories.

For context, my setup is as follows. I spun up a TPU v3-8 using ctpu up in the us-central1-a region. I preprocessed the data as directed, using scripts/prepro_squad.sh, and moved to a Google Storage bucket in the same region as the TPU. I have model checkpoint folders both locally (for sentencepiece) and in the cloud (for loading the model).

I have worked with TPUs before, but only TPU-v2 (not v3); is there something I am doing incorrectly?

When I run scripts/tpu_squad_large.sh, loading and initialization work fine, but the script breaks with what I believe is a memory issue:

# ... normal tensorflow logs ... I0621 17:53:36.702727 140612788254144 tpu_estimator.py:536] Enqueue next (1000) batch(es) of data to infeed. I0621 17:53:36.703403 140612788254144 tpu_estimator.py:540] Dequeue next (1000) batch(es) of data from outfeed. I0621 17:56:15.833373 140611248187136 error_handling.py:70] Error recorded from outfeed: Bad hardware status: 0x1 # ... stack trace ... Status code: Resource exhausted [9x] Compilation failure: Ran out of memory in memory space hbm. Used 20.90G of 16.00G hbm. Exceeded hbm capacity by 4.90G. Total hbm usage >= 20.90G: reserved 528.00M program 20.38G arguments unknown size Output size unknown.

Is there something I am doing incorrectly?

Also, have others managed to run scripts/tpu_squad_large.sh successfully (with batch size 48, etc.)?
opened by lukemelas 23

Pretraining loss is increasing

Right now I do pretraining for Malaysia language. I got my own dataset collected from wikipedia, social media and public news. Everything is perfect, it just, the loss increasing,

I0704 18:35:04.634176 140514656708352 train_gpu.py:303] [500] | gnorm 1.29 lr 0.000249 | loss 7.61 | pplx 2012.05, bpc 10.9745
I0704 18:40:19.443927 140514656708352 train_gpu.py:303] [1000] | gnorm 1.14 lr 0.000249 | loss 7.49 | pplx 1798.12, bpc 10.8123
I0704 18:45:34.236683 140514656708352 train_gpu.py:303] [1500] | gnorm 1.20 lr 0.000248 | loss 7.52 | pplx 1843.51, bpc 10.8482
I0704 18:50:49.070508 140514656708352 train_gpu.py:303] [2000] | gnorm 1.16 lr 0.000248 | loss 7.53 | pplx 1855.24, bpc 10.8574
I0704 18:56:03.973169 140514656708352 train_gpu.py:303] [2500] | gnorm 0.80 lr 0.000247 | loss 7.50 | pplx 1809.21, bpc 10.8211
I0704 19:01:18.817846 140514656708352 train_gpu.py:303] [3000] | gnorm 0.68 lr 0.000246 | loss 7.47 | pplx 1751.64, bpc 10.7745
I0704 19:06:33.646725 140514656708352 train_gpu.py:303] [3500] | gnorm 0.68 lr 0.000246 | loss 7.50 | pplx 1813.33, bpc 10.8244
I0704 19:11:48.491064 140514656708352 train_gpu.py:303] [4000] | gnorm 0.63 lr 0.000245 | loss 7.48 | pplx 1765.44, bpc 10.7858
I0704 19:17:03.302957 140514656708352 train_gpu.py:303] [4500] | gnorm 0.54 lr 0.000244 | loss 7.40 | pplx 1643.27, bpc 10.6824
I0704 19:22:18.108561 140514656708352 train_gpu.py:303] [5000] | gnorm 0.43 lr 0.000244 | loss 7.48 | pplx 1768.99, bpc 10.7887
I0704 19:27:32.939702 140514656708352 train_gpu.py:303] [5500] | gnorm 0.52 lr 0.000243 | loss 7.41 | pplx 1647.01, bpc 10.6856
I0704 19:32:47.666982 140514656708352 train_gpu.py:303] [6000] | gnorm 0.58 lr 0.000243 | loss 7.44 | pplx 1700.44, bpc 10.7317
I0704 19:38:02.447965 140514656708352 train_gpu.py:303] [6500] | gnorm 0.47 lr 0.000242 | loss 7.42 | pplx 1669.42, bpc 10.7051
I0704 19:43:17.212873 140514656708352 train_gpu.py:303] [7000] | gnorm 0.56 lr 0.000241 | loss 7.43 | pplx 1692.20, bpc 10.7247
I0704 19:48:31.992203 140514656708352 train_gpu.py:303] [7500] | gnorm 0.54 lr 0.000241 | loss 7.47 | pplx 1759.98, bpc 10.7813
I0704 19:53:46.838080 140514656708352 train_gpu.py:303] [8000] | gnorm 0.40 lr 0.000240 | loss 7.42 | pplx 1675.03, bpc 10.7100
I0704 19:59:01.705397 140514656708352 train_gpu.py:303] [8500] | gnorm 0.60 lr 0.000239 | loss 7.45 | pplx 1713.91, bpc 10.7431
I0704 20:04:16.556568 140514656708352 train_gpu.py:303] [9000] | gnorm 0.31 lr 0.000239 | loss 7.45 | pplx 1717.98, bpc 10.7465
I0704 20:09:31.360584 140514656708352 train_gpu.py:303] [9500] | gnorm 0.31 lr 0.000238 | loss 7.42 | pplx 1667.16, bpc 10.7032
I0704 20:14:46.129139 140514656708352 train_gpu.py:303] [10000] | gnorm 0.32 lr 0.000238 | loss 7.41 | pplx 1658.86, bpc 10.6960
I0704 20:14:54.924502 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 20:20:09.735051 140514656708352 train_gpu.py:303] [10500] | gnorm 0.71 lr 0.000237 | loss 7.32 | pplx 1515.01, bpc 10.5651
I0704 20:25:24.431047 140514656708352 train_gpu.py:303] [11000] | gnorm 0.30 lr 0.000236 | loss 7.38 | pplx 1601.43, bpc 10.6451
I0704 20:30:39.190253 140514656708352 train_gpu.py:303] [11500] | gnorm 0.56 lr 0.000236 | loss 7.05 | pplx 1150.96, bpc 10.1686
I0704 20:35:54.004818 140514656708352 train_gpu.py:303] [12000] | gnorm 0.37 lr 0.000235 | loss 7.12 | pplx 1230.52, bpc 10.2651
I0704 20:41:08.760111 140514656708352 train_gpu.py:303] [12500] | gnorm 0.36 lr 0.000234 | loss 7.31 | pplx 1499.00, bpc 10.5498
I0704 20:46:23.480738 140514656708352 train_gpu.py:303] [13000] | gnorm 0.31 lr 0.000234 | loss 7.43 | pplx 1689.70, bpc 10.7226
I0704 20:51:38.286542 140514656708352 train_gpu.py:303] [13500] | gnorm 0.29 lr 0.000233 | loss 7.37 | pplx 1581.20, bpc 10.6268
I0704 20:56:53.045661 140514656708352 train_gpu.py:303] [14000] | gnorm 0.33 lr 0.000233 | loss 7.37 | pplx 1585.96, bpc 10.6311
I0704 21:02:07.842073 140514656708352 train_gpu.py:303] [14500] | gnorm 0.28 lr 0.000232 | loss 7.31 | pplx 1496.60, bpc 10.5475
I0704 21:07:22.611250 140514656708352 train_gpu.py:303] [15000] | gnorm 0.46 lr 0.000231 | loss 7.36 | pplx 1570.65, bpc 10.6171
I0704 21:12:37.345983 140514656708352 train_gpu.py:303] [15500] | gnorm 0.34 lr 0.000231 | loss 7.43 | pplx 1692.69, bpc 10.7251
I0704 21:17:52.026112 140514656708352 train_gpu.py:303] [16000] | gnorm 0.47 lr 0.000230 | loss 7.33 | pplx 1522.32, bpc 10.5721
I0704 21:23:06.814132 140514656708352 train_gpu.py:303] [16500] | gnorm 0.28 lr 0.000229 | loss 7.38 | pplx 1610.54, bpc 10.6533
I0704 21:28:21.642250 140514656708352 train_gpu.py:303] [17000] | gnorm 0.35 lr 0.000229 | loss 7.47 | pplx 1751.44, bpc 10.7743
I0704 21:33:36.417810 140514656708352 train_gpu.py:303] [17500] | gnorm 0.26 lr 0.000228 | loss 7.66 | pplx 2127.10, bpc 11.0547
I0704 21:38:51.233461 140514656708352 train_gpu.py:303] [18000] | gnorm 0.48 lr 0.000228 | loss 7.64 | pplx 2081.33, bpc 11.0233
I0704 21:44:06.010222 140514656708352 train_gpu.py:303] [18500] | gnorm 0.26 lr 0.000227 | loss 7.62 | pplx 2035.16, bpc 10.9909
I0704 21:49:20.783527 140514656708352 train_gpu.py:303] [19000] | gnorm 0.29 lr 0.000226 | loss 7.63 | pplx 2067.63, bpc 11.0138
I0704 21:54:35.625918 140514656708352 train_gpu.py:303] [19500] | gnorm 0.29 lr 0.000226 | loss 7.64 | pplx 2074.96, bpc 11.0189
I0704 21:59:50.491468 140514656708352 train_gpu.py:303] [20000] | gnorm 0.24 lr 0.000225 | loss 7.60 | pplx 2005.06, bpc 10.9694
I0704 21:59:58.156530 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 22:05:13.066222 140514656708352 train_gpu.py:303] [20500] | gnorm 0.37 lr 0.000224 | loss 7.61 | pplx 2023.57, bpc 10.9827
I0704 22:10:27.896057 140514656708352 train_gpu.py:303] [21000] | gnorm 0.23 lr 0.000224 | loss 7.59 | pplx 1972.14, bpc 10.9455
I0704 22:15:42.730550 140514656708352 train_gpu.py:303] [21500] | gnorm 0.25 lr 0.000223 | loss 7.64 | pplx 2081.29, bpc 11.0233
I0704 22:20:57.537832 140514656708352 train_gpu.py:303] [22000] | gnorm 0.28 lr 0.000223 | loss 7.65 | pplx 2098.58, bpc 11.0352
I0704 22:26:12.292067 140514656708352 train_gpu.py:303] [22500] | gnorm 0.28 lr 0.000222 | loss 7.60 | pplx 1996.29, bpc 10.9631
I0704 22:31:27.120922 140514656708352 train_gpu.py:303] [23000] | gnorm 0.31 lr 0.000221 | loss 7.60 | pplx 1990.51, bpc 10.9589
I0704 22:36:41.893491 140514656708352 train_gpu.py:303] [23500] | gnorm 0.36 lr 0.000221 | loss 7.63 | pplx 2064.10, bpc 11.0113
I0704 22:41:56.750755 140514656708352 train_gpu.py:303] [24000] | gnorm 0.24 lr 0.000220 | loss 7.61 | pplx 2026.16, bpc 10.9845
I0704 22:47:11.568091 140514656708352 train_gpu.py:303] [24500] | gnorm 0.33 lr 0.000219 | loss 7.61 | pplx 2012.64, bpc 10.9749
I0704 22:52:26.442100 140514656708352 train_gpu.py:303] [25000] | gnorm 0.44 lr 0.000219 | loss 7.63 | pplx 2068.76, bpc 11.0146
I0704 22:57:41.386162 140514656708352 train_gpu.py:303] [25500] | gnorm 0.24 lr 0.000218 | loss 7.61 | pplx 2014.27, bpc 10.9760
I0704 23:02:56.267807 140514656708352 train_gpu.py:303] [26000] | gnorm 0.22 lr 0.000218 | loss 7.72 | pplx 2255.77, bpc 11.1394
I0704 23:08:11.103214 140514656708352 train_gpu.py:303] [26500] | gnorm 0.27 lr 0.000217 | loss 7.93 | pplx 2782.29, bpc 11.4421
I0704 23:13:25.904230 140514656708352 train_gpu.py:303] [27000] | gnorm 0.23 lr 0.000216 | loss 7.96 | pplx 2875.61, bpc 11.4897
I0704 23:18:40.707365 140514656708352 train_gpu.py:303] [27500] | gnorm 0.35 lr 0.000216 | loss 7.92 | pplx 2748.53, bpc 11.4244
I0704 23:23:55.549267 140514656708352 train_gpu.py:303] [28000] | gnorm 0.21 lr 0.000215 | loss 7.89 | pplx 2671.65, bpc 11.3835
I0704 23:29:10.405974 140514656708352 train_gpu.py:303] [28500] | gnorm 0.39 lr 0.000215 | loss 7.90 | pplx 2684.09, bpc 11.3902
I0704 23:34:25.217416 140514656708352 train_gpu.py:303] [29000] | gnorm 0.39 lr 0.000214 | loss 8.00 | pplx 2978.12, bpc 11.5402
I0704 23:39:40.074820 140514656708352 train_gpu.py:303] [29500] | gnorm 0.31 lr 0.000213 | loss 7.96 | pplx 2865.29, bpc 11.4845
I0704 23:44:54.826035 140514656708352 train_gpu.py:303] [30000] | gnorm 0.26 lr 0.000213 | loss 7.98 | pplx 2925.37, bpc 11.5144
I0704 23:45:02.488555 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0704 23:50:17.176898 140514656708352 train_gpu.py:303] [30500] | gnorm 0.29 lr 0.000212 | loss 7.96 | pplx 2876.71, bpc 11.4902
I0704 23:55:31.884260 140514656708352 train_gpu.py:303] [31000] | gnorm 0.27 lr 0.000211 | loss 7.96 | pplx 2860.68, bpc 11.4821
I0705 00:00:46.648993 140514656708352 train_gpu.py:303] [31500] | gnorm 0.23 lr 0.000211 | loss 7.97 | pplx 2885.09, bpc 11.4944
I0705 00:06:01.459876 140514656708352 train_gpu.py:303] [32000] | gnorm 0.23 lr 0.000210 | loss 7.97 | pplx 2878.87, bpc 11.4913
I0705 00:11:16.248872 140514656708352 train_gpu.py:303] [32500] | gnorm 0.30 lr 0.000210 | loss 7.95 | pplx 2824.14, bpc 11.4636
I0705 00:16:31.057889 140514656708352 train_gpu.py:303] [33000] | gnorm 0.67 lr 0.000209 | loss 7.95 | pplx 2843.30, bpc 11.4734
I0705 00:21:45.813891 140514656708352 train_gpu.py:303] [33500] | gnorm 0.30 lr 0.000208 | loss 7.93 | pplx 2791.75, bpc 11.4470
I0705 00:27:00.544517 140514656708352 train_gpu.py:303] [34000] | gnorm 0.26 lr 0.000208 | loss 7.91 | pplx 2724.91, bpc 11.4120
I0705 00:32:15.378443 140514656708352 train_gpu.py:303] [34500] | gnorm 0.30 lr 0.000207 | loss 7.90 | pplx 2710.08, bpc 11.4041
I0705 00:37:30.203745 140514656708352 train_gpu.py:303] [35000] | gnorm 0.27 lr 0.000206 | loss 7.91 | pplx 2728.60, bpc 11.4139
I0705 00:42:45.133219 140514656708352 train_gpu.py:303] [35500] | gnorm 0.33 lr 0.000206 | loss 7.94 | pplx 2819.00, bpc 11.4610
I0705 00:47:59.879519 140514656708352 train_gpu.py:303] [36000] | gnorm 0.34 lr 0.000205 | loss 7.87 | pplx 2624.20, bpc 11.3577
I0705 00:53:14.583731 140514656708352 train_gpu.py:303] [36500] | gnorm 0.40 lr 0.000205 | loss 7.90 | pplx 2705.18, bpc 11.4015
I0705 00:58:29.223032 140514656708352 train_gpu.py:303] [37000] | gnorm 0.27 lr 0.000204 | loss 7.88 | pplx 2646.44, bpc 11.3698
I0705 01:03:43.887945 140514656708352 train_gpu.py:303] [37500] | gnorm 0.45 lr 0.000203 | loss 7.89 | pplx 2657.59, bpc 11.3759
I0705 01:08:58.620047 140514656708352 train_gpu.py:303] [38000] | gnorm 0.60 lr 0.000203 | loss 7.90 | pplx 2688.38, bpc 11.3925
I0705 01:14:13.428936 140514656708352 train_gpu.py:303] [38500] | gnorm 0.21 lr 0.000202 | loss 7.86 | pplx 2599.27, bpc 11.3439
I0705 01:19:28.241355 140514656708352 train_gpu.py:303] [39000] | gnorm 0.22 lr 0.000201 | loss 7.91 | pplx 2726.29, bpc 11.4127
I0705 01:24:43.089514 140514656708352 train_gpu.py:303] [39500] | gnorm 0.24 lr 0.000201 | loss 7.90 | pplx 2688.34, bpc 11.3925
I0705 01:29:57.828010 140514656708352 train_gpu.py:303] [40000] | gnorm 0.24 lr 0.000200 | loss 7.97 | pplx 2895.85, bpc 11.4998
I0705 01:30:05.489835 140514656708352 train_gpu.py:309] Model saved in path: output-model/model.ckpt
I0705 01:35:20.313300 140514656708352 train_gpu.py:303] [40500] | gnorm 0.48 lr 0.000200 | loss 7.97 | pplx 2893.86, bpc 11.4988
I0705 01:40:35.002449 140514656708352 train_gpu.py:303] [41000] | gnorm 0.27 lr 0.000199 | loss 7.89 | pplx 2680.88, bpc 11.3885
I0705 01:45:49.796684 140514656708352 train_gpu.py:303] [41500] | gnorm 0.31 lr 0.000198 | loss 7.89 | pplx 2676.68, bpc 11.3862
I0705 01:51:04.649299 140514656708352 train_gpu.py:303] [42000] | gnorm 0.30 lr 0.000198 | loss 7.91 | pplx 2726.73, bpc 11.4130
I0705 01:56:19.464629 140514656708352 train_gpu.py:303] [42500] | gnorm 0.60 lr 0.000197 | loss 7.83 | pplx 2525.32, bpc 11.3023
I0705 02:01:34.198068 140514656708352 train_gpu.py:303] [43000] | gnorm 0.38 lr 0.000196 | loss 7.93 | pplx 2770.81, bpc 11.4361
I0705 02:06:48.962699 140514656708352 train_gpu.py:303] [43500] | gnorm 0.60 lr 0.000196 | loss 7.93 | pplx 2783.54, bpc 11.4427

First 12k steps are fine, after that, its increasing. totally normal or not?

opened by huseinzol05 18

How to export?

I needed to know how to write the serving function to export the trained xlnet model. I have this right now:

def serving_input_fn(): with tf.variable_scope("model"): feature_spec = { "input_ids": tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64), "input_mask": tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64), "segment_ids": tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64), "label_ids": tf.FixedLenFeature([], tf.int64), } serialized_tf_example = tf.placeholder(dtype=tf.string, shape=[None], name='input_example_tensor') receiver_tensors = {'examples': serialized_tf_example} features = tf.parse_example(serialized_tf_example, feature_spec) return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

EXPORT_DIR = 'gs://{}/export/{}'.format(BUCKET, TASK_VERSION) estimator._export_to_tpu = False # this is important path = estimator.export_savedmodel(EXPORT_DIR, serving_input_fn)

This is throwing me errors. Please note: this is the function that I used for Bert, and as I am no expert in tensorflow, I don't understand why it won't work. It throws a type mismatch error

opened by jinamshah 11
Can't load model in GCS directly

When I wanted to run the model on TPU, I used "gs://..." replace the ${LARGE_DIR}. But it turns out the IOError. Traceback (most recent call last): File "run_classifier.py", line 903, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_classifier.py", line 722, in main sp.Load(FLAGS.spiece_model_file) File "/usr/local/lib/python2.7/dist-packages/sentencepiece.py", line 118, in Load return _sentencepiece.SentencePieceProcessor_Load(self, filename) IOError: Not found: "gs://ykproject/pre-trained/xlnet_cased_L-24_H-1024_A-16/spiece.model": No such file or directory Error #2

Did this mean sp.Load() doesn't support load GCS file? And I should change the code. Or something other should I do?

opened by yokinglou 11
Error when I train with train_gpu.py. (tf_records)

Hi, thanks for your contribution.

I was trying to preprocess my own data and training own my gpu machine, but after I created the tfrecords file from the wiki data and try to run train, it does not work with "TypeError: filenames must be a tf.data.Dataset of tf.string elements."

It seems like simple dir issues or tfrecords does not complete correctly since I saw on the logs num of record info path is zero.

Does anyone have a similar issue? Appreciate for help.

I0629 08:26:34.599769 4393649600 tf_logging.py:115] n_token 32000 I0629 08:26:34.600133 4393649600 tf_logging.py:115] Use the following tfrecord dirs: ['data_out3/tfrecords'] I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0 I0629 08:26:34.601068 4393649600 tf_logging.py:115] [Dir 0] Number of chosen batches: 0 I0629 08:26:34.601134 4393649600 tf_logging.py:115] [Dir 0] Number of chosen files: 0 I0629 08:26:34.601197 4393649600 tf_logging.py:115] [] I0629 08:26:34.601253 4393649600 tf_logging.py:115] Total number of batches: 0 I0629 08:26:34.601778 4393649600 tf_logging.py:115] Total number of files: 0 I0629 08:26:34.601840 4393649600 tf_logging.py:115] [] I0629 08:26:34.601900 4393649600 tf_logging.py:115] num of batches 0 I0629 08:26:34.601970 4393649600 tf_logging.py:115] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in tf.app.run() File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/Users/user/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/Users/user/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/Users/user/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/data/ops/readers.py", line 194, in init "filenames must be a tf.data.Dataset of tf.string elements.") TypeError: filenames must be a tf.data.Dataset of tf.string elements.

opened by rainmaker712 10
error in pretrain an XLNet . train_gpu.py.

I have successfully ran spm_train \ --input=$INPUT \ --model_prefix=sp10m.cased.v3 \ --vocab_size=32000 \ --character_coverage=0.99995 \ --model_type=unigram \ --control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \ --user_defined_symbols=<eop>,.,(,),",-,–,£,€ \ --shuffle_input_sentence \ --input_sentence_size=10000000 and python data_utils.py \ --bsz_per_host=32 \ --num_core_per_host=16 \ --seq_len=512 \ --reuse_len=256 \ --input_glob=*.txt \ --save_dir=${SAVE_DIR} \ --num_passes=20 \ --bi_data=True \ --sp_path=spiece.model \ --mask_alpha=6 \ --mask_beta=1 \ --num_predict=85.

now when i want to run train_gpu.py : sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --model_dir=/home/ubuntu/axalimodeli i have the error:

/usr/local/lib/python3.5/dist-packages/tensorflow-plugins /home/ubuntu/.local/lib/python3.5/site-packages/tensorflow-plugins /usr/lib/python3/dist-packages/tensorflow-plugins /usr/lib/python3.5/dist-packages/tensorflow-plugins I0702 11:32:34.041983 139935611332352 train_gpu.py:319] n_token 32000 I0702 11:32:34.042275 139935611332352 data_utils.py:795] Use the following tfrecord dirs: ['/home/ubuntu/xlnet/training/tfrecords'] I0702 11:32:34.042413 139935611332352 data_utils.py:799] [0] Record glob: /home/ubuntu/xlnet/training/tfrecords/record_info-train-*.bsz-2048.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json I0702 11:32:34.042960 139935611332352 data_utils.py:803] [0] Num of record info path: 0 I0702 11:32:34.043075 139935611332352 data_utils.py:836] [Dir 0] Number of chosen batches: 0 I0702 11:32:34.043182 139935611332352 data_utils.py:838] [Dir 0] Number of chosen files: 0 I0702 11:32:34.043281 139935611332352 data_utils.py:839] [] I0702 11:32:34.043379 139935611332352 data_utils.py:846] Total number of batches: 0 I0702 11:32:34.043897 139935611332352 data_utils.py:848] Total number of files: 0 I0702 11:32:34.044010 139935611332352 data_utils.py:849] [] I0702 11:32:34.044113 139935611332352 train_gpu.py:204] num of batches 0 I0702 11:32:34.044229 139935611332352 data_utils.py:555] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/home/ubuntu/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/home/ubuntu/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/home/ubuntu/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__ filenames, compression_type, buffer_size, num_parallel_reads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__ filenames = _create_or_validate_filenames_dataset(filenames) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 50, in _create_or_validate_filenames_dataset "filenamesmust be atf.data.Datasetoftf.stringelements.") TypeError:filenamesmust be atf.data.Datasetoftf.string elements.

Can you help me to solve this case. data_utils.py make the tfrecords file in this directory '/home/ubuntu/xlnet/training/tfrecords'.

opened by 3NFBAGDU 9
Imdb classification taks - colab notebook (GPU)

I have just successfully trained the imdb classification task on a colab notebook:

If this is of interest to annoying the notebook code is here: https://github.com/CharlieBickerton/xlnet/blob/master/XLNet_imdb_GPU.ipynb

I've only just finished so it's not as clean as it could be.

opened by CharlieBickerton 9
Data Format for sentence matching (How to Prepare the data?)

If we want to use xlnet for the text classification task, where the data format is text sentence and its label, how we can do that? I see that imdb task is classification but with different data format.

How to Prepare the data?

opened by hana9090 8

Preprocess data for SQuAD 1.1

First: Thanks for the great work and publishing all resources!

Question: Is there a way to preprocess the SQuAD 1.1. training data using run_squad.py?

I am using the prepro_squad.sh script with the v1.1 squad file as input and get the following error, because is_impossible keys are non-existent in this dataset:

Traceback (most recent call last):
  File "run_squad.py", line 1310, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_squad.py", line 1158, in main
    preprocess()
  File "run_squad.py", line 1129, in preprocess
    train_examples = read_squad_examples(FLAGS.train_file, is_training=True)
  File "run_squad.py", line 251, in read_squad_examples
    is_impossible = qa["is_impossible"]
KeyError: 'is_impossible'

I couldn't find a flag I could set to turn this off in the run_squad code.

opened by volker42maru 8

TypeError: Expected binary or unicode string, got None

Hi Did you try to predict using cola processor? I am using the cola class processor (same as in BERT), it works fine for training and validation, but doesn't for test prediction and got the following error: TypeError: Expected binary or unicode string, got None

I am using the following command to predict: --do_train=False --do_eval=False --eval_split=test --do_predict=True --task_name=cola

what is the possible problem?!

Thanks

opened by aisheh90 7
【Huawei】2012Lab-Project Cooperation&Exchange Invitation&Job Invitation-Zihang Dai

Hi，Zihang先生，您好，我是华为2012实验室中央媒体技术院的春丽；我们目前在做CV、CG、AutoML、NLP等相关课题的研究及技术探索，包括但不限于以下研究方向： AR&VR：CG（如Character Animation/Rendering /物理引擎等）、3D重建、VPS、SLAM、智能交互、虚实融合等；音频：音乐音效（AI谱曲、音乐美化、3D空间、情感合成等）、智能语音、拾音增强等；视频：自动驾驶、视频监控、视频通信、娱乐视频、教育医疗、AI加速等；拍照：计算摄影、AI图像处理、图像生成、数字人等；期待能再次与你有一次开放、深入的沟通交流；这是我的联系方式，邮箱: chenchunli1@h-partners.com；电话：17710876257；微信：1274634225（我是小宝）。

opened by HanLu1226 0
run error about "InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[12,512,64], b.shape=[12,64,512], m=512, n=512, k=64, batch_size=12"
this is my run bash,new_gpu_squad_bash.sh(Modified from:gpu_squad_base.sh) `

local path

SQUAD_DIR=../SQUAD INIT_CKPT_DIR=../xlnet_cased_L-12_H-768_A-12 PROC_DATA_DIR=proc_data/squad MODEL_DIR=experiment/squad_new_gpu

Use 3 GPUs, each with 8 seqlen-512 samples

python ../run_squad.py
--use_tpu=False
--num_hosts=1
--num_core_per_host=1
--model_config_path=${INIT_CKPT_DIR}/xlnet_config.json
--spiece_model_file=${INIT_CKPT_DIR}/spiece.model
--output_dir=${PROC_DATA_DIR}
--init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt
--model_dir=${MODEL_DIR}
--train_file=${SQUAD_DIR}/small_train-v2.0.json
--predict_file=${SQUAD_DIR}/dev-v2.0.json
--uncased=False
--max_seq_length=512
--do_train=True
--train_batch_size=1
--do_predict=True
--predict_batch_size=1
--learning_rate=2e-5
--adam_epsilon=1e-6
--iterations=1000
--save_steps=1000
--train_steps=12000
--warmup_steps=1000
$@ bash run command :CUDA_VISIBLE_DEVICES=0 bash new_gpu_squad_bash.sh GPU space should enough. ![2022-08-18 15-03-03屏幕截图](https://user-images.githubusercontent.com/59367257/185326924-a7572b68-c36f-4096-aaa6-3f2be35fbb26.png) However, the displayed program reports an error.(tensorflow_gpu_1_13) zaisen_ye@ubuntu-DeepLearning-2602056:/data/zaisen_ye/xlnet-master/scripts$ CUDA_VISIBLE_DEVICES=0 bash new_gpu_squad_bash.sh
2022-08-18 15:00:46.353987: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2022-08-18 15:00:47.049036: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557600f5b2d0 executing computations on platform CUDA. Devices:
2022-08-18 15:00:47.049082: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6 2022-08-18 15:00:47.051712: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2022-08-18 15:00:47.054733: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557600fd8c20 executing computations on platform Host. Devices: 2022-08-18 15:00:47.054752: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2022-08-18 15:00:47.054866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.665 pciBusID: 0000:4f:00.0 totalMemory: 11.77GiB freeMemory: 11.53GiB 2022-08-18 15:00:47.054879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2022-08-18 15:00:47.057695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-08-18 15:00:47.057725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2022-08-18 15:00:47.057734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2022-08-18 15:00:47.057839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11215 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:4f:00.0, compute capability: 8.6) INFO:tensorflow:Single device mode.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md

https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa1087c9 210>, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=1, num_cores_per_replica=None, per_host_input_for_tra ining=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_tf_random_seed': None, '_device_fn': None, '_cluster': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step _count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_session_config': allow_soft_placement: true , '_global_id_in_cluster': 0, '_is_chief': True, '_protocol': None, '_save_checkpoints_steps': 1000, '_experimental_distribute': None, '_save_summary_steps': 100, '_model_dir': 'experiment/squad_new_gpu' , '_master': ''} WARNING:tensorflow:Estimator's model_fn (<function model_fn at 0x7f9fa03be3d0>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Input tfrecord file glob proc_data/squad/spiece.model..slen-512.qlen-64.train.tf_record INFO:tensorflow:Find 1 input paths ['proc_data/squad/spiece.model.0.slen-512.qlen-64.train.tf_record'] WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From ../run_squad.py:1019: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.experimental.map_and_batch(...). INFO:tensorflow:Calling model_fn. INFO:tensorflow:memory input None INFO:tensorflow:Use float type <dtype: 'float32'> WARNING:tensorflow:From /data/zaisen_ye/xlnet-master/modeling.py:534: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with ke ep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From /data/zaisen_ye/xlnet-master/modeling.py:67: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. INFO:tensorflow:#params: 119082242 WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated an d will be removed in a future version. Instructions for updating: Use tf.cast instead. INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/q/kernel:0
INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/k/kernel:0
INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/v/kernel:0
INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0422 to layer-0 grad of model/transformer/layer_0/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0563 to layer-1 grad of model/transformer/layer_1/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.0751 to layer-2 grad of model/transformer/layer_2/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1001 to layer-3 grad of model/transformer/layer_3/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1335 to layer-4 grad of model/transformer/layer_4/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.1780 to layer-5 grad of model/transformer/layer_5/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.2373 to layer-6 grad of model/transformer/layer_6/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.3164 to layer-7 grad of model/transformer/layer_7/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.4219 to layer-8 grad of model/transformer/layer_8/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/q/kernel:0
INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.5625 to layer-9 grad of model/transformer/layer_9/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 0.7500 to layer-10 grad of model/transformer/layer_10/ff/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/q/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/k/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/v/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/r/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/o/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/LayerNorm/beta:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/rel_attn/LayerNorm/gamma:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/layer_1/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/layer_1/bias:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/layer_2/kernel:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/layer_2/bias:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/LayerNorm/beta:0 INFO:tensorflow:Apply mult 1.0000 to layer-11 grad of model/transformer/layer_11/ff/LayerNorm/gamma:0 INFO:tensorflow:Initialize from the ckpt ../xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt INFO:tensorflow:*** Global Variables **** INFO:tensorflow: name = model/transformer/r_w_bias:0, shape = (12, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/r_r_bias:0, shape = (12, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/word_embedding/lookup_table:0, shape = (32000, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/r_s_bias:0, shape = (12, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/seg_embed:0, shape = (12, 2, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_0/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_1/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_2/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_3/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_4/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_5/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_6/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_7/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_8/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_9/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT
INFO:tensorflow: name = model/transformer/layer_10/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT
INFO:tensorflow: name = model/transformer/layer_10/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT
INFO:tensorflow: name = model/transformer/layer_10/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_10/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/q/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/k/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/v/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/r/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/o/kernel:0, shape = (768, 12, 64), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/rel_attn/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/layer_1/kernel:0, shape = (768, 3072), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/layer_1/bias:0, shape = (3072,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/layer_2/kernel:0, shape = (3072, 768), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/layer_2/bias:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = model/transformer/layer_11/ff/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT INFO:tensorflow: name = start_logits/dense/kernel:0, shape = (768, 1) INFO:tensorflow: name = start_logits/dense/bias:0, shape = (1,) INFO:tensorflow: name = end_logits/dense_0/kernel:0, shape = (1536, 768) INFO:tensorflow: name = end_logits/dense_0/bias:0, shape = (768,) INFO:tensorflow: name = end_logits/LayerNorm/beta:0, shape = (768,) INFO:tensorflow: name = end_logits/LayerNorm/gamma:0, shape = (768,) INFO:tensorflow: name = end_logits/dense_1/kernel:0, shape = (768, 1) INFO:tensorflow: name = end_logits/dense_1/bias:0, shape = (1,) INFO:tensorflow: name = answer_class/dense_0/kernel:0, shape = (1536, 768) INFO:tensorflow: name = answer_class/dense_0/bias:0, shape = (768,) INFO:tensorflow: name = answer_class/dense_1/kernel:0, shape = (768, 1) INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2022-08-18 15:01:06.954873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2022-08-18 15:01:06.954938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-08-18 15:01:06.954945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2022-08-18 15:01:06.954951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2022-08-18 15:01:06.955044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11215 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:4f:00.0, compute capability: 8.6) WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint _management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from experiment/squad_new_gpu/model.ckpt-0 WARNING:tensorflow:From /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkp oint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into experiment/squad_new_gpu/model.ckpt. 2022-08-18 15:05:02.150154: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally 2022-08-18 15:05:17.835563: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_EXECUTION_FAILED 2022-08-18 15:05:17.835625: E tensorflow/stream_executor/cuda/cuda_blas.cc:2620] Internal: failed BLAS call, see log for details 2022-08-18 15:05:17.835644: I tensorflow/stream_executor/stream.cc:5027] [stream=0x557603b483d0,impl=0x557603b391d0] did not memcpy host-to-device; source: 0x7f9cb801b970 2022-08-18 15:05:17.835677: E tensorflow/stream_executor/cuda/cuda_blas.cc:2620] Internal: failed to copy memory from host to device in CUDABlas::DoBlasGemmBatched Traceback (most recent call last): File "../run_squad.py", line 1317, in tf.app.run() File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "../run_squad.py", line 1216, in main estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(*args, **kwargs) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[12,512,64], b.shape=[12,64,512], m=512, n=512, k=64, batch_size=12 [[node model/transformer/layer_0/rel_attn/einsum_4/MatMul (defined at /data/zaisen_ye/xlnet-master/modeling.py:133) ]] [[node add_1 (defined at ../run_squad.py:1088) ]] Caused by op u'model/transformer/layer_0/rel_attn/einsum_4/MatMul', defined at:
File "../run_squad.py", line 1317, in
tf.app.run()
File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "../run_squad.py", line 1216, in main estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "../run_squad.py", line 1033, in model_fn outputs = function_builder.get_qa_outputs(FLAGS, features, is_training) File "/data/zaisen_ye/xlnet-master/function_builder.py", line 230, in get_qa_outputs input_mask=inp_mask) File "/data/zaisen_ye/xlnet-master/xlnet.py", line 222, in init ) = modeling.transformer_xl(**tfm_args) File "/data/zaisen_ye/xlnet-master/modeling.py", line 628, in transformer_xl reuse=reuse) File "/data/zaisen_ye/xlnet-master/modeling.py", line 309, in rel_multihead_attn r_r_bias, r_s_bias, attn_mask, dropatt, is_training, scale) File "/data/zaisen_ye/xlnet-master/modeling.py", line 133, in rel_attn_core ac = tf.einsum('ibnd,jbnd->ijbn', q_head + r_w_bias, k_head_h) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/ops/special_math_ops.py", line 262, in einsum axes_to_sum) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/ops/special_math_ops.py", line 394, in _einsum_reduction product = math_ops.matmul(t0, t1) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2417, in matmul a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1423, in batch_mat_mul "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[12,512,64], b.shape=[12,64,512], m=512, n=512, k=64, batch_size=12 [[node model/transformer/layer_0/rel_attn/einsum_4/MatMul (defined at /data/zaisen_ye/xlnet-master/modeling.py:133) ]] [[node add_1 (defined at ../run_squad.py:1088) ]] ` The packages used in the program are as follows:(tensorflow-1.13.1,sentencepiece-0.1.91,cudatoolkit-10.0.130,cudnn-7.3.1) (tensorflow_gpu_1_13) zaisen_ye@ubuntu-DeepLearning-2602056:/data/zaisen_ye/xlnet-master/scripts$ conda list
packages in environment at /home/zaisen_ye/.conda/envs/tensorflow_gpu_1_13:
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
_tflow_select 2.1.0 gpu
absl-py 0.15.0 pyhd3eb1b0_0
astor 0.8.1 pypi_0 pypi
backports 1.1 pyhd3eb1b0_0
backports-weakref 1.0.post1 pypi_0 pypi
backports.weakref 1.0.post1 py_1
blas 1.0 mkl
c-ares 1.18.1 h7f8727e_0
ca-certificates 2022.07.19 h06a4308_0
certifi 2020.6.20 pyhd3eb1b0_3
cudatoolkit 10.0.130 0
cudnn 7.3.1 cuda10.0_0
cupti 10.0.130 0
enum34 1.1.10 pypi_0 pypi funcsigs 1.0.2 pypi_0 pypi futures 3.3.0 py27_0
gast 0.5.3 pyhd3eb1b0_0
grpcio 1.41.1 pypi_0 pypi h5py 2.10.0 pypi_0 pypi hdf5 1.10.4 hb1b8bf9_0
intel-openmp 2022.0.1 h06a4308_3633
keras-applications 1.0.8 py_1
keras-preprocessing 1.1.2 pypi_0 pypi libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 7.5.0 ha8ba4b0_17
libgfortran4 7.5.0 ha8ba4b0_17
libgomp 11.2.0 h1234567_1
libprotobuf 3.11.2 hd408876_0
libstdcxx-ng 11.2.0 h1234567_1
linecache2 1.0.0 py_1
markdown 3.1.1 py27_0
mkl 2020.2 256
mkl-service 2.3.0 py27he904b0f_0
mkl_fft 1.0.15 py27ha843d7b_0
mkl_random 1.1.0 py27hd6b4f25_0
mock 3.0.5 py27_0
ncurses 6.3 h5eee18b_3
numpy 1.16.6 pypi_0 pypi numpy-base 1.16.6 py27hde5b4d6_0
openssl 1.1.1q h7f8727e_0
pip 19.3.1 py27_0
protobuf 3.17.3 pypi_0 pypi
python 2.7.18 ha1903f6_2
readline 8.1.2 h7f8727e_1
scipy 1.2.1 py27h7c811a0_0
sentencepiece 0.1.91 pypi_0 pypi
setuptools 44.0.0 py27_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.2 h5082296_0
tensorboard 1.13.1 py27hf484d3e_0
tensorflow 1.13.1 gpu_py27hcb41dfa_0
tensorflow-estimator 1.13.0 py_0
tensorflow-gpu 1.13.1 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
traceback2 1.4.0 py27_0
unittest2 1.1.0 py27_0
werkzeug 1.0.1 pyhd3eb1b0_0
wheel 0.37.1 pyhd3eb1b0_0
zlib 1.2.12 h7f8727e_2

PS:I know my problem is similar as in this question: https://stackoverflow.com/questions/43990046/tensorflow-blas-gemm-launch-failed,but it has not been solved there and I'm not sure this question is clear enough or is exactly the same problem as I have so I'm posting it with my own error message. I thought this problem is different of:(https://stackoverflow.com/questions/50911052/tensorflow-matmul-blas-xgemmbatched-launch-failed).. @zihangdai
opened by ccutyear 1
Tokens and values

I am using xlnet in an old english corpus and I am not being able to find how to make xlnet not consider "+" as a word.

E.g I have Adrianus cw+a+d to Ritheus .

It takes word cw+a+d into 5 tokens instead of one "cw", "+", "a", "+", "d"

Checked on sentencepiece but could not find anything. Can someone please help me.

opened by Dhurim 0
Update data_utils.py

Hello,this issue shows the reason that I commit this PR. I sincerely wish my PR will help you.And if you think my PR has a little work,Hoping you could merge it. Thank you,my friend.

opened by DLPerf 1
Performance issue in data_utils.py (by P3)

Hello! I've found a performance issue in data_utils.py: dataset = dataset.batch(bsz_per_core, drop_remainder=True)(line 573) should be called before dataset = dataset.cache().map(parser).repeat()(line 572), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function parser called in dataset = dataset.cache().map(parser).repeat() whether to be affected or not to make the changed code work properly. For example, if parser needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

opened by DLPerf 1
Performance issues in the program

Hello,I found a performance issue in the definition of parse_files_to_dataset , zihangdai/xlnet/blob/master/data_utils.py, dataset = dataset.cache().map was called without num_parallel_calls. I think it will increase the efficiency of your program if you add this.

Here is the documemtation of tensorflow to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

opened by DLPerf 0

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Related tags

Overview

Introduction

Release Notes

Results

Results on Reading Comprehension

Results on Text Classification

Results on GLUE

Pre-trained models

Released Models

Future Release Plan

Subscribing to XLNet on Google Groups

Fine-tuning with XLNet

Memory Issue during Finetuning

Text Classification/Regression

(1) STS-B: sentence pair relevance regression (with GPUs)

(2) IMDB: movie review sentiment classification (with TPU V3-8)

SQuAD2.0

RACE reading comprehension

Using Google Colab

Custom Usage of XLNet

XLNet Abstraction

Tokenization

Pretraining with XLNet

Comments

local path

Use 3 GPUs, each with 8 seqlen-512 samples

Owner

Zihang Dai

Sentence Embeddings with BERT & XLNet

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

A combination of autoregressors and autoencoders using XLNet for sentiment analysis

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Natural language Understanding Toolkit

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Watson Natural Language Understanding and Knowledge Studio

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework