For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Athar Sefid

Last update: Nov 2, 2022

Related tags

Deep Learning SciBERTSUM

Overview

LongScientificFormer

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation

Option 1: download the processed data

Pre-processed data

Put all files into raw_data directory

Step 2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-4.2.2 directory.

step 3. extracting sections from GROBID XML files

python preprocess.py -mode extract_pdf_sections -log_file ../logs/extract_section.log

step 4. extracting text from TIKA XML files

python preprocess.py -mode get_text_clean_tika -log_file ../logs/extract_tika_text.log

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

python preprocess.py -mode tokenize  -save_path ../temp -log_file ../logs/tokenize_by_corenlp.log

Step 6. Extract source, section, and target from tokenized files

python preprocess.py -mode clean_paper_jsons -save_path ../json_data/  -n_cpus 10 -log_file ../logs/build_json.log

Step 7. Generate BERT `.pt` files from source, sections and targets

python preprocess.py -mode format_to_bert -raw_path ../json_data/ -save_path ../bert_data  -lower -n_cpus 40 -log_file ../logs/build_bert_files.log

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1, after downloading, you could kill the process and rerun the code with multi-GPUs.

Train

python train.py  -ext_dropout 0.1 -lr 2e-3  -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

To continue training from a checkpoint

python train.py  -ext_dropout 0.1 -lr 2e-3  -train_from ../models/model_step_99000.pt -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

Test

python train.py -mode test  -test_batch_size 1 -bert_data_path ../bert_data -log_file ../logs/ext_bert_test -test_from ../models/model_step_99000.pt -model_path ../models -sep_optim true -use_interval true -visible_gpus 1,2,3 -alpha 0.95 -result_path ../results/ext

Comments

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Traceback (most recent call last):
  File "train.py", line 108, in <module>
    train_ext(args, device_id)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/train_extractive.py", line 64, in train_ext
    train_single_ext(args, device_id)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/train_extractive.py", line 149, in train_single_ext
    trainer.train(train_iter_fct, args.train_steps)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/trainer_ext.py", line 138, in train
    self._gradient_accumulation( # this is the main function that calculates the loss
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/trainer_ext.py", line 318, in _gradient_accumulation
    sent_scores, mask = self.model(src, sections, token_sections, segs, clss, mask, mask_cls)
  File "/home/shangjl/anaconda3/envs/scibert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/model_builder.py", line 135, in forward
    sent_scores = self.ext_layer(inputs_embeds, sections, attention_mask, extended_attention_mask).squeeze(-1)
  File "/home/shangjl/anaconda3/envs/scibert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/longExtractiveFormer.py", line 148, in forward
    x = self.transformer_inter[i](i, x,
  File "/home/shangjl/anaconda3/envs/scibert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/longExtractiveFormer.py", line 108, in forward
    output = self.self_attn(input_norm,
  File "/home/shangjl/anaconda3/envs/scibert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/longExtractiveFormerAttention.py", line 718, in forward
    self_outputs = self.self(
  File "/home/shangjl/anaconda3/envs/scibert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/longExtractiveFormerAttention.py", line 238, in forward
    attn_output = self._compute_attn_output_with_global_indices(
  File "/home/shangjl/CAIL/SciBERTSUM.git/trunk/src/models/longExtractiveFormerAttention.py", line 545, in _compute_attn_output_with_global_indices
    value_vectors_only_global[is_local_index_global_attn_nonzero] = value_vectors[is_index_global_attn_nonzero].detach().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

value_vectors_only_global[is_local_index_global_attn_nonzero] = value_vectors[is_index_global_attn_nonzero].detach().cpu().numpy() still didnt work

opened by SJLMax 1

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT `.pt` files from source, sections and targets

Model Training

Train

Test

You might also like...

Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Comments

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Owner

Athar Sefid

Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

Eth brownie struct encoding example

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

PyTorchMemTracer - Depict GPU memory footprint during DNN training of PyTorch

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT .pt files from source, sections and targets

Model Training

Train

Test

You might also like...

Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Comments

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Owner

Athar Sefid

Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

Eth brownie struct encoding example

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

Training code and evaluation benchmarks for the "Self-Supervised Policy Adaptation during Deployment" paper.

PyTorchMemTracer - Depict GPU memory footprint during DNN training of PyTorch

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Step 7. Generate BERT `.pt` files from source, sections and targets