Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

International Business Machines

Last update: Jan 5, 2023

Related tags

Deep Learning ibm-research-ai

Overview

Regression Transformer

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

Development setup

conda env create -f conda.yml
conda activate terminator
pip install -e .

Generate some data

Example data for QED can be generated using scripts/generate_example_data.py.

python scripts/generate_example_data.py examples/example.smi examples/qed_property_example.txt

If you need to create a new vocabulary for a dataset you can use scripts/create_vocabulary.py it will also automatically add some special tokens at the top of your vocabulary file.

python scripts/create_vocabulary.py examples/qed_property_example.txt examples/vocab.txt

At this point the folder containing the vocabulary file can be used to load a tokenizer compatible with any ExpressionBertTokenizer:

>>> from terminator.tokenization import ExpressionBertTokenizer
>>> tokenizer = ExpressionBertTokenizer.from_pretrained('examples')
>>> text = '
   
    0.3936|CBr'
   
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['
   
    '
   , '_0_0_', '_._', '_3_-1_', '_9_-2_', '_3_-3_', '_6_-4_', '|', 'C', 'Br']
>>> token_indexes = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
>>> print(token_indexes)
[16, 17, 18, 28, 45, 34, 35, 19, 15, 63]
>>> tokenizer.build_inputs_with_special_tokens(token_indexes)
[12, 16, 17, 18, 28, 45, 34, 35, 19, 15, 63, 13]

Prepare some train/eval data line by line:

head -n 900 examples/qed_property_example.txt > examples/train.txt
tail -n +901 examples/qed_property_example.txt > examples/eval.txt

Launch the training:

python scripts/run_language_modeling.py --output_dir examples/models/xlnet_selfies \
    --config_name configs/xlnet_selfies.json --tokenizer_name ./examples/vocab.txt \
    --do_train --do_eval --learning_rate 1e-4 --num_train_epochs 5 --save_total_limit 2 \
    --save_steps 500 --per_gpu_train_batch_size 16 --evaluate_during_training --eval_data_file ./examples/eval.txt \
    --train_data_file ./examples/train.txt --line_by_line --block_size 510 --seed 42 --logging_steps 250

Exemplary model configurations (number of heads, layers, etc.) can be found in the configs folder.

Comments

XLNetTokenizer and BertExpressionTokenizer
I see you guys have 2 tokenizers, Bert and XLNET. How are you using the two tokenizers?

Can you give more detail on how you go from the vocab to training in the readme file?

question
opened by pjuangph 2

Tokenizing example error

I get an error about 'examples' not being defined when I try to run your code from the scripts folder.

Steps:

cd scripts
Add the following codes below to a file
Run from scripts folder

It does work when I replace it with "bert-based-uncased", but I don't get the same token indices.

Do I need to install terminator beforehand? does "examples" correspond to the example folder?

# I added these 3 lines 
import sys
sys.path.insert(0,'terminator')
from tokenization import ExpressionBertTokenizer

# This is your code 
from terminator.tokenization import ExpressionBertTokenizer
tokenizer = ExpressionBertTokenizer.from_pretrained('examples') # Error is happening here
text = '<qed>0.3936|CBr'
tokens = tokenizer.tokenize(text)
print(tokens)
# ['<qed>', '_0_0_', '_._', '_3_-1_', '_9_-2_', '_3_-3_', '_6_-4_', '|', 'C', 'Br']
token_indexes = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
print(token_indexes)
# [16, 17, 18, 28, 45, 34, 35, 19, 15, 63]
 tokenizer.build_inputs_with_special_tokens(token_indexes)
# [12, 16, 17, 18, 28, 45, 34, 35, 19, 15, 63, 13]

I also get an error in ExpressionBertTokenizer. Seems like the super class doesn't like do_lower_case=False to be defined.

question

opened by pjuangph 1

Problem with example

I ran this code python scripts/generate_example_data.py examples/example.smi examples/qed_property_example.txt

And it produced this error

[15:35:57] Explicit valence for atom # 1 N, 6, is greater than permitted Problem processing SMILES=O=N=

Can the authors take a look? Thanks.
invalid

opened by pjuangph 1
Missing BeamSearch class

The BeamSearch class is not implemented in the terminator/search.py module, but it's imported in the terminator/evaluator.py file and used by the Evaluator class. Any attempt to import the Evaluator (e.g. in terminator/trainer.py) leads to an error:

ImportError: cannot import name 'BeamSearch' from 'terminator.search' (search.py)
bug

opened by mpaszkow 1
[ImgBot] Optimize images

Beep boop. Your images are optimized!

Your image file size has been reduced by 18% 🎉

Details

| File | Before | After | Percent reduction | |:--|:--|:--|:--| | /assets/overview.jpg | 165.43kb | 135.17kb | 18.29% |

📝 docs | :octocat: repo | 🙋🏾 issues | 🏪 marketplace

~Imgbot - Part of Optimole family

opened by imgbot[bot] 0

Owner

International Business Machines

GitHub

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

80 Sep 17, 2022

Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

2 Jan 26, 2022

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

161 Dec 8, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

22 Nov 25, 2022

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

897 Jan 5, 2023

Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

708 Dec 19, 2022

Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

514 Nov 17, 2022

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

195 Dec 17, 2022

Sequence lineage information extracted from RKI sequence data repo

Pango lineage information for German SARS-CoV-2 sequences This repository contains a join of the metadata and pango lineage tables of all German SARS-

24 Oct 26, 2022

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

1.4k Jan 8, 2023

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

68 Dec 28, 2022

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

TransGanFormer (wip) Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GansFormer and TransGan paper. I

146 Dec 6, 2022

3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

3DMV 3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans. This work is based on our ECCV'18 p

0 Feb 6, 2022

Auto-Lama combines object detection and image inpainting to automate object removals

Auto-Lama Auto-Lama combines object detection and image inpainting to automate object removals. It is build on top of DE:TR from Facebook Research and

44 Dec 9, 2022

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 8, 2023

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

Related tags

Overview

Regression Transformer

Development setup

Generate some data

Comments

XLNetTokenizer and BertExpressionTokenizer

Tokenizing example error

Problem with example

Missing BeamSearch class

[ImgBot] Optimize images

Beep boop. Your images are optimized!

Owner

International Business Machines

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Hitters Linear Regression - Hitters Linear Regression With Python

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sequence to Sequence Models with PyTorch

Sequence-to-Sequence learning using PyTorch

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

An implementation of a sequence to sequence neural network using an encoder-decoder

Sequence lineage information extracted from RKI sequence data repo

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

Auto-Lama combines object detection and image inpainting to automate object removals

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.