PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Jared Wang

Last update: Feb 27, 2022

Related tags

Deep Learning pytorch transformer layer-normalization transformer-pytorch post-ln pre-ln

Overview

Transformer-PyTorch

A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Pre-LN applies LayerNorm to the input of every sublayers instead of the residual connection part in Post-LN. The proposed model architecture in the paper was in Post-LN, however the official implementation has been changed into Pre-LN version. The experiment result shows that Pre-LN transformer converges faster while doesn't even need warming up, and is less sensitive to hyperparameters. For more detail about the difference between them, check out the paper On Layer Normalization in the Transformer Architecture.

A STAR would be so nice if you like it!

Dataset

The English-German small-dataset WMT 2016 multimodal task from torchtext.

Prerequisites

Python3
PyTorch >= 1.2.0
torchtext
spacy
nltk
tqdm

Implementation Notes

Beam search is not supported.
Label smoothing is not implemented.
BPE is not adapted.

Usage

Run transformer.ipynb to download dataset and train the model.
Change the flag pre_lnorm to determine which to use.

Evaluation

Parameter settings
- hidden size: 512
- feed forward size: 2048
- num head: 8
- layer: 6
- warm-up: 2000
- batch size: 128

Generated Examples

Here's an example from test data:

source
- eine frau verwendet eine bohrmaschine während ein mann sie fotografiert .
gold
- a woman uses a drill while another man takes her picture .
inference
- a woman uses an electric drill as a man takes a picture .

TODO

Label smoothing
Attention visualization

References

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

87 Nov 29, 2022

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

52 Dec 29, 2022

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

209 Dec 28, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

210 Dec 18, 2022

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

1.3k Jan 4, 2023

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

341 Dec 29, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

610 Dec 28, 2022

PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Related tags

Overview

Transformer-PyTorch

A STAR would be so nice if you like it!

Dataset

Prerequisites

Implementation Notes

Usage

Evaluation

Generated Examples

TODO

References

You might also like...

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

Owner

Jared Wang

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation

Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Transformer - Transformer in PyTorch