An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax

Last update: Jun 16, 2022

Related tags

Deep Learning Pytorch-Simple-Transformer

Overview

Simple Transformer

An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax.

Note: The only extra thing added is Dropout regularization in some layers and option to use GPU.

Install

python -m pip install -r requirements.txt

Toy data

python train_toy_data.py

Before Training	After 100 Epoch

English -> German Europarl dataset

python train_translate.py

Training on a small subset of 1000 sentences (Included in this repo)

You might also like...

Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

689 Dec 25, 2022

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

697 Jan 6, 2023

Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters

[Unofficial code-base] NeRF--: Neural Radiance Fields Without Known Camera Parameters [ Project | Paper | Official code base ] ⬅️ Thanks the original

239 Dec 22, 2022

[BMVC2021] The official implementation of "DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations"

DomainMix [BMVC2021] The official implementation of "DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations" [paper] [de

17 Dec 20, 2022

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

BBB Streamer NG? Makes a conference like this... ...streamable like this! I also recorded a small video showing the basic features: https://www.youtub

60 Oct 21, 2022

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 3, 2023

a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

3k Jan 5, 2023

A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

83 Jan 1, 2023

Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

380 Dec 19, 2022

Comments

Fix loss function

The calculation of loss function is wicked and allows model to fit very quickly and produce very good translation. But unfortunately this is not real and the model is not able to spell out the German sentence itself. It could only completes an i-length German sentence if you give it the first (i-1) token, which means it could not generate the whole sentence from a start-of-sentence tag .

It seems reasonable enough to use this loss to train the network, but unreasonable to assess its translation ability, though I have yet to train this network to its full functionality.



###################
## original leave-last-token out decoder, 
## Not sure what's the exact error in this calculation, 
## but maybe because the model see the mask token directly?
##
#Output German, One Token At A Time
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    out = model(item["german"][:,:i+1])
    all_outs = torch.cat((all_outs,out),dim=1)


# ###################
# My variation of leave-last-token-out decoder, Used at training
# output_vocab_size = german_vocab_len

g = item["german"].shape
x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    xx = torch.zeros( [g[0],g[1], ],dtype=torch.long ).to(device)
    out = model(x)
    xx[:,i:i+1] = item["german"][:,i:i+1]
    x = x+xx
    all_outs = torch.cat((all_outs,out),dim=1)

# ###################
# My variation of beam search decoder
model.encode(item["english"][:,1:-1])
g = item["german"].shape
x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
all_outs = torch.tensor([],requires_grad=True).to(device)
for i in range(item["german"].shape[1]-1):
    out = model(x)
    x[:,i:i+1] = out.argmax(axis=-1)
    all_outs = torch.cat((all_outs,out),dim=1)

I found this glitch when fiddling with the attention layer at its core, and found zeroing the attention value created no harm to the performance of a last-token-only model in

sub_layers.py

attention_weights = F.softmax(attention_weights,dim=2)
attention_weights = attention_weights *0.   ## Try this!

opened by shouldsee 8

Thanks for sharing, but questions remain

Dear devs,

I find this repo simple and smooth to run. However I am confused why you used the same embedding for input and output?

Specifically here in def fowward and in def encode, both used the same reference to self.embedding. This looks weird isnt it? Since source language should use a different encoding when compared to the destination language?


class TransformerTranslator(nn.Module):
    def __init__(self,embed_dim,num_blocks,num_heads,vocab_size,CUDA=False):
        super(TransformerTranslator,self).__init__()
        self.embedding = Embeddings(vocab_size,embed_dim,CUDA=CUDA)
        self.encoder = Encoder(embed_dim,num_heads,num_blocks,CUDA=CUDA)
        self.decoder = Decoder(embed_dim,num_heads,num_blocks,vocab_size,CUDA=CUDA)
        self.encoded = False
        self.device = torch.device('cuda:0' if CUDA else 'cpu')
    def encode(self,input_sequence):
        embedding = self.embedding(input_sequence).to(self.device)
        self.encode_out = self.encoder(embedding)
        self.encoded = True
    def forward(self,output_sequence):
        if(self.encoded==False):
            print("ERROR::TransformerTranslator:: MUST ENCODE FIRST.")
            return output_sequence
        else:
            embedding = self.embedding(output_sequence)
            return self.decoder(self.encode_out,embedding)

opened by shouldsee 2

Owner

GitHub

Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Attention Is All You Need Paper Implementation This is my from-scratch implementation of the original transformer architecture from the following pape

195 Dec 30, 2022

TensorFlow implementation of "Attention is all you need (Transformer)"

[TensorFlow 2] Attention is all you need (Transformer) TensorFlow implementation of "Attention is all you need (Transformer)" Dataset The MNIST datase

4 Jan 5, 2022

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

16 Jul 16, 2022

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes Introduction This is the unofficial code of Deep Dual-re

113 Dec 23, 2022