An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax

Overview

Simple Transformer

An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax.

Note: The only extra thing added is Dropout regularization in some layers and option to use GPU.

Install

python -m pip install -r requirements.txt

Toy data

python train_toy_data.py
Before Training After 100 Epoch
Image Image

English -> German Europarl dataset

python train_translate.py

Training on a small subset of 1000 sentences (Included in this repo) Image

You might also like...
Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision
Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

 This repository contains the code for the CVPR 2020 paper
This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Differentiable Volumetric Rendering Paper | Supplementary | Spotlight Video | Blog Entry | Presentation | Interactive Slides | Project Page This repos

Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters
Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters

[Unofficial code-base] NeRF--: Neural Radiance Fields Without Known Camera Parameters [ Project | Paper | Official code base ] ⬅️ Thanks the original

[BMVC2021] The official implementation of
[BMVC2021] The official implementation of "DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations"

DomainMix [BMVC2021] The official implementation of "DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations" [paper] [de

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)
BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

BBB Streamer NG? Makes a conference like this... ...streamable like this! I also recorded a small video showing the basic features: https://www.youtub

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

a delightful machine learning tool that allows you to train, test and use models without writing code
a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

A framework for joint super-resolution and image synthesis, without requiring real training data
A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

Image morphing without reference points by applying warp maps and optimizing over them.
Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

Comments
  • Fix loss function

    Fix loss function

    The calculation of loss function is wicked and allows model to fit very quickly and produce very good translation. But unfortunately this is not real and the model is not able to spell out the German sentence itself. It could only completes an i-length German sentence if you give it the first (i-1) token, which means it could not generate the whole sentence from a start-of-sentence tag .

    It seems reasonable enough to use this loss to train the network, but unreasonable to assess its translation ability, though I have yet to train this network to its full functionality.

    
    
    ###################
    ## original leave-last-token out decoder, 
    ## Not sure what's the exact error in this calculation, 
    ## but maybe because the model see the mask token directly?
    ##
    #Output German, One Token At A Time
    all_outs = torch.tensor([],requires_grad=True).to(device)
    for i in range(item["german"].shape[1]-1):
        out = model(item["german"][:,:i+1])
        all_outs = torch.cat((all_outs,out),dim=1)
    
    
    # ###################
    # My variation of leave-last-token-out decoder, Used at training
    # output_vocab_size = german_vocab_len
    
    g = item["german"].shape
    x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
    all_outs = torch.tensor([],requires_grad=True).to(device)
    for i in range(item["german"].shape[1]-1):
        xx = torch.zeros( [g[0],g[1], ],dtype=torch.long ).to(device)
        out = model(x)
        xx[:,i:i+1] = item["german"][:,i:i+1]
        x = x+xx
        all_outs = torch.cat((all_outs,out),dim=1)
    
    # ###################
    # My variation of beam search decoder
    model.encode(item["english"][:,1:-1])
    g = item["german"].shape
    x = torch.zeros( [g[0],g[1],],dtype=torch.long ).to(device)
    all_outs = torch.tensor([],requires_grad=True).to(device)
    for i in range(item["german"].shape[1]-1):
        out = model(x)
        x[:,i:i+1] = out.argmax(axis=-1)
        all_outs = torch.cat((all_outs,out),dim=1)
    
    

    I found this glitch when fiddling with the attention layer at its core, and found zeroing the attention value created no harm to the performance of a last-token-only model in

    sub_layers.py

    attention_weights = F.softmax(attention_weights,dim=2)
    attention_weights = attention_weights *0.   ## Try this!
    
    opened by shouldsee 8
  • Thanks for sharing, but questions remain

    Thanks for sharing, but questions remain

    Dear devs,

    I find this repo simple and smooth to run. However I am confused why you used the same embedding for input and output?

    Specifically here in def fowward and in def encode, both used the same reference to self.embedding. This looks weird isnt it? Since source language should use a different encoding when compared to the destination language?

    
    class TransformerTranslator(nn.Module):
        def __init__(self,embed_dim,num_blocks,num_heads,vocab_size,CUDA=False):
            super(TransformerTranslator,self).__init__()
            self.embedding = Embeddings(vocab_size,embed_dim,CUDA=CUDA)
            self.encoder = Encoder(embed_dim,num_heads,num_blocks,CUDA=CUDA)
            self.decoder = Decoder(embed_dim,num_heads,num_blocks,vocab_size,CUDA=CUDA)
            self.encoded = False
            self.device = torch.device('cuda:0' if CUDA else 'cpu')
        def encode(self,input_sequence):
            embedding = self.embedding(input_sequence).to(self.device)
            self.encode_out = self.encoder(embedding)
            self.encoded = True
        def forward(self,output_sequence):
            if(self.encoded==False):
                print("ERROR::TransformerTranslator:: MUST ENCODE FIRST.")
                return output_sequence
            else:
                embedding = self.embedding(output_sequence)
                return self.decoder(self.encode_out,embedding)
    
    opened by shouldsee 2
Owner
null
Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Attention Is All You Need Paper Implementation This is my from-scratch implementation of the original transformer architecture from the following pape

Brando Koch 195 Dec 30, 2022
TensorFlow implementation of "Attention is all you need (Transformer)"

[TensorFlow 2] Attention is all you need (Transformer) TensorFlow implementation of "Attention is all you need (Transformer)" Dataset The MNIST datase

YeongHyeon Park 4 Jan 5, 2022
Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Mozhdeh Gheini 16 Jul 16, 2022
[ACM MM 2021] Yes, "Attention is All You Need", for Exemplar based Colorization

Transformer for Image Colorization This is an implemention for Yes, "Attention Is All You Need", for Exemplar based Colorization, and the current soft

Wang Yin 30 Dec 7, 2022
TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels.

AutoDSP TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels. About Adaptive filtering algorithms are commonplace in sign

Jonah Casebeer 48 Sep 19, 2022
Enigma-Plus - Python based Enigma machine simulator with some extra features

Enigma-Plus Python based Enigma machine simulator with some extra features Examp

null 1 Jan 5, 2022
Syntax-Aware Action Targeting for Video Captioning

Syntax-Aware Action Targeting for Video Captioning Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). Th

null 59 Oct 13, 2022
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023