ReLA (Rectified Linear Attention) Transformer
Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt to combine the feedforward into the ReLA layer as memory key / values, as proposed in All Attention, suggestion made by Charles Foster.
Install
$ pip install rela-transformer
Usage
import torch
from rela_transformer.rela_transformer import ReLATransformer
model = ReLATransformer(
num_tokens = 20000,
dim = 512,
depth = 8,
max_seq_len = 1024,
dim_head = 64,
heads = 8,
causal = True
)
x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)
Enwik8
$ python train.py
Citations
@misc{zhang2021sparse,
title = {Sparse Attention with Linear Units},
author = {Biao Zhang and Ivan Titov and Rico Sennrich},
year = {2021},
eprint = {2104.07012},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}