Pytorch library for fast transformer implementations

Idiap Research Institute

Last update: Dec 30, 2022

Related tags

Deep Learning fast-transformers

Overview

Fast Transformers

Transformers are very successful models that achieve state of the art performance in many natural language tasks. However, it is very difficult to scale them to long sequences due to the quadratic scaling of self-attention.

This library was developed for our research on fast attention for transformers. You can find a list of our papers in the docs as well as related papers and papers that we have implemented.

Quick-start

The following code builds a transformer with softmax attention and one with linear attention and compares the time required by each to encode a sequence with 1000 elements.

import torch
from fast_transformers.builders import TransformerEncoderBuilder

# Create the builder for our transformers
builder = TransformerEncoderBuilder.from_kwargs(
    n_layers=8,
    n_heads=8,
    query_dimensions=64,
    value_dimensions=64,
    feed_forward_dimensions=1024
)

# Build a transformer with softmax attention
builder.attention_type = "full"
softmax_model = builder.get()

# Build a transformer with linear attention
builder.attention_type = "linear"
linear_model = builder.get()

# Construct the dummy input
X = torch.rand(10, 1000, 8*64)

# Prepare everythin for CUDA
X = X.cuda()
softmax_model.cuda()
softmax_model.eval()
linear_model.cuda()
linear_model.eval()

# Warmup the GPU
with torch.no_grad():
    softmax_model(X)
    linear_model(X)
torch.cuda.synchronize()

# Measure the execution time
softmax_start = torch.cuda.Event(enable_timing=True)
softmax_end = torch.cuda.Event(enable_timing=True)
linear_start = torch.cuda.Event(enable_timing=True)
linear_end = torch.cuda.Event(enable_timing=True)

with torch.no_grad():
    softmax_start.record()
    y = softmax_model(X)
    softmax_end.record()
    torch.cuda.synchronize()
    print("Softmax: ", softmax_start.elapsed_time(softmax_end), "ms")
    # Softmax: 144 ms (on a GTX1080Ti)

with torch.no_grad():
    linear_start.record()
    y = linear_model(X)
    linear_end.record()
    torch.cuda.synchronize()
    print("Linear: ", linear_start.elapsed_time(linear_end), "ms")
    # Linear: 68 ms (on a GTX1080Ti)

Dependencies & Installation

The fast transformers library has the following dependencies:

PyTorch
C++ toolchain
CUDA toolchain (if you want to compile for GPUs)

For most machines installation should be as simple as:

pip install --user pytorch-fast-transformers

Note: macOS users should ensure they have llvm and libomp installed. Using the homebrew package manager, this can be accomplished by running brew install llvm libomp.

Documentation

There exists a dedicated documentation site but you are also encouraged to read the source code.

Research

Ours

To read about the theory behind some attention implementations in this library we encourage you to follow our research.

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2006.16236)
Fast Transformers with Clustered Attention (2007.04825)

If you found our research helpful or influential please consider citing

@inproceedings{katharopoulos_et_al_2020,
    author = {Katharopoulos, A. and Vyas, A. and Pappas, N. and Fleuret, F.},
    title = {Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
    booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
    year = {2020}
}

@article{vyas_et_al_2020,
    author={Vyas, A. and Katharopoulos, A. and Fleuret, F.},
    title={Fast Transformers with Clustered Attention},
    booktitle = {Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS)},
    year={2020}
}

By others

Efficient Attention: Attention with Linear Complexities (1812.01243)
Linformer: Self-Attention with Linear Complexity (2006.04768)
Reformer: The Efficient Transformer (2001.04451)

Support, License and Copyright

This software is distributed with the MIT license which pretty much means that you can use it however you want and for whatever reason you want. All the information regarding support, copyright and the license can be found in the LICENSE file in the repository.

Comments

What is best way to perform recurrent sampling while training?

In general, I want to have teacher forcing pas and self-generated (free-running generative) pass aka professor forcing.

For now, looks like I need to merge FullAttention RecurrentFullAttention RecurrentCrossFullAttention into one class. And use it with flags like recurrent = true And the same for layers and encoder/decoder class. Seems inconvenient. Am I right? Or here is a better way?

opened by hadaev8 27
Encoder-decoder setup?

Thanks for all the work!

Is there anyway to use this library for a task that would typically require an encoder-decoder architecture, like machine translation?

I see the BERT example in the docs, but no mention of a decoder anywhere.

Thanks again :)
enhancement

opened by ghost 17
Implementation of random Fourier features

We should implement some of the RFF approaches of https://arxiv.org/abs/2009.14794 .

They can used directly as a feature map with the LinearAttention implementation.
enhancement

opened by angeloskath 11

Error with recurrent attention ValueError: too many values to unpack (expected 2)

Colab Link: https://colab.research.google.com/drive/1mYTh4MO_Tg6LBrhhVQUd81R92UNE56F7?authuser=1#scrollTo=cflC2xVxKb5M&line=8&uniqifier=1

Full trace:

<ipython-input-20-cd7d3f9fcf71> in forward(self, batch)
     59         src = self.encoder(batch['inp'])
     60         src = self.pos_encoder(src)
---> 61         src = self.transformer_encoder(src)
     62 
     63         trg = self.decoder(batch['out'][:,:-1])

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/content/fast-transformers/fast_transformers/recurrent/transformers.py in forward(self, x, state, memory)
    131         # Apply all the transformers
    132         for i, layer in enumerate(self.layers):
--> 133             x, s = layer(x, state[i])
    134             state[i] = s
    135 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/content/fast-transformers/fast_transformers/recurrent/transformers.py in forward(self, x, state, memory)
     77 
     78         # Run the self attention and add it to the input
---> 79         x2, state = self.attention(x, x, x, state)
     80         x = x + self.dropout(x2)
     81 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/content/fast-transformers/fast_transformers/recurrent/attention/self_attention/attention_layer.py in forward(self, query, key, value, state, memory)
     83 
     84         # Reshape them into many heads and compute the attention
---> 85         N, D = query.shape
     86         H = self.n_heads
     87         new_value, state = self.inner_attention(

ValueError: too many values to unpack (expected 2)

opened by hadaev8 11

Huggingface Bert vs. Fast Transformer full attention
First of all thank you for this amazing work!

In my research I am comparing different encoders for relation extraction. What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert. My expectation is that this setup should perform the same as an untrained bert from huggingface.

TransformerEncoderBuilder.from_kwargs( n_layers=12, n_heads=12, query_dimensions=64, value_dimensions=64, feed_forward_dimensions=3072, attention_type="full", activation="gelu" ).get()

Is my expectation correct? Why does it perform worse?
opened by lipmem 9
Memory usage: native PyTorch vs. "full"-Attention

Hello,

I wanted to leave some observations of myself here regarding the memory consumption (which is often a critical factor). It might be of some interesst for other who want to benchmark their implementation.

The fast-transformer implementation of full-self-attention uses around 35% more GPU memory and is slightly slower, than the nativ PyTorch implementation. I would like to note, that this is true for my specific setup and I run only a limited number of test runs (4 each), which I report here. I did only discover this, as my initial configuration/implementation in PyTorch did fit into the memory.

Both used modells use some embedding beforehand and differ only in the TransformerEncoderLayer / TransformerEncoderBuilder. I did not construct a minimal example, just exchanged the modules in my workflow to test different implementations.

The following numbers belong to this specific configuration:

Architecture: encoder only Attention mask: Causal masked (upper triangle) Layer number: 8 Embedding dimension: 64 Number of heads: 4 Feed-forward dimension: 4 * 64 Max sequence length: 4096 Batch size: 1 GPU: single RTX 2080 Ti

Peak memory usage in each run: native PyTorch: 6152 - 6200 GB fast-transformers: 8312 - 8454 GB

Computation time per epoch in each run: native PyTorch: 9min 9s - 9min 33s fast-transformers: 10min 18s - 10min 48s

The same configuration with 16 layers does fit into the GPU (~11GB) using native PyTorch and throws an OOM with fast-transformers. I suppose this is not an important issue, as long as both implementations provide similar results (might test it in the next couple of days on my specific setup, too), as the focus of the library lies on efficient implementations.

opened by GregorKobsik 9
CUDA problems in causal linear product

Hi, My machine has 4 gpus, but when I use the gpu-1 (where the default gpu is 0), I found the cuda code be computed on the gpu-0. And, the code can not be computed when I use multiple gpus one time. There is a out of memory error.
bug

opened by xyltt 8
RuntimeError: CUDA error: invalid argument when running tests/attention/test_improved_clustered_transformer_gpu.py

I have changed some hyperparameters of test_improved_clustered_transformer_gpu.py as shown in the following figure When 'input length' is 475 and 'd_model' is larger than 1540, the script will meet the "RuntimeError: CUDA error: invalid argument." Could you tell me why it happened?

opened by justimyhxu 8

No module named 'fast_transformers.causal_product.causal_product_cpu' (solved: needed to at CUDA to the PATH)

Hi there,

I am having some trouble using this library. I cloned this repo (July 19th) and ran the setup file, the setup ran but now I am getting this error (the same error occurs with pip install):

  File "/usr/local/lib/python3.6/dist-packages/fast_transformers/builders/__init__.py", line 29, in <module>
    from .transformer_encoder_builder import TransformerEncoderBuilder
  File "/usr/local/lib/python3.6/dist-packages/fast_transformers/builders/transformer_encoder_builder.py", line 31, in <module>
    from ..attention import AttentionLayer, FullAttention, \
  File "/usr/local/lib/python3.6/dist-packages/fast_transformers/attention/__init__.py", line 13, in <module>
    from .causal_linear_attention import CausalLinearAttention
  File "/usr/local/lib/python3.6/dist-packages/fast_transformers/attention/causal_linear_attention.py", line 12, in <module>
    from fast_transformers.causal_product import causal_dot_product 
  File "/usr/local/lib/python3.6/dist-No module named 'fast_transformers.causal_product.causal_product_cpu'packages/fast_transformers/causal_product/__init__.py", line 9, in <module>
    from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
ModuleNotFoundError:

When I comment out importing this file (above), I get an import error on the hashing files instead, so I think the issues is these CUDA files. I am using Ubuntu 18.04 and PyTorch 1.5.1 with CUDA 10.2. However using the exact same setup procedure on Google Colab, I have no issues - Colab uses PyTorch 1.5.1 but CUDA 10.1.

Could the CUDA version difference be the issue?

Thanks :)

opened by ghost 8

Tips and tricks for training linear_att

Hello,

I have migrated your linear_attention.py to be compatible with huggingface. I also have modified the masking part to do the LenghtMask.

The thing is that the model is very brittle and use to diverge. It is very sensitive to hyper-parameters and initialization.

Do you have some tips and tricks to train the linear_attention?

Thanks!

class LinearAttention(nn.Module):
    """Implement unmasked attention using dot product of feature maps in
    O(N D^2) complexity.
    Given the queries, keys and values as Q, K, V instead of computing
        V' = softmax(Q.mm(K.t()), dim=-1).mm(V),
    we make use of a feature map function Φ(.) and perform the following
    computation
        V' = normalize(Φ(Q).mm(Φ(K).t())).mm(V).
    The above can be computed in O(N D^2) complexity where D is the
    dimensionality of Q, K and V and N is the sequence length. Depending on the
    feature map, however, the complexity of the attention might be limited.
    Arguments
    ---------
        feature_map: callable, a callable that applies the feature map to the
                     last dimension of a tensor (default: elu(x)+1)
        eps: float, a small number to ensure the numerical stability of the
             denominator (default: 1e-6)
        event_dispatcher: str or EventDispatcher instance to be used by this
                          module for dispatching events (default: the default
                          global dispatcher)
    """

    def __init__(self, config, feature_map=None, eps=1e-4):
        super(LinearAttention, self).__init__()
        self.feature_map = (
            feature_map(config.true_hidden_size) if feature_map else
            elu_feature_map(config.true_hidden_size)
        )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.true_hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.eps = eps
        self.query_projection = nn.Linear(config.true_hidden_size, self.all_head_size)
        self.key_projection = nn.Linear(config.true_hidden_size, self.all_head_size)
        self.value_projection = nn.Linear(config.true_hidden_size if config.use_bottleneck_attention else config.hidden_size, self.all_head_size)
        self.out_projection = nn.Linear(config.true_hidden_size, config.true_hidden_size)
        self.n_heads = config.num_attention_heads
    
    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, queries, keys, values, attn_mask, query_lengths,
                key_lengths):

        N, L, _ = queries.shape
        _, S, _ = keys.shape
        H = self.n_heads

        # Project the queries/keys/values
        queries = self.query_projection(queries).view(N, L, H, -1)
        keys = self.key_projection(keys).view(N, S, H, -1)
        values = self.value_projection(values).view(N, S, H, -1)

        # Apply the feature map to the queries and keys
        # self.feature_map.new_feature_map(queries.device)
        Q = self.feature_map.forward_queries(queries)
        K = self.feature_map.forward_keys(keys)

        # Apply the key padding mask and make sure that the attn_mask is
        # all_ones
        if not attn_mask.all_ones:
            raise RuntimeError(("LinearAttention does not support arbitrary "
                                "attention masks"))
        K = K * key_lengths.float_matrix[:, :, None, None]

        # Compute the KV matrix, namely the dot product of keys and values so
        # that we never explicitly compute the attention matrix and thus
        # decrease the complexity
        KV = torch.einsum("nshd,nshm->nhmd", K, values)

        # Compute the normalizer
        Z = 1 / (torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1)) + self.eps)

        # Finally compute and return the new values
        V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z).contiguous().view(N, L, -1)

        return self.out_projection(V)

def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        output_hidden_states=None,
        output_attentions=None,
        return_dict=None,
        output_layers=None,
        regression=False,
    ):
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        device = input_ids.device if input_ids is not None else inputs_embeds.device

        N = input_shape[0]
        L = input_shape[1]
        if input_ids is not None:
            x = input_ids
        elif inputs_embeds is not None:
            x = inputs_embeds
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")
        extended_attention_mask = FullMask(L, device=x.device)
        # extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 # ¿?
        head_mask = LengthMask(x.new_full((N,), L, dtype=torch.int64))

opened by gaceladri 7

windows installation error linking local_product_cuda.cu

I've been trying to install on windows using pip and it looks like I'm almost there. I get through compiling everything and then I get an error when trying to complete linking of local-product-cuda.

System: Win 10, cuda 10.2.89 , pytorch 1.6, python 3.8

traceback: local_product_cuda.cu C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib/x64" /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\libs /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10_cuda.lib torch_cuda.lib /EXPORT:PyInit_local_product_cuda C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product/local_product_cuda.obj /OUT:build\lib.win-amd64-3.8\fast_transformers\local_product\local_product_cuda.cp38-win_amd64.pyd /IMPLIB:C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.lib

Creating library C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.lib and object C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.exp

local_product_cuda.obj : error LNK2001: unresolved external symbol "public: long * __cdecl at::Tensor::data_ptr(void)const " (??$data_ptr@J@Tensor@at@@QEBAPEAJXZ)

build\lib.win-amd64-3.8\fast_transformers\local_product\local_product_cuda.cp38- win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\link.exe' failed with exit status 1120

I've been investigating for quite a few hours now, but I can't figure out why I'm getting a linking error. From searching the error it seems like it's some issue with the .lib or function definition not being accessible to the .obj, but it seems like both the .lib and .obj are being created, and I'm assuming all definitions are wrapped into the pip bundle if others are able to install. I wanted to post here in case it is an issue with the dependencies somewhere or something getting messed up with windows. Anyone else having this problem or have an idea where to start in solving it?

Thanks!

opened by lm-b 7

Understanding how to define key, query and value for the cross attention calculation

Hello,

I have problem understanding how I can use this library to implement cross attention

for instance if tensor x=torch.rand(100,14,64)is key, tensor y=torch.rand(100,11,64) is value and tensorz=torch.rand(100,14,1) is query, how can I use TransformerDecoderBuilder to compute the cross attention for this example?

Here is how I built encoder and decoder class:

import math
import fast_transformers
from fast_transformers.builders import TransformerEncoderBuilder, TransformerDecoderBuilder
from collections import OrderedDict


class PositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model, dropout_prob=0.0, series_dimensions=1):
        global pe
        super().__init__()
        self.dropout = nn.Dropout(p=dropout_prob)
        self.d_model = d_model
        self.max_len = max_len
        self.series_dimensions = series_dimensions
        
        if self.series_dimensions == 1:
            if d_model % 2 != 0:
                raise ValueError("Cannot use sin/cos positional encoding with "
                                 "odd dim (got dim={:d})".format(d_model))
            pe = torch.zeros(self.max_len, d_model).float()
            pe.require_grad = False
            position = torch.arange(0, self.max_len, dtype=torch.float).unsqueeze(1)
            div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
            pe[:, 0::2] = torch.sin(position * div_term)
            pe[:, 1::2] = torch.cos(position * div_term)
        elif self.series_dimensions > 1:
            if d_model % 4 != 0:
                raise ValueError("Cannot use sin/cos positional encoding with "
                                 "odd dim (got dim={:d})".format(d_model))
            height = self.series_dimensions
            width = self.max_len
            pe = torch.zeros(d_model, height, width).float()
            pe.require_grad = False
            # Each dimension use half of d_model
            d_model = int(d_model / 2)
            div_term = torch.exp(torch.arange(0., d_model, 2) * -(math.log(10000.0) / d_model))
            pos_w = torch.arange(0., width).unsqueeze(1)
            pos_h = torch.arange(0., height).unsqueeze(1)
            pe[0:d_model:2, :, :] = torch.sin(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
            pe[1:d_model:2, :, :] = torch.cos(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
            pe[d_model::2, :, :] = torch.sin(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
            pe[d_model + 1::2, :, :] = torch.cos(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
            pe = pe.view(2*d_model, height * width, -1).squeeze(-1) # Flattening it back to 1D series
            pe = pe.transpose(0, 1)
            
        pe = pe.unsqueeze(0) # Extending it by an extra leading dim for the batches
        self.register_buffer('pe', pe)

    # Expecting a flattened (1D) series
    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)


class LinearTransformerCausalEncoder(torch.nn.Module):
    def __init__(self, input_features, output_features, hidden_dim, sequence_length, 
                 attention_type='causal-linear', n_layers=2, n_heads=4,
                 dropout=0.1, softmax_temp=None, activation_fn="gelu",
                 attention_dropout=0.1,
                ):
        super(LinearTransformerCausalEncoder, self).__init__()
        #
        self.d_model=hidden_dim*n_heads
        #
        self.pos_embedding = PositionalEncoding(
                                               max_len=sequence_length,
                                               d_model=self.d_model, #hidden_dim*n_heads      
                                               )
        self.value_embedding = nn.Linear(
            input_features,
            self.d_model
        )
        self.builder_dict = OrderedDict({
            "attention_type": attention_type,
            "n_layers": n_layers,
            "n_heads": n_heads,
            "feed_forward_dimensions": self.d_model*2,
            "query_dimensions": hidden_dim,
            "value_dimensions": hidden_dim,
            "dropout": dropout,
            "softmax_temp": softmax_temp,
            "activation" : activation_fn,
            "attention_dropout": attention_dropout,
        })
        self.transformer = TransformerEncoderBuilder.from_dictionary(
            self.builder_dict,
            strict=True
        ).get()
        hidden_size = n_heads*hidden_dim
        ##
        self.predictor = torch.nn.Linear(
            hidden_size,
            output_features
        )
    def forward(self, x):
        # x: [batch_size, input_dim, sequence_length]
        x = x.permute(0,2,1)
        x = self.value_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
        x = self.pos_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
        triangular_mask = fast_transformers.masking.TriangularCausalMask(x.size(1), device=x.device) # triangular_mask: [ sequence_length,  sequence_length]       
        y_hat = self.transformer(x, attn_mask=triangular_mask) # y_hat: [batch size, sequence_length, n_heads* hiden_size]     
        y_hat = self.predictor(y_hat) # y_hat: [batch size, sequence_length, output_size]
        return y_hat.permute(0,2,1)   # y_hat: [batch size, output_size, sequence_length]

class LinearTransformerCausalDecoder(torch.nn.Module):
    def __init__(self, output_features, hidden_dim, sequence_length, 
                 attention_type='causal-linear', n_layers=2, n_heads=4,
                 d_query=32, dropout=0.1, softmax_temp=None,activation_fn="gelu",
                 attention_dropout=0.1,):
        super(LinearTransformerCausalDecoder, self).__init__()
        self.d_model=hidden_dim*n_heads
        self.pos_embedding = PositionalEncoding(
             max_len=sequence_length,
            d_model=self.d_model, #hidden_dim*n_heads
           
        )
    
        self.value_embedding = torch.nn.Linear(
            output_features,
            self.d_model
        )
        self.builder_dict = OrderedDict({
            "cross_attention_type":attention_type,
            "self_attention_type":attention_type,
            "n_layers": n_layers,
            "n_heads": n_heads,
            "feed_forward_dimensions": self.d_model*2,
            "query_dimensions": hidden_dim,
            "value_dimensions": hidden_dim,
            "dropout": dropout,
            "softmax_temp": softmax_temp,
            "activation" : activation_fn,
            "attention_dropout": attention_dropout,
        })
        self.transformer = TransformerDecoderBuilder.from_dictionary(
            self.builder_dict,
            strict=True
        ).get()
        hidden_size = n_heads*hidden_dim
        
        self.predictor = torch.nn.Linear(
            hidden_size,
            output_features
        )
    def forward(self, target, memory, len_mask=None):
        
        x = target.permute(0,2,1) # x: [batch_size, sequence_length, input_dim]
        x = self.value_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
        x = self.pos_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
        triangular_mask = fast_transformers.masking.TriangularCausalMask(x.size(1), device=x.device) # triangular_mask: [ sequence_length,  sequence_length]       
        y_hat = self.transformer(x, memory, triangular_mask, len_mask=None) # y_hat: [batch size, sequence_length, n_heads* hiden_size]   
        y_hat = self.predictor(y_hat) # y_hat: [batch size, sequence_length, output_size]
        return y_hat.permute(0,2,1)   # y_hat: [batch size, output_size, sequence_length]x=torch.rand([100,14,64])

I have difficulty to comprehend how I can use LinearTransformerCausalDecoder for computing cross attention. I will appreciate if anyone can clarify it for this example key, query and value ? Thanks.

opened by neuronphysics 0

Cuda version

Is this package tested on more recent cuda and pytorch versions?

My code calls fast_transformers.causal_product, which is actually the only function I call from this package.

I setup this package with latest pytorch 1.13.0+cuda11.6, and get NaN errors at training. This, however, doesn't happen with the older pytorch 1.7.1+cuda11.0.

opened by jiaji-huang 1
Can't officially save Linear Attention model

Tried (ubuntu) to torch.save (1.1.0) model using Linear Attention (0.4.0) and got the following serialization error: PicklingError: Can't pickle <function <lambda> at 0x7fa4f10120e0>: attribute lookup <lambda> on fast_transformers.feature_maps.base failed

Any solution? Should I PR?

opened by maulberto3 1
Runtime error on causal_product_cpu on GCC/G++ 11

I've build pytorch fast transformers on Ubuntu 21.10, CUDA 11.6, GCC/G++ 11. Build worked fine. On:

import fast_transformers.causal_product.causal_product_cpu

from an init file, it throws the following error: [...] File "python3.8/site-packages/fast_transformers/builders/init.py", line 42, in from ..attention import
File "python3.8/site-packages/fast_transformers/attention/init.py", line 13, in from .causal_linear_attention import CausalLinearAttention File "python3.8/site-packages/fast_transformers/attention/causal_linear_attention.py", line 15, in from ..causal_product import causal_dot_product File "python3.8/site-packages/fast_transformers/causal_product/init.py", line 9, in from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu,
ImportError: python3.8/site-packages/fast_transformers/causal_product/causal_product_cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

used update-alternatives and pointed to GCC/G++ 10, the runtime error is gone.

More versions, in verbose format. Due to the update-alternatives, I'm calling g++-11 and gcc-11 specifically, they were the default in Ubuntu 21.10:

✗ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Fri_Dec_17_18:16:03_PST_2021 Cuda compilation tools, release 11.6, V11.6.55 Build cuda_11.6.r11.6/compiler.30794723_0

✗ gcc-11 --version gcc-11 (Ubuntu 11.2.0-7ubuntu2) 11.2.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

✗ g++-11 --version g++-11 (Ubuntu 11.2.0-7ubuntu2) 11.2.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE

opened by lsisoft 3
how causal mask constructed in training batch model with linear causal attention?

Hi! I have a few questions about the difference in models.

I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?

Thanks!

opened by Howuhh 0

Owner

Idiap Research Institute

GitHub

ilpyt: imitation learning library with modular, baseline implementations in Pytorch

ilpyt The imitation learning toolbox (ilpyt) contains modular implementations of common deep imitation learning algorithms in PyTorch, with unified in

11 Nov 17, 2022

A lightweight library to compare different PyTorch implementations of the same network architecture.

TorchBug is a lightweight library designed to compare two PyTorch implementations of the same network architecture. It allows you to count, and compar

5 Jan 2, 2023

A library for Deep Learning Implementations and utils

deeply A Deep Learning library Table of Contents Features Quick Start Usage License Features Python 2.7+ and Python 3.4+ compatible. Quick Start $ pip

1 Dec 12, 2022

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

Super-Fast-Adversarial-Training This is a PyTorch Implementation code for develo

26 Dec 2, 2022

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

198 Dec 27, 2022

Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

167 Dec 27, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

1 Jan 6, 2022

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Machine learning metrics for distributed, scalable PyTorch applications.

1.2k Jan 6, 2023

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

4.7k Jan 1, 2023

PyTorch implementations of the paper: "Learning Independent Instance Maps for Crowd Localization"

IIM - Crowd Localization This repo is the official implementation of paper: Learning Independent Instance Maps for Crowd Localization. The code is dev

91 Nov 10, 2022

PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

129 Dec 22, 2022

PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

4.7k Jan 4, 2023

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

471 Dec 16, 2022

PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

546 Dec 5, 2022

PyTorch implementations of Generative Adversarial Networks.

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as

13.4k Jan 8, 2023

Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

183 Dec 28, 2022