Pytorch library for fast transformer implementations

Overview

Fast Transformers

Transformers are very successful models that achieve state of the art performance in many natural language tasks. However, it is very difficult to scale them to long sequences due to the quadratic scaling of self-attention.

This library was developed for our research on fast attention for transformers. You can find a list of our papers in the docs as well as related papers and papers that we have implemented.

Quick-start

The following code builds a transformer with softmax attention and one with linear attention and compares the time required by each to encode a sequence with 1000 elements.

import torch
from fast_transformers.builders import TransformerEncoderBuilder

# Create the builder for our transformers
builder = TransformerEncoderBuilder.from_kwargs(
    n_layers=8,
    n_heads=8,
    query_dimensions=64,
    value_dimensions=64,
    feed_forward_dimensions=1024
)

# Build a transformer with softmax attention
builder.attention_type = "full"
softmax_model = builder.get()

# Build a transformer with linear attention
builder.attention_type = "linear"
linear_model = builder.get()

# Construct the dummy input
X = torch.rand(10, 1000, 8*64)

# Prepare everythin for CUDA
X = X.cuda()
softmax_model.cuda()
softmax_model.eval()
linear_model.cuda()
linear_model.eval()

# Warmup the GPU
with torch.no_grad():
    softmax_model(X)
    linear_model(X)
torch.cuda.synchronize()

# Measure the execution time
softmax_start = torch.cuda.Event(enable_timing=True)
softmax_end = torch.cuda.Event(enable_timing=True)
linear_start = torch.cuda.Event(enable_timing=True)
linear_end = torch.cuda.Event(enable_timing=True)

with torch.no_grad():
    softmax_start.record()
    y = softmax_model(X)
    softmax_end.record()
    torch.cuda.synchronize()
    print("Softmax: ", softmax_start.elapsed_time(softmax_end), "ms")
    # Softmax: 144 ms (on a GTX1080Ti)

with torch.no_grad():
    linear_start.record()
    y = linear_model(X)
    linear_end.record()
    torch.cuda.synchronize()
    print("Linear: ", linear_start.elapsed_time(linear_end), "ms")
    # Linear: 68 ms (on a GTX1080Ti)

Dependencies & Installation

The fast transformers library has the following dependencies:

  • PyTorch
  • C++ toolchain
  • CUDA toolchain (if you want to compile for GPUs)

For most machines installation should be as simple as:

pip install --user pytorch-fast-transformers

Note: macOS users should ensure they have llvm and libomp installed. Using the homebrew package manager, this can be accomplished by running brew install llvm libomp.

Documentation

There exists a dedicated documentation site but you are also encouraged to read the source code.

Research

Ours

To read about the theory behind some attention implementations in this library we encourage you to follow our research.

  • Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2006.16236)
  • Fast Transformers with Clustered Attention (2007.04825)

If you found our research helpful or influential please consider citing

@inproceedings{katharopoulos_et_al_2020,
    author = {Katharopoulos, A. and Vyas, A. and Pappas, N. and Fleuret, F.},
    title = {Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
    booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
    year = {2020}
}

@article{vyas_et_al_2020,
    author={Vyas, A. and Katharopoulos, A. and Fleuret, F.},
    title={Fast Transformers with Clustered Attention},
    booktitle = {Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS)},
    year={2020}
}

By others

  • Efficient Attention: Attention with Linear Complexities (1812.01243)
  • Linformer: Self-Attention with Linear Complexity (2006.04768)
  • Reformer: The Efficient Transformer (2001.04451)

Support, License and Copyright

This software is distributed with the MIT license which pretty much means that you can use it however you want and for whatever reason you want. All the information regarding support, copyright and the license can be found in the LICENSE file in the repository.

Comments
  • What is best way to perform recurrent sampling while training?

    What is best way to perform recurrent sampling while training?

    In general, I want to have teacher forcing pas and self-generated (free-running generative) pass aka professor forcing.

    For now, looks like I need to merge FullAttention RecurrentFullAttention RecurrentCrossFullAttention into one class. And use it with flags like recurrent = true And the same for layers and encoder/decoder class. Seems inconvenient. Am I right? Or here is a better way?

    opened by hadaev8 27
  • Encoder-decoder setup?

    Encoder-decoder setup?

    Thanks for all the work!

    Is there anyway to use this library for a task that would typically require an encoder-decoder architecture, like machine translation?

    I see the BERT example in the docs, but no mention of a decoder anywhere.

    Thanks again :)

    enhancement 
    opened by ghost 17
  • Implementation of random Fourier features

    Implementation of random Fourier features

    We should implement some of the RFF approaches of https://arxiv.org/abs/2009.14794 .

    They can used directly as a feature map with the LinearAttention implementation.

    enhancement 
    opened by angeloskath 11
  • Error with recurrent attention ValueError: too many values to unpack (expected 2)

    Error with recurrent attention ValueError: too many values to unpack (expected 2)

    Colab Link: https://colab.research.google.com/drive/1mYTh4MO_Tg6LBrhhVQUd81R92UNE56F7?authuser=1#scrollTo=cflC2xVxKb5M&line=8&uniqifier=1

    Full trace:

    <ipython-input-20-cd7d3f9fcf71> in forward(self, batch)
         59         src = self.encoder(batch['inp'])
         60         src = self.pos_encoder(src)
    ---> 61         src = self.transformer_encoder(src)
         62 
         63         trg = self.decoder(batch['out'][:,:-1])
    
    /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    /content/fast-transformers/fast_transformers/recurrent/transformers.py in forward(self, x, state, memory)
        131         # Apply all the transformers
        132         for i, layer in enumerate(self.layers):
    --> 133             x, s = layer(x, state[i])
        134             state[i] = s
        135 
    
    /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    /content/fast-transformers/fast_transformers/recurrent/transformers.py in forward(self, x, state, memory)
         77 
         78         # Run the self attention and add it to the input
    ---> 79         x2, state = self.attention(x, x, x, state)
         80         x = x + self.dropout(x2)
         81 
    
    /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    /content/fast-transformers/fast_transformers/recurrent/attention/self_attention/attention_layer.py in forward(self, query, key, value, state, memory)
         83 
         84         # Reshape them into many heads and compute the attention
    ---> 85         N, D = query.shape
         86         H = self.n_heads
         87         new_value, state = self.inner_attention(
    
    ValueError: too many values to unpack (expected 2)
    
    opened by hadaev8 11
  • Huggingface Bert vs. Fast Transformer full attention

    Huggingface Bert vs. Fast Transformer full attention

    First of all thank you for this amazing work!

    In my research I am comparing different encoders for relation extraction. What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert. My expectation is that this setup should perform the same as an untrained bert from huggingface.

    TransformerEncoderBuilder.from_kwargs(
                n_layers=12,
                n_heads=12,
                query_dimensions=64,
                value_dimensions=64,
                feed_forward_dimensions=3072,
                attention_type="full",
                activation="gelu"
            ).get()
    

    Is my expectation correct? Why does it perform worse?

    opened by lipmem 9
  • Memory usage: native PyTorch vs.

    Memory usage: native PyTorch vs. "full"-Attention

    Hello,

    I wanted to leave some observations of myself here regarding the memory consumption (which is often a critical factor). It might be of some interesst for other who want to benchmark their implementation.

    The fast-transformer implementation of full-self-attention uses around 35% more GPU memory and is slightly slower, than the nativ PyTorch implementation. I would like to note, that this is true for my specific setup and I run only a limited number of test runs (4 each), which I report here. I did only discover this, as my initial configuration/implementation in PyTorch did fit into the memory.

    Both used modells use some embedding beforehand and differ only in the TransformerEncoderLayer / TransformerEncoderBuilder. I did not construct a minimal example, just exchanged the modules in my workflow to test different implementations.

    The following numbers belong to this specific configuration:

    Architecture: encoder only Attention mask: Causal masked (upper triangle) Layer number: 8 Embedding dimension: 64 Number of heads: 4 Feed-forward dimension: 4 * 64 Max sequence length: 4096 Batch size: 1 GPU: single RTX 2080 Ti

    Peak memory usage in each run: native PyTorch: 6152 - 6200 GB fast-transformers: 8312 - 8454 GB

    Computation time per epoch in each run: native PyTorch: 9min 9s - 9min 33s fast-transformers: 10min 18s - 10min 48s

    The same configuration with 16 layers does fit into the GPU (~11GB) using native PyTorch and throws an OOM with fast-transformers. I suppose this is not an important issue, as long as both implementations provide similar results (might test it in the next couple of days on my specific setup, too), as the focus of the library lies on efficient implementations.

    opened by GregorKobsik 9
  • CUDA problems in causal linear product

    CUDA problems in causal linear product

    Hi, My machine has 4 gpus, but when I use the gpu-1 (where the default gpu is 0), I found the cuda code be computed on the gpu-0. And, the code can not be computed when I use multiple gpus one time. There is a out of memory error.

    bug 
    opened by xyltt 8
  • RuntimeError: CUDA error: invalid argument when running tests/attention/test_improved_clustered_transformer_gpu.py

    RuntimeError: CUDA error: invalid argument when running tests/attention/test_improved_clustered_transformer_gpu.py

    I have changed some hyperparameters of test_improved_clustered_transformer_gpu.py as shown in the following figure image When 'input length' is 475 and 'd_model' is larger than 1540, the script will meet the "RuntimeError: CUDA error: invalid argument." Could you tell me why it happened?

    opened by justimyhxu 8
  • No module named 'fast_transformers.causal_product.causal_product_cpu' (solved: needed to at CUDA to the PATH)

    No module named 'fast_transformers.causal_product.causal_product_cpu' (solved: needed to at CUDA to the PATH)

    Hi there,

    I am having some trouble using this library. I cloned this repo (July 19th) and ran the setup file, the setup ran but now I am getting this error (the same error occurs with pip install):

      File "/usr/local/lib/python3.6/dist-packages/fast_transformers/builders/__init__.py", line 29, in <module>
        from .transformer_encoder_builder import TransformerEncoderBuilder
      File "/usr/local/lib/python3.6/dist-packages/fast_transformers/builders/transformer_encoder_builder.py", line 31, in <module>
        from ..attention import AttentionLayer, FullAttention, \
      File "/usr/local/lib/python3.6/dist-packages/fast_transformers/attention/__init__.py", line 13, in <module>
        from .causal_linear_attention import CausalLinearAttention
      File "/usr/local/lib/python3.6/dist-packages/fast_transformers/attention/causal_linear_attention.py", line 12, in <module>
        from fast_transformers.causal_product import causal_dot_product 
      File "/usr/local/lib/python3.6/dist-No module named 'fast_transformers.causal_product.causal_product_cpu'packages/fast_transformers/causal_product/__init__.py", line 9, in <module>
        from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
    ModuleNotFoundError: 
    

    When I comment out importing this file (above), I get an import error on the hashing files instead, so I think the issues is these CUDA files. I am using Ubuntu 18.04 and PyTorch 1.5.1 with CUDA 10.2. However using the exact same setup procedure on Google Colab, I have no issues - Colab uses PyTorch 1.5.1 but CUDA 10.1.

    Could the CUDA version difference be the issue?

    Thanks :)

    opened by ghost 8
  • Tips and tricks for training linear_att

    Tips and tricks for training linear_att

    Hello,

    I have migrated your linear_attention.py to be compatible with huggingface. I also have modified the masking part to do the LenghtMask.

    The thing is that the model is very brittle and use to diverge. It is very sensitive to hyper-parameters and initialization.

    Do you have some tips and tricks to train the linear_attention?

    Thanks!

    class LinearAttention(nn.Module):
        """Implement unmasked attention using dot product of feature maps in
        O(N D^2) complexity.
        Given the queries, keys and values as Q, K, V instead of computing
            V' = softmax(Q.mm(K.t()), dim=-1).mm(V),
        we make use of a feature map function Φ(.) and perform the following
        computation
            V' = normalize(Φ(Q).mm(Φ(K).t())).mm(V).
        The above can be computed in O(N D^2) complexity where D is the
        dimensionality of Q, K and V and N is the sequence length. Depending on the
        feature map, however, the complexity of the attention might be limited.
        Arguments
        ---------
            feature_map: callable, a callable that applies the feature map to the
                         last dimension of a tensor (default: elu(x)+1)
            eps: float, a small number to ensure the numerical stability of the
                 denominator (default: 1e-6)
            event_dispatcher: str or EventDispatcher instance to be used by this
                              module for dispatching events (default: the default
                              global dispatcher)
        """
    
        def __init__(self, config, feature_map=None, eps=1e-4):
            super(LinearAttention, self).__init__()
            self.feature_map = (
                feature_map(config.true_hidden_size) if feature_map else
                elu_feature_map(config.true_hidden_size)
            )
    
            self.num_attention_heads = config.num_attention_heads
            self.attention_head_size = int(config.true_hidden_size / config.num_attention_heads)
            self.all_head_size = self.num_attention_heads * self.attention_head_size
    
            self.eps = eps
            self.query_projection = nn.Linear(config.true_hidden_size, self.all_head_size)
            self.key_projection = nn.Linear(config.true_hidden_size, self.all_head_size)
            self.value_projection = nn.Linear(config.true_hidden_size if config.use_bottleneck_attention else config.hidden_size, self.all_head_size)
            self.out_projection = nn.Linear(config.true_hidden_size, config.true_hidden_size)
            self.n_heads = config.num_attention_heads
        
        def transpose_for_scores(self, x):
            new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
            x = x.view(*new_x_shape)
            return x.permute(0, 2, 1, 3)
    
        def forward(self, queries, keys, values, attn_mask, query_lengths,
                    key_lengths):
    
            N, L, _ = queries.shape
            _, S, _ = keys.shape
            H = self.n_heads
    
            # Project the queries/keys/values
            queries = self.query_projection(queries).view(N, L, H, -1)
            keys = self.key_projection(keys).view(N, S, H, -1)
            values = self.value_projection(values).view(N, S, H, -1)
    
            # Apply the feature map to the queries and keys
            # self.feature_map.new_feature_map(queries.device)
            Q = self.feature_map.forward_queries(queries)
            K = self.feature_map.forward_keys(keys)
    
            # Apply the key padding mask and make sure that the attn_mask is
            # all_ones
            if not attn_mask.all_ones:
                raise RuntimeError(("LinearAttention does not support arbitrary "
                                    "attention masks"))
            K = K * key_lengths.float_matrix[:, :, None, None]
    
            # Compute the KV matrix, namely the dot product of keys and values so
            # that we never explicitly compute the attention matrix and thus
            # decrease the complexity
            KV = torch.einsum("nshd,nshm->nhmd", K, values)
    
            # Compute the normalizer
            Z = 1 / (torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1)) + self.eps)
    
            # Finally compute and return the new values
            V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z).contiguous().view(N, L, -1)
    
            return self.out_projection(V)
    
    def forward(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            output_hidden_states=None,
            output_attentions=None,
            return_dict=None,
            output_layers=None,
            regression=False,
        ):
            output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
            output_hidden_states = (
                output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
            )
            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
            if input_ids is not None and inputs_embeds is not None:
                raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
            elif input_ids is not None:
                input_shape = input_ids.size()
            elif inputs_embeds is not None:
                input_shape = inputs_embeds.size()[:-1]
            else:
                raise ValueError("You have to specify either input_ids or inputs_embeds")
    
            device = input_ids.device if input_ids is not None else inputs_embeds.device
    
            N = input_shape[0]
            L = input_shape[1]
            if input_ids is not None:
                x = input_ids
            elif inputs_embeds is not None:
                x = inputs_embeds
            else:
                raise ValueError("You have to specify either input_ids or inputs_embeds")
            extended_attention_mask = FullMask(L, device=x.device)
            # extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 # ¿?
            head_mask = LengthMask(x.new_full((N,), L, dtype=torch.int64))
    
    
    opened by gaceladri 7
  • windows installation error linking local_product_cuda.cu

    windows installation error linking local_product_cuda.cu

    I've been trying to install on windows using pip and it looks like I'm almost there. I get through compiling everything and then I get an error when trying to complete linking of local-product-cuda.

    System: Win 10, cuda 10.2.89 , pytorch 1.6, python 3.8

    traceback: local_product_cuda.cu C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib/x64" /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\libs /LIBPATH:C:\Users\user\Anaconda3\envs\testenv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10_cuda.lib torch_cuda.lib /EXPORT:PyInit_local_product_cuda C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product/local_product_cuda.obj /OUT:build\lib.win-amd64-3.8\fast_transformers\local_product\local_product_cuda.cp38-win_amd64.pyd /IMPLIB:C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.lib

    Creating library C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.lib and object C:\Users\user\AppData\Local\Temp\pip-install-x6t631um\pytorch-fast-transformers\build\temp.win-amd64-3.8\Release\fast_transformers/local_product\local_product_cuda.cp38-win_amd64.exp

    local_product_cuda.obj : error LNK2001: unresolved external symbol "public: long * __cdecl at::Tensor::data_ptr(void)const " (??$data_ptr@J@Tensor@at@@QEBAPEAJXZ)

    build\lib.win-amd64-3.8\fast_transformers\local_product\local_product_cuda.cp38- win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

    error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\link.exe' failed with exit status 1120

    I've been investigating for quite a few hours now, but I can't figure out why I'm getting a linking error. From searching the error it seems like it's some issue with the .lib or function definition not being accessible to the .obj, but it seems like both the .lib and .obj are being created, and I'm assuming all definitions are wrapped into the pip bundle if others are able to install. I wanted to post here in case it is an issue with the dependencies somewhere or something getting messed up with windows. Anyone else having this problem or have an idea where to start in solving it?

    Thanks!

    opened by lm-b 7
  • Understanding how to define key, query and value for the cross attention calculation

    Understanding how to define key, query and value for the cross attention calculation

    Hello,

    I have problem understanding how I can use this library to implement cross attention

    for instance if tensor x=torch.rand(100,14,64)is key, tensor y=torch.rand(100,11,64) is value and tensorz=torch.rand(100,14,1) is query, how can I use TransformerDecoderBuilder to compute the cross attention for this example?

    Here is how I built encoder and decoder class:

    import math
    import fast_transformers
    from fast_transformers.builders import TransformerEncoderBuilder, TransformerDecoderBuilder
    from collections import OrderedDict
    
    
    class PositionalEncoding(nn.Module):
        def __init__(self, max_len, d_model, dropout_prob=0.0, series_dimensions=1):
            global pe
            super().__init__()
            self.dropout = nn.Dropout(p=dropout_prob)
            self.d_model = d_model
            self.max_len = max_len
            self.series_dimensions = series_dimensions
            
            if self.series_dimensions == 1:
                if d_model % 2 != 0:
                    raise ValueError("Cannot use sin/cos positional encoding with "
                                     "odd dim (got dim={:d})".format(d_model))
                pe = torch.zeros(self.max_len, d_model).float()
                pe.require_grad = False
                position = torch.arange(0, self.max_len, dtype=torch.float).unsqueeze(1)
                div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
                pe[:, 0::2] = torch.sin(position * div_term)
                pe[:, 1::2] = torch.cos(position * div_term)
            elif self.series_dimensions > 1:
                if d_model % 4 != 0:
                    raise ValueError("Cannot use sin/cos positional encoding with "
                                     "odd dim (got dim={:d})".format(d_model))
                height = self.series_dimensions
                width = self.max_len
                pe = torch.zeros(d_model, height, width).float()
                pe.require_grad = False
                # Each dimension use half of d_model
                d_model = int(d_model / 2)
                div_term = torch.exp(torch.arange(0., d_model, 2) * -(math.log(10000.0) / d_model))
                pos_w = torch.arange(0., width).unsqueeze(1)
                pos_h = torch.arange(0., height).unsqueeze(1)
                pe[0:d_model:2, :, :] = torch.sin(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
                pe[1:d_model:2, :, :] = torch.cos(pos_w * div_term).transpose(0, 1).unsqueeze(1).repeat(1, height, 1)
                pe[d_model::2, :, :] = torch.sin(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
                pe[d_model + 1::2, :, :] = torch.cos(pos_h * div_term).transpose(0, 1).unsqueeze(2).repeat(1, 1, width)
                pe = pe.view(2*d_model, height * width, -1).squeeze(-1) # Flattening it back to 1D series
                pe = pe.transpose(0, 1)
                
            pe = pe.unsqueeze(0) # Extending it by an extra leading dim for the batches
            self.register_buffer('pe', pe)
    
        # Expecting a flattened (1D) series
        def forward(self, x):
            x = x + self.pe[:, :x.size(1), :]
            return self.dropout(x)
    
    
    class LinearTransformerCausalEncoder(torch.nn.Module):
        def __init__(self, input_features, output_features, hidden_dim, sequence_length, 
                     attention_type='causal-linear', n_layers=2, n_heads=4,
                     dropout=0.1, softmax_temp=None, activation_fn="gelu",
                     attention_dropout=0.1,
                    ):
            super(LinearTransformerCausalEncoder, self).__init__()
            #
            self.d_model=hidden_dim*n_heads
            #
            self.pos_embedding = PositionalEncoding(
                                                   max_len=sequence_length,
                                                   d_model=self.d_model, #hidden_dim*n_heads      
                                                   )
            self.value_embedding = nn.Linear(
                input_features,
                self.d_model
            )
            self.builder_dict = OrderedDict({
                "attention_type": attention_type,
                "n_layers": n_layers,
                "n_heads": n_heads,
                "feed_forward_dimensions": self.d_model*2,
                "query_dimensions": hidden_dim,
                "value_dimensions": hidden_dim,
                "dropout": dropout,
                "softmax_temp": softmax_temp,
                "activation" : activation_fn,
                "attention_dropout": attention_dropout,
            })
            self.transformer = TransformerEncoderBuilder.from_dictionary(
                self.builder_dict,
                strict=True
            ).get()
            hidden_size = n_heads*hidden_dim
            ##
            self.predictor = torch.nn.Linear(
                hidden_size,
                output_features
            )
        def forward(self, x):
            # x: [batch_size, input_dim, sequence_length]
            x = x.permute(0,2,1)
            x = self.value_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
            x = self.pos_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
            triangular_mask = fast_transformers.masking.TriangularCausalMask(x.size(1), device=x.device) # triangular_mask: [ sequence_length,  sequence_length]       
            y_hat = self.transformer(x, attn_mask=triangular_mask) # y_hat: [batch size, sequence_length, n_heads* hiden_size]     
            y_hat = self.predictor(y_hat) # y_hat: [batch size, sequence_length, output_size]
            return y_hat.permute(0,2,1)   # y_hat: [batch size, output_size, sequence_length]
    
    class LinearTransformerCausalDecoder(torch.nn.Module):
        def __init__(self, output_features, hidden_dim, sequence_length, 
                     attention_type='causal-linear', n_layers=2, n_heads=4,
                     d_query=32, dropout=0.1, softmax_temp=None,activation_fn="gelu",
                     attention_dropout=0.1,):
            super(LinearTransformerCausalDecoder, self).__init__()
            self.d_model=hidden_dim*n_heads
            self.pos_embedding = PositionalEncoding(
                 max_len=sequence_length,
                d_model=self.d_model, #hidden_dim*n_heads
               
            )
        
            self.value_embedding = torch.nn.Linear(
                output_features,
                self.d_model
            )
            self.builder_dict = OrderedDict({
                "cross_attention_type":attention_type,
                "self_attention_type":attention_type,
                "n_layers": n_layers,
                "n_heads": n_heads,
                "feed_forward_dimensions": self.d_model*2,
                "query_dimensions": hidden_dim,
                "value_dimensions": hidden_dim,
                "dropout": dropout,
                "softmax_temp": softmax_temp,
                "activation" : activation_fn,
                "attention_dropout": attention_dropout,
            })
            self.transformer = TransformerDecoderBuilder.from_dictionary(
                self.builder_dict,
                strict=True
            ).get()
            hidden_size = n_heads*hidden_dim
            
            self.predictor = torch.nn.Linear(
                hidden_size,
                output_features
            )
        def forward(self, target, memory, len_mask=None):
            
            x = target.permute(0,2,1) # x: [batch_size, sequence_length, input_dim]
            x = self.value_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
            x = self.pos_embedding(x) # x: [batch size, sequence_length, n_heads* hiden_size]
            triangular_mask = fast_transformers.masking.TriangularCausalMask(x.size(1), device=x.device) # triangular_mask: [ sequence_length,  sequence_length]       
            y_hat = self.transformer(x, memory, triangular_mask, len_mask=None) # y_hat: [batch size, sequence_length, n_heads* hiden_size]   
            y_hat = self.predictor(y_hat) # y_hat: [batch size, sequence_length, output_size]
            return y_hat.permute(0,2,1)   # y_hat: [batch size, output_size, sequence_length]x=torch.rand([100,14,64])
    

    I have difficulty to comprehend how I can use LinearTransformerCausalDecoder for computing cross attention. I will appreciate if anyone can clarify it for this example key, query and value ? Thanks.

    opened by neuronphysics 0
  • Cuda version

    Cuda version

    Is this package tested on more recent cuda and pytorch versions?

    My code calls fast_transformers.causal_product, which is actually the only function I call from this package.

    I setup this package with latest pytorch 1.13.0+cuda11.6, and get NaN errors at training. This, however, doesn't happen with the older pytorch 1.7.1+cuda11.0.

    opened by jiaji-huang 1
  • Can't officially save Linear Attention model

    Can't officially save Linear Attention model

    Tried (ubuntu) to torch.save (1.1.0) model using Linear Attention (0.4.0) and got the following serialization error: PicklingError: Can't pickle <function <lambda> at 0x7fa4f10120e0>: attribute lookup <lambda> on fast_transformers.feature_maps.base failed

    Any solution? Should I PR?

    opened by maulberto3 1
  • Runtime error on causal_product_cpu on GCC/G++ 11

    Runtime error on causal_product_cpu on GCC/G++ 11

    I've build pytorch fast transformers on Ubuntu 21.10, CUDA 11.6, GCC/G++ 11. Build worked fine. On:

    import fast_transformers.causal_product.causal_product_cpu

    from an init file, it throws the following error: [...] File "python3.8/site-packages/fast_transformers/builders/init.py", line 42, in from ..attention import
    File "python3.8/site-packages/fast_transformers/attention/init.py", line 13, in from .causal_linear_attention import CausalLinearAttention File "python3.8/site-packages/fast_transformers/attention/causal_linear_attention.py", line 15, in from ..causal_product import causal_dot_product File "python3.8/site-packages/fast_transformers/causal_product/init.py", line 9, in from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu,
    ImportError: python3.8/site-packages/fast_transformers/causal_product/causal_product_cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

    used update-alternatives and pointed to GCC/G++ 10, the runtime error is gone.

    More versions, in verbose format. Due to the update-alternatives, I'm calling g++-11 and gcc-11 specifically, they were the default in Ubuntu 21.10:

    ✗ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Fri_Dec_17_18:16:03_PST_2021 Cuda compilation tools, release 11.6, V11.6.55 Build cuda_11.6.r11.6/compiler.30794723_0

    ✗ gcc-11 --version gcc-11 (Ubuntu 11.2.0-7ubuntu2) 11.2.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    ✗ g++-11 --version g++-11 (Ubuntu 11.2.0-7ubuntu2) 11.2.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE

    opened by lsisoft 3
  • how causal mask constructed in training batch model with linear causal attention?

    how causal mask constructed in training batch model with linear causal attention?

    Hi! I have a few questions about the difference in models.

    I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?

    Thanks!

    opened by Howuhh 0
Owner
Idiap Research Institute
Idiap Research Institute
ilpyt: imitation learning library with modular, baseline implementations in Pytorch

ilpyt The imitation learning toolbox (ilpyt) contains modular implementations of common deep imitation learning algorithms in PyTorch, with unified in

The MITRE Corporation 11 Nov 17, 2022
A lightweight library to compare different PyTorch implementations of the same network architecture.

TorchBug is a lightweight library designed to compare two PyTorch implementations of the same network architecture. It allows you to count, and compar

Arjun Krishnakumar 5 Jan 2, 2023
A library for Deep Learning Implementations and utils

deeply A Deep Learning library Table of Contents Features Quick Start Usage License Features Python 2.7+ and Python 3.4+ compatible. Quick Start $ pip

Achilles Rasquinha 1 Dec 12, 2022
Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

Super-Fast-Adversarial-Training This is a PyTorch Implementation code for develo

LBK 26 Dec 2, 2022
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 198 Dec 27, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

Tianyang Li 1 Jan 6, 2022
Pytorch Lightning 1.2k Jan 6, 2023
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

DLR-RM 4.7k Jan 1, 2023
PyTorch implementations of the paper: "Learning Independent Instance Maps for Crowd Localization"

IIM - Crowd Localization This repo is the official implementation of paper: Learning Independent Instance Maps for Crowd Localization. The code is dev

tao han 91 Nov 10, 2022
PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

Yoonki Jeong 129 Dec 22, 2022
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 4.7k Jan 4, 2023
Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

Shayne O'Brien 471 Dec 16, 2022
PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

Ilya Kostrikov 546 Dec 5, 2022
PyTorch implementations of Generative Adversarial Networks.

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as

Erik Linder-Norén 13.4k Jan 8, 2023
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

null 183 Dec 28, 2022