Tutorials and implementations for "Self-normalizing networks"

Tutorials and implementations for "Self-normalizing networks"(SNNs) as suggested by Klambauer et al. (arXiv pre-print).


  • see environment file for full list of prerequisites. Tutorial implementations use Tensorflow > 2.0 (Keras) or Pytorch, but versions for Tensorflow 1.x users based on the deprecated tf.contrib module (with separate environment file) are also available.

Note for Tensorflow >= 1.4 users

Tensorflow >= 1.4 already has the function tf.nn.selu and tf.contrib.nn.alpha_dropout that implement the SELU activation function and the suggested dropout version.

Note for Tensorflow >= 2.0 users

Tensorflow 2.3 already has selu activation function when using high level framework keras, tf.keras.activations.selu. Must be combined with tf.keras.initializers.LecunNormal, corresponding dropout version is tf.keras.layers.AlphaDropout.

Note for Pytorch users

Pytorch versions >= 0.2 feature torch.nn.SELU and torch.nn.AlphaDropout, they must be combined with the correct initializer, namely torch.nn.init.kaiming_normal_ (parameter, mode='fan_in', nonlinearity='linear') as this is identical to lecun initialisation (mode='fan_in') with a gain of 1 (nonlinearity='linear').


Tensorflow 1.x

  • Multilayer Perceptron on MNIST (notebook)
  • Convolutional Neural Network on MNIST (notebook)
  • Convolutional Neural Network on CIFAR10 (notebook)

Tensorflow 2.x (Keras)


  • Multilayer Perceptron on MNIST (notebook)
  • Convolutional Neural Network on MNIST (notebook)
  • Convolutional Neural Network on CIFAR10 (notebook)

Further material

Design novel SELU functions (Tensorflow 1.x)

  • How to obtain the SELU parameters alpha and lambda for arbitrary fixed points (notebook)

Basic python functions to implement SNNs (Tensorflow 1.x)

are provided as code chunks here: selu.py

Notebooks and code to produce Figure 1 (Tensorflow 1.x)

are provided here: Figure1, builds on top of the biutils package.

Calculations and numeric checks of the theorems (Mathematica)

are provided as mathematica notebooks here:

UCI, Tox21 and HTRU2 data sets

  • Is SeLU alone having positive impact on accuracy?

    In MNIST, Cifar-10 tutorials there is Selu as well as alpha dropout used and result after experiment is that SNN outperforms ReLU, ELU based models. Mnist based models (Lenet) can work without dropout, batchNorm with quite good accuracy, so my question is if Selu alone (no dropout and Batch Norm) is according to your observations, increasing accuracy? What I mean is that I have model that is working on MNIST and is a basic CNN eg. convolutions, ReLU, Fully Connected and softmax, and assuming that initialization of weights and normalization of input was done correctly can I expect increased accuracy?

    opened by jczaja 6
  • Can someone help me with creating the csv from sdf with exactly the same number of features?

    So I used skchem pipeline to feature extract meaningful features from the sdf train file mention in the official Tox21 challenge website. But I am not able to get the 801 features that are there in the zip csv file . Can someone help me with that py code. My aim is to experiment with the architucture with SeLU and technique and not get into domain specific feature extraction details.

    opened by pdcoded 5
  • How should we handle skip connections properly?

    Could you please give me some hints?

    Thank you

    opened by qbx2 3
  • Categorical and continuous variables preprocessing

    With the UCI data, how did you preprocess the categorical and continuous variables?

    Did you enforce a min/max or did you just standardize the continuous variables? And for the categorical, did you use one-hot/dummy coding or standardized them?

    Edit: Also, what batch size did you use? Did it depend on the sample size?


    opened by AlexiaJM 3
  • SELU values for a truncated normal distribution

    https://github.com/bioinf-jku/SNNs/blob/f992b229795712a54c67266995d8ea522cd10770/selu.py#L31 and many other examples (e.g. Keras) do an additional trick where samples are resampled if they're not within two standard deviations of the mean. I'm curious how much of an effect this truncation has on the fix points derivation? Are they analytically identical for a normal distribution and a truncated normal distribution?

    I read in the paper that "Uniform and truncated Gaussian distributions with these moments led to networks with similar behavior." but this feels unsatisfactory to me. Maybe a small discrepancy becomes really problematic for deeper networks? This aligns with my experience that it's still beneficial to have batchnorm/layernorm with SELU.

    opened by carlthome 2
  • cnn_graph


    Thank you for sharing the code. I have successfully applied them (SELU and alpha_dropout) building a pure fully connected network (7 layers) for a regression problem (the R2 between the predicted and the observed variable is greater than 0.99!).

    Right now I'm trying to replace the RELU in cnn_graph with SELU. Unlike the standard cnn, cnn_graph performs the convolution on the graph Fourier transformed inputs (a recursive process involves multiple matrix multiplications between the layer-specific graph Laplacian and the layer inputs). The original normalized input is shifted to some unknown distribution by the graph Fourier transform. Therefore, I don't know how to apply the SELU even on the first cnn_graph layer. Could you give me some suggestions on this?

    Besides, can I use some normalization on the output of cnn_graph before feeding them into fully connected network using SELU as activation function?


    opened by maosi-chen 2
  • batch normalization

    @bioinf-jku,thank you for your nice work! I am a new to deep learning,and I have some simply questions,since the net in your test code is not very deep,It makes no big different to add batch normalization layers after each convolution layers,but if the net is very deep,is it necessary to add batch normalization layer after each convolution layers?Or there is no need to do so since the activation function selu has the ability to batch normalize the input layer? thank you in advance!

    opened by zhly0 2
  • Questions on the self-normalizing property

    I think the proposed SELU is an powerful non-linearity for MLPs. The self-normalizing property comes from the derivation of the forward propagation. This property could be confirmed by the following codes.

    import torch
    f = torch.nn.functional.selu
    x = torch.randn(1024, 1024) * 456 + 123
    lin = torch.nn.Linear(1024, 1024, bias=False)
    _ = torch.nn.init.kaiming_normal_(lin.weight, nonlinearity="linear")
    with torch.no_grad():
        for i in range(100): x = f(lin(x))
    print(f"mean = {x.mean()}") # 0.00253
    print(f"var = {(x ** 2).mean()}") # 1.05135

    However, the self-normalizing property only holds for the forward pass, it does not hold for the backward pass (is it right?). Noisy gradients will definitely be harmful for the learning processes. The proposed SELU is based on ELU, which is based on a selective preference. I wonder if there could exists a more general non-linearity that has self-normalizing properties for both the forward and backward propagations. If it is possible, how could we find it?

    opened by Karbo123 1
  • Effect of bias in linear layers

    I've been experimenting with SELUs, and found they provide an improvement in terms of computation time during training with respect to batchnorm, thank you for your work.

    I just have a question regarding the effect of bias in linear layers. As I understand it, every neuron should have mean zero in order to stay in the self regularizing zone, but bias precisely shifts that mean. In my experiments however I didn't see much of an effect either removing or adding biases. I see that in the tutorial notebook bias is used, and I wonder wether you've considered the issue.

    opened by ptrcarta 1
  • information about step (1) in selu.py

    Hi, thank you for the paper and the code. Could you please tell us how you scale the inputs to zero mean and unit variance in step (1) in selu.py

    Thank you

    opened by AzizCode92 1
  • Alpha dropouts

    Just a question: when applying the alpha dropouts, on the prediction phase, not the training, do you scale by p where p is the probability to be kept? https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout

    opened by edubois 1
Institute of Bioinformatics, Johannes Kepler University Linz
Institute of Bioinformatics, Johannes Kepler University Linz
