Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Overview

Run on Gradient

Mish: Self Regularized
Non-Monotonic Activation Function

BMVC 2020 (Official Paper)



Notes: (Click to expand)
  • A considerably faster version based on CUDA can be found here - Mish CUDA (All credits to Thomas Brandon for the same)
  • Memory Efficient Experimental version of Mish can be found here
  • Faster variants for Mish and H-Mish by Yashas Samaga can be found here - ConvolutionBuildingBlocks
  • Alternative (experimental improved) variant of H-Mish developed by Páll Haraldsson can be found here - H-Mish (Available in Julia)
  • Variance based initialization method for Mish (experimental) by Federico Andres Lois can be found here - Mish_init
Changelogs/ Updates: (Click to expand)

News/ Media Coverage:

   

  • (02/2020): Talk on Mish and Non-Linear Dynamics at Sicara is out now. Watch on:

   

  • (07/2020): CROWN: A comparison of morphology for Mish, Swish and ReLU produced in collaboration with Javier Ideami. Watch on:

   

   

  • (12/2020): Talk on From Smooth Activations to Robustness to Catastrophic Forgetting at Weights & Biases Salon is out now. Watch on:

   


MILA/ CIFAR 2020 DLRLSS (Click on arrow to view)

Contents: (Click to expand)
  1. Mish
    a. Loss landscape
  2. ImageNet Scores
  3. MS-COCO
  4. Variation of Parameter Comparison
    a. MNIST
    b. CIFAR10
  5. Significance Level
  6. Results
    a. Summary of Results (Vision Tasks)
    b. Summary of Results (Language Tasks)
  7. Try It!
  8. Future Work
  9. Acknowledgements
  10. Cite this work

Mish:

Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924
Mish has a parametric order of continuity of: C

Derivative of Mish with respect to Swish and Δ(x) preconditioning:

Further simplifying:

Alternative derivative form:

where:

We hypothesize the Δ(x) to be exhibiting the properties of a pre-conditioner making the gradient more smoother. Further details are provided in the paper.

Loss Landscape:

To visit the interactive Loss Landscape visualizer, click here.

Loss landscape visualizations for a ResNet-20 for CIFAR 10 using ReLU, Mish and Swish (from L-R) for 200 epochs training:


Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme.

We also investigate the output landscape of randomly initialized neural networks as shown below. Mish has a much smoother profile than ReLU.

ImageNet Scores:

PWC

For Installing DarkNet framework, please refer to darknet(Alexey AB)

For PyTorch based ImageNet scores, please refer to this readme

Network Activation Top-1 Accuracy Top-5 Accuracy cfg Weights Hardware
ResNet-50 Mish 74.244% 92.406% cfg weights AWS p3.16x large, 8 Tesla V100
DarkNet-53 Mish 77.01% 93.75% cfg weights AWS p3.16x large, 8 Tesla V100
DenseNet-201 Mish 76.584% 93.47% cfg weights AWS p3.16x large, 8 Tesla V100
ResNext-50 Mish 77.182% 93.318% cfg weights AWS p3.16x large, 8 Tesla V100
Network Activation Top-1 Accuracy Top-5 Accuracy
CSPResNet-50 Leaky ReLU 77.1% 94.1%
CSPResNet-50 Mish 78.1% 94.2%
Pelee Net Leaky ReLU 70.7% 90%
Pelee Net Mish 71.4% 90.4%
Pelee Net Swish 71.5% 90.7%
CSPPelee Net Leaky ReLU 70.9% 90.2%
CSPPelee Net Mish 71.2% 90.3%

Results on CSPResNext-50:

MixUp CutMix Mosaic Blur Label Smoothing Leaky ReLU Swish Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77.9%(=) 94%(=)
✔️ ✔️ 77.2%(-) 94%(=)
✔️ ✔️ 78%(+) 94.3%(+)
✔️ ✔️ 78.1%(+) 94.5%(+)
✔️ ✔️ 77.5%(-) 93.8%(-)
✔️ ✔️ 78.1%(+) 94.4%(+)
✔️ 64.5%(-) 86%(-)
✔️ 78.9%(+) 94.5%(+)
✔️ ✔️ ✔️ ✔️ 78.5%(+) 94.8%(+)
✔️ ✔️ ✔️ ✔️ 79.8%(+) 95.2%(+) cfg weights

Results on CSPResNet-50:

CutMix Mosaic Label Smoothing Leaky ReLU Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 76.6%(=) 93.3%(=)
✔️ ✔️ ✔️ ✔️ 77.1%(+) 94.1%(+)
✔️ ✔️ ✔️ ✔️ 78.1%(+) 94.2%(+) cfg weights

Results on CSPDarkNet-53:

CutMix Mosaic Label Smoothing Leaky ReLU Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77.2%(=) 93.6%(=)
✔️ ✔️ ✔️ ✔️ 77.8%(+) 94.4%(+)
✔️ ✔️ ✔️ ✔️ 78.7%(+) 94.8%(+) cfg weights

Results on SpineNet-49:

CutMix Mosaic Label Smoothing ReLU Swish Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77%(=) 93.3%(=) - -
✔️ ✔️ 78.1%(+) 94%(+) - -
✔️ ✔️ ✔️ ✔️ 78.3%(+) 94.6%(+) - -

MS-COCO:

PWC PWC

For PyTorch based MS-COCO scores, please refer to this readme

Model Mish AP50...95 mAP50 CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPS VPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPS GPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS
CSPDarkNet-53 (512 x 512) 42.4% 64.5% 3.5 1.23 43
CSPDarkNet-53 (512 x 512) ✔️ 43% 64.9% - - 41
CSPDarkNet-53 (608 x 608) ✔️ 43.5% 65.7% - - 26
Architecture Mish CutMix Mosaic Label Smoothing Size AP AP50 AP75
CSPResNext50-PANet-SPP 512 x 512 42.4% 64.4% 45.9%
CSPResNext50-PANet-SPP ✔️ ✔️ ✔️ 512 x 512 42.3% 64.3% 45.7%
CSPResNext50-PANet-SPP ✔️ ✔️ ✔️ ✔️ 512 x 512 42.3% 64.2% 45.8%
CSPDarkNet53-PANet-SPP ✔️ ✔️ ✔️ 512 x 512 42.4% 64.5% 46%
CSPDarkNet53-PANet-SPP ✔️ ✔️ ✔️ ✔️ 512 x 512 43% 64.9% 46.5%

Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.

Variation of Parameter Comparison:

MNIST:

To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.

The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.

Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)

CIFAR10:

Significance Level:

The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 23 runs, specifically it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across 23 runs which proves the consistency of performance for Mish.

Activation Function Mean Accuracy Mean Loss Standard Deviation of Accuracy P-value Cohen's d Score 95% CI
Mish 87.48% 4.13% 0.3967 - - -
Swish-1 87.32% 4.22% 0.414 P = 0.1973 0.386 -0.3975 to 0.0844
E-Swish (β=1.75) 87.49% 4.156% 0.411 P = 0.9075 0.034444 -0.2261 to 0.2539
GELU 87.37% 4.339% 0.472 P = 0.4003 0.250468 -0.3682 to 0.1499
ReLU 86.66% 4.398% 0.584 P < 0.0001 1.645536 -1.1179 to -0.5247
ELU(α=1.0) 86.41% 4.211% 0.3371 P < 0.0001 2.918232 -1.2931 to -0.8556
Leaky ReLU(α=0.3) 86.85% 4.112% 0.4569 P < 0.0001 1.47632 -0.8860 to -0.3774
RReLU 86.87% 4.138% 0.4478 P < 0.0001 1.444091 -0.8623 to -0.3595
SELU 83.91% 4.831% 0.5995 P < 0.0001 7.020812 -3.8713 to -3.2670
SoftPlus(β = 1) 83.004% 5.546% 1.4015 P < 0.0001 4.345453 -4.7778 to -4.1735
HardShrink(λ = 0.5) 75.03% 7.231% 0.98345 P < 0.0001 16.601747 -12.8948 to -12.0035
Hardtanh 82.78% 5.209% 0.4491 P < 0.0001 11.093842 -4.9522 to -4.4486
LogSigmoid 81.98% 5.705% 1.6751 P < 0.0001 4.517156 -6.2221 to -4.7753
PReLU 85.66% 5.101% 2.2406 P = 0.0004 1.128135 -2.7715 to -0.8590
ReLU6 86.75% 4.355% 0.4501 P < 0.0001 1.711482 -0.9782 to -0.4740
CELU(α=1.0) 86.23% 4.243% 0.50941 P < 0.0001 2.741669 -1.5231 to -0.9804
Sigmoid 74.82% 8.127% 5.7662 P < 0.0001 3.098289 -15.0915 to -10.2337
Softshrink(λ = 0.5) 82.35% 5.4915% 0.71959 P < 0.0001 8.830541 -5.4762 to -4.7856
Tanhshrink 82.35% 5.446% 0.94508 P < 0.0001 7.083564 -5.5646 to -4.7032
Tanh 83.15% 5.161% 0.6887 P < 0.0001 7.700198 -4.6618 to -3.9938
Softsign 82.66% 5.258% 0.6697 P < 0.0001 8.761157 -5.1493 to -4.4951
Aria-2(β = 1, α=1.5) 81.31% 6.0021% 2.35475 P < 0.0001 3.655362 -7.1757 to -5.1687
Bent's Identity 85.03% 4.531% 0.60404 P < 0.0001 4.80211 -2.7576 to -2.1502
SQNL 83.44% 5.015% 0.46819 P < 0.0001 9.317237 -4.3009 to -3.7852
ELisH 87.38% 4.288% 0.47731 P = 0.4283 0.235784 -0.3643 to 0.1573
Hard ELisH 85.89% 4.431% 0.62245 P < 0.0001 3.048849 -1.9015 to -1.2811
SReLU 85.05% 4.541% 0.5826 P < 0.0001 4.883831 -2.7306 to -2.1381
ISRU (α=1.0) 86.85% 4.669% 0.1106 P < 0.0001 5.302987 -4.4855 to -3.5815
Flatten T-Swish 86.93% 4.459% 0.40047 P < 0.0001 1.378742 -0.7865 to -0.3127
SineReLU (ε = 0.001) 86.48% 4.396% 0.88062 P < 0.0001 1.461675 -1.4041 to -0.5924
Weighted Tanh (Weight = 1.7145) 80.66% 5.985% 1.19868 P < 0.0001 7.638298 -7.3502 to -6.2890
LeCun's Tanh 82.72% 5.322% 0.58256 P < 0.0001 9.551812 -5.0566 to -4.4642
Soft Clipping (α=0.5) 55.21% 18.518% 10.831994 P < 0.0001 4.210373 -36.8255 to -27.7154
ISRLU (α=1.0) 86.69% 4.231% 0.5788 P < 0.0001 1.572874 -1.0753 to -0.4856

Values rounded up which might cause slight deviation in the statistical values reproduced from these tests

Results:

PWC PWC

News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.

Summary of Results (Vision Tasks):

Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.

Activation Function Mish > Baseline Model Mish < Baseline Model
ReLU 55 20
Swish-1 53 22
SELU 26 1
Sigmoid 24 0
TanH 24 0
HardShrink(λ = 0.5) 23 0
Tanhshrink 23 0
PReLU(Default Parameters) 23 2
Softsign 22 1
Softshrink (λ = 0.5) 22 1
Hardtanh 21 2
ELU(α=1.0) 21 7
LogSigmoid 20 4
GELU 19 3
E-Swish (β=1.75) 19 7
CELU(α=1.0) 18 5
SoftPlus(β = 1) 17 7
Leaky ReLU(α=0.3) 17 8
Aria-2(β = 1, α=1.5) 16 2
ReLU6 16 8
SQNL 13 1
Weighted TanH (Weight = 1.7145) 12 1
RReLU 12 11
ISRU (α=1.0) 11 1
Le Cun's TanH 10 2
Bent's Identity 10 5
Hard ELisH 9 1
Flatten T-Swish 9 3
Soft Clipping (α=0.5) 9 3
SineReLU (ε = 0.001) 9 4
ISRLU (α=1.0) 9 4
ELisH 7 3
SReLU 7 6
Hard Sigmoid 1 0
Thresholded ReLU(θ=1.0) 1 0

Summary of Results (Language Tasks):

Comparison is done based on the best metric score (Test accuracy) across 3 runs.

Activation Function Mish > Baseline Model Mish < Baseline Model
Penalized TanH 5 0
ELU 5 0
Sigmoid 5 0
SReLU 4 0
TanH 4 1
Swish 3 2
ReLU 2 3
Leaky ReLU 2 3
GELU 1 2

Try It!

Torch DarkNet Julia FastAI TensorFlow Keras CUDA
Source Source Source Source Source Source Source
Future Work: (Click to view)
  • Comparison of Convergence Rates.
  • Normalizing constant for Mish to eliminate the use of Batch Norm.
  • Regularizing effect of the first derivative of Mish with repect to Swish.
Acknowledgments: (Click to expand)

Thanks to all the people who have helped and supported me massively through this project who include:

  1. Sparsha Mishra
  2. Alexandra Deis
  3. Alexey Bochkovskiy
  4. Chien-Yao Wang
  5. Thomas Brandon
  6. Less Wright
  7. Manjunath Bhat
  8. Ajay Uppili Arasanipalai
  9. Federico Lois
  10. Javier Ideami
  11. Ioannis Anifantakis
  12. George Christopoulos
  13. Miklos Toth

And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.

Cite this work:

@article{misra2019mish,
  title={Mish: A self regularized non-monotonic neural activation function},
  author={Misra, Diganta},
  journal={arXiv preprint arXiv:1908.08681},
  year={2019}
}
Comments
  • Equivalent, faster (?) formulation

    Equivalent, faster (?) formulation

    Hello, thanks for the great work. Using the exponential identity for tanh, you can remove two of the transcendental operations (exp, log) and get what, hopefully, should be a faster implementation.

    Since

    \tanh(x) = (e^2x - 1) / (e^2x + 1)
    

    you can express Mish as:

    y = e^x
    mish(x) = x y (y + 2) / (y^2 + 2 y + 2)
    

    or equivalently (to avoid overflow when x is large)

    y = e^-x
    mish(x) = x (1 + 2 y) / (1 + 2 y + 2 y^2) 
    

    NB: With a little tweak, there is an interesting connection to the GELU approximated with a logistic distribution ("Logistic Error Linear Unit"?) (i.e. Swish)

    x \tanh(0.5 \log( 1 + e^x) ) = x \sigma(x - \log 2)
    

    c.f. the approximation x \sigma(1.702 x) from the GELU paper.

    enhancement 
    opened by tmassingham-ont 34
  • Mish and alternatives, including my own

    Mish and alternatives, including my own

    First, congratulation on your Mish paper accepted.

    I've been thinking about activation functions, and how they can be improved, what properties are most important for a long time, and also inspired by your paper (I noticed the days old update), and e.g. recent SharkFin (my own unpublished idea has some similarities).

    I was just thinking if I could ask you some questions, I'm not sure if this is the right place.

    It seems like almost any function can do (all except polynominal proven to work for shallow wide networks, and that restriction eliminated by deep-narrow networks):

    Universal Approximation with Deep Narrow Networks https://arxiv.org/pdf/1905.08539.pdf

    we show that the class of neural networks of arbitrary depth, width n+m+2, and activation function ρ, is dense [..] This covers every activation function possible to use in practice, and also includes polynomial activation functions [..]

    We refer to these as enhanced neurons. [..]

    4.2. Square Model Lemma 4.3 One layer of two enhanced neurons, with square activation function, may exactly represent the multiplication function (x,y)→xy on R^2 [..]

    Remark 4.7 Lemma 4.5 is key to the proof of Proposition 4.6. It was fortunate that the reciprocal function may be approximated by a network of width two - note that even if Proposition 4.6 were already known, it would have required a network of width three. It remains unclear whether an arbitrary-depth network of width two, with square activation function, is dense in C(K). [..]

    Remark 4.8 Note that allowing a single extra neuron in each layer would remove the need for the trick with the reciprocal, as it would allow [..] Doing so would dramatically reduce the depth of the network. We are thus paying a heavy price in depth in order to reduce the width by a single neuron.

    [I've yet to read much further, but this seems very important.]

    So my reading is all but identity function can work as activation function (when more than one hidden layer), and a network no wider than 4 (or 5 better, optimal?) can approximate all four elementary arithmetic operation. Could also approximate e.g. sine and exponential with such narrow network (through Fourier-theorem), I think.

    Have you looked at Capsule networks, and deep variant? https://arxiv.org/pdf/1904.09546.pdf

    May main worry is that by thinking about better activation functions (of if there can be one best), I'm wasting my time, with them and/or (traditional) backpropagation going away, with the thousand-brain theory and more. Capsule networks seem similar, with a voting mechanism. It at least has ReLU in the first layer (I didn't look at more in detail).

    Have you looked at BERT and variants? I assume they could use your functions, or do you know if there are exceptions, making GELU better for them? I'm thinking it's maybe just ignorance (or authors extending, want to change one thing at a time):

    https://arxiv.org/pdf/1909.11942.pdf

    The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder (Vaswani et al., 2017) with GELU nonlinearities

    The Reversible Residual Network: Backpropagation Without Storing Activations https://arxiv.org/pdf/1707.04585.pdf

    opened by PallHaraldsson 11
  • More comparison with existing methods?

    More comparison with existing methods?

    Just wondering if all Activation Functions have been addressed in the ReadME.

    • https://www.semanticscholar.org/paper/A-comparative-performance-analysis-of-different-in-Farzad-Mashayekhi/bcfdfe54796c501a90c3b353661a19e9c161d2c8/figure/0
    • https://www.semanticscholar.org/paper/Activation-Functions-for-Generalized-Learning-A-Villmann-Ravichandran/04d54996bcbe44b3547da889d7eab8aab3660990/figure/0
    • https://www.semanticscholar.org/paper/Comparison-of-new-activation-functions-in-neural-Gomes-Ludermir/9b37079041bdaca4248ab4f62f1a63013a50f067/figure/1
    • https://www.semanticscholar.org/paper/Searching-for-Activation-Functions-Ramachandran-Zoph/c8c4ab59ac29973a00df4e5c8df3773a3c59995a/figure/2
    question 
    opened by DonaldTsang 10
  • {NameError}name 'uses_learning_phase' is not defined

    {NameError}name 'uses_learning_phase' is not defined

    When using in an RNN, I get {NameError}name 'uses_learning_phase' is not defined when applying a Dense layer on the output of an LSTM.

    Tensorflow 1.14

    bug help wanted Keras/ Tensorflow 
    opened by valendin 10
  • Why does mish return significantly worse result than elu?

    Why does mish return significantly worse result than elu?

    I've tested mish vs elu on a simple feedforward network with L2 and dropout and got this result: download

    You can check it with Colab Notebook, press ctrl+f9.

    question 
    opened by qo4on 9
  • Need help with Mish-Metal

    Need help with Mish-Metal

    I read your research paper on Mish, and from the graphs I saw, it is clearly the best activation function.

    You had two different implementations of Mish with widely different computational overheads. There were standard Mish and Mish-CUDA, where Mish-CUDA has almost zero computational cost. I'm optimizing DL4S for Metal, and looking to optimize Mish so that it can be the top-recommended function. However, I need to know how you implemented Mish-CUDA.

    Could you please explain how Mish-CUDA was faster than the standard Mish, and how I might implement it in a generic GPGPU context? I have a lot of experience with assembly-level optimization and I know the differences between Apple GPUs/Metal and Nvidia GPUs/CUDA well, if that helps with your explanation.

    help wanted question 
    opened by philipturner 7
  • Small rant on the inertia of AI research

    Small rant on the inertia of AI research

    Hi! This is not an issue per se and can be closed.

    First of all, thank you for advancing progress in deep learning.

    I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc) They are all not very accurate (often sub 95% F1 score) and their errors add up.

    Such limitations make Nlp not yet suitable for many things. This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.

    Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by: Having a "standard neural network" for the task and mix with it their new fancy idea.

    I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks. Almost always they have this common baseline + one idea, theirs. The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.

    Sorry to say, but "this" is to me retarded Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia). I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers. You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art? Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner?? I would like you to tell me what you think of this major issue that slow AI progress.

    As an example of such inertia let's talk about Swish, Mish or RAdam : Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?" Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network) Not even pre trained models where so many papers depend on them (I opened issues for each of them).

    Thank you for reading.

    opened by LifeIsStrange 7
  • Computational cost of Mish vs GELU vs Swish

    Computational cost of Mish vs GELU vs Swish

    What are the computational cost (CPU/GPU cycles per node) of Mish vs GELU vs Swish? If it is possible to reduce the CPU/GPU cycles of the computation through simplification, it can save both time and energy, leading to lower green house gas emissions. Of course convergence is important, RELU isn't that computational heavy, but it is weak, and right now the second derivative of both GELU and Swish are symmetric.

    enhancement question 
    opened by DonaldTsang 6
  • PyTorch Mish - 1.5x slower training, 2.9X more memory usage vs LeakyReLU(0.1)

    PyTorch Mish - 1.5x slower training, 2.9X more memory usage vs LeakyReLU(0.1)

    Hi, thanks for this interesting new activation function. I've tested it with YOLOv3-SPP on a V100 from https://github.com/ultralytics/yolov3 and have mixed feedback. The performance improves slightly, but the training time is much slower and the GPU memory requirements are much higher vs LeakyReLU(0.1). Any suggestions on how to improve speed/memory in PyTorch? Thanks!

    From https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-554784601:

    | [email protected] | mAP0.5:0.95 | GPU memory | Epoch time --- | --- | --- | --- | --- LeakyReLU(0.1) | 48.9 | 29.6 | 4.0G | 31min Mish() | 50.9 | 31.2 | 11.1G | 46min

    class Swish(nn.Module):
        def __init__(self):
            super(Swish, self).__init__()
    
        def forward(self, x):
            return x.mul_(torch.sigmoid(x))
    
    
    class Mish(nn.Module):  # https://github.com/digantamisra98/Mish
        def __init__(self):
            super().__init__()
    
        def forward(self, x):
            return x.mul_(F.softplus(x).tanh())
    
    question 
    opened by glenn-jocher 6
  • When trying to change cifar-10 to cifar-100(In cifar-10-senet-18-mish) it raises RuntimeError: CUDA error: device-side assert triggered.

    When trying to change cifar-10 to cifar-100(In cifar-10-senet-18-mish) it raises RuntimeError: CUDA error: device-side assert triggered.

    When trying to change cifar-10 to cifar-100(In cifar-10-senet-18-mish), I try to change the code in the second cell like this: def get_training_dataloader(train_transform, batch_size=128, num_workers=0, shuffle=True): """ return training dataloader Args: train_transform: transfroms for train dataset path: path to cifar100 training python dataset batch_size: dataloader batchsize num_workers: dataloader num_works shuffle: whether to shuffle Returns: train_data_loader:torch dataloader object """

    transform_train = train_transform
    cifar100_training = torchvision.datasets.CIFAR100(root='.', train=True, download=True, transform=transform_train)
    cifar100_training_loader = DataLoader(
        cifar100_training, shuffle=shuffle, num_workers=num_workers, batch_size=batch_size)
    
    return cifar100_training_loader
    

    define test dataloader

    def get_testing_dataloader(test_transform, batch_size=128, num_workers=0, shuffle=True): """ return training dataloader Args: test_transform: transforms for test dataset path: path to cifar100 test python dataset batch_size: dataloader batchsize num_workers: dataloader num_works shuffle: whether to shuffle Returns: cifar100_test_loader:torch dataloader object """

    transform_test = test_transform
    cifar100_test = torchvision.datasets.CIFAR100(root='.', train=False, download=True, transform=transform_test)
    cifar100_test_loader = DataLoader(
        cifar100_test, shuffle=shuffle, num_workers=num_workers, batch_size=batch_size)
    
    return cifar100_test_loader
    

    However, it raises RuntimeError: CUDA error: device-side assert triggered. Could you help me of how to do cifar-100?

    Besides, I'd like to know how to use stats.ipynb...Thanks a lot!!!

    @digantamisra98

    bug torch 
    opened by SkeletonOne 6
  • implementing pytorch mish inplace

    implementing pytorch mish inplace

    Hi, do you have any tips on how to implement Mish as an inplace operation in PyTorch? Could this work? Is there a way of making the F.softplus() operation also inplace?

    class Mish_(nn.Module):
        def __init__(self):
            super().__init__()
            
        def forward(self, x):
            softplus_res = F.softplus(x)
            torch.tanh_(softplus_res)
            return x * softplus_res
    
    torch 
    opened by jpcenteno80 6
  • Correct gain value during kaiming weight initialization

    Correct gain value during kaiming weight initialization

    Hello! Great work on this activation function! I've been using it in some of my projects with great success.

    I want to let you know I found what the gain should be set at during kaiming weight initialization for Mish.

    I found this experimentally using this code:

    import torch
    import torch.nn.functional as F
    import pandas as pd
    import numpy as np
    
    device = 'cpu'
    
    def mish(x):
        return x * (torch.tanh(F.softplus(x)))
    
    aa = []
    bb = []
    for n in range(100):
        with torch.no_grad():
            a = torch.randn(5000, 5000, device=device)
            b = a
            x = 0.0 + 0.00001 * n
            for i in range(10):
                l = torch.nn.Linear(5000, 5000, bias=False).to(device)
                torch.nn.init.kaiming_uniform_(l.weight, a=x)
                b = mish(l(b))
            aa.append(b.std().item())
            bb.append(x)
            print(x)
            print (f"in: {a.std().item():.8f}, out: {b.std().item():.8f}")
    pd.DataFrame(data=aa, index=bb).plot(figsize=(20,8))
    

    which was talked about here. The "a" hyperparameter for init.kaiming_uniform_ is not actually the gain but the negative slope of a leaky relu, so really I experimentally found the equivalent negative slope of mish for kaiming_uniform_ init. The actual gain is found internally by math.sqrt(2.0 / (1 + a ** 2)).

    This is an example of the code output. I found through repeated experiments that 0.0003 results in the most consistently efficient throughput through the network, so it is almost the zero slop of relu but not quite. a=0 did produce okay results, as did say 0.001, but the best averaged over many runs is 0.0003. This is important because for deep networks the pytorch default value of sqrt(5) for initializing conv layers is not a good default value if using mish.

    I now use something like

    for m in self.modules():
        if isinstance(m, (nn.Conv1d, nn.Linear)):
            torch.nn.init.kaiming_uniform_(m.weight, a=0.0003)
    
    enhancement 
    opened by evanatyourservice 5
Owner
Xa9aX ツ
Research MSc at @mila-iqia. VRS @VITA-Group, and Founder @landskape-ai. ボイド
Xa9aX ツ
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022
This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

inverse_attention This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021. Le

Firas Laakom 5 Jul 8, 2022
Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

EMI-FGSM This repository contains code to reproduce results from the paper: Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021) Xiaosen Wa

John Hopcroft Lab at HUST 10 Sep 26, 2022
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

null 7 Oct 22, 2021
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

null 7 May 26, 2022
Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

PGpoints Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021) Hyeontae Son, Young Min Kim Pre

Hyeontae Son 9 Jun 6, 2022
UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).

UDP-Pose This is the pytorch implementation for UDP++, which won the Fisrt place in COCO Keypoint Challenge at ECCV 2020 Workshop. Top-Down Results on

null 20 Jul 29, 2022
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

CV Lab @ Yonsei University 87 Dec 30, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

null 46 Nov 9, 2022
Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

Xavier 33 Oct 12, 2022
Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

Prune Truong 71 Nov 18, 2022
Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Swapping Autoencoder for Deep Image Manipulation Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang UC

null 449 Dec 27, 2022
Official implementation of "Accelerating Reinforcement Learning with Learned Skill Priors", Pertsch et al., CoRL 2020

Accelerating Reinforcement Learning with Learned Skill Priors [Project Website] [Paper] Karl Pertsch1, Youngwoon Lee1, Joseph Lim1 1CLVR Lab, Universi

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 134 Dec 6, 2022
Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

Andrew Luo 41 Dec 9, 2022
This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

README This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et a

Raghav 42 Dec 15, 2022
The official project of SimSwap (ACM MM 2020)

SimSwap: An Efficient Framework For High Fidelity Face Swapping Proceedings of the 28th ACM International Conference on Multimedia The official reposi

Six_God 2.6k Jan 8, 2023
The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

The Equalization Losses for Long-tailed Object Detection and Instance Segmentation This repo is official implementation CVPR 2021 paper: Equalization

Jingru Tan 129 Dec 16, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022