Enabling easy statistical significance testing for deep neural networks.

Dennis Ulmer

Last update: Dec 20, 2022

Related tags

Testing machine-learning deep-learning ml dl hypothesis-testing statistical-significance significance-testing statistical-significance-test hypothesis-tests

Overview

deep-significance: Easy and Better Significance Testing for Deep Neural Networks

Contents

⁉️ Why
📥 Installation
🔖 Examples
🎓 Cite
🏅 Acknowledgements
📚 Bibliography

⁉️ Why?

Although Deep Learning has undergone spectacular growth in the recent decade, a large portion of experimental evidence is not supported by statistical hypothesis tests. Instead, conclusions are often drawn based on single performance scores.

This is problematic: Neural network display highly non-convex loss surfaces (Li et al., 2018) and their performance depends on the specific hyperparameters that were found, or stochastic factors like Dropout masks, making comparisons between architectures more difficult. Based on comparing only (the mean of) a few scores, we often cannot conclude that one model type or algorithm is better than another. This endangers the progress in the field, as seeming success due to random chance might lead practitioners astray.

For instance, a recent study in Natural Language Processing by Narang et al. (2021) has found that many modifications proposed to transformers do not actually improve performance. Similar issues are known to plague other fields like e.g., Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017) as well.

To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing:

Statistical Significance tests such as Almost Stochastic Order (Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989).
Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936).

All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation here or the scenarios in the section Examples.

📥 Installation

The package can simply be installed using pip by running

pip3 install deepsig

Another option is to clone the repository and install the package locally:

git clone https://github.com/Kaleidophon/deep-significance.git
cd deep-significance
pip3 install -e .

Warning: Installed like this, imports will fail when the clones repository is moved.

🔖 Examples

tl;dr: Use aso() to compare scores for two models. If the returned eps_min < 0.5, A is better than B. The lower eps_min, the more confident the result.

In the following, I will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as this blog post for a general overview or Dror et al. (2018) for a NLP-specific point of view.

In general, in statistical significance testing, we usually compare two algorithms and on a dataset using some evaluation metric (we assume a higher = better). The difference between the two algorithms on the data is then defined as

where is our test statistic. We then test the following null hypothesis:

Thus, we assume our algorithm A to be equally as good or worse than algorithm B and reject the null hypothesis if A is better than B (what we actually would like to see). Most statistical significance tests operate using p-values, which define the probability that under the null-hypothesis, the expected by the test is larger than or equal to the observed difference (that is, for a one-sided test, i.e. we assume A to be better than B):

We can interpret this equation as follows: Assuming that A is not better than B, the test assumes a corresponding distribution of differences that is drawn from. How does our actually observed difference fit in there? This is what the p-value is expressing: If this probability is high, is in line with what we expected under the null hypothesis, so we conclude A not to better than B. If the probability is low, that means that is quite unlikely under the null hypothesis and that the reverse case is more likely - i.e. that it is likely larger than - and we conclude that A is indeed better than B. Note that the p-value does not express whether the null hypothesis is true.

To decide when we trust A to be better than B, we set a threshold that will determine when the p-value is small enough for us to reject the null hypothesis, this is called the significance level and it is often set to be 0.05.

Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks

Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum or maximum might not be enough to make a decision. For this reason, Dror et al. (2019) introduced Almost Stochastic Order (ASO), a test to compare two score distributions.

It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant by comparing their cumulative distribution functions:

Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). For this reason, Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to which stochastic order is being violated (red area):

ASO returns a value , which expresses the amount of violation of stochastic order. If , A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a confidence score. The lower it is, the more sure we can be that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as

If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5. Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting .

Scenario 1 - Comparing multiple runs of two models

In the simplest scenario, we have retrieved a set of scores from a model A and a baseline B on a dataset, stemming from various model runs with different seeds. We can now simply apply the ASO test:

import numpy as np
from deepsig import aso

# Simulate scores
N = 5  # Number of random seeds
scores_a = np.random.normal(loc=0.9, scale=0.8, size=N)
scores_b =  np.random.normal(loc=0, scale=1, size=N)

min_eps = aso(scores_a, scores_b)  # min_eps = 0.0, so A is better

ASO does not make any assumptions about the distributions of the scores. This means that we can apply it to any kind of test metric. The more scores of model runs is supplied, the more reliable the test becomes.

Scenario 2 - Comparing multiple runs across datasets

When comparing models across datasets, we formulate one null hypothesis per dataset. However, we have to make sure not to fall prey to the multiple comparisons problem: In short, the more comparisons between A and B we are conducting, the more likely gets is to reject a null-hypothesis accidentally. That is why we have to adjust our significance threshold accordingly by dividing it by the number of comparisons, which corresponds to the Bonferroni correction (Bonferroni et al., 1936):

import numpy as np
from deepsig import aso 

# Simulate scores for three datasets
M = 3  # Number of datasets
N = 5  # Number of random seeds
scores_a = [np.random.normal(loc=0.3, scale=0.8, size=N) for _ in range(M)]
scores_b = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)]

# epsilon_min values with Bonferroni correction 
eps_min = [aso(a, b, confidence_level=0.05 / M) for a, b in zip(scores_a, scores_b)]
# eps_min = [0.1565800030782686, 1, 0.0]

Scenario 3 - Comparing sample-level scores

In previous examples, we have assumed that we compare two algorithms A and B based on their performance per run, i.e. we run each algorithm once per random seed and obtain exactly one score on our test set. In some cases however, we would like to compare two algorithms based on scores for every point in the test set. If we only use one seed per model, then this case is equivalent to scenario 1. But what if we also want to use multiple seeds per model?

In this scenario, we can do pair-wise comparisons of the score distributions between A and B and use the Bonferroni correction accordingly:

from itertools import product 

import numpy as np
from deepsig import aso 

# Simulate scores for three datasets
M = 40   # Number of data points
N = 3  # Number of random seeds
scores_a = [np.random.normal(loc=0.3, scale=0.8, size=M) for _ in range(N)]
scores_b = [np.random.normal(loc=0, scale=1, size=M) for _ in range(N)]
pairs = list(product(scores_a, scores_b))

# epsilon_min values with Bonferroni correction 
eps_min = [aso(a, b, confidence_level=0.05 / len(pairs)) for a, b in pairs]

Scenario 4 - Comparing more than two models

Similarly, when comparing multiple models (now again on a per-seed basis), we can use a similar approach like in the previous example. For instance, for three models, we can create a matrix and fill the entries with the corresponding values. The diagonal will naturally always be 1, but we can also restrict ourself to only filling out one half of the matrix by making use of the following property of ASO:

Note: While an appealing shortcut, it has been observed during testing this property, due to the random element of bootstrap iterations, might not always hold exactly - the difference between the two quantities has been seen to be up to when the scores distributions of A and B are very similar.

The corresponding code can then look something like this:

import numpy as np 
from deepsig import aso 
 
N = 5  # Number of random seeds
M = 3  # Number of different models / algorithms
num_comparisons = M * (M - 1) / 2
eps_min = np.eye(M)  # M x M matrix with ones on diagonal

scores_a = [np.random.normal(loc=0.3, scale=0.8, size=N) for _ in range(M)]
scores_b = [np.random.normal(loc=0, scale=1, size=N) for _ in range(M)]

for i in range(M):
  for j in range(i + 1, M):
    
    e_min = aso(scores_a[i], scores_b[j], confidence_level=0.05 / num_comparisons)
    eps_min[i, j] = e_min
    eps_min[j, i] = 1 - e_min
    
# eps_min =
# [[1.        1.         0.96926677]
# [0.         1.         0.71251641]
# [0.03073323 0.28748359 1.        ]]

✨ Other features

🚀 For the impatient: ASO with multi-threading

Waiting for all the bootstrap iterations to finish can feel tedious, especially when doing many comparisons. Therefore, ASO supports multithreading using joblib via the num_jobs argument.

from deepsig import aso
import numpy as np
from timeit import timeit

a = np.random.normal(size=5)
b = np.random.normal(size=5)

print(timeit(lambda: aso(a, b, num_jobs=1, show_progress=False), number=5))  # 146.6909574989986
print(timeit(lambda: aso(a, b, num_jobs=4, show_progress=False), number=5))  # 50.416724971000804

🔌 Compatibility with PyTorch, Tensorflow, Jax & Numpy

All tests implemented in this package also can take PyTorch / Tensorflow tensors and Jax or NumPy arrays as arguments:

from deepsig import aso 
import torch

a = torch.randn(5, 1)
b = torch.randn(5, 1)

aso(a, b)  # It just works!

🎲 Permutation and bootstrap test

Should you be suspicious of ASO and want to revert to the good old faithful tests, this package also implements the paired-bootstrap as well as the permutation randomization test. Note that as discussed in the next section, these tests have less statistical power than ASO. Furthermore, a function for the Bonferroni-correction using p-values can also be found using from deepsig import bonferroni_correction.

import numpy as np
from deepsig import bootstrap_test, permutation_test

a = np.random.normal(loc=0.8, size=10)
b = np.random.normal(size=10)

print(permutation_test(a, b))  # 0.16183816183816183
print(bootstrap_test(a, b))    # 0.103

General recommendations & other notes

Naturally, the CDFs built from scores_a and scores_b can only be approximations of the true distributions. Therefore, as many scores as possible should be collected, especially if the variance between runs is high. If only one run is available, comparing sample-wise score distributions like in scenario 3 can be an option, but comparing multiple runs will always be preferable.
num_samples and num_bootstrap_iterations can be reduced to increase the speed of aso(). However, this is not recommended as the result of the test will also become less accurate. Technically, is a upper bound that becomes tighter with the number of samples and bootstrap iterations (del Barrio et al., 2017). Thus, increasing the number of jobs with num_jobs instead is always preferred.
Bootstrap and permutation-randomization are all non-parametric tests, i.e. they don't make any assumptions about the distribution of our test metric. Nevertheless, they differ in their statistical power, which is defined as the probability that the null hypothesis is being rejected given that there is a difference between A and B. In other words, the more powerful a test, the less conservative it is and the more it is able to pick up on smaller difference between A and B. Therefore, if the distribution is known or found out why normality tests (like e.g. Anderson-Darling or Shapiro-Wilk), something like a parametric test like Student's or Welch's t-test is preferable to bootstrap or permutation-randomization. However, because these test are in turn less applicable in a Deep Learning setting due to the reasons elaborated on in Why?, ASO is still a better choice.

🎓 Cite

If you use the ASO test via aso(), please cite the original work:

@inproceedings{dror2019deep,
  author    = {Rotem Dror and
               Segev Shlomov and
               Roi Reichart},
  editor    = {Anna Korhonen and
               David R. Traum and
               Llu{\'{\i}}s M{\`{a}}rquez},
  title     = {Deep Dominance - How to Properly Compare Deep Neural Models},
  booktitle = {Proceedings of the 57th Conference of the Association for Computational
               Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019,
               Volume 1: Long Papers},
  pages     = {2773--2785},
  publisher = {Association for Computational Linguistics},
  year      = {2019},
  url       = {https://doi.org/10.18653/v1/p19-1266},
  doi       = {10.18653/v1/p19-1266},
  timestamp = {Tue, 28 Jan 2020 10:27:52 +0100},
}

Using this package in general, please cite the following:

@software{dennis_ulmer_2021_4638709,
  author       = {Dennis Ulmer},
  title        = {{deep-significance: Easy and Better Significance 
                   Testing for Deep Neural Networks}},
  month        = mar,
  year         = 2021,
  note         = {https://github.com/Kaleidophon/deep-significance},
  publisher    = {Zenodo},
  version      = {v1.0.0a},
  doi          = {10.5281/zenodo.4638709},
  url          = {https://doi.org/10.5281/zenodo.4638709}
}

🏅 Acknowledgements

This package was created out of discussions of the NLPnorth group at the IT University Copenhagen, whose members I want to thank for their feedback. The code in this repository is in multiple places based on several of Rotem Dror's repositories, namely this, this and this one. Thanks also go out to her personally for being available to answer questions and provide feedback to the implementation and documentation of this package.

The commit message template used in this project can be found here. The inline latex equations were rendered using readme2latex.

📚 Bibliography

Del Barrio, Eustasio, Juan A. Cuesta-Albertos, and Carlos Matrán. "An optimal transportation approach for assessing almost stochastic order." The Mathematics of the Uncertain. Springer, Cham, 2018. 33-44.

Bonferroni, Carlo. "Teoria statistica delle classi e calcolo delle probabilita." Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936): 3-62.

Borji, Ali. "Negative results in computer vision: A perspective." Image and Vision Computing 69 (2018): 1-8.

Dror, Rotem, et al. "The hitchhiker’s guide to testing statistical significance in natural language processing." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

Dror, Rotem, Shlomov, Segev, and Reichart, Roi. "Deep dominance-how to properly compare deep neural models." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

Efron, Bradley, and Robert J. Tibshirani. "An introduction to the bootstrap." CRC press, 1994.

Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018: 6391-6401

Narang, Sharan, et al. "Do Transformer Modifications Transfer Across Implementations and Applications?." arXiv preprint arXiv:2102.11972 (2021).

Noreen, Eric W. "Computer intensive methods for hypothesis testing: An introduction." Wiley, New York (1989).

Comments

Impact of sample size

Hi,

I'm doing some (very) small scale experimentation with your package and there's something I'm not clear about. How does the sample size(s) effect the statistical significance returned by your function (min_eps)? I don't mean that as a general question, rather about the correct usage of your package. For example, why would aso([8], [5, 5, 8, 7, 8]) return 0.0858? can I really conclude that algorithm 1 is better than algorithm 2 based on a single sample from algorithm 1's results? Another example would be aso([10, 8], [5, 5, 8, 7, 8]) -> 0.0367. Again, I'm convinced I can conclude that algorithm 1 is better than algorithm2 based on such as small sample of results. I would expect to be "asked" to run more experiments to provide more results to be used in the statistical test.

So in short, I'm asking if your function takes into account the sample size(s) when it calculates the significance score (min_eps)? Am I missing something here? If I do, please feel free to correct me.

Any clarification would be appreciated, thanks! Ran

opened by ranshadmi 5
Doubt on how tu use ASO

@Kaleidophon Good evening, sorry i'm not sure i understood how the ASO function works. For example if i run: " my_model_scores = scores_AUROC_Resnet baseline_scores = scores_AUROC_Mobilenet

min_eps = aso(my_model_scores, baseline_scores, seed=seed, show_progress=False, confidence_score 0.95)" from what i understood min_eps should be the upper bound to the amount of violation of the stochastic order. What i don't understand is how the samples F* and G* are extracted. I mean in the original paper it says inverse transform sampling is used. While as far as i understood in your paper on this repository it is stated in (3) that these samples are obtained bootstrapping. Does this mean the same thing? Are they inverselly sampled casually or is a bootstrapping involving a constant similar to the one used in power analysis used? maybe i'm just confusing terms and they mean the same thing. Thank you in advance and have a good evening

opened by JACKYNIKK 2
Misaligned diagonals for DataFrame
Hi,

Love this repo, thanks for doing this!

Issue

I have a small issue with respect to the new DataFrame feature that you implemented. Here I have misaligned diagonals where I would get close to stochastic dominance of a model compared to itself.

Reproduce Issue

I have the following dictionary: d = {'x': array([59.13, 58.03, 59.18, 58.78, 58.5 ]), 'y': array([58.13, 59.19, 59.94, 60.08, 59.85]), 'z': array([58.77, 58.86, 59.58, 59.59, 59.64]), 'w': array([58.16, 58.49, 59.87, 58.94, 58.96])}

I use the following line of code: print(multi_aso(d, confidence_level=0.05, return_df=True))

I get the following result:

x y z w x 1.000000 1.000000 0.202027 0.0 y 1.000000 0.101093 0.000000 0.0 z 0.202027 0.000000 1.000000 0.0 w 0.000000 0.000000 0.000000 1.0

Where I think the diagonal for the (y, y) pair shouldn't be correct.

Thanks for reading!
opened by jjzha 2
Sample-level random seed test

Hi,

First of all, thanks a lot for your work. It is exactly what I was looking for :)

I wondered if it is possible to compare multiple runs of the two models, A and B, on sample level rather than on score level? So let's say you trained each model five times with different random seeds. Does it make sense that two tests each run of A against each run of B and then average all the epsilons?

opened by tresiwald 2
Holm-Bonferroni vs Bonferroni

Hello, I am not as familiar with statistical testing as I ought to be, but I found this package interesting. Is the Bonferonni statistic used here actually the Holm-Bonferroni statistic? The code matches the definition used by wikipedia, as linked before.

https://github.com/Kaleidophon/deep-significance/blob/21f9251f6279af77c3d788308412e78556a0c170/deepsig/correction.py#L81

Great package by the way!

opened by ndalton12 2
Relax upper bound requirements, update setup to use utf8

I am proposing a change to relax required package versions, as they contradict minimal requirements of other commonly used packages (e.g., datasets requires tqdm>=4.62.1, but deepsig requires it to be 4.59.0).

I am also proposing to explicitly specify "utf8" ending for description parsing, as currently Windows system use cp1252 encoding, which throws a parsing error.

opened by kogolobo 1
⬆️ Bump tensorflow from 2.4.1 to 2.4.2
Bumps tensorflow from 2.4.1 to 2.4.2.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.4.2

Release 2.4.2

This release introduces several vulnerability fixes:

Fixes a heap buffer overflow in RaggedBinCount (CVE-2021-29512)

Fixes a heap out of bounds write in RaggedBinCount (CVE-2021-29514)

Fixes a type confusion during tensor casts which leads to dereferencing null pointers (CVE-2021-29513)

Fixes a reference binding to null pointer in MatrixDiag* ops (CVE-2021-29515)

Fixes a null pointer dereference via invalid Ragged Tensors (CVE-2021-29516)

Fixes a division by zero in Conv3D (CVE-2021-29517)

Fixes vulnerabilities where session operations in eager mode lead to null pointer dereferences (CVE-2021-29518)

Fixes a CHECK-fail in SparseCross caused by type confusion (CVE-2021-29519)

Fixes a segfault in SparseCountSparseOutput (CVE-2021-29521)

Fixes a heap buffer overflow in Conv3DBackprop* (CVE-2021-29520)

Fixes a division by 0 in Conv3DBackprop* (CVE-2021-29522)

Fixes a CHECK-fail in AddManySparseToTensorsMap (CVE-2021-29523)

Fixes a division by 0 in Conv2DBackpropFilter (CVE-2021-29524)

Fixes a division by 0 in Conv2DBackpropInput (CVE-2021-29525)

Fixes a division by 0 in Conv2D (CVE-2021-29526)

Fixes a division by 0 in QuantizedConv2D (CVE-2021-29527)

Fixes a division by 0 in QuantizedMul (CVE-2021-29528)

Fixes vulnerabilities caused by invalid validation in SparseMatrixSparseCholesky (CVE-2021-29530)

Fixes a heap buffer overflow caused by rounding (CVE-2021-29529)

Fixes a CHECK-fail in tf.raw_ops.EncodePng (CVE-2021-29531)

Fixes a heap out of bounds read in RaggedCross (CVE-2021-29532)

Fixes a CHECK-fail in DrawBoundingBoxes (CVE-2021-29533)

Fixes a heap buffer overflow in QuantizedMul (CVE-2021-29535)

Fixes a CHECK-fail in SparseConcat (CVE-2021-29534)

Fixes a heap buffer overflow in QuantizedResizeBilinear (CVE-2021-29537)

Fixes a heap buffer overflow in QuantizedReshape (CVE-2021-29536)

Fixes a division by zero in Conv2DBackpropFilter (CVE-2021-29538)

Fixes a heap buffer overflow in Conv2DBackpropFilter (CVE-2021-29540)

Fixes a heap buffer overflow in StringNGrams (CVE-2021-29542)

Fixes a null pointer dereference in StringNGrams (CVE-2021-29541)

Fixes a CHECK-fail in QuantizeAndDequantizeV4Grad (CVE-2021-29544)

Fixes a CHECK-fail in CTCGreedyDecoder (CVE-2021-29543)

Fixes a heap buffer overflow in SparseTensorToCSRSparseMatrix (CVE-2021-29545)

Fixes a division by 0 in QuantizedBiasAdd (CVE-2021-29546)

Fixes a heap out of bounds in QuantizedBatchNormWithGlobalNormalization (CVE-2021-29547)

Fixes a division by 0 in QuantizedBatchNormWithGlobalNormalization (CVE-2021-29548)

Fixes a division by 0 in QuantizedAdd (CVE-2021-29549)

Fixes a division by 0 in FractionalAvgPool (CVE-2021-29550)

Fixes an OOB read in MatrixTriangularSolve (CVE-2021-29551)

Fixes a heap OOB in QuantizeAndDequantizeV3 (CVE-2021-29553)

Fixes a CHECK-failure in UnsortedSegmentJoin (CVE-2021-29552)

Fixes a division by 0 in DenseCountSparseOutput (CVE-2021-29554)

Fixes a division by 0 in FusedBatchNorm (CVE-2021-29555)

Fixes a division by 0 in SparseMatMul (CVE-2021-29557)

Fixes a division by 0 in Reverse (CVE-2021-29556)

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.4.2

This release introduces several vulnerability fixes:

Fixes a heap buffer overflow in RaggedBinCount (CVE-2021-29512)

Fixes a heap out of bounds write in RaggedBinCount (CVE-2021-29514)

Fixes a type confusion during tensor casts which leads to dereferencing null pointers (CVE-2021-29513)

Fixes a reference binding to null pointer in MatrixDiag* ops (CVE-2021-29515)

Fixes a null pointer dereference via invalid Ragged Tensors (CVE-2021-29516)

Fixes a division by zero in Conv3D (CVE-2021-29517)

Fixes vulnerabilities where session operations in eager mode lead to null pointer dereferences (CVE-2021-29518)

Fixes a CHECK-fail in SparseCross caused by type confusion (CVE-2021-29519)

Fixes a segfault in SparseCountSparseOutput (CVE-2021-29521)

Fixes a heap buffer overflow in Conv3DBackprop* (CVE-2021-29520)

Fixes a division by 0 in Conv3DBackprop* (CVE-2021-29522)

Fixes a CHECK-fail in AddManySparseToTensorsMap (CVE-2021-29523)

Fixes a division by 0 in Conv2DBackpropFilter (CVE-2021-29524)

Fixes a division by 0 in Conv2DBackpropInput (CVE-2021-29525)

Fixes a division by 0 in Conv2D (CVE-2021-29526)

Fixes a division by 0 in QuantizedConv2D (CVE-2021-29527)

Fixes a division by 0 in QuantizedMul (CVE-2021-29528)

Fixes vulnerabilities caused by invalid validation in SparseMatrixSparseCholesky (CVE-2021-29530)

Fixes a heap buffer overflow caused by rounding (CVE-2021-29529)

Fixes a CHECK-fail in tf.raw_ops.EncodePng (CVE-2021-29531)

Fixes a heap out of bounds read in RaggedCross (CVE-2021-29532)

Fixes a CHECK-fail in DrawBoundingBoxes

... (truncated)

Commits

1923123 Merge pull request #50210 from tensorflow/geetachavan1-patch-1

a0c8093 Update BUILD

f1c8200 Merge pull request #50203 from tensorflow/mihaimaruseac-patch-1

7cf45b5 Update common.sh

4aaac2b Merge pull request #50185 from geetachavan1/cherrypicks_U90C1

65afa4b Fix the nightly nonpip builds for MacOS.

46c1821 Merge pull request #50184 from tensorflow/mihaimaruseac-patch-1

cf8d667 Update common_win.bat

b2ef8a6 Merge pull request #50061 from tensorflow/geetachavan1-patch-2

f9a1ba8 Update sparse_fill_empty_rows_op.cc

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Rotem Changes

Hi Dennis, I have some suggestions for changes in the README file. I haven't looked at the code yet but I guess that in that part I won't have a lot to comment on. The biggest change has to do with the ASO being a parametric test since it knows the distribution of its test statistic (the violation ratio) as was proved in Del Bario et al. The other are minor changes.

Overall it looks great! Rotem

opened by rtmdrr 1
Create `deepsig.sample_size` module
[x] Create function for bootstrap power analysis (see Henderson et al., 2018; Yuan & Hayashi, 2003)

[x] Create function to compute reduction in uncertainty for violation ratio estimate (based on eq. 9 in del Barrio et al., 2018)

[x] Implement unit tests

[x] Update documentation

[x] Publish new version 1.2.0

enhancement
opened by Kaleidophon 0

Releases(v1.2.6)

v1.2.6(Jul 7, 2022)
Added feature to automatically select the maximum amount of jobs by using num_jobs=-1

Source code(tar.gz)
Source code(zip)
v1.2.5(Apr 20, 2022)
Changes:

Updated documentation and code based on insights from paper experiments

Speed and reliability improvements

Deprecating some arguments, such as num_samples, use_symmetry and adding num_comparisons

Source code(tar.gz)
Source code(zip)
v1.2.0(Dec 3, 2021)

Add functions to help determine the right sample size, namely aso_uncertainty_reduction(), bootstrap_power_analysis()
Source code(tar.gz)
Source code(zip)
v1.1.3(Oct 27, 2021)

Fixing the bug described in #7 with commit a900e4a2d60278db211f9439a4f910e0aa53073b.
Source code(tar.gz)
Source code(zip)
v1.1.2(Oct 26, 2021)

Two critical hotfixes (see commit 23bf573). Additional minor fix for documentation in README.
Source code(tar.gz)
Source code(zip)
v1.1.1(Oct 15, 2021)
Features:

Make it easier to compare multiple model scores at once with multi_aso().

Results can be created as a pandas.DataFrame for readability.

Both aso() and multi_aso() support a seed argument for replicability.

Source code(tar.gz)
Source code(zip)
v1.0.0a(Mar 26, 2021)
First release of deepsig! The package currently contains the following:

Re-implementation of almost stochastic order (ASO) with optional multi-threading speed-up

Extensive documentation and readme to guide practicioners

Additional implementations of the permutation-randomization test, paired-bootstrap test and bonferroni correction (for p-values)

Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 25, 2021)
First release of deepsig! The package currently contains the following:

Re-implementation of almost stochastic order (ASO) with optional multi-threading speed-up

Extensive documentation and readme to guide practicioners

Additional implementations of the permutation-randomization test, paired-bootstrap test and bonferroni correction (for p-values)

Source code(tar.gz)
Source code(zip)

Owner

Dennis Ulmer

PhD student @ IT University Copenhagen | Formerly Research Intern @ Pacmed / AI MSc. student University of Amsterdam

GitHub https://deep-significance.rtfd.io/en/latest/

PENBUD is penetration testing buddy which helps you in penetration testing by making various important tools interactive.

penbud - Penetration Tester Buddy PENBUD is penetration testing buddy which helps you in penetration testing by making various important tools interac

15 Feb 1, 2022

pytest plugin for distributed testing and loop-on-failures testing modes.

xdist: pytest distributed testing plugin The pytest-xdist plugin extends pytest with some unique test execution modes: test run parallelization: if yo

1.1k Dec 30, 2022

PacketPy is an open-source solution for stress testing network devices using different testing methods

PacketPy About PacketPy is an open-source solution for stress testing network devices using different testing methods. Currently, there are only two c

4 Sep 22, 2022

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.

Hypothesis Hypothesis is a family of testing libraries which let you write tests parametrized by a source of examples. A Hypothesis implementation the

6.4k Jan 5, 2023

✅ Python web automation and testing. 🚀 Fast, easy, reliable. 💠

Build fast, reliable, end-to-end tests. SeleniumBase is a Python framework for web automation, end-to-end testing, and more. Tests are run with "pytes

3k Jan 4, 2023

The pytest framework makes it easy to write small tests, yet scales to support complex functional testing

The pytest framework makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries. An example o

9.6k Jan 2, 2023

Statistical tests for the sequential locality of graphs

Statistical tests for the sequential locality of graphs You can assess the statistical significance of the sequential locality of an adjacency matrix

2 Nov 23, 2021

Generic automation framework for acceptance testing and RPA

Robot Framework Introduction Installation Example Usage Documentation Support and contact Contributing License Introduction Robot Framework is a gener

7.7k Jan 7, 2023

Scalable user load testing tool written in Python

Locust Locust is an easy to use, scriptable and scalable performance testing tool. You define the behaviour of your users in regular Python code, inst

20.4k Jan 4, 2023

A modern API testing tool for web applications built with Open API and GraphQL specifications.

Schemathesis Schemathesis is a modern API testing tool for web applications built with Open API and GraphQL specifications. It reads the application s

1.6k Jan 6, 2023

Sixpack is a language-agnostic a/b-testing framework

Sixpack Sixpack is a framework to enable A/B testing across multiple programming languages. It does this by exposing a simple API for client libraries

1.7k Dec 24, 2022

Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

2.3k Jan 1, 2023

fsociety Hacking Tools Pack – A Penetration Testing Framework

Fsociety Hacking Tools Pack A Penetration Testing Framework, you will have every script that a hacker needs. Works with Python 2. For a Python 3 versi

8.2k Jan 3, 2023

Scalable user load testing tool written in Python

Locust Locust is an easy to use, scriptable and scalable performance testing tool. You define the behaviour of your users in regular Python code, inst

15.3k Feb 8, 2021

Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

1.8k Feb 7, 2021

Language-agnostic HTTP API Testing Tool

Dredd — HTTP API Testing Framework Dredd is a language-agnostic command-line tool for validating API description document against backend implementati

4k Jan 5, 2023

Web testing library for Robot Framework

SeleniumLibrary Contents Introduction Keyword Documentation Installation Browser drivers Usage Extending SeleniumLibrary Community Versions History In

1.2k Jan 3, 2023

A command-line tool and Python library and Pytest plugin for automated testing of RESTful APIs, with a simple, concise and flexible YAML-based syntax

1.0 Release See here for details about breaking changes with the upcoming 1.0 release: https://github.com/taverntesting/tavern/issues/495 Easier API t

909 Dec 15, 2022

One-stop solution for HTTP(S) testing.

HttpRunner HttpRunner is a simple & elegant, yet powerful HTTP(S) testing framework. Enjoy! ✨ ?? ✨ Design Philosophy Convention over configuration ROI

3.5k Jan 4, 2023

Enabling easy statistical significance testing for deep neural networks.

Related tags

Overview

deep-significance: Easy and Better Significance Testing for Deep Neural Networks

⁉️ Why?

📥 Installation

🔖 Examples

Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks

Scenario 1 - Comparing multiple runs of two models

Scenario 2 - Comparing multiple runs across datasets

Scenario 3 - Comparing sample-level scores

Scenario 4 - Comparing more than two models

✨ Other features

🚀 For the impatient: ASO with multi-threading

🔌 Compatibility with PyTorch, Tensorflow, Jax & Numpy

🎲 Permutation and bootstrap test

General recommendations & other notes

🎓 Cite

🏅 Acknowledgements

📚 Bibliography

Comments

Issue

Reproduce Issue

TensorFlow 2.4.2

Release 2.4.2

Release 2.4.2

Releases(v1.2.6)

v1.2.6(Jul 7, 2022)

v1.2.5(Apr 20, 2022)

v1.2.0(Dec 3, 2021)

v1.1.3(Oct 27, 2021)

v1.1.2(Oct 26, 2021)

v1.1.1(Oct 15, 2021)

v1.0.0a(Mar 26, 2021)

v1.0.0(Mar 25, 2021)

Owner

Dennis Ulmer

PENBUD is penetration testing buddy which helps you in penetration testing by making various important tools interactive.

pytest plugin for distributed testing and loop-on-failures testing modes.

PacketPy is an open-source solution for stress testing network devices using different testing methods

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.

✅ Python web automation and testing. 🚀 Fast, easy, reliable. 💠

The pytest framework makes it easy to write small tests, yet scales to support complex functional testing

Statistical tests for the sequential locality of graphs

Generic automation framework for acceptance testing and RPA

Scalable user load testing tool written in Python

A modern API testing tool for web applications built with Open API and GraphQL specifications.

Sixpack is a language-agnostic a/b-testing framework

Automatically mock your HTTP interactions to simplify and speed up testing

fsociety Hacking Tools Pack – A Penetration Testing Framework

Scalable user load testing tool written in Python

Automatically mock your HTTP interactions to simplify and speed up testing

Language-agnostic HTTP API Testing Tool

Web testing library for Robot Framework

A command-line tool and Python library and Pytest plugin for automated testing of RESTful APIs, with a simple, concise and flexible YAML-based syntax

One-stop solution for HTTP(S) testing.