General tricks that may help you find bad, or noisy, labels in your dataset

vincent d warmerdam

Last update: Dec 26, 2022

Related tags

Miscellaneous doubtlab

Overview

doubtlab

A lab for bad labels.

Warning still in progress.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.

Install

You can install the tool via pip.

python -m pip install doubtlab

Quickstart

Doubtlab allows you to define "reasons" for a row of data to deserve another look. These reasons can form a pipeline which can be used to retreive a sorted list of examples worth checking again.

from doubtlab import DoubtLab
from doubtlab.reasons import ProbaReason, WrongPredictionReason

# Let's say we have some model already
model.fit(X, y)

# Next we can the reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtLab(**reasons)

# Get the predicates, or reasoning, behind the order
predicates = doubt.get_predicates(X, y)
# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)
# Get the (X, y) candidates worth checking again
X_check, y_check = doubt.get_candidates(X, y)

Features

The library implemented many "reaons" for doubt.

ProbaReason: assign doubt when a models' confidence-values are low
RandomReason: assign doubt randomly, just for sure
LongConfidenceReason: assign doubt when a wrong class gains too much confidence
ShortConfidenceReason: assign doubt when the correct class gains too little confidence
DisagreeReason: assign doubt when two models disagree on a prediction
CleanLabReason: assign doubt according to cleanlab

Related Projects

The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the labelerrors.com project.
My employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check out Rasa X if you're working on virtual assistants.

Comments

`QuantileDifferenceReason` and `StandardDeviationReason`
Hey! I was thinking if it would make sense to add two more reasons for regressions tasks, namely something like HighLeveragePointReason and HighStudentizedResidualReason.

Citing Wikipedia:

Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables (link)

A studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. [...] This is an important technique in the detection of outliers. (link)
opened by FBruzzesi 31
Doubt Reason Based on Entropy

If a machine learning model is very "confident" then the proba scores will have low entropy. The most uncertain outcome is a uniform distribution which would contain high entropy. Therefore, it could be sensible to add entropy as a reason for doubt.

opened by koaning 10
Add staticmethods to reasons to prevent re-compute.
I really like the current design with reasons just being function calls.

However, when working with large datasets or in use cases where you already have the predictions of a model, I wonder if you have thought about letting users to pass either a sklearn model or the pre-computed probas (for those Reasons where it make sense). For threshold-based reasons and large datasets this could save some time and compute, allow for faster iteration, and it would open up the possibility of using other models beyond sklearn.

I understand that the design wouldn't be as clean as it is right now, might cause miss-alignments if users don't send the correct shapes/positions, but I wonder if you have considered this (or any other way to pass pre-computed predictions).

Just to illustrate what I mean (sorry about the dirty-pseudo code):

class ProbaReason: def __init__(self, model=None, probas=None, max_proba=0.55): if not model or probas: print("You should at least pass a model or probas") self.model = model self.probas = probas self.max_proba = max_proba def __call__(self, X, y=None): probas = probas if self.probas else self.model.predict_proba(X) result = probas.max(axis=1) <= self.max_proba return result.astype(np.float16)
opened by dvsrepo 9
"Fair" Sorting

Suppose there are 5 reasons for doubt, 4 of which overlap a lot. Then we may end up in a situation where we ignore a reason. That could be bad ... maybe it's worth exploring voting systems a bit to figure out alternative sorting methods.

opened by koaning 7
Add example to docs that shows lambda X, y: y.isna()

Hey! First of all: this is a very cool project ;) I have been thinking about potential new "reasons" to doubt and I personally often look into predictions generated by a model whenever the data instance had missing values (and part of the model-pipeline imputes them)... So I wonder if it would be useful to have a FillNaNReason (or something similar) based, for example in the MissingIndicator transformer.

opened by juanitorduz 4
added conda-install-option and badges to readme
This closes #14: doubtlab can now be installed with conda from conda-forge channel.

[x] Created conda-forge/doubtlab-feedstock to make doubtlab available on conda-forge channel.

[x] Added conda install option to readme.

[x] Added the following badges to readme.
opened by sugatoray 4
Added a LICENSE
Hi @koaning,

I am assuming MIT License is okay for this repository. If you think otherwise, please feel free to make changes in the PR accordingly.

[x] Added an MIT License

[x] ~~Added a Citation file~~ Removed the citation file and updated the name of the PR. - ~~If you have an orcid, please consider adding it to the citation.cff file.~~
opened by sugatoray 4
Add a conda installation option using conda-forge channel

I have already started this one. Will push a PR once the conda installation option is available.

See: Adding doubtlab from PyPI to conda-forge channel.

@koaning As the primary maintainer of this repo, would you like to be listed as one of the maintainers of doubtlab on conda-forge channel? Please let me know, I will add your name as another maintainer of conda-forge/doubtlab-feedstock, once it is accepted.

opened by sugatoray 3
Doubt about MarginConfidenceReason :-)

Hi Vincent,

Nice library! As mentioned a while ago on Twitter I'm doing a review to understand and compare different approaches to find label errors.

I'm playing with the AG News dataset, which we know it contains a lot of errors from our own previous experiments with Rubrix (using the training loss and using cleanlab).

While playing with the different reasons, I'm having difficulties to understand the reasoning behind the MarginConfidenceReason. As far as I can tell, if the model is doubting the margin between the top two predicted labels should be small, and that could point to an ambiguous example and/or a label error. If I read the code and description correctly, MarginConfidenceReason is doing the opposite, so I'd love to know the reasoning behind this to make sure I'm not missing something.

For context, using the MarginConfidenceReason with the AG News training set yields almost the entire dataset (117788 examples for the default threshold of 0.2, and 112995 for threshold=0.5). I guess this could become useful when there's overlap with other reasons, but I want to make sure about the reasoning :-).

opened by dvsrepo 2
updated docs: installation and badges
Updated docs:

[x] updated installation (with conda)

[x] ~~added badges from readme~~

@koaning I am not sure if you would prefer to include the badges in the docs (website). If you don't, please feel free to remove them.

UPDATE: removed badges from the docs (docs/index.md).
opened by sugatoray 2
Issue with cleanlab upgrading to v2

Issue

Environment details

Temporary fix

pip install "doubtlab==1.0.0"

More permanent fix

Pin doubtlab dependency to "doubtlab<2.0.0"

More more permanent fix

They've made some changes to their API

Let me know if you'd like me to make a PR

Thanks for a great package @koaning 😄

opened by duarteocarmo 1
Consider a fairlearn demo.

When two models disagree something interesting might be happening. But that'll only happen if you have two models that are actually different.

What if you have one model that's better at accuracy and another one that's better at fairness.

Maybe these labels deserve more attention too.

opened by koaning 0
Assign Doubt for Dissimilarity from Labelled Set

Suppose that y can contain NaN values if they aren't labeled. In that case, we may want to favor a subset of these NaN values. In particular: if they differ substantially from the already labeled datapoints.

The idea here is that we may be able to sample more diverse datapoints.

opened by koaning 10
Does it make sense to add an ensemble for spaCy?

This seems to be a like-able method of dealing with text outside the realm of scikit-learn. But I prefer to delay this feature until I really understand the use-case. For anything related to entities we cannot use sklearn, but tags/classes should work fine as-is.

opened by koaning 1

Releases(0.2.4)

0.2.4(Nov 25, 2022)

Support added for false-positive, false-negative reason via WrongPredictionReason.
Source code(tar.gz)
Source code(zip)
0.2.3(Apr 22, 2022)

Supports CleanLab v2.
Source code(tar.gz)
Source code(zip)
0.2.2(Apr 14, 2022)

Sped up confidence-based reasons by vectorization. Added more priority on proba based methods on docs.
Source code(tar.gz)
Source code(zip)
0.1.5(Dec 21, 2021)

Added a StandardizedErrorReason https://github.com/koaning/doubtlab/pull/29. Thanks @FBruzzesi!
Source code(tar.gz)
Source code(zip)
0.1.4(Dec 15, 2021)

Added a reason based on entropy and from_proba-style methods.
Source code(tar.gz)
Source code(zip)
0.1.3(Dec 7, 2021)

Fixed a bug related to the margin reason. Details found here: https://github.com/koaning/doubtlab/pull/22
Source code(tar.gz)
Source code(zip)
0.1.2(Nov 23, 2021)

Added the MarginConfidenceReason.
Source code(tar.gz)
Source code(zip)
0.1.0(Nov 9, 2021)

Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub

Pytorch implementation of "Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates"

Peer Loss functions This repository is the (Multi-Class & Deep Learning) Pytorch implementation of "Peer Loss Functions: Learning from Noisy Labels wi

1 Feb 8, 2022

Python script to commit to your github for a perfect commit streak. This is purely for education purposes, please don't use this script to do bad stuff.

Daily-Git-Commit Commit to repo every day for the perfect commit streak Requirments pip install -r requirements.txt Setup Download this repository. Cr

34 Dec 14, 2022

A repo to record how I prepare my Interview, and really hope it can help you as well. Really appreciate Kieran's help in the pattern's part.

Project Overview The purpose of this repo is to help others to find solutions and explaintion I will commit a solution and explanation to every proble

1 Nov 29, 2021

This is a vscode extension with a Virtual Assistant that you can play with when you are bored or you need help..

VS Code Virtual Assistant This is a vscode extension with a Virtual Assistant that you can play with when you are bored or you need help. Its currentl

6 Aug 22, 2021

An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to art and design.

Awesome AI for Art & Design An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to a

20 Dec 21, 2022

Bad Apple printed out on the console with Python!

bad-apple Bad Apple printed out on the console with Python! Preface A word of disclaimer, while the final code is somewhat original, this project is a

186 Dec 1, 2022

Really bad lisp implementation. Fun with pattern matching.

Lisp-py This is a horrible, ugly interpreter for a trivial lisp. Don't use it. It was written as an excuse to mess around with the new pattern matchin

1 Nov 23, 2021

Kellogg bad | Union good | Support strike funds

KelloggBot Credit to SeanDaBlack for the basis of the script. req.py is selenium python bot. sc.js is a the base of the ios shortcut [COMING SOON] Set

407 Nov 17, 2022

💻 Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your DSA journey.📰🔥 This repository contains the REST APIs of the application.✨

Algo-Phantom-Backend ?? Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your D

44 Nov 15, 2022

A simple bot that will help you in your learning and make it more fun.

hyperskill-SimpleChattyBot-python A simple bot that will help you in your learning and make it more fun. Syntax bot.py Stages Stage #1: Zuhura Bot we

1 Nov 9, 2021

A collection of online resources to help you on your Tech journey.

Everything Tech Resources & Projects About The Project Coming from an engineering background and looking to up skill yourself on a new field can be di

396 Dec 31, 2022

This is the old code for bitcoin risk metric, the whole purpose form it is to help you DCA your investment according to bitcoin risk.

About The Project This is the old code for bitcoin risk metric, the whole purpose form it is to help you DCA your investment according to bitcoin risk

2 Aug 3, 2022

PBN Obfuscator: A overpowered obfuscator for python, which will help you protect your source code

PBN Obfuscator PBN Obfuscator is a overpowered obfuscator for python, which will

6 Dec 22, 2022

switching computer? changing your setup? You need to automate the download of your current setup? This is the right tool for you :incoming_envelope:

?? setup_shift(SS.py) switching computer? changing your setup? You need to automate the download of your current setup? This is the right tool for you

15 Aug 26, 2022

Participants of Bertelsmann Technology Scholarship created an awesome list of resources and they want to share it with the world, if you find illegal resources please report to us and we will remove.

Participants of Bertelsmann Technology Scholarship created an awesome list of resources and they want to share it with the world, if you find illegal

29 Nov 28, 2022

General tricks that may help you find bad, or noisy, labels in your dataset

Related tags

Overview

doubtlab

Install

Quickstart

Features

Related Projects

Comments

Issue

Environment details

Temporary fix

More permanent fix

More more permanent fix

Releases(0.2.4)

0.2.4(Nov 25, 2022)

0.2.3(Apr 22, 2022)

0.2.2(Apr 14, 2022)

0.1.5(Dec 21, 2021)

0.1.4(Dec 15, 2021)

0.1.3(Dec 7, 2021)

0.1.2(Nov 23, 2021)

0.1.0(Nov 9, 2021)

Owner

vincent d warmerdam

Pytorch implementation of "Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates"

Python script to commit to your github for a perfect commit streak. This is purely for education purposes, please don't use this script to do bad stuff.

A repo to record how I prepare my Interview, and really hope it can help you as well. Really appreciate Kieran's help in the pattern's part.

This is a vscode extension with a Virtual Assistant that you can play with when you are bored or you need help..

An awesome list of AI for art and design - resources, and popular datasets and how we may apply computer vision tasks to art and design.

Bad Apple printed out on the console with Python!

Really bad lisp implementation. Fun with pattern matching.

Kellogg bad | Union good | Support strike funds

💻 Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your DSA journey.📰🔥 This repository contains the REST APIs of the application.✨

A simple bot that will help you in your learning and make it more fun.

A collection of online resources to help you on your Tech journey.

This is the old code for bitcoin risk metric, the whole purpose form it is to help you DCA your investment according to bitcoin risk.

PBN Obfuscator: A overpowered obfuscator for python, which will help you protect your source code

switching computer? changing your setup? You need to automate the download of your current setup? This is the right tool for you :incoming_envelope:

Participants of Bertelsmann Technology Scholarship created an awesome list of resources and they want to share it with the world, if you find illegal resources please report to us and we will remove.

Here You will Find CodeChef Challenge Solutions

Find out where all films you want to watch are streaming

On this repo, you'll find every codes I made during my NSI classes (informatical courses)

Automatically find solutions when your Python code encounters an issue.