Metric learning algorithms in Python

Last update: Dec 28, 2022

Related tags

Overview

metric-learn: Metric Learning in Python

metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised metric learning algorithms. As part of scikit-learn-contrib, the API of metric-learn is compatible with scikit-learn, the leading library for machine learning in Python. This allows to use all the scikit-learn routines (for pipelining, model selection, etc) with metric learning algorithms through a unified interface.

Algorithms

Large Margin Nearest Neighbor (LMNN)
Information Theoretic Metric Learning (ITML)
Sparse Determinant Metric Learning (SDML)
Least Squares Metric Learning (LSML)
Sparse Compositional Metric Learning (SCML)
Neighborhood Components Analysis (NCA)
Local Fisher Discriminant Analysis (LFDA)
Relative Components Analysis (RCA)
Metric Learning for Kernel Regression (MLKR)
Mahalanobis Metric for Clustering (MMC)

Dependencies

Python 3.6+ (the last version supporting Python 2 and Python 3.5 was v0.5.0)
numpy, scipy, scikit-learn>=0.20.3

Optional dependencies

For SDML, using skggm will allow the algorithm to solve problematic cases (install from commit a0ed406). pip install 'git+https://github.com/skggm/skggm.git@a0ed406586c4364ea3297a658f415e13b5cbdaf8' to install the required version of skggm from GitHub.
For running the examples only: matplotlib

Installation/Setup

If you use Anaconda: conda install -c conda-forge metric-learn. See more options here.
To install from PyPI: pip install metric-learn.
For a manual install of the latest code, download the source repository and run python setup.py install. You may then run pytest test to run all tests (you will need to have the pytest package installed).

Usage

See the sphinx documentation for full documentation about installation, API, usage, and examples.

Citation

If you use metric-learn in a scientific publication, we would appreciate citations to the following paper:

metric-learn: Metric Learning Algorithms in Python, de Vazelhes et al., Journal of Machine Learning Research, 21(138):1-6, 2020.

Bibtex entry:

@article{metric-learn,
  title = {metric-learn: {M}etric {L}earning {A}lgorithms in {P}ython},
  author = {{de Vazelhes}, William and {Carey}, CJ and {Tang}, Yuan and
            {Vauquier}, Nathalie and {Bellet}, Aur{\'e}lien},
  journal = {Journal of Machine Learning Research},
  year = {2020},
  volume = {21},
  number = {138},
  pages = {1--6}
}

Comments

[MRG] Address comments for sklearn-contrib integration

Hi, we've made a request for inclusion in scikit-learn-contrib, this PR intends to address the comments of the issue: https://github.com/scikit-learn-contrib/scikit-learn-contrib/issues/40

TODO:

[x] Fix flake8 errors (there remains some error due to unused imports in metric_learn/__init__.py, but I guess these are needed right ?) And also inverse_covariance.quic is imported but unused, but this is normal since it's just to define the variable HAS_SKGGM. I don't know if there's another way to bypass this. Here is the log of flake8 after the fixes:

./test/metric_learn_test.py:17:3: F401 'inverse_covariance.quic' imported but unused
./metric_learn/__init__.py:3:1: F401 '.constraints.Constraints' imported but unused
./metric_learn/__init__.py:4:1: F401 '.covariance.Covariance' imported but unused
./metric_learn/__init__.py:5:1: F401 '.itml.ITML' imported but unused
./metric_learn/__init__.py:5:1: F401 '.itml.ITML_Supervised' imported but unused
./metric_learn/__init__.py:6:1: F401 '.lmnn.LMNN' imported but unused
./metric_learn/__init__.py:7:1: F401 '.lsml.LSML' imported but unused
./metric_learn/__init__.py:7:1: F401 '.lsml.LSML_Supervised' imported but unused
./metric_learn/__init__.py:8:1: F401 '.sdml.SDML_Supervised' imported but unused
./metric_learn/__init__.py:8:1: F401 '.sdml.SDML' imported but unused
./metric_learn/__init__.py:9:1: F401 '.nca.NCA' imported but unused
./metric_learn/__init__.py:10:1: F401 '.lfda.LFDA' imported but unused
./metric_learn/__init__.py:11:1: F401 '.rca.RCA' imported but unused
./metric_learn/__init__.py:11:1: F401 '.rca.RCA_Supervised' imported but unused
./metric_learn/__init__.py:12:1: F401 '.mlkr.MLKR' imported but unused
./metric_learn/__init__.py:13:1: F401 '.mmc.MMC_Supervised' imported but unused
./metric_learn/__init__.py:13:1: F401 '.mmc.MMC' imported but unused
./metric_learn/__init__.py:15:1: F401 '._version.__version__' imported but unused

Note that I ignored some errors (E111 (indentation is not a multiple of four),E114 (indentation is not a multiple of four (comment)))

[x] Put Python 3.7 in the CI tests

opened by wdevazelhes 36

[MRG] Refactor the metric() method
Fixes #147

TODO:

[x] Add some tests

[x] Add references to the right parts of documentation (like Mahalanobis Distances) in the docstrings (if possible)

[x] Emphasize a bit more the difference and links between this and score_pairs in the docstring

[x] Be careful that it should work on 1D arrays

[x] Be careful that it should not return a float if given 2D arrays

[x] Remove useless np.atleast2d (those in transformer_from_metric and those just before returning the transformer_)
opened by wdevazelhes 33
[MRG] Enhance documentation
This PR enhances the documentation by fixing issues about the doc and adding other improvements

Fixes #155 Fixes #149 Fixes #150 Fixes #135

TODO:

[x] Add link to docstring in documentation's titles of algos (fixes #155)

[x] Fix #149

[x] Fix #150

[x] Put the description of algorithms that have a supervised version in the non-supervised version (and not at the top of the page). This allows the user not to scroll up everytime after following a link. We could also make separate pages for algos and their supervised version

[x] Check that links still work (like MMC puts on the docstring of MMC (had a pb with that, also MahalanobisMixin was not working))

[x] Check that no num_dims remains (they should all be changed into n_components, I saw one in MLKR for instance)

[x] Solve the TODOs inside the .rst files

[x] Put a list of the methods in the docstring page (automatically), like in scikit-learn. This will make it easier to find methods without having to scroll down every time

~~Ensure that the doctest work~~ -> postponed, already opened in #156

[x] some arguments are not documented, ensure that they are all documented

~~Maybe add metric_learn.constraints.positive_negative_pairs docstring~~ -> postponed, discussed in #227

[x] Address https://github.com/metric-learn/metric-learn/pull/208#pullrequestreview-248843755
opened by wdevazelhes 28
[MRG] Create new Mahalanobis mixin
This PR creates a new Mahalanobis Mixin (cf issue https://github.com/metric-learn/metric-learn/issues/91), which is a common interface for all algorithms that learn a Mahalanobis type (pseudo) distance (of the form (x - x')^T M (x - x')) (right now all algorithms are this form but there might be others in the future).

This interface will enforce that an attribute metric_ exists, add documentation for it in the docstring of child classes, and will allow to factorize computations of embed functions (similar to what is done now with transform), and score_pairs function (these functions will come in later PRs, therefore right now this Mixin seems a bit artificial but it is temporary).

I also used the opportunity of this PR to improve the way the argument metric_ is returned, checking that the matrix indeed exists (i.e. it has been explicitely initialized or the estimator has been fitted), and raising a warning otherwise. Don't hesitate to comment on this last part, or to tell me if it should belong to a separate PR.

TODO:

[x] Create the class

[x] Make current algorithms inherit from it

[x] Use this opportunity to improve the metric_ property

[x] Maybe add some more tests

[X] ~~Fix docstrings to render nicely (as if metric_ was a regular Attribute of the class)~~ done by copying, see https://github.com/metric-learn/metric-learn/pull/96#issuecomment-415036218

[x] Check array at predict, embed etc, to only allow arrays of the right shape =>EDIT: full checking will be done in the "preprocessor" PR

EDIT: Initially we were thinking of doing also an ExplicitMixin that would be for metric learning algorithms that have a way to embed points in a space where the metric is the learned one. Since all algorithms are of this form for now, we will not implement it but rather implement all the functions in MahalanobisMixin (see https://github.com/metric-learn/metric-learn/pull/95#issuecomment-394689505)

~~[ ] Add embed function~~ => EDIT: for now we will let only a transform function (see https://github.com/metric-learn/metric-learn/pull/96#issuecomment-407118297)

[x] Add score_pairs function
opened by wdevazelhes 26
[MRG] Add preprocessor option
This PR adds an argument preprocessor to weakly supervised algorithms initialization: an option that allows them to accept two types of input: either 3D arrays of tuples of points as before, or 2D arrays of indices/identifiers of points. In the latter case, the preprocessor given as input to the Metric Learner will allow to retreive points from identifiers. The preprocessor is basically a callable that is called on a list of identifiers (or just on one identifier ? we'll have to decide), and that returns the points associated to this identifiers. If instead of a callable the user provides an array X, it will automatically create the "indexing" preprocessor (i.e. a preprocessor such that preprocessor(id) returns X[id]). We could also imagine other shortcuts like this (for instance the user could provide a string of a root folder containing images and then it would create a preprocessor such that preprocessor("img2.png") would return the vector associated to the image located at "rootfolder/img2.png")

Note: This PR branched from MahalanobisMixin PR, so as soon as MahalanobisMixin is merged the diff should become more readable.

TODO:

[x] Add comments and more info to this PR

[x] Add tests: in progress: add tests to check that the output format of check_input is as specified - ~~[ ] Add documentation~~ -> probably in another PR

[x] Make it simpler with unified check_input function

[x] Refactor the check_input function and its tests to be cleaner

[x] Write docstrings

[x] Refactor tests to have the list of metric learners at one place

[x] Fix the linalg bug
opened by wdevazelhes 25
Push a new release to PyPI

There are a lot of good changes since v0.3, so I think we're almost ready to release v0.4.

Once most (hopefully all) of the remaining items on the milestone are finished, we should tag a commit on Github and push the new builds to PyPI.

@terrytangyuan: I'll update this issue when we're good to go.

opened by perimosocordiae 25
[MRG+1] Threshold for pairs learners
This PR fits a threshold for tuples learners to allow a predict (and scoring) on pairs

Fixes #131 Fixes #165

Finally, we decided that it would be good to have at least a minimal implementation of threshold calibration inside metric-learn, so that _PairsClassifiers can have a threshold hence a predict directly out of the box, without the need for a MetaEstimator. A MetaEstimator could however be used outside the algorithm for more precise threshold calibration (with cross-validation).

The following features should be implemented:

[x] We should have two methods for _PairsClassifiers: set_threshold() and calibrate_threshold(validation_set, method='max_tpr', args={'min_tnr': 0.1}). set_threshold will set the threshold to a hard value, and calibrate_threshold will take a validation set and a method ('accuracy', 'f1'...) and will find the threshold which optimizes the metric in method on the validation set. We went for the same syntax that scikit-learn's PR.

[ ] At fit time, we should either have a simple rule to set a threhold (for instance median of distances, or mean between positive pairs distances mean and negative pairs distances mean), or we should return calibrate_threshold(trainset), and in this case also raise a warning at the end that says that the threshold has been fitted on the trainset, so we should check scikit-learn's utilities for calibration to have a calibration less prone to overfitting. Also in this case we could allow to put arguments in fit like `fit(pairs, y, threshold_method='max_tpr', threshold_args={'min_tnr': 0.1})

[x] The following scores should be implemented:

'accuracy'

‘f_beta’

‘max_tpr’

'max_tnr' See scikit-learn's calibration PR (https://github.com/scikit-learn/scikit-learn/pull/10117) for more details, and the documentation of it here https://35753-843222-gh.circle-artifacts.com/0/doc/modules/calibration.html

[x] For some estimators for which a natural threshold exist (like ITML: the mean between the lower threshold and the higher threshold), we should put this threshold

[x] Decide what to do by default, rule of thumb scoring or calibration on trainset ?

Questions:

Should we do the same thing for QuadrupletsLearners ? (the natural threshold is 0, so I don't think it would make as much sense here, and users would maybe rather use the future meta estimator from scikit-learn's PR https://github.com/scikit-learn/scikit-learn/pull/10117), but since we will already have it implemented for pairs, and maybe for coherence, we could have it also for Quadruplets Learners

TODO:

[x] Implement tests that check that we can use custom scores in cross val (cf #165). This needs to have a predict hence is related to this PR.

[x] Implement API tests (that the behaviour is as expected)

[x] Implement numerical tests (on examples where we know the f_1 score etc

[x] Implement the actual method (cf. features above)

[x] Add this in the doc: also talk in the doc about the CalibratedClassifierCV

[x] ~~Think about which scoring make sense with quadruplets and what impact is has on the code~~

[x] ~~Use/Adapt CalibratedClassifierCV or an equivalent for quadruplets~~

[x] ~~Maybe test CalibratedClassifierCV that it returns a coherent value (like all that have predict_proba = 0.8 have indeed 80% success)~~ CalibratedClassifier's behaviour should be tested in scikit-learn: in metric learn we should just test the API (but on a small draft example I tested it for ITML and it worked more or less)

[ ] Add a case (and a test case) where the best accuracy is when we need to reject all points (so threshold = best score + 1). See if this applies too to other strategies
opened by wdevazelhes 24
[MRG] FIX: make proposal for sdml formulation
Digging into SDML's code and paper, I don't understand some parts of the implementation. I think it should be as proposed in this PR. Tell me if I'm wrong

Looking at these this paper about SDML https://icml.cc/Conferences/2009/papers/46.pdf, line , and this paper about Graphical Lasso http://statweb.stanford.edu/~tibs/ftp/graph.pdf: it seems that in SDML, we want to optimize equation 8, which can indeed be done with Graphical Lasso according to the paper on Graphical Lasso. In fact, equation 8 in SDML's paper is the same as equation 1 in Graphical Lasso's paper (up to a minus sign). The following variables are equivalent:

|SDML's paper | Graphical Lasso's paper| |-------------------|-------------------------------| |M|theta| |P|S|

where in SDML's paper, P = M_0^-1 + etha * X.L.X^T.

But note that in SDML's paper, M_0^-1 is the inverse of the a priori Mahalanobis matrix, which can be indeed initialized to the inverse of the covariance matrix. So then M_0^-1 will be the inverse of the inverse of the covariance matrix hence the covariance matrix itself.

So we should just compute P = emp. covariance matrix + self.balance_param * loss_matrix and do graphical lasso on this (and not: inverse the covariance matrix, do P = this_inverse + self.balance_param * loss_matrix, inverse the result, and compute Graphical Lasso on this as it is done currently)

And in both cases, we want to evaluate the sparse inverse of M/theta, so things are OK for that

Also, I didn't get the hack to ensure positive semidefinite, doesn't it change the result ?

This PR's modification does not fix the issues we had (like plot_sandwich.py does not have better results with this). So maybe let's not merge it until we have a whole fix for SDML.

There are other things that could explain why SDML doesn't work, like choosing the optimization parameter alpha, and also that graph_lasso seems to sometimes work badly, see https://github.com/scikit-learn/scikit-learn/issues/6887 and https://github.com/scikit-learn/scikit-learn/issues/11417)

So first tell me if you agree with this modification (not on merging it or not, just whether it's the right formula or not), so that we can look elsewhere to see how to fix SDML's error.

TODO:

[x] Put a message just before merge, in the release draft, to announce that skggm is needed for SDML

[x] replace sklearn's pinvh (deprecated) by scipy's pinvh

[x] deal with the 1D input case for sdml

[x] Add a small test on the non SPD case that skggm can solve

[x] Make travis indeed install skggm if the version allows it
opened by wdevazelhes 24
Fix covariance initialization when matrix is not invertible

This PR fixes #276, an issue that arises in the context of covariance initialization on an algorithm that doesn't require a PD matrix. It can happen that the calculated matrix is singular and in consequence isn't invertible, leading to non-explicative warnings. This is fixed by the use of a pseudo-inverse when a PD matrix is not required.

opened by grudloff 23
[MRG] Export notebook to gallery
Fixes #141 #153

Hi, I've just converted @bhargavvader 's notebook from #27 into a sphinx-gallery file (with this snippet: https://gist.github.com/chsasank/7218ca16f8d022e02a9c0deb94a310fe). This way, it will appear nicely in the documentation, and can also allow us to check if every algorithms work fine. There are a few things to change to make the PR mergeable (to compile the doc, you need sphinx-gallery):

[x] As dicussed with @bellet, the iris dataset is maybe not the most expressive dataset for metric learning, we might want to find a dataset where classes are even more mixed and where metric learning gives a very advantageous separation

[x] Some parts seem to be broken (see the logo "broken"), I need to see why

[x] On my computer, the plan of the notebook appears in the left toolbar, we might not want that (we might want to see only two tabs (because we have two examples in metric-learn/examples) on the left sidebar and not tens of tabs)

[x] Some examples seem not to work super well in terms of separation, I need to see why
opened by wdevazelhes 21
[MRG+2] Update repo to work with both new and old scikit-learn
Fixes #311

I added the workaround suggested by @bellet here: https://github.com/scikit-learn-contrib/metric-learn/issues/311#issuecomment-804229956, so that imports work both in sklearn <0.24 and >= 0.24 (EDIT: actually the old import strategy was already deprecated in 0.22, so I compared the version to 0.22)

as well as an additional travis job to test (I chose the one with python 3.6 since it's the oldest python so I thought if someting goes wrong it might be this one, and also with skggm, since I also I'm thinking something might have more chance to go wrong when using another package... I could have put all the checks (python3.6+3.7, with/without skggm), but then I thought the travis test suite might take quite some time)
opened by wdevazelhes 18
Adjustment of the validation of the number of target neighbors

Before the actual optimization process, it is checked whether the parameters are valid. In the lines 175 - 177 it is checked if the chosen k is valid in the context of the training data. According to the definition of LMNN by Weinberger et al. each class must have at least k+1 elements, so that there are at least k target neighbors for each data point. In the implementation, however, it is only checked whether self.n_neighbors<= required_k (in fact the code checks the opposite in order to throw an error), where required_k is the number of elements of the smallest class. This check indicates that the choice of k is valid for a class that has exactly k elements, which shouldn’t be the case. However, this leads to selecting a point as its own target neighbor, if this small class. For the determination of the target neighbors, a distance matrix of all points within the class is computed. To prevent that the point itself is recognized as nearest neighbor, the diagonal of this matrix is set to infinity. If a class has only k elements, all elements of the class are chosen as target neighbors, including the current point itself (even if it has a distance of infinity to itself according to the distance matrix). This results in each point of such a class effectively having one target neighbor less than classes with more training data, which can have unintended influences on the final transformation depending on the dataset used.

To prevent this, it is sufficient to adjust the validation so thatself.n_neighbors < required_k must apply.

opened by JanekBlankenburg 1
Refactor LMNN as a triplets learner
Addresses my request in #210.

Introduces a _BaseLMNN/ LMNN class to operate on triplets, s.t. d(triplets[i, 0],triplets[i, 1]) < d(triplets[i, 0], triplets[i, 2]) - the same setup as SCML.

from this definition of triplets, create a 'label mask', an nxn matrix with mask[i,j] = 1 and mask[i,k] = -1 for the set triplet[i,j,k] (else 0).

This simply reformulates the loss_grad / _find_impostors to operate on this label mask (imposters that violate the large margin) are detected by evaluating the squared distances of entries implied by -1 values in the mask.

the desired parameter k can be inferred from the triplets by counting unique occurrences of genuine and imposter pairs

Renames LMNN to LMNN_Supervised.
opened by zdk123 0
SCML: Add warm_start parameter

Because SCML optimization procedure is based on stochastic subgradient descent, we can save the weights after fitting the model, and use them in a following fit call (with a different set of triplets). The decision to use the warm_start parameter instead of a new partial_fit method is because partial_fit in scikit-learn will only fit 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter is reached), which is the case also for SCML.

opened by maxi-marufo 2
[WIP] Add model selection example with LFW dataset and KNN task

I created a model selection example for supervised Mahalanobis learners, to show the effectiveness of the linear transformation.

I use a "large" dataset from sklearn: Labeled Faces in the Wild (LFW) people dataset (classification). That it's a bit more complex than using iris, and for the same reason I use PCA to reduce dimentionality.

The usual pipeline would be: PCA-> Classifier, but in this case we try PCA-> Metric learner-> Classifier, and we compare how precision, recall and f1 scores vary to the first scenario that I call a baseline.

To compare models I fixed the last Classifier being a KNeighborsClassifier.

In general, all supervised learners are able to outperform the baseline.

I think this example can be useful to users, because its hard to know beforehand which model will perform the best with our dataset.

Note: The models's parameters are not tuned, this example act as a "final" comparison between models.

opened by mvargas33 0
[DOC] [WIP] Developers documentation page
@bellet @perimosocordiae @terrytangyuan @wdevazelhes

Motivated by #259 , I made this docs for new developers like a guide in How to contribute to the package. I followed the scikit-learn guideline here, but talking with @bellet it's better to keep things simple in terms of governance, for instance.

I also considered comments at #205 and in #13 .

I also propose that for API or major changes, a MLEP (Metric Learning Enhancement Proposal) document is needed, being Github Discussions the palce to put it and review it, because sometimes, huge API changes are linked to more than one PR. Take the OASIS discussion at #336 as a very simple and informal MLEP. (In general I took this idea from sklearn).

The main sections are:

Contributing: Introduction, values, general guideline, how to contribute code (PR process), how to test, how to compile the docs.

Metric learn governance: Roles, decision-making process, in general. How to proceed with API changes (MLEP)

Implement a new algorithm: Criteria of selection, and how to proceed.

API Structure: A quick review of the API for devs to know what classes to inherit from, and wich methods to implement, and where.

MLEP Template: The template to be used in Github discussion for major changes.

What is left:

How to make a release

How to publish the docs (the gh-page branch thing)

How to update at Pypi and Conda.

And because this has not been discussed in the past, take this draft as what it is: a draft. Moreover the governance part, and the MLEP part. Maybe something much simpler is enough, like the Github discussion #336 that I did, but with a general template.

Best! 😁

Ps: I'm testing CSS usage in some parts, ignore it
opened by mvargas33 1
3. [WIP] OASIS algorithm implementation

Hi!

I am currently implementing the OASIS algorithm and I open this PR to make the implementation transparent while working on it. Any discussion, question or comments is very welcomed.

This PR is under the WIP (Work In Progress) tag because as of now, I have a draft implementation of the algorithm out-of-the-package itself. It's a file in the root directory, with a test file in root as well.

Over these days I will move the algorithm to metric_learn folder to make it compatible with the current API. Same for testing.

Current testing only checks that nothing is broken, I'll make some test regarding KNN tasks to verify that the algorithm performs better at least for a handmade toy test.

This PR depends on the Bilinear PR #329 acceptance beforehand.

opened by mvargas33 1

Releases(v0.6.2)

v0.6.2(Jul 2, 2020)

This release uniformizes well version numbers that were mistaken in the previous release.
Source code(tar.gz)
Source code(zip)
v0.6.1(Jul 2, 2020)

This release explicitly requires python>=3.6 and scikit-learn>=0.20.3 to install it.
Source code(tar.gz)
Source code(zip)
v0.6.0(Jul 1, 2020)
This release features various fixes and improvements, as well as a new triplet-based algorithm, SCML (see http://researchers.lille.inria.fr/abellet/papers/aaai14.pdf), and an associated Triplets API. Triplets-based metric learning algorithms are used in settings where we have an "anchor" sample that we want to be closer with a "positive" sample than with a "negative" sample. Consistently with related packages like scikit-learn, we have also dropped support for Python 2 and Python 3.5.

New algorithms

Add Sparse Compositional Metric Learning (SCML) (#278)

General updates on the package

Drop support for python 2 and python 3.5 (#291)

Add the Triplets API (#279)

Solve issues in the documentation (#265, #266, #271, #274, #280)

Allow installation from conda (#283)

Fix covariance initialization when matrix is not invertible (#277)

Add more robusts checks that an estimator is fitted (#267)

Improvements to existing algorithms

Improve LMNN's verbose (#253)

Fix chunk generation in RCA (#254, #263)

Source code(tar.gz)
Source code(zip)
v0.5.0(Jul 18, 2019)
This is a major release in which the API (in particular for weakly-supervised algorithms) was largely refurbished in order to make it more unified and largely compatible with scikit-learn. Note that for this reason, you might encounter a significant amount of DeprecationWarning and ChangedBehaviourWarning. These warnings will disappear in version 0.6.0. The changes are summarized below:

All algorithms:

Uniformize initialization for all algorithms: all algorithms that have a 'prior' or an 'init' as a parameter, can now choose it in a unified way, between (more) different choices ('identity', 'random', etc...) (#195 )

Rename num_dims to n_components for algorithms that have such a parameter. (#193)

metric() method has been renamed into get_mahalanobis_matrix (#152)

You can now use the function score_pairs to score a bunch of pair of points (return the distance between them), or get_metric to get a metric function that can be plugged into scikit-learn estimators like any scipy distance.

Weakly supervised algorithms

major API changes (#139, #217, #220, #197, #168) allowing greater compatibility with scikit-learn routines:

in order to fit weakly supervised algorithms, users now have to provide 3d arrays of tuples (and possibly an array of labels y). For pairs learners, instead of X and [a, b, c, d] as before, we should have an array pairs such that pairs[i] = X[a[k], b[k]] if y[i] == 1 or X[c[k], d[k]] if y[i] != 1, where k is some integer (you can obtain such a representation by stacking horizontally a and b, then c and d, stacking these vertically, and taking X[this array of indices]). For quadruplets learners, one should have the same form of input, instead that there is no need for y, and that the 3d array will be an array of 4-uples instead of 2-uples. The two first elements of each quadruplet are the ones that we want to be more similar to each other than the last two.

Alternatively, a "preprocessor" can be used, if users instead want to give tuples of indices and not tuples of plain points, for less redundant manipulation of data. Custom preprocessor can be easily written for advanced use (e.g., to load and encode images from file paths).

You can also use predict on a given pair or quadruplet, i.e. predict whether the pair is similar or not, or in the case of quadruplets, whether a given new quadruplet is in the right ordering or not

For pairs, this prediction depends on a threshold that can be set with set_threshold and calibrated on some data with calibrate_threshold.

For pairs, a default score is defined, which is the AUC (Area under the ROC Curve). For quadruplets, the default score is the accuracy (proportion of quadruplets given in the right order).

All of the above allows the algorithms to be compatible with scikit-learn for cross-validation, grid-search etc...

For more information about these changes, see the new documentation

Supervised algorithms

deprecation of num_labeled parameter (#119):

ITML_supervised bounds must now be set in init and not fit anymore (#163)

deprecation of use_pca in LMNN (#231).

the random seed for generating constraints has now to be put at initialization rather than fit time (#224).

removed preprocessing the data for RCA (#194).

removed shogun dependency for LMNN (#216).

Improved documentation:

mathematical formulation of algorithms (#178)

general introduction to metric learning, use cases, different problem formulations (#145)

description of the API in the user guide (#208 and #229)

Bug fixes:

scikit-learn's fix https://github.com/scikit-learn/scikit-learn/pull/13276 fixed SDML when the matrix to reconstruct is PSD, and the use of skggm fixed it in cases where the matrix is not PSD but we can still converge. The use of skggm is now recommended (i.e. we recommend to install skggm to use SDML).

For all the algorithms that had a parameter num_dims (renamed to n_components, see above), it will now be checked to be between 1 and n_features, with n_features the number of dimensions of the input space

LMNN did not update impostors at each iteration, which could result in problematic cases. Impostors are now recomputed at each iteration, which solves these problems (#228).

The pseudo-inverse is now used in Covariance instead of the plain inverse, which allows to make Covariance work even in the case where the covariance matrix is not invertible (e.g. if the data lies on a space of smaller dimension).(#206)

There was an error in #101 that caused LMNN to return a wrong gradient (one dot product with L was missing). This has been fixed in #201.

Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 5, 2018)
Two newly introduced algorithms:

MLKR (Metric Learning for Kernel Regression)

MMC (Mahalanobis Metric for Clustering)

Improved documentation and examples

Performance improvements

Minor bug fixes

Source code(tar.gz)
Source code(zip)
v0.3.0(Jul 13, 2016)

Constraints are now managed with a unified interface (metric_learn.Constraints), which makes it easy to generate various input formats from (possibly) partial label information.
Source code(tar.gz)
Source code(zip)
v0.2.1(May 16, 2016)

All classes inheriting from BaseMetricLearner now support sklearn-style get_params and set_params.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 7, 2015)

We now support Python 3 alongside Python 2 in the same codebase.
Source code(tar.gz)
Source code(zip)
v0.1.1(Oct 7, 2015)
This minor release adds two new methods:

Local Fisher Discriminant Analysis (LFDA)

Relative Components Analysis (RCA)

The performance of the non-Shogun LMNN implementation has also been improved, and it should now consume less memory.

This release also includes the new Sphinx documentation and improved docstrings for many of the classes and methods,
Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 16, 2015)

This is the first release on PyPI, thanks to @terrytangyuan.
Source code(tar.gz)
Source code(zip)

Owner

scikit-learn compatible projects

GitHub http://contrib.scikit-learn.org/metric-learn/

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

2 Apr 18, 2022

Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

0 Jan 31, 2022

Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

220 Dec 13, 2022

Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

3.7k Jan 7, 2023

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

14.5k Jan 7, 2023

Machine Learning Algorithms

Machine-Learning-Algorithms In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the p

3 Aug 10, 2022

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 6, 2022

Machine learning algorithms implementation

Machine learning algorithms implementation This repository consisits of implementation of various machine learning algorithms. The algorithms implemen

1 Jan 3, 2022

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map, play blackjack game and robot in grid world and evaluate reward for it

1 Jan 19, 2022

Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

4.9k Jan 5, 2023

Implementation of different ML Algorithms from scratch, written in Python 3.x

393 Nov 29, 2022

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022

scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022

Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

3 Dec 6, 2022

An open-source library of algorithms to analyse time series in GPU and CPU.

216 Dec 30, 2022

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking. TF-DF is a TensorFlow wrapper around the Yggdrasil Decision Forests C++ libraries. Models trained with TF-DF are compatible with Yggdrasil Decision Forests' models, and vice versa.

538 Jan 1, 2023

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

1 Dec 13, 2021

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

2 Nov 18, 2021

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

1 Jan 31, 2022

Metric learning algorithms in Python

Related tags

Overview

metric-learn: Metric Learning in Python

Comments

TODO:

TODO:

Questions:

TODO:

Releases(v0.6.2)

v0.6.2(Jul 2, 2020)

v0.6.1(Jul 2, 2020)

v0.6.0(Jul 1, 2020)

v0.5.0(Jul 18, 2019)

v0.4.0(Sep 5, 2018)

v0.3.0(Jul 13, 2016)

v0.2.1(May 16, 2016)

v0.2.0(Nov 7, 2015)

v0.1.1(Oct 7, 2015)

v0.1.0(Sep 16, 2015)

Owner

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Implemented four supervised learning Machine Learning algorithms

Python-based implementations of algorithms for learning on imbalanced data.

Uplift modeling and causal inference with machine learning algorithms

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Machine Learning Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Machine learning algorithms implementation

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

Distributed Evolutionary Algorithms in Python

Implementation of different ML Algorithms from scratch, written in Python 3.x

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

scikit-multimodallearn is a Python package implementing algorithms multimodal data.

Implementation of linesearch Optimization Algorithms in Python

An open-source library of algorithms to analyse time series in GPU and CPU.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms