Dimensionality reduction in very large datasets using Siamese Networks

beringresearch

Last update: Jan 1, 2023

Related tags

Overview

ivis

Implementation of the ivis algorithm as described in the paper Structure-preserving visualisation of high dimensional single-cell datasets. Ivis is designed to reduce dimensionality of very large datasets using a siamese neural network trained on triplets. Both unsupervised and supervised modes are supported.

Installation

Ivis runs on top of TensorFlow. To install the latest ivis release from PyPi running on the CPU TensorFlow package, run:

# TensorFlow 2 packages require a pip version >19.0.
pip install --upgrade pip

pip install ivis[cpu]

If you have CUDA installed and want ivis to use the tensorflow-gpu package, run

pip install ivis[gpu]

Development version can be installed directly from from github:

git clone https://github.com/beringresearch/ivis
cd ivis
pip install -e '.[cpu]'

The following optional dependencies are needed if using the visualization callbacks while training the Ivis model:

matplotlib
seaborn

Upgrading

Ivis Python package is updated frequently! To upgrade, run:

pip install ivis --upgrade

Features

Scalable: ivis is fast and easily extends to millions of observations and thousands of features.
Versatile: numpy arrays, sparse matrices, and hdf5 files are supported out of the box. Additionally, both categorical and continuous features are handled well, making it easy to apply ivis to heterogeneous problems including clustering and anomaly detection.
Accurate: ivis excels at preserving both local and global features of a dataset. Often, ivis performs better at preserving global structure of the data than t-SNE, making it easy to visualise and interpret high-dimensional datasets.
Generalisable: ivis supports addition of new data points to original embeddings via a transform method, making it easy to incorporate ivis into standard sklearn Pipelines.

And many more! See ivis readme for latest additions and examples.

Examples

from ivis import Ivis
from sklearn.preprocessing import MinMaxScaler
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
X_scaled = MinMaxScaler().fit_transform(X)

model = Ivis(embedding_dims=2, k=15)

embeddings = model.fit_transform(X_scaled)

Comments

Bug with index.build(ntrees)

Hello,

I'm trying to run the ivis examples (both the simple iris one and the mnist one, and I keep getting this error whenever the model fitting is being called (running this on Debian). Any thoughts?

In [7]: embeddings = ivis.fit_transform(mnist.data)

Error truncating file: Invalid argument
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-7-d5f1692c2b85> in <module>
----> 1 embeddings = ivis.fit_transform(mnist.data)

/opt/conda/envs/ivisumap/lib/python3.7/site-packages/ivis/ivis.py in fit_transform(self, X, Y, shuffle_mode)
    289         """
    290
--> 291         self.fit(X, Y, shuffle_mode)
    292         return self.transform(X)
    293

/opt/conda/envs/ivisumap/lib/python3.7/site-packages/ivis/ivis.py in fit(self, X, Y, shuffle_mode)
    269         """
    270
--> 271         self._fit(X, Y, shuffle_mode)
    272         return self
    273

/opt/conda/envs/ivisumap/lib/python3.7/site-packages/ivis/ivis.py in _fit(self, X, Y, shuffle_mode)
    146                 print('Building KNN index')
    147             build_annoy_index(X, self.annoy_index_path,
--> 148                               ntrees=self.ntrees, verbose=self.verbose)
    149
    150         datagen = generator_from_index(X, Y,

/opt/conda/envs/ivisumap/lib/python3.7/site-packages/ivis/data/knn.py in build_annoy_index(X, path, ntrees, verbose)
     28
     29     # Build n trees
---> 30     index.build(ntrees)
     31     if platform.system() == 'Windows':
     32         index.save(path)

Exception: Invalid argument

opened by sadatnfs 15

Windows compatibility?

Really excited to compare Ivis to UMAP on a project I am currently working on.

The server I have access to is a Windows 10 machine, with a Python 3.7 Anaconda environment.

Following the install instructions and trying to run the MNIST example, I am seeing the following error: TypeError: can't pickle annoy.Annoy objects
enhancement help wanted

opened by paul-harambee 13

Ivis seems to provoke errors when composing a sklearn.pipeline.Pipeline passed to sklearn.model_selection.GridSearchCV and executed in parallel

The problem

I noticed that when Ivis compose a sklearn.pipeline.Pipeline which is passed to sklearn.model_selection.GridSearch to fine-tune hyper-parameters across all estimators/transformers, and GridSearch has n_jobs=-1 (i.e., when executions within GridSearch are parallel), errors are thrown. This does not happen when n_jobs=1 (i.e., when the executions within GridSearch are sequential).

Since Pipeline globally regulates the n_jobs parameter, thus not supporting the parallelization of only specific steps, this problem forces the global use of n_jobs=1, which sensibly slows down the fine-tuning process by underusing the computational power of the setup in which the script is being executed (even in parts where n_jobs=-1 would work).

Environment

A virtual environment was created specifically to this repository, wherein all modules described in requirements.txt were installed. My setup runs an up-to-date version of Windows 10 (no WSL).

Runtime

python=3.8.4

Relevant modules

ivis=2.0.3
tensorflow=2.5.0

Minimal reproducible example

Code

if __name__ == "__main__":
    import tempfile
    import ivis

    from sklearn import datasets, ensemble, model_selection, pipeline, preprocessing
    from os import environ

    environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

    X, y = datasets.load_iris(return_X_y=True)

    pipeline_with_ivis = pipeline.Pipeline([
        ("normalize", preprocessing.MinMaxScaler()),
        ("project", ivis.Ivis()),
        ("classify", ensemble.RandomForestClassifier()),
    ], memory=tempfile.mkdtemp())

    parameter_grid = {
        "project__k": (15,),
        "project__verbose": (True,),

        "classify__random_state": (2021,)
    }

    grid_search = model_selection.GridSearchCV(pipeline_with_ivis, parameter_grid, scoring="accuracy", cv=10, n_jobs=-1,
                                               return_train_score=True, verbose=3).fit(X, y)

Error

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 212, in extract_knn
    process.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\process.py", line 39, in _Popen
    return Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\popen_loky_win32.py", line 70, in __init__
    child_env.update(process_obj.env)
AttributeError: 'KnnWorker' object has no attribute 'env'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 591, in __call__
    return self._cached_call(args, kwargs)[0]
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 534, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 761, in call
    output = self.func(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 350, in fit_transform
    self.fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 328, in fit
    self._fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 190, in _fit
    self.neighbour_matrix = AnnoyKnnMatrix.build(X, path=self.annoy_index_path,
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 63, in build
    return cls(index, X.shape, path, k, search_k, precompute, include_distances, verbose)
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 48, in __init__
    self.precomputed_neighbours = self.get_neighbour_indices()
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 96, in get_neighbour_indices
    return extract_knn(
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 236, in extract_knn
    process.terminate()
  File "C:\Python38\lib\multiprocessing\process.py", line 133, in terminate
    self._popen.terminate()
AttributeError: 'NoneType' object has no attribute 'terminate'
  warnings.warn("Estimator fit failed. The score on this train-test"

[...]

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan]
  warnings.warn(

Discussion

By coding and playing with the example above, I acquired the understanding that, since both sklearn uses joblib and ivis uses multiprocessing, these modules might not be playing well with each other for some reason.

I would discard the understanding that nested estimators/transformers with parallel routines would be the problem: estimators like sklearn.ensemble.RandomForestClassifier can be set to have n_jobs=-1 without problem within the Pipeline passed to GridSearchCV.

I am particularly affected by this issue because I want to employ ivis in projects that involve hyper-parameter fine-tuning using cross-validation via GridSearchCV with concurrent executions. I attempted to diagnose the problem, but to no avail, which is why I bring this issue to your attention.

Observation: another part of this problem is a design choice that is not adherent to the sklearn API guidelines, whose solution I propose and detail in #95. This issue does not cause the aforementioned error, but might cause other errors that could affect the same use scenario (Pipeline in GridSearchCV running in parallel).

opened by imatheussm 10

attempt to apply non-function

I want to install ivis in R, but show the error as the title. The system of my computer is Windows, so I have installed conda before running the code., can anyone help me to solve this problem. thank you! library (reticulate) devtools : : install _github("beringresearch/ivis/R-package") library (ivis) model ＜- ivis (k ＝ 3） Error in ivis _object$Ivis(embedding _dims ＝ embedding _dims, k ＝ k, distance ＝ distance, : attempt to apply non-function

opened by Feifei0511 9

Issue installing and running ivis R package in RStudio

Hello,

For JOSS review.

The installation instructions fail when run in the RStudio environment:

> devtools::install_github("beringresearch/ivis/R-package", force=TRUE)
Downloading GitHub repo beringresearch/ivis@master
✔  checking for file ‘/private/var/folders/cp/8rn2cs_x79zcbp_yb75ychg80000gq/T/Rtmpud6pnU/remotesbe4d59017fdb/beringresearch-ivis-bbccdb7/R-package/DESCRIPTION’ ...
─  preparing ‘ivis’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘ivis_1.1.3.tar.gz’
   
* installing *source* package ‘ivis’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘ivis’:
 .onLoad failed in loadNamespace() for 'ivis', details:
  call: path.expand(path)
  error: invalid 'path' argument
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/ivis’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/ivis’
Error: Failed to install 'ivis' from GitHub:
  (converted from warning) installation of package ‘/var/folders/cp/8rn2cs_x79zcbp_yb75ychg80000gq/T//Rtmpud6pnU/filebe4d71713083/ivis_1.1.3.tar.gz’ had non-zero exit status

However, it does work fine when run in the console (Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64):

> devtools::install_github("beringresearch/ivis/R-package", force=TRUE)
Downloading GitHub repo beringresearch/ivis@master
   checking for file ‘/private/var/folders/cp/8rn2cs_x79zcbp_yb75ychg80000gq/T/Rtmpvj2CT3/remotesc3827327cfb8/beringresearch-ivis-bbccdb7/R-package/DESCRIPTION’✔  checking for file ‘/private/var/folders/cp/8rn2cs_x79zcbp_yb75ychg80000gq/T/Rtmpvj2CT3/remotesc3827327cfb8/beringresearch-ivis-bbccdb7/R-package/DESCRIPTION’
─  preparing ‘ivis’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘ivis_1.1.3.tar.gz’
   
* installing *source* package ‘ivis’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (ivis)

Moreover, the ivis package (installed from the terminal) can be loaded from an R console in a terminal, but throws the following error when loaded in RStudio

> library(ivis)
Error: package or namespace load failed for ‘ivis’:
 .onLoad failed in loadNamespace() for 'ivis', details:
  call: path.expand(path)
  error: invalid 'path' argument

This is most likely due to conda not being on the PATH in RStudio:

# RStudio
> system("echo $PATH")
/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/local/ncbi/igblast/bin:/Library/TeX/texbin:/opt/X11/bin:/opt/local/bin
# Console
> system("echo $PATH")
/Users/kevin/miniconda3/bin:/Users/kevin/miniconda3/condabin:/usr/local/opt/imagemagick@6/bin:/Users/kevin/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/ncbi/igblast/bin:/Library/TeX/texbin:/opt/X11/bin

Is there a recommended way to set up an environment to run ivis in RStudio, or are users only expected to run it from a terminal R console?

Thanks!

opened by kevinrue 6

Enable registration or passing of a custom triplet loss function

In Python, Ivis.__init__ accepts a distance: str keyword argument, which sets from a dictionary a predefined triplet loss function for that distance metric. Currently, one of the ways to provide a custom distance function is to monkeypatch the ivis.nn.losses.get_loss_functions. Other ways to accomplish the same are even messier from the perspectives of usage and implementation.

The nature of dimensionality reduction, especially when dealing with one-hot-encoded categorical features, sometimes requires custom ways to calculate loss. Under the hood, ivis has the ability to enable custom loss functions, but any such offerings need to be implemented in a clean and API-idiomatic manner.

A custom distance function requires its own triplet loss implementation. Ivis.__init__ could support an additional keyword argument (e.g. triplet_loss: Callable[..., ...] = ...) for users to be able to pass their own.

Alternatively, it could simply be passed inside the existing distance kwarg, with its signature changing to distance: Union[str, Callable[..., ...]].

Another way would be to make the losses dictionary built by ivis.nn.losses.get_loss_functions a module-level loss function registrar.

Additionally, docs and examples need to be updated on how to correctly implement a custom loss function. With all currently available distance metrics, the triplet loss implementation follows a very similar pattern, and should not be too daunting to attempt to implement.

opened by mihajenko 5
Add a vignette to the R package
Hello,

For JOSS review.

Is your feature request related to a problem? Please describe.

The R package lacks documentation of an application to a real-life dataset.

Describe the solution you'd like

Please add a vignette in the R package demonstrating at least an example application to a single-cell dataset. Basically, the equivalent of the scanpy workflow here.

A convenient way to use the pbmc3k dataset for demonstration purposes is the Bioconductor TENxPBMCData package.

Suggested code:

library(TENxPBMCData) tenx_ pbmc3k <- TENxPBMCData(dataset = "pbmc3k")

Ideally, consider using the vignette (or a separate one) to also give an introduction to the functionality of the R package. It is not necessary to duplicate information already described in the documentation of the Python package (DRY principle); you may simply include a link to the main page.

Describe alternatives you've considered

A working example of an R workflow could also be included in the documentation of the Python package, although this is probably unnecessarily difficult to maintain. Ideally, that example would be run and tested for every new release of the Python and R source code.

Additional context Once you have an R vignette written, you should also consider using pkgdown to automatically create a GitHub website including the full package documentation.
opened by kevinrue 5
Extremely slow extraction of KNN neighbours on 100k samples
I'm using ivis[cpu] on a dataset of about 100k samples with around 200k sparse features. My training dataset is stored in an h5 file and I use the following code to fit and transform the dataset:

with h5py.File(filename, 'r') as f: X = f['data'] Y = pd.Categorical(meta_df["label"]).codes model = Ivis(epochs=5, k=15) model.fit(X, Y, shuffle_mode='batch') # Shuffle batches when using h5 files embeddings = model.transform(X)

However, it takes so long:

Building KNN index 100%|██████████| 105942/105942 [55:07<00:00, 32.03it/s] Extracting KNN neighbours 0%| | 262/105942 [7:16:38<2935:20:19, 99.99s/it]

2935 hours!! Am I missing something? or this is expected? Should I switch to GPU?

By the way, I'm using a google colab system with 8 CPU cores, 50 GB Ram, and an SSD disk.
opened by adavoudi 4
How to get stable results?
Hello Folks,

thank you for all the work on this lib. I have a question about reproducibility: Is there a way to set a random seed or random state and get stable results?

I'm trying to achieve this with:

import random import numpy random.seed(42) numpy.random.seed(42)

I'm aware that these are not threadsafe, so this may be the reason of the not reproducible results. Anyway, is there any way to enforce this?
opened by rsarai 4

model_save: optimizer is not compatible with pickle

When attempting to use save_model after fitting a supervised Ivis instance, I get an error when trying to save. It looks like some part of the optimizer is not compatible to be pickled with python.

Replicate:

import ivis
i = ivis.Ivis(embedding_dims=10, n_epochs_without_progress=5)
i.fit(X, y)
i.save_model("model.ivis")

Traceback (most recent call last):
  File "src/ivis_persist.py", line 69, in <module>
    ivises[output].save_model(f"models/{output}.ivis")
  File "/Users/pbaumgartner/anaconda3/envs/env/lib/python3.7/site-packages/ivis/ivis.py", line 404, in save_model
    pkl.dump(self.model_.optimizer, f)
AttributeError: Can't pickle local object 'make_gradient_clipnorm_fn.<locals>.<lambda>'

System Info: Running ivis==2.0.0 on macOS with python 3.7.

bug

opened by pmbaumgartner 4

R pkg fit() call finishes but subprocess doesn't terminate
This model consistently feels like a magic trick, thanks for contributing!

Bug I'm running the ivis R package(v1.7.1) (more system details below). I can get model$fit() and model$transform() working just fine and producing substantive results. However, when the R process finishes and returns the fitted model, I'm seeing continued sky-high system usage. The R process calling ivis is definitely completed and back to a command prompt, but in htop I can see the RStudio GUI process (parent of the rsession process) occupying at least 2 full cores. Some process further down is not stopping when the R process gets the returned value. (Restarting the R session does kill it.)

I don't understand enough of the ivis-through-reticulate toolchain to provide more helpful diagnostics in this first report, but happy to run experiments and document further.

Environment

ivis R package(v1.7.1), installed from Github (beringresearch/ivis@56a8479) 14 Apr 2020

reticulate (v1.15), 2020-04-02 CRAN (R 3.6.2)

R 3.6.2 on MacOS 10.14.6 (18G4032)

platform x86_64-apple-darwin15.6.0 arch x86_64 os darwin15.6.0 system x86_64, darwin15.6.0 status major 3 minor 6.2 year 2019 month 12 day 12 svn rev 77560 language R version.string R version 3.6.2 (2019-12-12) nickname Dark and Stormy Night
opened by sheffe 4

InternalError: Graph execution error:

Hello， I want to use ivis to do the analysis for my scRNA-seq data.

Here is my code:

def getReduction(X):
    #X = PCA(n_components=4, copy=True, random_state=1).fit_transform(X)
    from ivis import Ivis
    model = Ivis(embedding_dims=4, k=15)
    X = model.fit_transform(X)
    print(X.shape)
    return X

but I got some errors:

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 multi_train_x = getReduction(train_x)

Input In [8], in getReduction(X)
      3 from ivis import Ivis
      4 model = Ivis(embedding_dims=6, k=15)
----> 5 X = model.fit_transform(X)
      6 print(X.shape)
      7 return X

File /opt/conda/lib/python3.8/site-packages/ivis/ivis.py:368, in Ivis.fit_transform(self, X, Y, shuffle_mode)
    349 def fit_transform(self, X, Y=None, shuffle_mode=True):
    350     """Fit to data then transform
    351 
    352     Parameters
   (...)
    365         Embedding of the data in low-dimensional space.
    366     """
--> 368     self.fit(X, Y, shuffle_mode)
    369     return self.transform(X)

File /opt/conda/lib/python3.8/site-packages/ivis/ivis.py:346, in Ivis.fit(self, X, Y, shuffle_mode)
    328 def fit(self, X, Y=None, shuffle_mode=True):
    329     """Fit an ivis model.
    330 
    331     Parameters
   (...)
    343         Returns estimator instance.
    344     """
--> 346     self._fit(X, Y, shuffle_mode)
    347     return self

File /opt/conda/lib/python3.8/site-packages/ivis/ivis.py:318, in Ivis._fit(self, X, Y, shuffle_mode)
    315 if self.verbose > 0:
    316     print('Training neural network')
--> 318 hist = self.model_.fit(
    319     datagen,
    320     epochs=self.epochs,
    321     callbacks=self.callbacks_ + [EarlyStopping(monitor='loss',
    322                                                patience=self.n_epochs_without_progress)],
    323     shuffle=shuffle_mode,
    324     steps_per_epoch=int(np.ceil(X.shape[0] / self.batch_size)),
    325     verbose=self.verbose)
    326 self.loss_history_ += hist.history['loss']

File /opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     65 except Exception as e:  # pylint: disable=broad-except
     66   filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67   raise e.with_traceback(filtered_tb) from None
     68 finally:
     69   del filtered_tb

File /opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     52 try:
     53   ctx.ensure_initialized()
---> 54   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     55                                       inputs, attrs, num_outputs)
     56 except core._NotOkStatusException as e:
     57   if name is not None:

InternalError: Graph execution error:

Detected at node 'model_1/model/dense/MatMul' defined at (most recent call last):
    File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/opt/conda/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
      app.start()
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 712, in start
      self.io_loop.start()
    File "/opt/conda/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
      self._run_once()
    File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
      handle._run()
    File "/opt/conda/lib/python3.8/asyncio/events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 504, in dispatch_queue
      await self.process_one()
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 493, in process_one
      await dispatch(*args)
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 400, in dispatch_shell
      await result
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 724, in execute_request
      reply_content = await reply_content
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 383, in do_execute
      res = shell.run_cell(
    File "/opt/conda/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 528, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2880, in run_cell
      result = self._run_cell(
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2935, in _run_cell
      return runner(coro)
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3134, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3337, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3397, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/tmp/ipykernel_1917/2291785529.py", line 1, in <cell line: 1>
      multi_train_x = getReduction(train_x)
    File "/tmp/ipykernel_1917/2290316524.py", line 5, in getReduction
      X = model.fit_transform(X)
    File "/opt/conda/lib/python3.8/site-packages/ivis/ivis.py", line 368, in fit_transform
      self.fit(X, Y, shuffle_mode)
    File "/opt/conda/lib/python3.8/site-packages/ivis/ivis.py", line 346, in fit
      self._fit(X, Y, shuffle_mode)
    File "/opt/conda/lib/python3.8/site-packages/ivis/ivis.py", line 318, in _fit
      hist = self.model_.fit(
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 889, in train_step
      y_pred = self(x, training=True)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 490, in __call__
      return super().__call__(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/functional.py", line 458, in call
      return self._run_internal_graph(
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/functional.py", line 596, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/training.py", line 490, in __call__
      return super().__call__(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/functional.py", line 458, in call
      return self._run_internal_graph(
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/functional.py", line 596, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/keras/layers/core/dense.py", line 221, in call
      outputs = tf.matmul(a=inputs, b=self.kernel)
Node: 'model_1/model/dense/MatMul'
Attempting to perform BLAS operation using StreamExecutor without BLAS support
	 [[{{node model_1/model/dense/MatMul}}]] [Op:__inference_train_function_1703]

Thanks !!!

opened by bitcometz 3

Add conda-forge package

In addition to the pypi package, please add a conda-forge package (https://conda-forge.org).

I can give support if needed.

You can easily create a boilerplate conda recipe with grayskull (starting from the pypi package): https://github.com/conda-incubator/grayskull (note: the "annoy" package is called "python-annoy" in conda-forge).

opened by candalfigomoro 0
Distance-weighted random sampling of non-neighbor negatives
Not a fully-baked feature request, just a directional hunch. I've found the conclusions from this paper Sampling Matters in Deep Embedding Learning pretty intuitive -- (1) the method for choosing negative samples is critical to the overall embedding, maybe more than the specific loss function, and (2) a distance-weighted sampling of negatives had some nice properties during training and better results compared to uniform random sampling or oversampling hard cases.

I'm brand-new to Annoy, not confident on the implementation details or performance changes here, but I suspect that the prebuilt index could be used for both positive and negative sampling. An example: the current approach draws random negatives in sequence and chooses the first index not in a neighbor list. A distance-weighted approach for choosing a negative for each triplet might work like this:

Draw a random set of candidate negatives

Drop any candidate negatives already in the neighbor list

Choose from the remaining set of candidates with probabilities proportional to 1/f(dist(i, j)), where f(dist) could be just 1/dist, 1/sqrt(dist), etc

Annoy gives us the dist(i, j) without much of a performance hit. Weighted choice of the candidate negatives puts a (tunable) thumb on the scale for triplets that contain closer/harder-negative matches.

This idea probably does increase some hyperparameter selection headaches. I think the impactful choices here are the size of the initial set of candidate negatives and (especially) f(dist).
opened by sheffe 2
Custom generator for training on out-of-memory datasets

In https://bering-ivis.readthedocs.io/en/latest/oom_datasets.html, for out-of-memory datasets, you say to train on h5 files that exist on disk.

In my case, I can't use h5 files, but I could use a custom generator which yields numpy array batched data.

Is there a way to provide batched data through a custom generator function? Something like keras' fit_generator.

Thank you

opened by candalfigomoro 5

Releases(2.08)

2.08(Nov 4, 2022)

Source code(tar.gz)
Source code(zip)
2.07(Mar 10, 2022)
Added ability to save/load ivis models that have not been trained. This also fixes an issue when using GridSearchCV in conjunction with ivis

Bugfix for triplet generator when used in conjunction with a dataset exposing the custom get_triplet_data method

Source code(tar.gz)
Source code(zip)
2.06(Oct 17, 2021)
New features:

ivis models are now serializable via pickle/dill/joblib. Thanks to @imatheussm for his contributions toward this.

The save_model method now accepts an optional "save_format" argument. Setting it to "tfs" will export ivis models in the TensorFlow SavedModel format, which integrates well with other TensorFlow libraries.

Source code(tar.gz)
Source code(zip)
2.0.5rc1(Jun 4, 2021)
Knn retrieval made more efficient by switching from multi-processing to multi-threading. Memory savings depend on OS and core count.

Fixed issue where saved ivis models would attempt to load the index at the path they were saved with - this can't be relied on when the index is temporary and deleted after use.

Fixed issue where Annoy Index metric parameter was not passed to an index that was loaded from disk.

A few other things changed, including better error handling, cleaner code, and allowing for saving AnnoyKnnMatrix via pickle

Source code(tar.gz)
Source code(zip)
2.0.5(Jul 13, 2021)
Highlights:

Improved training speed for numpy arrray inputs thanks to a faster triplet generator.

Batched retrieval capabilities that makes ivis much faster when training on out-of-memory data that is retrieved in parallel.

Improved performance when using Ivis with precompute=False option by using multi-threading when retrieving batches of KNN on-demand.

Added deprecation notices for minor upcoming changes to API for consistency and adherence to sklearn API.

Source code(tar.gz)
Source code(zip)
2.0.3(May 26, 2021)
improved memory utilization during KNN retrieval

AnnoyIndex is now removed from disk after running Ivis

Source code(tar.gz)
Source code(zip)
2.02(Apr 15, 2021)
Minor release

Fixes zero chunk error #90

Source code(tar.gz)
Source code(zip)
2.0.1(Jan 6, 2021)
Minor release addressing:

Tensorflow 2.4 model save compatibility (#82)

Training/Inference batch size concordance

Source code(tar.gz)
Source code(zip)
2.0.0(Dec 8, 2020)
Major ivis release!

Version 2.0 features:

Unsupervised, semi-supervised, and fully supervised dimensionality reduction

Support for arbitrary datasets:

N-dimensional arrays

Image files on disk

Custom data connectors

In- and out-of-memory data ingestion

Resumable training

Arbitrary neural network backbones

Customizable neighbour retrieval

Callbacks and Tensorboard integration

Source code(tar.gz)
Source code(zip)
1.8.4(Nov 2, 2020)
Added support for TensorFlow 2.3.0

Visualise embeddings using EmbeddingsExplorer class through datashader in Jupyter notebooks

Source code(tar.gz)
Source code(zip)
1.8.3(Oct 28, 2020)

Improved handling of n-dimensional arrays and HDF5 files
Source code(tar.gz)
Source code(zip)
1.8.2(Oct 28, 2020)

Support for dimensionality reduction of arbitrary n-dimensional arrays.
Source code(tar.gz)
Source code(zip)
1.8.1(Jun 11, 2020)

Compatibility fixes with tensorflow ≥1.13.1
Source code(tar.gz)
Source code(zip)
1.8.0(May 13, 2020)
Introducing neighbour_matrix parameter for provision of arbitrary KNNs.

Transition to tf.Datasets, improving memory efficiency and overall stability

Source code(tar.gz)
Source code(zip)
1.7.2(Apr 8, 2020)

Source code(tar.gz)
Source code(zip)
1.7.0(Jan 7, 2020)

This release addresses #50 by making it easy to alternate between CPU- and GPU-enabled tensor flow backend
Source code(tar.gz)
Source code(zip)
1.6.0(Oct 29, 2019)
Major features:

Support for semi-supervised dimensionality reduction

Switch from using fit_generator to fit for training the Keras model

Address eager execution issues with TF 2.0

User-configurable on-disk-building of Annoy index.

Tidy handling of interrupted multi-thread processes

Minor features:

Tests for semi-supervised DR

Improved input validation

Better hyper parameter validation

Slight changes to default hyperparameters

Bug fixes

Source code(tar.gz)
Source code(zip)
1.5.3(Oct 3, 2019)
Control eager execution

R package updates and improvements

Save ivis object with a custom model

Bug squashes and performance improvements

Source code(tar.gz)
Source code(zip)
1.5.0(Oct 1, 2019)

Transition ivis to TensorFlow 2.0
Source code(tar.gz)
Source code(zip)
1.4.1(Sep 5, 2019)

Added support for supervised multi-label dimensionality reduction.
Source code(tar.gz)
Source code(zip)
1.4.0(Aug 19, 2019)
A number of major additions:

Support for both classification- and regression-type supervision

Access to all Keras losses for supervised dimensionality reduction

Bug fixes and performance improvements

Source code(tar.gz)
Source code(zip)
1.3.0(Aug 6, 2019)
This release introduces a number of new features into ivis:

Windows support

Code changes to support ivis on Python2

R package received a major facelift - with big thanks to JOSS reviewers

Added cosine distance metric in triplet loss function

Minor bug fixes and performance improvements

Source code(tar.gz)
Source code(zip)
1.2.4(Aug 5, 2019)

Appropriate contribution assignments.
Source code(tar.gz)
Source code(zip)
1.2.3-joss(Aug 5, 2019)

ivis release following feedback from JOSS review.
Source code(tar.gz)
Source code(zip)
1.2.3(Jul 4, 2019)

Support for sparse matrices in supervised mode Bug fixes
Source code(tar.gz)
Source code(zip)
1.2.2(Jul 2, 2019)

Added callbacks and sanity checks during module imports
Source code(tar.gz)
Source code(zip)
1.2.1(Jul 2, 2019)

Bug fixes and cleanup
Source code(tar.gz)
Source code(zip)
1.2.0(Jul 2, 2019)
Supervised mode added to ivis. Additional features:

Add classification_weight parameter to allow users to tune balance between classification vs. triplet loss.

Add Ivis callbacks module for ivis-specific callbacks such as checkpointing during training. Ivis object code changed to deal with provided callbacks.

Tensorboard callbacks

Sparse matrix support in supervised mode

Source code(tar.gz)
Source code(zip)
1.1.5(Jun 25, 2019)

Significant improvement in processing speed for both precompute=True and precompute=False option using Keras Sequence generator. Addresses #21 .
Source code(tar.gz)
Source code(zip)
1.1.4(Jun 20, 2019)

Bug fixes PyPi release
Source code(tar.gz)
Source code(zip)

Dimensionality reduction in very large datasets using Siamese Networks

Related tags

Overview

ivis

Installation

Upgrading

Features

Examples

Comments

The problem

Environment

Runtime

Relevant modules

Minimal reproducible example

Code

Error

Discussion

Releases(2.08)

2.08(Nov 4, 2022)

2.07(Mar 10, 2022)

2.06(Oct 17, 2021)

2.0.5rc1(Jun 4, 2021)

2.0.5(Jul 13, 2021)

2.0.3(May 26, 2021)

2.02(Apr 15, 2021)

2.0.1(Jan 6, 2021)

2.0.0(Dec 8, 2020)

1.8.4(Nov 2, 2020)

1.8.3(Oct 28, 2020)

1.8.2(Oct 28, 2020)

1.8.1(Jun 11, 2020)

1.8.0(May 13, 2020)

1.7.2(Apr 8, 2020)

1.7.0(Jan 7, 2020)

1.6.0(Oct 29, 2019)

1.5.3(Oct 3, 2019)

1.5.0(Oct 1, 2019)

1.4.1(Sep 5, 2019)

1.4.0(Aug 19, 2019)

1.3.0(Aug 6, 2019)

1.2.4(Aug 5, 2019)

1.2.3-joss(Aug 5, 2019)

1.2.3(Jul 4, 2019)

1.2.2(Jul 2, 2019)

1.2.1(Jul 2, 2019)

1.2.0(Jul 2, 2019)

1.1.5(Jun 25, 2019)

1.1.4(Jun 20, 2019)

Owner

beringresearch

A central task in drug discovery is searching, screening, and organizing large chemical databases

Visualizations for machine learning datasets

Visualize and compare datasets, target values and associations, with one line of code.

The open-source tool for building high-quality datasets and computer vision models

Visualizations for machine learning datasets

Visualize and compare datasets, target values and associations, with one line of code.

The open-source tool for building high-quality datasets and computer vision models

Draw datasets from within Jupyter.

Glue is a python project to link visualizations of scientific datasets across many files.

HM02: Visualizing Interesting Datasets

HW 2: Visualizing interesting datasets

Learning Convolutional Neural Networks with Interactive Visualization.

A System Metrics Monitoring Tool Built using Python3 , rabbitmq,Grafana and InfluxDB. Setup using docker compose. Use to monitor system performance with graphical interface of grafana , storage of influxdb and message queuing of rabbitmq

Statistical data visualization using matplotlib

Statistical data visualization using matplotlib

Interactive plotting for Pandas using Vega-Lite

Statistical data visualization using matplotlib

Interactive plotting for Pandas using Vega-Lite

basemap - Plot on map projections (with coastlines and political boundaries) using matplotlib.