Data imputations library to preprocess datasets with missing data

Elton Law

Last update: Dec 5, 2022

Related tags

Overview

Impyute

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

>>> n = 5
>>> arr = np.random.uniform(high=6, size=(n, n))
>>> for _ in range(3):
>>>    arr[np.random.randint(n), np.random.randint(n)] = np.nan
>>> print(arr)
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, np.nan],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, np.nan, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, np.nan, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])
>>> import impyute as impy
>>> print(impy.mean(arr))
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, 3.7122365],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, 1.99128649, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, 3.08994336, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])

Feature Support

Imputation of Cross Sectional Data
- K-Nearest Neighbours
- Multivariate Imputation by Chained Equations
- Expectation Maximization
- Mean Imputation
- Mode Imputation
- Median Imputation
- Random Imputation
Imputation of Time Series Data
- Last Observation Carried Forward
- Moving Window
- Autoregressive Integrated Moving Average (WIP)
Diagnostic Tools
- Loggers
- Distribution of Null Values
- Comparison of imputations
- Little's MCAR Test (WIP)

Versions

Currently tested on 2.7, 3.4, 3.5, 3.6 and 3.7

Installation

To install impyute, run the following:

$ pip install impyute

Or to get the most current version:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Documentation

Documentation is available here: http://impyute.readthedocs.io/

How to Contribute

Check out CONTRIBUTING

Comments

Multivariate Imputation by Chained Equations is going to return only mean value of the Column.

Issue: In the module "impyute.mice" we were expecting to return the imputed value for each column based on the linear equation converged for the column. In contrast we are getting only the mean values of the column.

Reason: Here we have a logic of entering into loop only if we satisfy below condition, however we have a glitch in the condition which is making us to skip it and return the mean values imputed data set straight away.

Condition Failing: converged = [False] * len(null_xyv) while all(converged): ..................... ....................

Resolution:

converged = [False] * len(null_xyv) while not(all(converged)): ..................... ....................

opened by PavanTejaDokku 5
Name change request: mice
Dear Elton,

Thanks for your effort to implement an algorithm for imputing multivariate data.

I’d like to request a name change of your impyute.imputation.cs.mice procedure. The documentation of this procedure says that it implements Multivariate Imputation by Chained Equations (MICE) from my JSS 2011 paper. However, this documentation is not accurate since your procedure does not implement the MICE algorithm. It differs in important respects from my method:

Your procedure provides a single imputation, whereas MICE is a procedure for generating multiple imputations;

Your procedure imputes the “best” (predicted) value, while the MICE algorithm always adds noise;

Your procedure uses linear regression, whereas the MICE algorithm is open to any type of imputation model;

Your procedure uses different convergence criteria.

These differences have profound methodological implications. Advertising your procedure as “MICE” will create confusion among analysts, who might be led to believe that they are doing MICE when in fact they are not.

Your procedure is an implementation of Buck’s method published in 1960 (described in more detail in Little & Rubin 2002), so I would suggest that you could perhaps rename to “buck”?

With regards, Stef van Buuren
Priority: Critical
opened by stefvanbuuren 3
Updated fast_knn.py to avoid division by 0

The fast_knn function has stability issues if we're using Sheperd's weight function and distances are 0. In the fast_knn function, I added a small constant to distances to avoid division by 0.

opened by tahmidmehdi 2
Which is the similarity function in KNN imputation?

Hi!

I know that there are many different function to calculate similarity between incomplete vectors, but which one is implemented in the package? I have not found any specification or cite about KNN in the documentation.

opened by aldoconcha 2

Different results on using function fast_knn and the function's content

I was trying to understand the working of the function fast_knn. So, I tried to execute it line by line in order to understand the working. Here it is:

from scipy.spatial import KDTree
def shepards(distances, power=2):
    return to_percentage(1/np.power(distances, power))

def to_percentage(vec):
    return vec/np.sum(vec)

data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] =  np.nan
k=4
eps=0
p=2
distance_upper_bound=np.inf
leafsize=10
idw_fn=shepards
init_impute_fn=mean

nan_xy = np.argwhere(np.isnan(data_temp))
data_temp_c = init_impute_fn(data_temp)
kdtree = KDTree(data_temp_c, leafsize=leafsize)
for x_i, y_i in nan_xy:
    distances, indices = kdtree.query(data_temp_c[x_i], k=k+1, eps=eps,
                                      p=p, distance_upper_bound=distance_upper_bound)
    # Will always return itself in the first index. Delete it.
    distances, indices = distances[1:], indices[1:]
    # Add small constant to distances to avoid division by 0
    distances += 1e-3
    weights = idw_fn(distances)
    # Assign missing value the weighted average of `k` nearest neighbours
    data_temp[x_i][y_i] = np.dot(weights, [data_temp_c[ind][y_i] for ind in indices])
data_temp

This outputs:

array([[ 0.        ,  1.        , 10.06569379,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])

whereas the function has a different output. The code :

data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
data_temp[0][2] =  np.nan
fast_knn(data_temp, k=4)

and the output

array([[ 0.        ,  1.        , 16.78451885,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
       [10.        , 11.        , 12.        , 13.        , 14.        ],
       [15.        , 16.        , 17.        , 18.        , 19.        ],
       [20.        , 21.        , 22.        , 23.        , 24.        ]])
``

opened by aadarshsingh191198 1

Enhance locf to (a) optionally allow entire row/column to be NaN (b) optionally not perform look forward

(a) Real-world data can occasionally have all data for a specific row/column missing. (b) In processing time series, we know only about the past, not the future.

Example call: impyute.imputation.ts.locf(p000008, axis=1, entire_set_nan_ok=True, no_look_forward=True)

Example code after modification - apologies, I've not done pull requests before :-)

import numpy as np from impyute.ops import matrix from impyute.ops import wrapper from impyute.ops import error

@wrapper.wrappers @wrapper.checks def locf(data, axis=0, no_look_forward=False, entire_set_nan_ok=False): """ Last Observation Carried Forward

For each set of missing indices, use the value of one row before(same
column). In the case that the missing value is the first row, look one
row ahead instead. If this next row is also NaN, look to the next row.
Repeat until you find a row in this column that's not NaN. All the rows
before will be filled with this value.

Parameters
----------
data: numpy.ndarray
    Data to impute.
axis: boolean (optional)
    0 if time series is in row format (Ex. data[0][:] is 1st data point).
    1 if time series is in col format (Ex. data[:][0] is 1st data point).
no_look_forward boolean (optional). Default=False
    False  if NaN in first row, try to impute by looking ahead in next row.
    True   do not impute in first row, even if NaN is present there.
                Result may contain NaN in first row.
entire_set_nan_ok boolean (optional) Default=False
    False  if entire column is NaN, raise exception.
    True   if entire column is NaN, ignore.
                Result may contain NaN in entire column.

Returns
-------
numpy.ndarray
    Imputed data.

"""
if axis == 0:
    data = np.transpose(data)
elif axis == 1:
    pass
else:
    raise error.BadInputError("Error: Axis value is invalid, please use either 0 (row format) or 1 (column format)")

nan_xy = matrix.nan_indices(data)
# print(nan_xy)
for x_i, y_i in nan_xy:
    # no_look_forward=True means do not impute first set with values from farther down
    # meant to be used in situations where index is Time, so we would not not know what will happen in the future
    # Simplest scenario, look one row back
    # print(f'{x_i}', end=' ')
    if x_i-1 > -1:
        data[x_i][y_i] = data[x_i-1][y_i]

    # Look n rows forward
    elif not no_look_forward:
        x_residuals = np.shape(data)[0]-x_i-1  # n datapoints left
        val_found = False
        for i in range(1, x_residuals):
            if not np.isnan(data[x_i+i][y_i]):
                val_found = True
                break
        if val_found:
            # pylint: disable=undefined-loop-variable
            for x_nan in range(i):
                data[x_i+x_nan][y_i] = data[x_i+i][y_i]
    else:
        if entire_set_nan_ok:
            pass
        else:
            raise Exception("Error: Entire Column is NaN")
return data

opened by gkovaig 1

Ddfg add randc function
This pull request is for addressing #67

Add a function randc() to randomly generate data frame with categorical data, which are alphabetic characters. Extra characters combinations would be generated when the 26 characters are used up. (If number is desired, just leave a comment, I can update it)

Update the Corruptor class to accept an extra attribute dtype with default value np.float, so the Corrupter class can generate dataset in other dtype, like np.string

Add test cases for randc() function. One for BadInputError test, second for testing if the number of categories in the dataset is desired, third for testing if the shape of the dataset is desired.
opened by xyz8983 1
fast_knn, moving_window and locf are returning data without imputation for univariate time series

Data looks as below

tsNH4_na.head()

| index | ds | y | | :--- | :---: | :---: | | 2010-11-30 16:10:00 | 2010-11-30 16:10:00 | 13.714667 | | 2010-11-30 16:20:00 | 2010-11-30 16:20:00 | NaN | | 2010-11-30 16:30:00 | 2010-11-30 16:30:00 | 14.630500 | | 2010-11-30 16:40:00 | 2010-11-30 16:40:00 | 16.385333 | | 2010-11-30 16:50:00 | 2010-11-30 16:50:00 | 15.992667 |

Including ds is giving error BadInputError: Data is not float. So just tried with single variable y

np.isnan(impy.imputation.ts.moving_window(np.array(tsNH4_na[["y"]]),func = np.mean,errors='raise',nindex=0,wsize=10)).sum()

833

impy.fast_knn(tsNH4_na[['y']],k = 2) np.isnan(imput).sum()

833

Unimputed data also have 833 missing points.

opened by kumarh22 1
fast_knn: the nearest neighbor gets the lowest weight

Hi Elton,

Thank you for implementing this library, it's so convenient! I found your library from the link below. https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

When I was using fast_knn, I found that nearer neighbor got lower weight when getting the weighted average of k nearest neighbors.

In the example you provided,

fast_knn(data, k=2) # Weighted average of nearest 2 neighbours array([[ 0. , 1. , 10.08608891, 3. , 4. ], [ 5. , 6. , 7. , 8. , 9. ], [10. , 11. , 12. , 13. , 14. ], [15. , 16. , 17. , 18. , 19. ], [20. , 21. , 22. , 23. , 24. ]])

In this example, 10.086 is imputed according to kNN algorithm. We get the 2 nearest neighbors using Euclidean distance, for the first row as a "point", the nearest neighbor is the second "point" (second row), and the second nearest neighbor is the third "point" which is the third row. The distance between the first point and second point (nearest neighbor) is 12.5, the distance between the first point and third point (the second nearest neighbor) is 20.156. So this is how 10.086 comes: 10.086 = 7 * 12.5/(12.5 + 20.156) + 12 * 20.156/(12.5 + 20.156) The weight for each point is calculated based on its distance, so the nearer the point, the smaller the distance, the lower the weight, which is supposed to be the opposite.

In a nutshell, I believe the nearest neighbor should have the highest weight, in this example, the imputed value should be close to 7 instead of 12 (the average of 7 and 12 is 9.5 for reference).

Thanks. Best, Minjie

opened by MinjieSh 1
#43, Multivariate Imputation by Chained Equations is going to return only mean value of the Column

Modified the code, as per comments. Thanks for feedback it was a nice learning experience. attached the unit test document for reference. feedback is always welcome.

opened by PavanTejaDokku 1

Parsing requirements error from upgrade to pip 10

Error running make test during the installation of impyute it errors out

...
Step 8/10 : WORKDIR /impyute
Removing intermediate container 59c6df997518
 ---> 457da5768f03
Step 9/10 : RUN pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .
 ---> Running in 7dd589eb1d87
Obtaining file:///impyute
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/impyute/setup.py", line 4, in <module>
        from pip.req import parse_requirements
    ImportError: No module named req
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /impyute/
The command '/bin/sh -c pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .' returned a non-zero code: 1
make: *** [test] Error 1

opened by eltonlaw 1

About em problem

I created NAN in my data set randomly, and i want to compare the performance of EM methods in SPSS and impyute . and i got spss_em* MSE_spss: 22.177916455492653 r_spss: 0.721709731654166 impyute_em MSE_impyute: 289.1830722478248 r_impyute: 0.002467765572835078 the em from impyute seems to not work very well , and i do not know why

opened by ROOKLO 3
suggestion: split between train/test set, allow training and loading of imputation statistics models

In research, scientific integrity plays a very important part. One can publish very good papers by playing tricks between train and test set in order to get good results, but such results can never be applied in real life, because those tricks simply does not work in real-life applications.

Thank you very much for creating a wonderful framework for missing value imputation! However, your framework does not provide a way to apply imputation statistics trained on one dataset onto another dataset. I would greatly appreciate if you can make it.

For downward compatibility, you can create an optional kwarg called model for every function such as impy.mean, impy.mode, etc. When calling the function, by default model=None; if you pass model=True, the function will return a tuple consisting both the imputed data and the imputation statistics object; if you pass model=<imputation-statistics-object>, then the function will apply the trained imputation statistics to impute the data. In that way, all existing code will not be affected.

opened by xuancong84 3
Definition of Mice method

Hi,

I was reading the Impyute documentation at this link. The documentation mentions that the mice method is defined in impyute.imputations.cs.mice. But, I don't find the method in the mentioned directory. I want to check the various variables used in the method. Could you please direct me to the method definition?

opened by loneharoon 1
Data type checks

Hi!

On this check, it makes sure the input data is of type np.float, but it fails when input data is np.float32 or np.float64.

https://github.com/eltonlaw/impyute/blob/aadda08d8b221d7b6e2f387051bc2a3903e1b0b8/impyute/util/checks.py#L52

Would you rather cast to np.float32 by default?

opened by e3vela 0

Owner

Elton Law

GitHub http://impyute.readthedocs.io/

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

411 Dec 27, 2022

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

20 Jan 5, 2023

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

2 Feb 14, 2022

Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

4 Feb 11, 2022

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

31 Dec 16, 2022

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

180 Dec 18, 2022

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

3 Jul 5, 2022

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dynamical systems.

2 Nov 8, 2021

Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

293 Dec 29, 2022

EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

20 Dec 22, 2022

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

302 Dec 29, 2022

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

95 Dec 13, 2022

TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

tedana: TE Dependent ANAlysis TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI)

136 Dec 22, 2022

Data imputations library to preprocess datasets with missing data

Related tags

Overview

Impyute

Feature Support

Versions

Installation

Documentation

How to Contribute

Comments

Owner

Elton Law

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Python Package for DataHerb: create, search, and load datasets.

Instant search for and access to many datasets in Pyspark.

Active Learning demo using two small datasets

Python tools for querying and manipulating BIDS datasets.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

pyETT: Python library for Eleven VR Table Tennis data

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

Lale is a Python library for semi-automated data science.

EOD Historical Data Python Library (Unofficial)

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

Python library for creating data pipelines with chain functional programming

Spaghetti: an open-source Python library for the analysis of network-based spatial data

TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data