Data imputations library to preprocess datasets with missing data

Overview
https://travis-ci.org/eltonlaw/impyute.svg?branch=master

Impyute

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

>>> n = 5
>>> arr = np.random.uniform(high=6, size=(n, n))
>>> for _ in range(3):
>>>    arr[np.random.randint(n), np.random.randint(n)] = np.nan
>>> print(arr)
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, np.nan],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, np.nan, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, np.nan, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])
>>> import impyute as impy
>>> print(impy.mean(arr))
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, 3.7122365],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, 1.99128649, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, 3.08994336, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])

Feature Support

  • Imputation of Cross Sectional Data
    • K-Nearest Neighbours
    • Multivariate Imputation by Chained Equations
    • Expectation Maximization
    • Mean Imputation
    • Mode Imputation
    • Median Imputation
    • Random Imputation
  • Imputation of Time Series Data
    • Last Observation Carried Forward
    • Moving Window
    • Autoregressive Integrated Moving Average (WIP)
  • Diagnostic Tools
    • Loggers
    • Distribution of Null Values
    • Comparison of imputations
    • Little's MCAR Test (WIP)

Versions

Currently tested on 2.7, 3.4, 3.5, 3.6 and 3.7

Installation

To install impyute, run the following:

$ pip install impyute

Or to get the most current version:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Documentation

Documentation is available here: http://impyute.readthedocs.io/

How to Contribute

Check out CONTRIBUTING

Comments
  • Multivariate Imputation by Chained Equations is going to return only mean value of the Column.

    Multivariate Imputation by Chained Equations is going to return only mean value of the Column.

    Issue: In the module "impyute.mice" we were expecting to return the imputed value for each column based on the linear equation converged for the column. In contrast we are getting only the mean values of the column.

    Reason: Here we have a logic of entering into loop only if we satisfy below condition, however we have a glitch in the condition which is making us to skip it and return the mean values imputed data set straight away.

    Condition Failing: converged = [False] * len(null_xyv) while all(converged): ..................... ....................

    Resolution:

    converged = [False] * len(null_xyv) while not(all(converged)): ..................... ....................

    opened by PavanTejaDokku 5
  • Name change request: mice

    Name change request: mice

    Dear Elton,

    Thanks for your effort to implement an algorithm for imputing multivariate data.

    I’d like to request a name change of your impyute.imputation.cs.mice procedure. The documentation of this procedure says that it implements Multivariate Imputation by Chained Equations (MICE) from my JSS 2011 paper. However, this documentation is not accurate since your procedure does not implement the MICE algorithm. It differs in important respects from my method:

    • Your procedure provides a single imputation, whereas MICE is a procedure for generating multiple imputations;
    • Your procedure imputes the “best” (predicted) value, while the MICE algorithm always adds noise;
    • Your procedure uses linear regression, whereas the MICE algorithm is open to any type of imputation model;
    • Your procedure uses different convergence criteria.

    These differences have profound methodological implications. Advertising your procedure as “MICE” will create confusion among analysts, who might be led to believe that they are doing MICE when in fact they are not.

    Your procedure is an implementation of Buck’s method published in 1960 (described in more detail in Little & Rubin 2002), so I would suggest that you could perhaps rename to “buck”?

    With regards, Stef van Buuren

    Priority: Critical 
    opened by stefvanbuuren 3
  • Updated fast_knn.py to avoid division by 0

    Updated fast_knn.py to avoid division by 0

    The fast_knn function has stability issues if we're using Sheperd's weight function and distances are 0. In the fast_knn function, I added a small constant to distances to avoid division by 0.

    opened by tahmidmehdi 2
  • Which is the similarity function in KNN imputation?

    Which is the similarity function in KNN imputation?

    Hi!

    I know that there are many different function to calculate similarity between incomplete vectors, but which one is implemented in the package? I have not found any specification or cite about KNN in the documentation.

    opened by aldoconcha 2
  • Different results on using function fast_knn and the function's content

    Different results on using function fast_knn and the function's content

    I was trying to understand the working of the function fast_knn. So, I tried to execute it line by line in order to understand the working. Here it is:

    from scipy.spatial import KDTree
    def shepards(distances, power=2):
        return to_percentage(1/np.power(distances, power))
    
    def to_percentage(vec):
        return vec/np.sum(vec)
    
    data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
    data_temp[0][2] =  np.nan
    k=4
    eps=0
    p=2
    distance_upper_bound=np.inf
    leafsize=10
    idw_fn=shepards
    init_impute_fn=mean
    
    nan_xy = np.argwhere(np.isnan(data_temp))
    data_temp_c = init_impute_fn(data_temp)
    kdtree = KDTree(data_temp_c, leafsize=leafsize)
    for x_i, y_i in nan_xy:
        distances, indices = kdtree.query(data_temp_c[x_i], k=k+1, eps=eps,
                                          p=p, distance_upper_bound=distance_upper_bound)
        # Will always return itself in the first index. Delete it.
        distances, indices = distances[1:], indices[1:]
        # Add small constant to distances to avoid division by 0
        distances += 1e-3
        weights = idw_fn(distances)
        # Assign missing value the weighted average of `k` nearest neighbours
        data_temp[x_i][y_i] = np.dot(weights, [data_temp_c[ind][y_i] for ind in indices])
    data_temp
    

    This outputs:

    array([[ 0.        ,  1.        , 10.06569379,  3.        ,  4.        ],
           [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
           [10.        , 11.        , 12.        , 13.        , 14.        ],
           [15.        , 16.        , 17.        , 18.        , 19.        ],
           [20.        , 21.        , 22.        , 23.        , 24.        ]])
    

    whereas the function has a different output. The code :

    data_temp = np.arange(25).reshape((5, 5)).astype(np.float)
    data_temp[0][2] =  np.nan
    fast_knn(data_temp, k=4)
    

    and the output

    array([[ 0.        ,  1.        , 16.78451885,  3.        ,  4.        ],
           [ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ],
           [10.        , 11.        , 12.        , 13.        , 14.        ],
           [15.        , 16.        , 17.        , 18.        , 19.        ],
           [20.        , 21.        , 22.        , 23.        , 24.        ]])
    ``
    
    opened by aadarshsingh191198 1
  • Enhance locf to (a) optionally allow entire row/column to be NaN (b) optionally not perform look forward

    Enhance locf to (a) optionally allow entire row/column to be NaN (b) optionally not perform look forward

    (a) Real-world data can occasionally have all data for a specific row/column missing. (b) In processing time series, we know only about the past, not the future.

    Example call: impyute.imputation.ts.locf(p000008, axis=1, entire_set_nan_ok=True, no_look_forward=True)

    Example code after modification - apologies, I've not done pull requests before :-)

    import numpy as np from impyute.ops import matrix from impyute.ops import wrapper from impyute.ops import error

    @wrapper.wrappers @wrapper.checks def locf(data, axis=0, no_look_forward=False, entire_set_nan_ok=False): """ Last Observation Carried Forward

    For each set of missing indices, use the value of one row before(same
    column). In the case that the missing value is the first row, look one
    row ahead instead. If this next row is also NaN, look to the next row.
    Repeat until you find a row in this column that's not NaN. All the rows
    before will be filled with this value.
    
    Parameters
    ----------
    data: numpy.ndarray
        Data to impute.
    axis: boolean (optional)
        0 if time series is in row format (Ex. data[0][:] is 1st data point).
        1 if time series is in col format (Ex. data[:][0] is 1st data point).
    no_look_forward boolean (optional). Default=False
        False  if NaN in first row, try to impute by looking ahead in next row.
        True   do not impute in first row, even if NaN is present there.
                    Result may contain NaN in first row.
    entire_set_nan_ok boolean (optional) Default=False
        False  if entire column is NaN, raise exception.
        True   if entire column is NaN, ignore.
                    Result may contain NaN in entire column.
    
    Returns
    -------
    numpy.ndarray
        Imputed data.
    
    """
    if axis == 0:
        data = np.transpose(data)
    elif axis == 1:
        pass
    else:
        raise error.BadInputError("Error: Axis value is invalid, please use either 0 (row format) or 1 (column format)")
    
    nan_xy = matrix.nan_indices(data)
    # print(nan_xy)
    for x_i, y_i in nan_xy:
        # no_look_forward=True means do not impute first set with values from farther down
        # meant to be used in situations where index is Time, so we would not not know what will happen in the future
        # Simplest scenario, look one row back
        # print(f'{x_i}', end=' ')
        if x_i-1 > -1:
            data[x_i][y_i] = data[x_i-1][y_i]
    
        # Look n rows forward
        elif not no_look_forward:
            x_residuals = np.shape(data)[0]-x_i-1  # n datapoints left
            val_found = False
            for i in range(1, x_residuals):
                if not np.isnan(data[x_i+i][y_i]):
                    val_found = True
                    break
            if val_found:
                # pylint: disable=undefined-loop-variable
                for x_nan in range(i):
                    data[x_i+x_nan][y_i] = data[x_i+i][y_i]
        else:
            if entire_set_nan_ok:
                pass
            else:
                raise Exception("Error: Entire Column is NaN")
    return data
    
    opened by gkovaig 1
  • Ddfg add randc function

    Ddfg add randc function

    This pull request is for addressing #67

    1. Add a function randc() to randomly generate data frame with categorical data, which are alphabetic characters. Extra characters combinations would be generated when the 26 characters are used up. (If number is desired, just leave a comment, I can update it)
    2. Update the Corruptor class to accept an extra attribute dtype with default value np.float, so the Corrupter class can generate dataset in other dtype, like np.string
    3. Add test cases for randc() function. One for BadInputError test, second for testing if the number of categories in the dataset is desired, third for testing if the shape of the dataset is desired.
    opened by xyz8983 1
  • fast_knn, moving_window and locf are returning data without imputation for univariate time series

    fast_knn, moving_window and locf are returning data without imputation for univariate time series

    Data looks as below

    tsNH4_na.head()

    | index | ds | y | | :--- | :---: | :---: | | 2010-11-30 16:10:00 | 2010-11-30 16:10:00 | 13.714667 | | 2010-11-30 16:20:00 | 2010-11-30 16:20:00 | NaN | | 2010-11-30 16:30:00 | 2010-11-30 16:30:00 | 14.630500 | | 2010-11-30 16:40:00 | 2010-11-30 16:40:00 | 16.385333 | | 2010-11-30 16:50:00 | 2010-11-30 16:50:00 | 15.992667 |

    Including ds is giving error BadInputError: Data is not float. So just tried with single variable y

    np.isnan(impy.imputation.ts.moving_window(np.array(tsNH4_na[["y"]]),func = np.mean,errors='raise',nindex=0,wsize=10)).sum()

    833

    impy.fast_knn(tsNH4_na[['y']],k = 2) np.isnan(imput).sum()

    833

    Unimputed data also have 833 missing points.

    opened by kumarh22 1
  • fast_knn: the nearest neighbor gets the lowest weight

    fast_knn: the nearest neighbor gets the lowest weight

    Hi Elton,

    Thank you for implementing this library, it's so convenient! I found your library from the link below. https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

    When I was using fast_knn, I found that nearer neighbor got lower weight when getting the weighted average of k nearest neighbors.

    In the example you provided,

    fast_knn(data, k=2) # Weighted average of nearest 2 neighbours array([[ 0. , 1. , 10.08608891, 3. , 4. ], [ 5. , 6. , 7. , 8. , 9. ], [10. , 11. , 12. , 13. , 14. ], [15. , 16. , 17. , 18. , 19. ], [20. , 21. , 22. , 23. , 24. ]])

    In this example, 10.086 is imputed according to kNN algorithm. We get the 2 nearest neighbors using Euclidean distance, for the first row as a "point", the nearest neighbor is the second "point" (second row), and the second nearest neighbor is the third "point" which is the third row. The distance between the first point and second point (nearest neighbor) is 12.5, the distance between the first point and third point (the second nearest neighbor) is 20.156. So this is how 10.086 comes: 10.086 = 7 * 12.5/(12.5 + 20.156) + 12 * 20.156/(12.5 + 20.156) The weight for each point is calculated based on its distance, so the nearer the point, the smaller the distance, the lower the weight, which is supposed to be the opposite.

    In a nutshell, I believe the nearest neighbor should have the highest weight, in this example, the imputed value should be close to 7 instead of 12 (the average of 7 and 12 is 9.5 for reference).

    Thanks. Best, Minjie

    opened by MinjieSh 1
  • #43,  Multivariate Imputation by Chained Equations is going to return only mean value of the Column

    #43, Multivariate Imputation by Chained Equations is going to return only mean value of the Column

    Modified the code, as per comments. Thanks for feedback it was a nice learning experience. attached the unit test document for reference. feedback is always welcome. image

    opened by PavanTejaDokku 1
  • Parsing requirements error from upgrade to pip 10

    Parsing requirements error from upgrade to pip 10

    Error running make test during the installation of impyute it errors out

    ...
    Step 8/10 : WORKDIR /impyute
    Removing intermediate container 59c6df997518
     ---> 457da5768f03
    Step 9/10 : RUN pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .
     ---> Running in 7dd589eb1d87
    Obtaining file:///impyute
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/impyute/setup.py", line 4, in <module>
            from pip.req import parse_requirements
        ImportError: No module named req
        
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /impyute/
    The command '/bin/sh -c pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .' returned a non-zero code: 1
    make: *** [test] Error 1
    
    opened by eltonlaw 1
  • About em problem

    About em problem

    I created NAN in my data set randomly, and i want to compare the performance of EM methods in SPSS and impyute . and i got spss_em* MSE_spss: 22.177916455492653 r_spss: 0.721709731654166 impyute_em MSE_impyute: 289.1830722478248 r_impyute: 0.002467765572835078 the em from impyute seems to not work very well , and i do not know why

    opened by ROOKLO 3
  • suggestion: split between train/test set, allow training and loading of imputation statistics models

    suggestion: split between train/test set, allow training and loading of imputation statistics models

    In research, scientific integrity plays a very important part. One can publish very good papers by playing tricks between train and test set in order to get good results, but such results can never be applied in real life, because those tricks simply does not work in real-life applications.

    Thank you very much for creating a wonderful framework for missing value imputation! However, your framework does not provide a way to apply imputation statistics trained on one dataset onto another dataset. I would greatly appreciate if you can make it.

    For downward compatibility, you can create an optional kwarg called model for every function such as impy.mean, impy.mode, etc. When calling the function, by default model=None; if you pass model=True, the function will return a tuple consisting both the imputed data and the imputation statistics object; if you pass model=<imputation-statistics-object>, then the function will apply the trained imputation statistics to impute the data. In that way, all existing code will not be affected.

    opened by xuancong84 3
  • Definition of Mice method

    Definition of Mice method

    Hi,

    I was reading the Impyute documentation at this link. The documentation mentions that the mice method is defined in impyute.imputations.cs.mice. But, I don't find the method in the mentioned directory. I want to check the various variables used in the method. Could you please direct me to the method definition?

    opened by loneharoon 1
  • Data type checks

    Data type checks

    Hi!

    On this check, it makes sure the input data is of type np.float, but it fails when input data is np.float32 or np.float64.

    https://github.com/eltonlaw/impyute/blob/aadda08d8b221d7b6e2f387051bc2a3903e1b0b8/impyute/util/checks.py#L52

    Would you rather cast to np.float32 by default?

    opened by e3vela 0
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 5, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

Souvik Pratiher 31 Dec 16, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

null 3 Nov 10, 2021
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 5, 2022
pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

Tharsis Souza 5 Nov 19, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dynamical systems.

null 2 Nov 8, 2021
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

International Business Machines 293 Dec 29, 2022
EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

Michael Whittle 20 Dec 22, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 5, 2023
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 3, 2023
TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

tedana: TE Dependent ANAlysis TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI)

null 136 Dec 22, 2022