DataPrep — The easiest way to prepare data in Python

SFU Database Group

Last update: Dec 27, 2022

Related tags

Data Analysis data-science connector exploratory-data-analysis eda apis data-exploration cleaning dataprep datacleaning dataconnector apiwrapper webconnector datapreparation

Overview

Documentation | Discord | Forum

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

Collect data from common data sources (through dataprep.connector)
Do your exploratory data analysis (through dataprep.eda)
Clean and standardize data (through dataprep.clean)
...more modules are coming

Releases

Repo	Version	Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

10X Faster: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Click here to see the benchmark result.

Try DataPrep.EDA Online: DataPrep.EDA Demo in Colab

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.
Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
How-to Guide: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

Learn DataPrep.EDA in 2 minutes:

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides

A Unified API: each function follows the syntax clean_{type}(df, 'column name') (see an example below).
Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.

The following example shows how to clean and standardize a column of country names.

from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
           country  country_clean
0              USA  United States
1  country: Canada         Canada
2              233        Estonia
3              tr          Turkey
4               NA            NaN

Type validation is also supported:

from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0     True
1    False
2     True
3     True
4    False
Name: country, dtype: bool

Check clean_headers, clean_country, clean_date, clean_duplication, clean_email, clean_lat_long, clean_ip, clean_phone, clean_text, clean_url, clean_address and clean_df to see how each function works.

Connector

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument _count) without getting into unnecessary detail about a specific pagination scheme.
Speed: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the _concurrency argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.

How to fetch all publications of Andrew Y. Ng?

from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)

Here, you can find detailed Examples.

Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

Submit bugs and help us verify fixes as they are checked in.
Review the source code changes.
Engage with other DataPrep users and developers on StackOverflow.
Help each other in the DataPrep Community Discord and Forum.
Contribute bug fixes.
Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

Pandas Profiling

Inspired the report functionality and insights provided in dataprep.eda.
missingno

Inspired the missing value analysis in dataprep.eda.

Citing DataPrep

If you use DataPrep, please consider citing the following paper:

Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

BibTeX entry:

@inproceedings{dataprepeda2021,
  author    = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
  title     = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
  booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
  year      = {2021}
}

Comments

plot(df, x) descriptive statistics

Task: add the column statistics from pandas-profiling to plot(df, x).

My proposal is to add another tab with the statistics like we have a tab for each visualization. Or is it important to see the statistics beside a visualization? I think if we add a tab it could be created with Bokeh Div, (or Paragraph or PreText) which allows formatted text (example), but there are perhaps other possibilities.

Do we also want to add full dataset statistics like pandas-profiling? Perhaps we could add these at the top of the plot(df) output.

I think it's important to consider the time required to compute the statistics. We can consider dask.compute(statistics), but should verify this is the most efficient approach. It would be ideal to perform minimal passes over the dataset in order to compute all the statistics.
type: enhancement module: EDA

opened by brandonlockhart 24
feat(eda): add stat. in plot_missing
Description

Fixes #367 - EDA.plot_missing: enrich with stat. I only implemented it under the situation: plot_missing(df)

How Has This Been Tested?

manually

Snapshots:

Checklist:

[x] My code follows the style guidelines of this project

[x] I have already squashed the commits and make the commit message conform to the project standard.

[x] I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.

[x] I have performed a self-review of my own code

[ ] I have commented my code, particularly in hard-to-understand areas

[ ] I have made corresponding changes to the documentation

[ ] My changes generate no new warnings

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] New and existing unit tests pass locally with my changes

[x] Any dependent changes have been merged and published in downstream modules
opened by yuzhenmao 19
support of time series
This issue is about the rough idea to support time series in dataprep.eda.

Essentially, datatime could be regarded as a numeric type, and it could be transformed to timestamp (float) via datatime.timestamp() or pd.to_numeric(). Hence, we could do the following work as the initial support of time series.

Identify the column with datatime64 type in the dataframe.

plot(df) & plot(df, x): handle time series column like numeric column, which could be binalized. When show the ticks of time series column, show the datetime string via function like datatime.strftime(). An example output is https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html

plot(df, x, y): When x is a datetime column and y is a numeric column, change the scatter plot with the line chart, which shows how y changes with x. For all other cases, apply the processing as step 2.

plot_correlation: we could ignore the datetime column as pandas does, or transform datetime to numeric column via pd.to_numeric() and then apply the original processing.

plot_missing: apply the similar processing of step 2.

type: enhancement module: EDA
opened by jinglinpeng 14
Conda Installation of the dataprep AI is not supported
Conda Installation for the data prep AI is not supported.

$ conda install dataprep Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

dataprep

Current channels:

https://repo.anaconda.com/pkgs/main/linux-64

https://repo.anaconda.com/pkgs/main/noarch

https://repo.anaconda.com/pkgs/r/linux-64

https://repo.anaconda.com/pkgs/r/noarch

https://conda.anaconda.org/conda-forge/linux-64

https://conda.anaconda.org/conda-forge/noarch

To search for alternate channels that may provide the conda package you're looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.
type: enhancement module: EDA
opened by abhisheksundarraman 13

Error concerning scipy.stats.stats when creating a report

Describe the bug I get the following error when I try to run the example of creating a report:

error happended in column:PassengerId Traceback (most recent call last): File "", line 1, in File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/init.py", line 68, in create_report "components": format_report(df, cfg, mode, progress), File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 76, in format_report comps = format_basic(edaframe, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 274, in format_basic data, completions = basic_computations(df, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 383, in basic_computations variables_data = compute_variables(df, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 318, in compute_variables data[col] = cont_comps(df.frame[col], cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/distribution/compute/univariate.py", line 200, in cont_comps data["chisq"] = chisquare(data["hist"][0]) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dask/array/stats.py", line 136, in chisquare return power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis, lambda="pearson") File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dask/array/stats.py", line 144, in power_divergence if lambda not in scipy.stats.stats._power_div_lambda_names: File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/scipy/stats/stats.py", line 54, in getattr raise AttributeError( AttributeError: scipy.stats.stats is deprecated and has no attribute _power_div_lambda_names. Try looking in scipy.stats instead.

To Reproduce

from dataprep.datasets import load_dataset
df = load_dataset("titanic")
from dataprep.eda import create_report
report = create_report(df)

Expected behavior To get the EDA report.

Desktop (please complete the following information):

OS: Ubuntu 20.04.4 LTS
Platform [Python script]
Platform Version [PyCharm 2021.3.2 (Community Edition)]
Python Version [3.8.12]
Dataprep Version [0.4.2]

Additional context I have tested in a fresh conda env with pip install dataprep. Here are the packages installed:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
aiohttp                   3.8.1                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
argon2-cffi               21.3.0                   pypi_0    pypi
argon2-cffi-bindings      21.2.0                   pypi_0    pypi
asttokens                 2.0.5                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
backcall                  0.2.0                    pypi_0    pypi
bleach                    4.1.0                    pypi_0    pypi
bokeh                     2.4.2                    pypi_0    pypi
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.10.8        py38h06a4308_2  
cffi                      1.15.0                   pypi_0    pypi
charset-normalizer        2.0.12                   pypi_0    pypi
click                     8.0.4                    pypi_0    pypi
cloudpickle               2.0.0                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
dask                      2021.12.0                pypi_0    pypi
dataprep                  0.4.2                    pypi_0    pypi
debugpy                   1.5.1                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
entrypoints               0.4                      pypi_0    pypi
executing                 0.8.2                    pypi_0    pypi
flask                     2.0.3                    pypi_0    pypi
flask-cors                3.0.10                   pypi_0    pypi
fonttools                 4.29.1                   pypi_0    pypi
frozenlist                1.3.0                    pypi_0    pypi
fsspec                    2022.2.0                 pypi_0    pypi
idna                      3.3                      pypi_0    pypi
importlib-resources       5.4.0                    pypi_0    pypi
ipykernel                 6.9.1                    pypi_0    pypi
ipython                   8.1.0                    pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
ipywidgets                7.6.5                    pypi_0    pypi
itsdangerous              2.1.0                    pypi_0    pypi
jedi                      0.18.1                   pypi_0    pypi
jinja2                    3.0.3                    pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
jsonpath-ng               1.5.3                    pypi_0    pypi
jsonschema                4.4.0                    pypi_0    pypi
jupyter-client            7.1.2                    pypi_0    pypi
jupyter-core              4.9.2                    pypi_0    pypi
jupyterlab-pygments       0.1.2                    pypi_0    pypi
jupyterlab-widgets        1.0.2                    pypi_0    pypi
kiwisolver                1.3.2                    pypi_0    pypi
ld_impl_linux-64          2.35.1               h7274673_9  
levenshtein               0.16.0                   pypi_0    pypi
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgomp                   9.3.0               h5101ec6_17  
libstdcxx-ng              9.3.0               hd4cf53a_17  
locket                    0.2.1                    pypi_0    pypi
markupsafe                2.1.0                    pypi_0    pypi
matplotlib                3.5.1                    pypi_0    pypi
matplotlib-inline         0.1.3                    pypi_0    pypi
metaphone                 0.6                      pypi_0    pypi
mistune                   0.8.4                    pypi_0    pypi
multidict                 6.0.2                    pypi_0    pypi
nbclient                  0.5.11                   pypi_0    pypi
nbconvert                 6.4.2                    pypi_0    pypi
nbformat                  5.1.3                    pypi_0    pypi
ncurses                   6.3                  h7f8727e_2  
nest-asyncio              1.5.4                    pypi_0    pypi
nltk                      3.7                      pypi_0    pypi
notebook                  6.4.8                    pypi_0    pypi
numpy                     1.22.2                   pypi_0    pypi
openssl                   1.1.1m               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
pandas                    1.4.1                    pypi_0    pypi
pandocfilters             1.5.0                    pypi_0    pypi
parso                     0.8.3                    pypi_0    pypi
partd                     1.2.0                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pickleshare               0.7.5                    pypi_0    pypi
pillow                    9.0.1                    pypi_0    pypi
pip                       21.2.4           py38h06a4308_0  
ply                       3.11                     pypi_0    pypi
prometheus-client         0.13.1                   pypi_0    pypi
prompt-toolkit            3.0.28                   pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pycparser                 2.21                     pypi_0    pypi
pydantic                  1.9.0                    pypi_0    pypi
pygments                  2.11.2                   pypi_0    pypi
pyparsing                 3.0.7                    pypi_0    pypi
pyrsistent                0.18.1                   pypi_0    pypi
python                    3.8.12               h12debd9_0  
python-crfsuite           0.9.7                    pypi_0    pypi
python-dateutil           2.8.2                    pypi_0    pypi
python-stdnum             1.17                     pypi_0    pypi
pytz                      2021.3                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     22.3.0                   pypi_0    pypi
rapidfuzz                 1.8.3                    pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2021.11.10               pypi_0    pypi
scipy                     1.8.0                    pypi_0    pypi
send2trash                1.8.0                    pypi_0    pypi
setuptools                58.0.4           py38h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.37.2               hc218d9a_0  
stack-data                0.2.0                    pypi_0    pypi
terminado                 0.13.1                   pypi_0    pypi
testpath                  0.6.0                    pypi_0    pypi
tk                        8.6.11               h1ccaba5_0  
toolz                     0.11.2                   pypi_0    pypi
tornado                   6.1                      pypi_0    pypi
tqdm                      4.62.3                   pypi_0    pypi
traitlets                 5.1.1                    pypi_0    pypi
typing-extensions         4.1.1                    pypi_0    pypi
varname                   0.8.1                    pypi_0    pypi
wcwidth                   0.2.5                    pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
werkzeug                  2.0.3                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
widgetsnbextension        3.5.2                    pypi_0    pypi
wordcloud                 1.8.1                    pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
yarl                      1.7.2                    pypi_0    pypi
zipp                      3.7.0                    pypi_0    pypi
zlib                      1.2.11               h7f8727e_4

type: bug

opened by mina-marmpena 12

eda.create_report: page design prototype

We have added stats info to our plot function, now we can use all those information to generate an HTML page for our users.

I prototyped this layout without adding any practical plots, so we can change this design easily. Every element is 1:1 to our current code's definition, I believe this may give you a better concept of how this webpage would look like. The width of page is 1920px.

I will put the prototype here if anyone needs a more detailed inspection. Let me know if you have any suggestions. @jnwang @jinglinpeng @dovahcrow @Waterpine @brandonlockhart https://www.figma.com/file/txfQwkocxBOFOilPvaI9MC/Untitled?node-id=0%3A1
type: enhancement module: EDA

opened by eutialia 12

create_report() crashes on a dataframe with a constant numeric column

create_report() crashes when the dataframe contains a column of all constant numeric values. Interestingly, I didn't see the failure when I used plot(), plot_correlation(), or plot_missing(), so maybe they have logic to adapt to that condition.

Below is a simple repro case.

Repro:

dp.create_report(pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]}))

Error trace:

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-65-6594a8550d86> in <module>
----> 1 dp.create_report(pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]}))

c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\__init__.py in create_report(df, title, mode, progress)
     52         "resources": INLINE.render(),
     53         "title": title,
---> 54         "components": format_report(df, mode, progress),
     55     }
     56     template_base = ENV_LOADER.get_template("base.html")

c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\formatter.py in format_report(df, mode, progress)
     61         df = string_dtype_to_object(df)
     62         if mode == "basic":
---> 63             comps = format_basic(df)
     64         # elif mode == "full":
     65         #     comps = format_full(df)

c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\formatter.py in format_basic(df)
     97             category=RuntimeWarning,
     98         )
---> 99         (data,) = dask.compute(data)
    100 
    101     # results dictionary

c:\Miniconda\envs\python3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    450         postcomputes.append(x.__dask_postcompute__())
    451 
--> 452     results = schedule(dsk, keys, **kwargs)
    453     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    454 

c:\Miniconda\envs\python3\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     82         get_id=_thread_get_id,
     83         pack_exception=pack_exception,
---> 84         **kwargs
     85     )
     86 

c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    484                         _execute_task(task, data)  # Re-execute locally
    485                     else:
--> 486                         raise_exception(exc, tb)
    487                 res, worker_id = loads(res_info)
    488                 state["cache"][key] = res

c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in reraise(exc, tb)
    314     if exc.__traceback__ is not tb:
    315         raise exc.with_traceback(tb)
--> 316     raise exc
    317 
    318 

c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    220     try:
    221         task, data = loads(task_info)
--> 222         result = _execute_task(task, data)
    223         id = get_id()
    224         result = dumps((result, id))

c:\Miniconda\envs\python3\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\distribution\compute\common.py in gaussian_kde(arr)
    230 def gaussian_kde(arr: np.ndarray) -> Tuple[float, float]:
    231     """Delayed version of scipy gaussian_kde."""
--> 232     return cast(Tuple[np.ndarray, np.ndarray], gaussian_kde_(arr))
    233 
    234 

c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in __init__(self, dataset, bw_method, weights)
    204             self._neff = 1/sum(self._weights**2)
    205 
--> 206         self.set_bandwidth(bw_method=bw_method)
    207 
    208     def evaluate(self, points):

c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in set_bandwidth(self, bw_method)
    554             raise ValueError(msg)
    555 
--> 556         self._compute_covariance()
    557 
    558     def _compute_covariance(self):

c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in _compute_covariance(self)
    566                                                bias=False,
    567                                                aweights=self.weights))
--> 568             self._data_inv_cov = linalg.inv(self._data_covariance)
    569 
    570         self.covariance = self._data_covariance * self.factor**2

c:\Miniconda\envs\python3\lib\site-packages\scipy\linalg\basic.py in inv(a, overwrite_a, check_finite)
    975         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    976     if info > 0:
--> 977         raise LinAlgError("singular matrix")
    978     if info < 0:
    979         raise ValueError('illegal value in %d-th argument of internal '

LinAlgError: singular matrix

type: bug module: EDA

opened by dhuntley1023 11

plot(df) and plot_correlation(df) fail when data has 'list' columns

When running plot(df) and plot_correlation(df) on the following dataframe, since the author column is a list, both plot and plot_correlation failed.

For plot(), the reported error is TypeError: unhashable type: 'list'

For plot_correlation(df), the reported error AssertionError: No numerical columns found
type: bug module: EDA

opened by jnwang 11
eda.plot: support text analysis
The issue is related to the task of supporting text analysis. The initial idea is that:

Have a type system, and support a type named 'text'.

plot(df): for text column, we show the word cloud.

plot(df, x): for text column, we show some information related to the text, such as the length distribution, the most common words and the statistics of Lowercase/Uppercase Letter.

For now, I think we should start from the type system, which is needed for multiple tasks. Please let me know how you think about the overall design. If we both agree, I will make another issue about the type system. @jnwang @dovahcrow @brandonlockhart @Waterpine @dylanzxc .
type: enhancement Epic module: EDA
opened by jinglinpeng 10
data_connector: Fetch all publications of one specific author

Suppose a user wants to fetch all publications of one specific author (e.g., Jian Pei). Dataprep.data_connector cannot meet her needs. For example, the first paper is not written by Jian Pei, but it was returned since the author list contains the keywords Jian and Pei.

A user can get all publications of Jian Pei through this API: https://dblp.org/search/publ/api?q=author%3AJian_Pei%3A

Please consider to support this feature.
type: enhancement module: Connector

opened by jnwang 10
feat(clean): add clean_df function
Description

clean_df function: conduct a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame. Closes #503.

How Has This Been Tested?

I have tested this function using a few real-world datasets. I will also add my test function later.

Checklist:

[x] My code follows the style guidelines of this project

[x] I have already squashed the commits and make the commit message conform to the project standard.

[x] I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.

[x] I have performed a self-review of my own code

[x] I have commented my code, particularly in hard-to-understand areas

[x] I have made corresponding changes to the documentation

[ ] My changes generate no new warnings

[x] I have added tests that prove my fix is effective or that my feature works

[x] New and existing unit tests pass locally with my changes

[ ] Any dependent changes have been merged and published in downstream modules
opened by AndyWangSFU 9
Add the option to pass a target varaible when creating the EDA report
I am often interested in understanding the relationship between one specific column (my target variable) and the others.

It would be nice if we could pass a target variable when creating the EDA report: i.e. eda.create_report(target="has_survived").

Then all variable plots would all be crossed with this target variable. You already have this functionality in the eda.plot(df, target) function from the docs.

This is a functionality that I like to use in the sweetviz library:

sweetviz.analyze(source, target_feat)
type: enhancement
opened by rluthi 0
Need data-type for each column in create_report function.
Hi all, Currently, I need to add a data-type (type) param in creat_report() like as plot() function. This data type can help me generate report with numerical/categorical features without affecting "Distinct Count".

This image below was automatically generated by creat_report. However, my expected output is numerical stats and visualization.

My expected feature:

dttype = {c: "Continuous" for c in dataframe.columns} creat_report(dataframe, dtype=dttype)

Any solution to my problem, please support me. Thanks
type: enhancement
opened by anthng 2
EDA plot() not showing properly in VScode
Describe the bug EDA plot() not showing properly in VScode

To Reproduce Steps to reproduce the behavior:

Open new Jupyter notebook

Import from dataprep.eda import plot, plot_correlation, create_report, plot_missing

Try .head() method on any test dataset so it shows first DF rows as output

Next cell call .plot() method from dataprep

Near finishing execution, the method adds padding to the left of all Output cells and it's impossible to scroll to the right to see the full output

Or:

import numpy as np import pandas as pd import datetime from datetime import date import matplotlib import seaborn as sns import matplotlib.pyplot as plt import plotly.graph_objects as go from sklearn.preprocessing import StandardScaler, normalize from sklearn import metrics from sklearn.mixture import GaussianMixture from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules import warnings warnings.filterwarnings('ignore') data=pd.read_csv('marketing_campaign.csv',header=0,sep=';') from dataprep.eda import plot, plot_correlation, create_report, plot_missing plot(data)

Expected behavior I expected to be shown the output normally in the screen

Screenshots

Images:

Desktop (please complete the following information):

OS: Windows 11

Browser: None

Platform VSCode

Platform Version 1.74.1

Python Version 3.10.9

Dataprep Version 0.4.4

Additional context
type: bug
opened by ldavidr3 1
Make the documentation more lightweight and readable
DataPrep is a great tool! And the automatic EDA module is probably the best open-source option out there.

However, I find your package difficult to understand and learn from your documentation. It makes it feels more complicated than it actually is.

Some ideas to improve :

Highlight 2 line EDA report creation (only advanced users need the detail)

Separate documentation from case studies

Organise the sub-sections in doc of the Clean module

type: enhancement
opened by rluthi 3
build(deps): bump express from 4.17.1 to 4.18.2 in /dataprep/clean/gui/clean_frontend
Bumps express from 4.17.1 to 4.18.2.

Release notes

Sourced from express's releases.

4.18.2

Fix regression routing a large stack in a single route

deps: [email protected]

deps: [email protected]

perf: remove unnecessary object clone

deps: [email protected]

4.18.1

Fix hanging on large stack of sync routes

4.18.0

Add "root" option to res.download

Allow options without filename in res.download

Deprecate string and non-integer arguments to res.status

Fix behavior of null/undefined as maxAge in res.cookie

Fix handling very large stacks of sync middleware

Ignore Object.prototype values in settings through app.set/app.get

Invoke default with same arguments as types in res.format

Support proper 205 responses using res.send

Use http-errors for res.format error

deps: [email protected]

Fix error message for json parse whitespace in strict

Fix internal error when inflated body exceeds limit

Prevent loss of async hooks context

Prevent hanging when request already read

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

Add priority option

Fix expires option to reject invalid dates

deps: [email protected]

Replace internal eval usage with Function constructor

Use instance methods on process to check for listeners

deps: [email protected]

Remove set content headers that break response

deps: [email protected]

deps: [email protected]

deps: [email protected]

Prevent loss of async hooks context

deps: [email protected]

deps: [email protected]

Fix emitted 416 error missing headers property

Limit the headers removed for 304 response

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

... (truncated)

Changelog

Sourced from express's changelog.

4.18.2 / 2022-10-08

Fix regression routing a large stack in a single route

deps: [email protected]

deps: [email protected]

perf: remove unnecessary object clone

deps: [email protected]

4.18.1 / 2022-04-29

Fix hanging on large stack of sync routes

4.18.0 / 2022-04-25

Add "root" option to res.download

Allow options without filename in res.download

Deprecate string and non-integer arguments to res.status

Fix behavior of null/undefined as maxAge in res.cookie

Fix handling very large stacks of sync middleware

Ignore Object.prototype values in settings through app.set/app.get

Invoke default with same arguments as types in res.format

Support proper 205 responses using res.send

Use http-errors for res.format error

deps: [email protected]

Fix error message for json parse whitespace in strict

Fix internal error when inflated body exceeds limit

Prevent loss of async hooks context

Prevent hanging when request already read

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

Add priority option

Fix expires option to reject invalid dates

deps: [email protected]

Replace internal eval usage with Function constructor

Use instance methods on process to check for listeners

deps: [email protected]

Remove set content headers that break response

deps: [email protected]

deps: [email protected]

deps: [email protected]

Prevent loss of async hooks context

deps: [email protected]

deps: [email protected]

... (truncated)

Commits

8368dc1 4.18.2

61f4049 docs: replace Freenode with Libera Chat

bb7907b build: [email protected]

f56ce73 build: [email protected]

24b3dc5 deps: [email protected]

689d175 deps: [email protected]

340be0f build: [email protected]

33e8dc3 docs: use Node.js name style

644f646 build: [email protected]

ecd7572 build: [email protected]

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies javascript
opened by dependabot[bot] 1
build(deps): bump certifi from 2022.6.15 to 2022.12.7
Bumps certifi from 2022.6.15 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies python
opened by dependabot[bot] 1

Releases(v0.4.4-alpha.1)

v0.4.4-alpha.1(Jul 8, 2022)
Bugfixes 🐛

eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)

eda: jinja2.markup import broken with 3.1 (b9b60a0a)

eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)

eda: report for empty df (485e58d3)

eda: plot_diff when columns are not aligned (7e53dbf6)

eda: scipy version issue (8798a146)

eda: na column name when upgrade dask (43fdd1a6)

eda: pd grouper issue when upgrade dask (761c4455)

clean: delete abundant print (0e072a80)

eda.plot: fix display issue in notebook (6ed13b09)

eda.plot: fix pagination styling issues (8396f2d9)

eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)

eda: interaction error in report for cat-only df (e60239a0)

eda: fix cat-cat error (94f70ef6)

eda: fix stat layout issue (5bb535d7)

eda.create_report: fix display issue in notebook (487659fd)

clean: remove usaddress library (c192ab43)

clean: fix the bug of am, pm (4c3b2312)

clean: fix the bug of am, pm (caf2b372)

eda: fixed issue where plots weren't rendering twice (fd3fd573)

eda: wordcloud setting in terminal (00901699)

Features ✨

eda: added sorting feature for create_diff_report (8b187a6c)

eda: add running total for time series test (d0940726)

eda: add create_db_report submodule (9784cceb)

eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)

eda.create_report: add sort by approximate unique (5738db2a)

eda: add sort variables by alphabetical and missing (fb93493a)

clean: New version of GUI (6828807b)

eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

eda: add tests for intermediate compute functions (700add77)

Documentation 📃

clean: revise _init.py (02ede811)

clean: add doc of clean GUI (5e2f38ac)

eda.plot: add pagination for plot (c4cd4b97)

eda.create_report: remove old doc file (e1153cb1)

eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)

eda: add doc for getting imdt result (6fbcfe4c)

eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Andrey Pham <[email protected]> (First time contributor) ⭐️

Bowen0729 <[email protected]> (First time contributor) ⭐️

Danrui Qi <[email protected]> (First time contributor) ⭐️

Danrui QI <[email protected]> (First time contributor) ⭐️

dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (First time contributor) ⭐️

Devin <[email protected]> (First time contributor) ⭐️

Devin Lu <[email protected]>

Grey Murav <[email protected]> (First time contributor) ⭐️

henryye <[email protected]> (First time contributor) ⭐️

Jinglin Peng <[email protected]>

jwa345 <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.4a1-py3-none-any.whl(9.13 MB)
dataprep-0.4.4a1.tar.gz(8.66 MB)
v0.4.4(Jul 15, 2022)
Bugfixes 🐛

eda: type error for npartitions (57db1ede)

eda.create-db-report: remove pystache dependency and replace it with jinja2 (676fff1a)

eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)

eda: jinja2.markup import broken with 3.1 (b9b60a0a)

eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)

eda: report for empty df (485e58d3)

eda: plot_diff when columns are not aligned (7e53dbf6)

eda: scipy version issue (8798a146)

eda: na column name when upgrade dask (43fdd1a6)

eda: pd grouper issue when upgrade dask (761c4455)

clean: delete abundant print (0e072a80)

eda.plot: fix display issue in notebook (6ed13b09)

eda.plot: fix pagination styling issues (8396f2d9)

eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)

eda: interaction error in report for cat-only df (e60239a0)

eda: fix cat-cat error (94f70ef6)

eda: fix stat layout issue (5bb535d7)

eda.create_report: fix display issue in notebook (487659fd)

clean: remove usaddress library (c192ab43)

clean: fix the bug of am, pm (4c3b2312)

clean: fix the bug of am, pm (caf2b372)

eda: fixed issue where plots weren't rendering twice (fd3fd573)

eda: wordcloud setting in terminal (00901699)

Features ✨

clean: add updated version of rapidfuzz and python-crfsuite (59f35066)

eda.create-db-report: add save report functionality (2fb16ad6)

eda: add get_db_names (a7bf8206)

eda: added sorting feature for create_diff_report (8b187a6c)

eda: add running total for time series test (d0940726)

eda: add create_db_report submodule (9784cceb)

eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)

eda.create_report: add sort by approximate unique (5738db2a)

eda: add sort variables by alphabetical and missing (fb93493a)

clean: New version of GUI (6828807b)

eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

eda: add test for npartition type error (5affd75a)

eda: add tests for intermediate compute functions (700add77)

Documentation 📃

eda: add the use-case of dataprep.eda for spark dataframe with ray (4bf14e7c)

clean: revise _init.py (02ede811)

clean: add doc of clean GUI (5e2f38ac)

eda.plot: add pagination for plot (c4cd4b97)

eda.create_report: remove old doc file (e1153cb1)

eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)

eda: add doc for getting imdt result (6fbcfe4c)

eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Andrey Pham <[email protected]> (First time contributor) ⭐️

astellarius <[email protected]> (First time contributor) ⭐️

Bowen0729 <[email protected]> (First time contributor) ⭐️

Danrui Qi <[email protected]> (First time contributor) ⭐️

Danrui QI <[email protected]> (First time contributor) ⭐️

dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (First time contributor) ⭐️

Devin <[email protected]> (First time contributor) ⭐️

Devin Lu <[email protected]>

Grey Murav <[email protected]> (First time contributor) ⭐️

henryye <[email protected]> (First time contributor) ⭐️

Jinglin Peng <[email protected]>

jwa345 <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]>

Sultan Orazbayev <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.4-py3-none-any.whl(9.13 MB)
dataprep-0.4.4.tar.gz(8.66 MB)
v0.4.3(Mar 31, 2022)
Bugfixes 🐛

eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)

eda: report for empty df (485e58d3)

eda: plot_diff when columns are not aligned (7e53dbf6)

eda: scipy version issue (8798a146)

eda: na column name when upgrade dask (43fdd1a6)

eda: pd grouper issue when upgrade dask (761c4455)

clean: delete abundant print (0e072a80)

eda.plot: fix display issue in notebook (6ed13b09)

eda.plot: fix pagination styling issues (8396f2d9)

eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)

eda: interaction error in report for cat-only df (e60239a0)

eda: fix cat-cat error (94f70ef6)

eda: fix stat layout issue (5bb535d7)

eda.create_report: fix display issue in notebook (487659fd)

clean: remove usaddress library (c192ab43)

clean: fix the bug of am, pm (4c3b2312)

clean: fix the bug of am, pm (caf2b372)

eda: fixed issue where plots weren't rendering twice (fd3fd573)

eda: wordcloud setting in terminal (00901699)

Features ✨

eda: added sorting feature for create_diff_report (8b187a6c)

eda: add running total for time series test (d0940726)

eda: add create_db_report submodule (9784cceb)

eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)

eda.create_report: add sort by approximate unique (5738db2a)

eda: add sort variables by alphabetical and missing (fb93493a)

clean: New version of GUI (6828807b)

eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

eda: add tests for intermediate compute functions (700add77)

Documentation 📃

clean: revise _init.py (02ede811)

clean: add doc of clean GUI (5e2f38ac)

eda.plot: add pagination for plot (c4cd4b97)

eda.create_report: remove old doc file (e1153cb1)

eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)

eda: add doc for getting imdt result (6fbcfe4c)

eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Andrey Pham <[email protected]> (First time contributor) ⭐️

Bowen0729 <[email protected]> (First time contributor) ⭐️

Danrui Qi <[email protected]> (First time contributor) ⭐️

Danrui QI <[email protected]> (First time contributor) ⭐️

dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (First time contributor) ⭐️

Devin <[email protected]> (First time contributor) ⭐️

Devin Lu <[email protected]>

Grey Murav <[email protected]> (First time contributor) ⭐️

henryye <[email protected]> (First time contributor) ⭐️

Jinglin Peng <[email protected]>

jwa345 <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.3-py3-none-any.whl(9.02 MB)
dataprep-0.4.3.tar.gz(8.57 MB)
v0.4.2(Feb 21, 2022)
Bugfixes 🐛

eda: na column name when upgrade dask (43fdd1a6)

eda: pd grouper issue when upgrade dask (761c4455)

clean: delete abundant print (0e072a80)

eda.plot: fix display issue in notebook (6ed13b09)

eda.plot: fix pagination styling issues (8396f2d9)

eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)

eda: interaction error in report for cat-only df (e60239a0)

eda: fix cat-cat error (94f70ef6)

eda: fix stat layout issue (5bb535d7)

eda.create_report: fix display issue in notebook (487659fd)

clean: remove usaddress library (c192ab43)

clean: fix the bug of am, pm (4c3b2312)

clean: fix the bug of am, pm (caf2b372)

eda: fixed issue where plots weren't rendering twice (fd3fd573)

eda: wordcloud setting in terminal (00901699)

Features ✨

eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)

eda.create_report: add sort by approximate unique (5738db2a)

eda: add sort variables by alphabetical and missing (fb93493a)

clean: New version of GUI (6828807b)

eda: enriched show details tab by adding plots and overview statistics (eeb210db)

Code Quality + Testing 💯

eda: add tests for intermediate compute functions (700add77)

Documentation 📃

clean: add doc of clean GUI (5e2f38ac)

eda.plot: add pagination for plot (c4cd4b97)

eda.create_report: remove old doc file (e1153cb1)

eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)

eda: add doc for getting imdt result (6fbcfe4c)

eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Andrey Pham <[email protected]> (First time contributor) ⭐️

Bowen0729 <[email protected]> (First time contributor) ⭐️

dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (First time contributor) ⭐️

Devin <[email protected]> (First time contributor) ⭐️

Devin Lu <[email protected]>

Grey Murav <[email protected]> (First time contributor) ⭐️

henryye <[email protected]> (First time contributor) ⭐️

Jinglin Peng <[email protected]>

jwa345 <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.2-py3-none-any.whl(3.52 MB)
dataprep-0.4.2.tar.gz(3.14 MB)
v0.4.1(Nov 25, 2021)
v0.4.1

Bugfixes 🐛

eda: stat layout in plot (946319f7)

eda: fix display in plot(df) (c11bb94c)

eda: report for pandas extension type (2cbb3873)

eda: fix saving imdt as json file (5ee6529f)

Features ✨

clean: Add wiki and simple GUI(7f4ab12a)

eda: added overview and variables section for create_diff_report (dc4cf7da)

eda: add categorical interaction in create_report (7f13cd57)

Code Quality + Testing 💯

eda: added basic automated tests (3a0653e0)

Documentation 📃

eda: link creete_diff_report to intro (05d9850b)

eda: added docs for create_diff_report (d8fc9d4b)

eda: enrich parameters in report (3d0a148a)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Devin Lu <[email protected]>

Jinglin Peng <[email protected]>

qidanrui <[email protected]>

waterpine <[email protected]>

Weiyuan Wu <[email protected]>

Xiaoying Wang <[email protected]> (First time contributor) ⭐️

Xiaoying Wang <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.1-py3-none-any.whl(3.37 MB)
dataprep-0.4.1.tar.gz(3.00 MB)
v0.4.0(Oct 26, 2021)
v0.4.0

Bugfixes 🐛

eda: fix string type (b7e3321f)

eda: fix value table display (57281bc2)

eda: remove imdt output from plot (5c227e15)

eda: adjusted save report method to accept one parameter (4ceefcc1)

eda: clean config code and fix scatter sample param (8ab27f92)

plot_diff: fix ci issue (44ce81cf)

clean: clean_duplication issue 646 (ca9f7085)

eda: fix category type error (9750694a)

Features ✨

eda: refactored code and added density parameter to plot_diff(df) (323ae6b0)

eda: save imdt as json file (78673867)

connector: integrate connectorx into connector (106457e3, a64e3563, 9f89d3bf)

clean: add clean_ml function (909cd196)

clean: add multiple clean functions for number types (3c05be58)

eda.diff: add plot_diff([df1..dfn], continuous) (3bfb4f57)

clean: support conversion into packed binary format in clean_ip (7e30f93f, 37a83b03)

Code Quality + Testing 💯

eda: add densify test and doc for diff (f8d2054d)

eda: add test for config (ab3172f5)

Performance 🚀

clean: update documentation of clean_duplication (50f90fa9)

Documentation 📃

clean: change the introduction (862b4478)

eda: change eda colab position (ce25b17d, d00b0bd5)

clean: add documentation for multiple clean functions for number types (732480f1)

clean: add documentation for clean_ml function (0c139db6)

eda: scattter.sample_rate added to documentation (549b3193)

eda: fix plot show (0b40a40f)

readme: add benchmark link (e807f798)

readme: small text change on clean and connector (e193a6a7)

readme: fix titanc link (29cc06cc)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Devin Lu <[email protected]> (First time contributor) ⭐️

dylanzxc <[email protected]>

Jinglin Peng <[email protected]>

Noir Tree <[email protected]> (First time contributor) ⭐️

pwwang <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]>

sahmad11 <[email protected]> (First time contributor) ⭐️

waterpine <[email protected]>

Weiyuan Wu <[email protected]>

Xiaoying Wang <[email protected]> (First time contributor) ⭐️

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.4.0-py3-none-any.whl(2.00 MB)
dataprep-0.4.0.tar.gz(1.65 MB)
v0.3.0(May 20, 2021)
v0.3.0

Bugfixes 🐛

eda: fix long name in missing heatmap (f6cc399e)

connector: fix bug in url_path_params (c95a7ff1)

eda: fix NA and int viz issue in plot_diff (ef36d5ac)

eda: fix missing for SmallCard and DateTime type (201e487b)

eda: fix create_report for dask csv (93e85673)

clean: fix mixesd up formats of date in one column (e2956956)

eda: fixed uncaught dtype and long var names (24f0295e)

eda: fix correlation of num columns with small distinct values (9959b78a)

eda: fix issue with dataframe of one column (910bb71a)

eda: add geopoint in type count (94cbca23)

eda: fixed uncaught dtype exceptions (d301eb75)

eda: fix str transform with small distinct as categorical (65e7f907)

eda: fix na values display issue (1ce5775e)

eda: keep na when preprocess df (17d82191)

clean: fix returned df_clean in clean_dupl (180e6ad2)

clean: escape apostrophes in code exported by clean_dupl (e6ea7e97)

eda: fixed endless loop and UI issues (69779cd6)

eda: fix insight error (9ad4e26b)

eda: suppress warnings for missing and report (df2a1e70)

eda: fix insights of plot_correlation (f0ca5f41)

eda: suppress warnings of progress bar and dask (ca8da4e1)

eda.create_report: fix constant column error (160844ad)

docs: fix docs of clean_df (38dd4b2a)

clean: remove unneeded replace in clean_dupl (51c02cdd)

eda: fixed bugs come with random generated datasets (53ecf76c)

eda: fix bugs in log transformation (209d7d0c)

eda: fixed and optimized css layouts (58e1b18f)

clean: fix bug in validate_country (28068d46)

eda: fix column name and index related issues (40a89b91)

eda: variables can be none (325b0904)

connector: path to new config repo (59603e5b)

clean: lat_long regex not match a date format (49d3d227)

eda.distribution: highlight variable names (998b1762)

eda: fix the error of numerical cell in object column (91c4f9df)

eda.distribution: box plot with object dtype (a37e9f21)

clean: add comma after street suffix or name (e7655db9)

clean: cast values as str in validate funcs (8e1b459a)

Features ✨

clean: tuple of input formats for clean_country() (6bc65513)

clean: add clean_text function (55d3ae95)

eda: change color of geo map (1dbcddbf)

clean: add clean_currency function (deb55938)

clean: add clean_df() function (b750284f)

type: detect column as categorical for small unique values (4696e598)

eda: add geo_plot function (bbe64ec2)

eda: create_report UI improvement (c849b013)

eda: added new function plot_diff (79523c30)

connector: allow parameters appear in url path (5adaf301)

eda: value frequency table (bc37b794)

eda: create_report UI improvement (72a0ca95)

clean: add clean_duplication() function (98ff38d0)

clean: support letters in clean_phone (25d163b3)

eda: specify colors in plot(df), plot(df, x) (33fa36ea)

connector: add functionality that lists supported websites (88187e18)

clean: add clean_address function (e839ecd3)

clean: add clean_headers function (40742a19)

eda: parameter management and how-to guide (d2e8b10a)

clean: add clean_date function (6aa6410e)

create_report: add tabs for correlation and missing (6dc568b5)

Code Quality + Testing 💯

eda: add test for geo point (943033a6)

eda: add dataset test for report (0de5208b)

eda: add test of random df (68239f03)

clean: add tests for clean_duplication() (a4b9d32b)

eda: add random data generator (e83f95b3)

clean: add tests for clean_headers (0aca076e)

eda: add test case of object column with numerical cell (57839841)

clean) : add tests for clean_date and validate_date (812dbb8d)

Performance 🚀

eda: optimize df preprocess and performance of create_report (e7eb182f)

clean: update documentation of clean_date (c540fcc7)

clean: improve performance of clean_duplication (8fda37e8)

eda: use approximate nunique (60300644)

clean: improve the peformace of clean_email() (176382bc)

clean: improve performance of clean_date (854329ba)

Documentation 📃

readme: update video, paper and titanic report for eda (1126dea8)

eda: replace x, y, z with col1, col2, col3 (57f65b30)

clean: add documentation for clean_text (65436b06)

eda: add documentation for insights (1e4659be)

clean: add documentation for clean_df() (4ecf0d71)

eda: update user guide's datasets (2428f98e)

eda: add documentation for geo plot (3558257c)

clean: add user guide for clean_duplication (d834e857)

clean: fix clean documentation (e3bed2ba)

connector: revision (23085dd3)

clean: add documentation for clean_date function (d445f36a)

connector: add info docs (cb8cb5c5)

connector: add config file section (f55226ea)

connector: adding a process overview via DBLP section (5794d6c8)

connector: remove stale rst files (433fdfe4)

connector: convert pagination section from rst to ipynb (e4b9ba0c)

connector: convert authorization section from rst to ipynb (d25af473)

connector: change the pointer in index file from connector.rst to introduction.ipynb (218e41c6)

connector: rewrite introduction and form doc structure (6a876937)

connector: update API reference doc (9bed1694)

clean: improve DataPrep.Clean ReadMe (a0bc96b0)

eda: update legacy documentations for eda (8f948e05)

clean: add documentation for clean_address (4061fca3)

clean: add documentation for clean_headers (7a9d519c)

clean: add links from user guide to api ref (182b5254)

clean: Docstrings for phone and email (47f1e33d)

datasets: add introduction for datasets (83d42cee)

clean: add API reference (68182f6a)

clean: add documentation for clean_ip function (9da3ed1e)

connector: add query() section (c904d1fc)

connector: add connect() section (bff842ed)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

andy <[email protected]> (First time contributor) ⭐️

AndyWangSFU <[email protected]> (First time contributor) ⭐️

atol <[email protected]>

Brandon Lockhart <[email protected]>

dylanzxc <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

jinglinpeng <[email protected]>

Lakshay-sethi <[email protected]> (First time contributor) ⭐️

nzrymiak <[email protected]>

peiwangdb <[email protected]>

peterirani <[email protected]> (First time contributor) ⭐️

qidanrui <[email protected]> (First time contributor) ⭐️

ryanwdale <[email protected]>

waterpine <[email protected]>

Weiyuan Wu <[email protected]>

Yi Xie <[email protected]>

yuzhenmao <[email protected]>

yuzhenmao <[email protected]>

yxie66 <[email protected]>

zhixuan_chi <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.3.0-py3-none-any.whl(1.69 MB)
dataprep-0.3.0.tar.gz(1.59 MB)
v0.2.15(Jan 6, 2021)
Bugfixes 🐛

eda: add test to plot_missing (303a13e6)

eda: when data size is small using plot_missing (9e59aa00)

eda: set encoding to udf when file is opened (f43c1aa2)

clean: split parameter for clean_phone (f9bb1003)

connector: config manager checks _meta.json (5c2278de)

eda.create_report: univar datetime analysis (4632852a)

eda.report: encoding and show issue (721ae7be)

Features ✨

datasets: add load_dataset and get_dataset_names (2b9e1f95)

connector: allow using config from other branches (276afff3)

connector: from_key parameter validation (bd89ef29)

clean: add clean_ip function (3b232708)

connector: improve info (2a175a82)

eda: enrich plot_correlation (29c444e2)

clean: implement clean_phone for Canadian/US formats (45d43682)

eda: modify doc of plot_missing (489c9220)

clean: add errors parameter, enhance report for clean_url (aa7ec9cb)

clean: add clean_url function (2894d0a0)

eda: add stat. in plot_missing (0f44f153)

connector: adding validation for auth params (0a7c712d)

eda: convert all plot functions to new UI (36f8fa3e)

connector: update info function documentation (7b6ae530)

connector: create display dataframe function (9767cf47)

Code Quality + Testing 💯

clean: add tests for clean_ip and validate_ip (fc156829)

clean: add tests for clean_url (452dbe8f)

clean: add tests for clean_phone (fcf73106)

clean: add tests for clean_email() (fdd02c62)

clean: add tests for clean_country() (8a593fa6)

clean: add tests for clean_lat_long (aea26025)

Performance 🚀

clean: improve the peformace of the clean subpackage (c7c787bd)

Documentation 📃

README: add link to each section (b687076a)

README: polish EDA section (fd5ef8c4)

clean: add documentation for clean_url (bf937f9d)

clean: add documentation for clean_phone (8165a428)

readme: fix the broken image (12e1fa16)

readme: add introduction for dataprep.clean (3710037d)

clean: add docs for clean_country (21639814)

eda: modify doc for plot_correlation (b6b377c9)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

atol <[email protected]>

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

jinglinpeng <[email protected]> (First time contributor) ⭐️

Juan Ospina <[email protected]> (First time contributor) ⭐️

nzrymiak <[email protected]> (First time contributor) ⭐️

pallavib <[email protected]>

peiwangdb <[email protected]>

Peshotan Irani <[email protected]>

peterirani <[email protected]>

ryanwdale <[email protected]>

waterpine <[email protected]>

Weiyuan Wu <[email protected]>

Yi Xie <[email protected]>

yuzhenmao <[email protected]> (First time contributor) ⭐️

yxie66 <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.15-py3-none-any.whl(188.63 KB)
dataprep-0.2.15.tar.gz(149.16 KB)
v0.2.14(Oct 22, 2020)
Bugfixes 🐛

eda.plot_missing: new label texts and color mapping (71a95f91)

connector: add missing authdef (8b274b92)

eda.create_report: handle unhashable dtypes (77437491)

Features ✨

connector: remove jsonschema dependency (6f07faf9)

connector: don't support xml website anymore (fa173a06)

connector: simplify generator, add connect (a96d9b3c)

clean: implement clean_country function (5dea1bde)

connector: do not update local config if it already exists (cd675f30)

eda: Redesigned layout for plot_missing (c85eaa5d)

connector: add generator UI (4d1e9004)

Performance 🚀

eda: optimize plot_missing and plot_corr (b46036dc)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

Pallavi Bharadwaj <[email protected]>

pallavib <[email protected]>

ryanwdale <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.14-py3-none-any.whl(151.39 KB)
dataprep-0.2.14.tar.gz(118.94 KB)
v0.2.13(Oct 1, 2020)
Bugfixes 🐛

eda: change dtype 'string' to 'object' (8ddddbcf)

eda: remove unecessary compute (98c4ab0c)

connector: wrong calculation for pagination (516038b9)

eda.data_array: handle empty df correctly (97db86d7)

eda.distribution: fix pie chart insight (d3564a6f)

eda.distribution: delay scipy computations (89fafaec)

eda.correlation: wrong mask calculation (8ebe9cc0)

eda.plot: fixed wordcloud, all nan column (ce762d55)

Features ✨

connector: implement authorization code (e6838ca1)

connector: full text search _q to be a universal parameter (947584ab)

cleaning: add clean_email() function (4658a208)

connector: implement generator (7a93ea0e)

connector: add token based pagination (5ec6e00c)

connector: implement page pagination (02c93b4e)

connector: implement header authentication (d879c207)

connector: use pydantic for schema (dff08442)

connector: rename pagination types (500ce130)

cleaning: add report parameter for clean_lat_long (f0af6212)

connector: Parameter check when calling query() (0db7a16b)

eda: support series as the input (bad6a873)

eda.plot: Redesigned layout for plot(df, x) (04c7fd55)

cleaning: clean latitude, longitude coordinates (93927a98)

eda.report: allow disabling the progress bar (2a90f7f3)

eda.correlation: move nan corr values to the bottom (4bba52e0)

eda: add progress bar for dask local scheduler (e13257c8)

eda.plot: increase # of bins and ngroups (f78cfaef)

Performance 🚀

eda.plot: changed drop_null to dropna (0a7fe56d)

eda.missing: use DataArray (fb69ea1b)

eda.plot: optimize bivariate computations (031748e9)

eda: improve progress bar performance (64be8895)

eda.correlation: increase the performance (3575aac4)

eda.correlation: performance tuning (68471e50)

Documentation 📃

cleaning: add documentation for clean_email() (5bc37706)

cleaning: update clean_lat_long docs (d698a10e)

cleaning: add documentation for clean_lat_long (eaba8c71)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

atol <[email protected]> (First time contributor) ⭐️

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

jospina <[email protected]> (First time contributor) ⭐️

Pallavi Bharadwaj <[email protected]>

pallavib <[email protected]> (First time contributor) ⭐️

peiwangdb <[email protected]>

rwdale <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

Yi Xie <[email protected]> (First time contributor) ⭐️

yuzhenmao <[email protected]> (First time contributor) ⭐️

yxie66 <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.13-py3-none-any.whl(138.07 KB)
dataprep-0.2.13.tar.gz(106.97 KB)
v0.2.12(Aug 25, 2020)
Bugfixes 🐛

eda.create_report: optional dependency on ipython (75542cda)

Features ✨

eda.plot: add plot(df, x) insights (090d2f33)

connector: early return when df is empty for multi page (99fd5164)

eda.plot: Redesigned layout for plot(df) (5baebcb2)

eda.plot: Add auto insights (f176e9f8)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

eutialia <[email protected]> (First time contributor) ⭐️

pei wang <[email protected]>

peterirani <[email protected]>

Weiyuan Wu <[email protected]>

zhixuan_chi <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.12-py3-none-any.whl(109.29 KB)
dataprep-0.2.12.tar.gz(88.79 KB)
v0.2.11(Aug 10, 2020)
Bugfixes 🐛

eda: fix holoview palette deprecation warning (b5b27d36)

eda.correlation: truncate axis tick values (ed6ef8c6)

Features ✨

eda.create_report: new report object for saving and showing report (66fefd59)

eda.plot_missing: add dendrogram for plot_missing (1e11d5c5)

eda.config: add config class (f2bc8c50)

Performance 🚀

eda.plot: optimize plot() by tweaking dask (54b6f667)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

kla55 <[email protected]>

Pallavi Bharadwaj <[email protected]>

peiwangdb <[email protected]>

Peshotan Irani <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.11-py3-none-any.whl(92.87 KB)
dataprep-0.2.11.tar.gz(79.18 KB)
v0.2.10(Jul 26, 2020)
Bugfixes 🐛

eda.create_report: updated key name (80405159)

Features ✨

eda: add show_browser function (04cf7306)

eda: add a drop_null function (11acb3e0)

Code Quality + Testing 💯

eda.create_report: added test script for create_report function (c88a6eb2)

Performance 🚀

eda: optimize plot_missing performance by tweaking dask (d7669779)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

peterirani <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

zhixuan_chi <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.10-py3-none-any.whl(85.06 KB)
dataprep-0.2.10.tar.gz(72.17 KB)
v0.2.9(Jul 12, 2020)
Bugfixes 🐛

eda: deal with no missing df for plot_missing (9d4d39b6)

eda.plot: kde yaxis tick locations (7c48fe14)

eda: display inside Google Colab (e61d16cf)

Features ✨

data_connector: adding support for authorization type TokenParam (c59480c8)

Code Quality + Testing 💯

data_connector: add test for QueryParam auth (818891e1)

Documentation 📃

eda: update plot_missing doc (4ee03afa)

data_connector: Example notebook for YouTube usage (5ae4283b)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

eutialia <[email protected]>

Jinglin Peng <[email protected]>

Korakot Chaovavanich <[email protected]> (First time contributor) ⭐️

Pallavi Bharadwaj <[email protected]>

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.9-py3-none-any.whl(82.41 KB)
dataprep-0.2.9.tar.gz(70.63 KB)
v0.2.8(Jul 4, 2020)
Bugfixes 🐛

eda.plot: fix wordcloud and change the stats layout (7fe25c74)

Documentation 📃

data_connector_example: Example to fetch and anaylze tweets (e1d97d7a)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Peshotan Irani <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

zhixuan_chi <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.8-py3-none-any.whl(73.39 KB)
dataprep-0.2.8.tar.gz(64.01 KB)
v0.2.7(Jun 29, 2020)
Bugfixes 🐛

eda: fix the plot doesn't show up (b72770a6)

Features ✨

eda.plot: support text analysis (21ef6293)

Documentation 📃

data_connector: Twitter usage example notebook (a698811f)

data_connector: documentation update (8977f757)

data_connector: documentation update (8fc30940)

data_connector: documentation update (0bd70b7a)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Jinglin Peng <[email protected]>

kla55 <[email protected]>

Pallavi Bharadwaj <[email protected]> (First time contributor) ⭐️

peiwangdb <[email protected]>

root <[email protected]>

Sanjana12111994 <[email protected]> (First time contributor) ⭐️

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

zhixuan_chi <[email protected]> (First time contributor) ⭐️

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.7-py3-none-any.whl(73.33 KB)
dataprep-0.2.7.tar.gz(63.98 KB)
v0.2.6(Jun 1, 2020)
Bugfixes 🐛

eda.basic: fixed kde calculation (a6e58e94)

Features ✨

eda: notebook with code from dvsp blog post (bfd06b89)

eda.basic: added time series plots, support (0d10ebbb)

Code Quality + Testing 💯

data_connector: add integration test (95e52996)

Documentation 📃

data_connector: use remote yelp config in the docstring (d67bb70b)

data_connector: improved docstring (c363ac26)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

pei wang <[email protected]>

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
v0.2.5(May 5, 2020)
Bugfixes 🐛

eda.basic: fixed kde calculation (a6e58e94)

Features ✨

eda: notebook with code from dvsp blog post (bfd06b89)

eda.basic: added time series plots, support (0d10ebbb)

Code Quality + Testing 💯

data_connector: add integration test (95e52996)

Documentation 📃

data_connector: use remote yelp config in the docstring (d67bb70b)

data_connector: improved docstring (c363ac26)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Brandon Lockhart <[email protected]>

pei wang <[email protected]>

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.5-py3-none-any.whl(57.25 KB)
dataprep-0.2.5.tar.gz(47.32 KB)
v0.2.4(Apr 18, 2020)
Bugfixes 🐛

eda: report permission error on windows (f6e2f24f)

dataprep.eda: fix bug for stacked_viz (52cc35b8)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.4-py3-none-any.whl(53.24 KB)
dataprep-0.2.4.tar.gz(43.10 KB)
v0.2.3(Apr 12, 2020)
Documentation 📃

readme: fix the fig (4d04ea78)

readme: make examples link to documentation (cfb1e39d)

readme: add links and tooltips for the plots (13c7ab10)

Contributors this release 🏆

The following users contributed code to DataPrep since the last release.

Jinglin Peng <[email protected]>

Weiyuan Wu <[email protected]>

Weiyuan Wu <[email protected]>

🎉🎉 Thank you! 🎉🎉
Source code(tar.gz)
Source code(zip)
dataprep-0.2.3-py3-none-any.whl(53.14 KB)
dataprep-0.2.3.tar.gz(42.98 KB)
v0.2.2(Apr 4, 2020)
Fix

constrain the x-ticks of boxplots (issue #104 ), and polish the x/yaxis (9f9235e4ff432b627cf1319be1cbb97c688c3f1f)

Documentation

Polish readme (0585f906c04fb7f9a67efb7c4415eb8943299c01)

Add two examples (issue #75 and #78) and embed interactive fig. to doc (5bac8d5a2bbc96ee8d5a1ce3b520a929b3b6e3fe)

Add sphinx documentation (97f34ffacec2752ac23f0ed2a9aa1f5b76dbca15)

Source code(tar.gz)
Source code(zip)
dataprep-0.2.2-py3-none-any.whl(52.94 KB)
dataprep-0.2.2.tar.gz(42.72 KB)
v0.2.1(Mar 20, 2020)
This is a hotfix release

Fix

No pyproject in the installed library (20cc7b866f479eabc22a7f192ef75bf3ea0ac11d)

Source code(tar.gz)
Source code(zip)
dataprep-0.2.1-py3-none-any.whl(52.17 KB)
dataprep-0.2.1.tar.gz(41.81 KB)
v0.2.0(Mar 20, 2020)
This release includes some new features and lots of bug fixes.

Feature

Implement report (ac13bf898a2927e0fa838037de1772e3f5f60b74)

Implement dc.show_schema() (487a14fb4b14ed6aead0f724a03dfc2666d3ce4f)

Implement dc.info (50e3e18d4423833bf70e5fae72ba6037c5b0958e)

Support template (8a3a4a370aa796775fdb007b9df17b84fd53d282)

Fix

Remove_if_empty on template (443097273fe72aa1f7a5280c527e92b89c69eeb9)

Fix parameter names (ef0119f498af10f3d5662edafcbafc92ea5c2567)

Improved plot(df) efficiency (1ecd54376bcef015d5091a9a6a2bef1f7ed172b0)

Fixed xtic rounding (697496e8d9687c8f675eca88c0af83aa6d995beb)

Fix scatter and top-k nan (6a1bdb7780afe62b152997a75715b21dc1cb018a)

Fixed xtics for histograms (1f13d1b51ae8e1ad50e72829575a4bc27c4cc591)

Plot_correlation only supports for numerical data (c06726063a1dc7910c11111a421c521fc481d0bf)

It works for the columns with missing values (5329c54cc9112a1d011955518e61a729f8e28007)

Make the tooltip style align with plot(df) (33e5403fb828b924a9cbacc0d2395e3f1a5eaa87)

Documentation

Add documentation (4f342ce2890529e617284abc54c3d84cd6d4f6c2)

Source code(tar.gz)
Source code(zip)
dataprep-0.2.0-py3-none-any.whl(52.31 KB)
dataprep-0.2.0.tar.gz(41.91 KB)