DataPrep — The easiest way to prepare data in Python

Overview

Documentation | Discord | Forum

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

Releases

Repo Version Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

  • 10X Faster: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
  • Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
  • Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Click here to see the benchmark result.

Try DataPrep.EDA Online: DataPrep.EDA Demo in Colab

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

  • Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.
  • Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
  • How-to Guide: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

Learn DataPrep.EDA in 2 minutes:

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides

  • A Unified API: each function follows the syntax clean_{type}(df, 'column name') (see an example below).
  • Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
  • Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.

The following example shows how to clean and standardize a column of country names.

from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
           country  country_clean
0              USA  United States
1  country: Canada         Canada
2              233        Estonia
3              tr          Turkey
4               NA            NaN

Type validation is also supported:

from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0     True
1    False
2     True
3     True
4    False
Name: country, dtype: bool

Check clean_headers, clean_country, clean_date, clean_duplication, clean_email, clean_lat_long, clean_ip, clean_phone, clean_text, clean_url, clean_address and clean_df to see how each function works.

Connector

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

  • A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
  • Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument _count) without getting into unnecessary detail about a specific pagination scheme.
  • Speed: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the _concurrency argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.

How to fetch all publications of Andrew Y. Ng?

from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)

Here, you can find detailed Examples.

Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

  • Submit bugs and help us verify fixes as they are checked in.
  • Review the source code changes.
  • Engage with other DataPrep users and developers on StackOverflow.
  • Help each other in the DataPrep Community Discord and Forum.
  • Twitter
  • Contribute bug fixes.
  • Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

  • Pandas Profiling

    Inspired the report functionality and insights provided in dataprep.eda.

  • missingno

    Inspired the missing value analysis in dataprep.eda.

Citing DataPrep

If you use DataPrep, please consider citing the following paper:

Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

BibTeX entry:

@inproceedings{dataprepeda2021,
  author    = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
  title     = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
  booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
  year      = {2021}
}
Comments
  • plot(df, x) descriptive statistics

    plot(df, x) descriptive statistics

    Task: add the column statistics from pandas-profiling to plot(df, x).

    My proposal is to add another tab with the statistics like we have a tab for each visualization. Or is it important to see the statistics beside a visualization? I think if we add a tab it could be created with Bokeh Div, (or Paragraph or PreText) which allows formatted text (example), but there are perhaps other possibilities.

    Do we also want to add full dataset statistics like pandas-profiling? Perhaps we could add these at the top of the plot(df) output.

    I think it's important to consider the time required to compute the statistics. We can consider dask.compute(statistics), but should verify this is the most efficient approach. It would be ideal to perform minimal passes over the dataset in order to compute all the statistics.

    type: enhancement module: EDA 
    opened by brandonlockhart 24
  • feat(eda): add stat. in plot_missing

    feat(eda): add stat. in plot_missing

    Description

    Fixes #367 - EDA.plot_missing: enrich with stat. I only implemented it under the situation: plot_missing(df)

    How Has This Been Tested?

    manually

    Snapshots:

    image

    Checklist:

    • [x] My code follows the style guidelines of this project
    • [x] I have already squashed the commits and make the commit message conform to the project standard.
    • [x] I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] My changes generate no new warnings
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [x] Any dependent changes have been merged and published in downstream modules
    opened by yuzhenmao 19
  • support of time series

    support of time series

    This issue is about the rough idea to support time series in dataprep.eda.

    Essentially, datatime could be regarded as a numeric type, and it could be transformed to timestamp (float) via datatime.timestamp() or pd.to_numeric(). Hence, we could do the following work as the initial support of time series.

    1. Identify the column with datatime64 type in the dataframe.
    2. plot(df) & plot(df, x): handle time series column like numeric column, which could be binalized. When show the ticks of time series column, show the datetime string via function like datatime.strftime(). An example output is https://pandas.pydata.org/pandas-docs/version/0.13/visualization.html
    3. plot(df, x, y): When x is a datetime column and y is a numeric column, change the scatter plot with the line chart, which shows how y changes with x. For all other cases, apply the processing as step 2.
    4. plot_correlation: we could ignore the datetime column as pandas does, or transform datetime to numeric column via pd.to_numeric() and then apply the original processing.
    5. plot_missing: apply the similar processing of step 2.
    type: enhancement module: EDA 
    opened by jinglinpeng 14
  • Conda Installation of the dataprep AI is not supported

    Conda Installation of the dataprep AI is not supported

    Conda Installation for the data prep AI is not supported.

    $ conda install dataprep Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

    • dataprep

    Current channels:

    • https://repo.anaconda.com/pkgs/main/linux-64
    • https://repo.anaconda.com/pkgs/main/noarch
    • https://repo.anaconda.com/pkgs/r/linux-64
    • https://repo.anaconda.com/pkgs/r/noarch
    • https://conda.anaconda.org/conda-forge/linux-64
    • https://conda.anaconda.org/conda-forge/noarch

    To search for alternate channels that may provide the conda package you're looking for, navigate to

    https://anaconda.org
    

    and use the search bar at the top of the page.

    type: enhancement module: EDA 
    opened by abhisheksundarraman 13
  • Error concerning scipy.stats.stats when creating a report

    Error concerning scipy.stats.stats when creating a report

    Describe the bug I get the following error when I try to run the example of creating a report:

    error happended in column:PassengerId Traceback (most recent call last): File "", line 1, in File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/init.py", line 68, in create_report "components": format_report(df, cfg, mode, progress), File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 76, in format_report comps = format_basic(edaframe, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 274, in format_basic data, completions = basic_computations(df, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 383, in basic_computations variables_data = compute_variables(df, cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/create_report/formatter.py", line 318, in compute_variables data[col] = cont_comps(df.frame[col], cfg) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dataprep/eda/distribution/compute/univariate.py", line 200, in cont_comps data["chisq"] = chisquare(data["hist"][0]) File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dask/array/stats.py", line 136, in chisquare return power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis, lambda="pearson") File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/dask/array/stats.py", line 144, in power_divergence if lambda not in scipy.stats.stats._power_div_lambda_names: File "/home/user/anaconda3/envs/test-data-prep/lib/python3.8/site-packages/scipy/stats/stats.py", line 54, in getattr raise AttributeError( AttributeError: scipy.stats.stats is deprecated and has no attribute _power_div_lambda_names. Try looking in scipy.stats instead.

    To Reproduce

    from dataprep.datasets import load_dataset
    df = load_dataset("titanic")
    from dataprep.eda import create_report
    report = create_report(df)
    

    Expected behavior To get the EDA report.

    Desktop (please complete the following information):

    • OS: Ubuntu 20.04.4 LTS
    • Platform [Python script]
    • Platform Version [PyCharm 2021.3.2 (Community Edition)]
    • Python Version [3.8.12]
    • Dataprep Version [0.4.2]

    Additional context I have tested in a fresh conda env with pip install dataprep. Here are the packages installed:

    # Name                    Version                   Build  Channel
    _libgcc_mutex             0.1                        main  
    _openmp_mutex             4.5                       1_gnu  
    aiohttp                   3.8.1                    pypi_0    pypi
    aiosignal                 1.2.0                    pypi_0    pypi
    argon2-cffi               21.3.0                   pypi_0    pypi
    argon2-cffi-bindings      21.2.0                   pypi_0    pypi
    asttokens                 2.0.5                    pypi_0    pypi
    async-timeout             4.0.2                    pypi_0    pypi
    attrs                     21.4.0                   pypi_0    pypi
    backcall                  0.2.0                    pypi_0    pypi
    bleach                    4.1.0                    pypi_0    pypi
    bokeh                     2.4.2                    pypi_0    pypi
    ca-certificates           2021.10.26           h06a4308_2  
    certifi                   2021.10.8        py38h06a4308_2  
    cffi                      1.15.0                   pypi_0    pypi
    charset-normalizer        2.0.12                   pypi_0    pypi
    click                     8.0.4                    pypi_0    pypi
    cloudpickle               2.0.0                    pypi_0    pypi
    cycler                    0.11.0                   pypi_0    pypi
    dask                      2021.12.0                pypi_0    pypi
    dataprep                  0.4.2                    pypi_0    pypi
    debugpy                   1.5.1                    pypi_0    pypi
    decorator                 5.1.1                    pypi_0    pypi
    defusedxml                0.7.1                    pypi_0    pypi
    entrypoints               0.4                      pypi_0    pypi
    executing                 0.8.2                    pypi_0    pypi
    flask                     2.0.3                    pypi_0    pypi
    flask-cors                3.0.10                   pypi_0    pypi
    fonttools                 4.29.1                   pypi_0    pypi
    frozenlist                1.3.0                    pypi_0    pypi
    fsspec                    2022.2.0                 pypi_0    pypi
    idna                      3.3                      pypi_0    pypi
    importlib-resources       5.4.0                    pypi_0    pypi
    ipykernel                 6.9.1                    pypi_0    pypi
    ipython                   8.1.0                    pypi_0    pypi
    ipython-genutils          0.2.0                    pypi_0    pypi
    ipywidgets                7.6.5                    pypi_0    pypi
    itsdangerous              2.1.0                    pypi_0    pypi
    jedi                      0.18.1                   pypi_0    pypi
    jinja2                    3.0.3                    pypi_0    pypi
    joblib                    1.1.0                    pypi_0    pypi
    jsonpath-ng               1.5.3                    pypi_0    pypi
    jsonschema                4.4.0                    pypi_0    pypi
    jupyter-client            7.1.2                    pypi_0    pypi
    jupyter-core              4.9.2                    pypi_0    pypi
    jupyterlab-pygments       0.1.2                    pypi_0    pypi
    jupyterlab-widgets        1.0.2                    pypi_0    pypi
    kiwisolver                1.3.2                    pypi_0    pypi
    ld_impl_linux-64          2.35.1               h7274673_9  
    levenshtein               0.16.0                   pypi_0    pypi
    libffi                    3.3                  he6710b0_2  
    libgcc-ng                 9.3.0               h5101ec6_17  
    libgomp                   9.3.0               h5101ec6_17  
    libstdcxx-ng              9.3.0               hd4cf53a_17  
    locket                    0.2.1                    pypi_0    pypi
    markupsafe                2.1.0                    pypi_0    pypi
    matplotlib                3.5.1                    pypi_0    pypi
    matplotlib-inline         0.1.3                    pypi_0    pypi
    metaphone                 0.6                      pypi_0    pypi
    mistune                   0.8.4                    pypi_0    pypi
    multidict                 6.0.2                    pypi_0    pypi
    nbclient                  0.5.11                   pypi_0    pypi
    nbconvert                 6.4.2                    pypi_0    pypi
    nbformat                  5.1.3                    pypi_0    pypi
    ncurses                   6.3                  h7f8727e_2  
    nest-asyncio              1.5.4                    pypi_0    pypi
    nltk                      3.7                      pypi_0    pypi
    notebook                  6.4.8                    pypi_0    pypi
    numpy                     1.22.2                   pypi_0    pypi
    openssl                   1.1.1m               h7f8727e_0  
    packaging                 21.3                     pypi_0    pypi
    pandas                    1.4.1                    pypi_0    pypi
    pandocfilters             1.5.0                    pypi_0    pypi
    parso                     0.8.3                    pypi_0    pypi
    partd                     1.2.0                    pypi_0    pypi
    pexpect                   4.8.0                    pypi_0    pypi
    pickleshare               0.7.5                    pypi_0    pypi
    pillow                    9.0.1                    pypi_0    pypi
    pip                       21.2.4           py38h06a4308_0  
    ply                       3.11                     pypi_0    pypi
    prometheus-client         0.13.1                   pypi_0    pypi
    prompt-toolkit            3.0.28                   pypi_0    pypi
    ptyprocess                0.7.0                    pypi_0    pypi
    pure-eval                 0.2.2                    pypi_0    pypi
    pycparser                 2.21                     pypi_0    pypi
    pydantic                  1.9.0                    pypi_0    pypi
    pygments                  2.11.2                   pypi_0    pypi
    pyparsing                 3.0.7                    pypi_0    pypi
    pyrsistent                0.18.1                   pypi_0    pypi
    python                    3.8.12               h12debd9_0  
    python-crfsuite           0.9.7                    pypi_0    pypi
    python-dateutil           2.8.2                    pypi_0    pypi
    python-stdnum             1.17                     pypi_0    pypi
    pytz                      2021.3                   pypi_0    pypi
    pyyaml                    6.0                      pypi_0    pypi
    pyzmq                     22.3.0                   pypi_0    pypi
    rapidfuzz                 1.8.3                    pypi_0    pypi
    readline                  8.1.2                h7f8727e_1  
    regex                     2021.11.10               pypi_0    pypi
    scipy                     1.8.0                    pypi_0    pypi
    send2trash                1.8.0                    pypi_0    pypi
    setuptools                58.0.4           py38h06a4308_0  
    six                       1.16.0                   pypi_0    pypi
    sqlite                    3.37.2               hc218d9a_0  
    stack-data                0.2.0                    pypi_0    pypi
    terminado                 0.13.1                   pypi_0    pypi
    testpath                  0.6.0                    pypi_0    pypi
    tk                        8.6.11               h1ccaba5_0  
    toolz                     0.11.2                   pypi_0    pypi
    tornado                   6.1                      pypi_0    pypi
    tqdm                      4.62.3                   pypi_0    pypi
    traitlets                 5.1.1                    pypi_0    pypi
    typing-extensions         4.1.1                    pypi_0    pypi
    varname                   0.8.1                    pypi_0    pypi
    wcwidth                   0.2.5                    pypi_0    pypi
    webencodings              0.5.1                    pypi_0    pypi
    werkzeug                  2.0.3                    pypi_0    pypi
    wheel                     0.37.1             pyhd3eb1b0_0  
    widgetsnbextension        3.5.2                    pypi_0    pypi
    wordcloud                 1.8.1                    pypi_0    pypi
    xz                        5.2.5                h7b6447c_0  
    yarl                      1.7.2                    pypi_0    pypi
    zipp                      3.7.0                    pypi_0    pypi
    zlib                      1.2.11               h7f8727e_4 
    
    type: bug 
    opened by mina-marmpena 12
  • eda.create_report: page design prototype

    eda.create_report: page design prototype

    We have added stats info to our plot function, now we can use all those information to generate an HTML page for our users.

    I prototyped this layout without adding any practical plots, so we can change this design easily. Every element is 1:1 to our current code's definition, I believe this may give you a better concept of how this webpage would look like. The width of page is 1920px.

    Screen Shot 2020-05-28 at 15 12 07

    I will put the prototype here if anyone needs a more detailed inspection. Let me know if you have any suggestions. @jnwang @jinglinpeng @dovahcrow @Waterpine @brandonlockhart https://www.figma.com/file/txfQwkocxBOFOilPvaI9MC/Untitled?node-id=0%3A1

    type: enhancement module: EDA 
    opened by eutialia 12
  • create_report() crashes on a dataframe with a constant numeric column

    create_report() crashes on a dataframe with a constant numeric column

    create_report() crashes when the dataframe contains a column of all constant numeric values. Interestingly, I didn't see the failure when I used plot(), plot_correlation(), or plot_missing(), so maybe they have logic to adapt to that condition.

    Below is a simple repro case.

    Repro:

    dp.create_report(pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]}))
    

    Error trace:

    ---------------------------------------------------------------------------
    LinAlgError                               Traceback (most recent call last)
    <ipython-input-65-6594a8550d86> in <module>
    ----> 1 dp.create_report(pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]}))
    
    c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\__init__.py in create_report(df, title, mode, progress)
         52         "resources": INLINE.render(),
         53         "title": title,
    ---> 54         "components": format_report(df, mode, progress),
         55     }
         56     template_base = ENV_LOADER.get_template("base.html")
    
    c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\formatter.py in format_report(df, mode, progress)
         61         df = string_dtype_to_object(df)
         62         if mode == "basic":
    ---> 63             comps = format_basic(df)
         64         # elif mode == "full":
         65         #     comps = format_full(df)
    
    c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\create_report\formatter.py in format_basic(df)
         97             category=RuntimeWarning,
         98         )
    ---> 99         (data,) = dask.compute(data)
        100 
        101     # results dictionary
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
        450         postcomputes.append(x.__dask_postcompute__())
        451 
    --> 452     results = schedule(dsk, keys, **kwargs)
        453     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
        454 
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
         82         get_id=_thread_get_id,
         83         pack_exception=pack_exception,
    ---> 84         **kwargs
         85     )
         86 
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
        484                         _execute_task(task, data)  # Re-execute locally
        485                     else:
    --> 486                         raise_exception(exc, tb)
        487                 res, worker_id = loads(res_info)
        488                 state["cache"][key] = res
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in reraise(exc, tb)
        314     if exc.__traceback__ is not tb:
        315         raise exc.with_traceback(tb)
    --> 316     raise exc
        317 
        318 
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
        220     try:
        221         task, data = loads(task_info)
    --> 222         result = _execute_task(task, data)
        223         id = get_id()
        224         result = dumps((result, id))
    
    c:\Miniconda\envs\python3\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
        119         # temporaries by their reference count and can execute certain
        120         # operations in-place.
    --> 121         return func(*(_execute_task(a, cache) for a in args))
        122     elif not ishashable(arg):
        123         return arg
    
    c:\Miniconda\envs\python3\lib\site-packages\dataprep\eda\distribution\compute\common.py in gaussian_kde(arr)
        230 def gaussian_kde(arr: np.ndarray) -> Tuple[float, float]:
        231     """Delayed version of scipy gaussian_kde."""
    --> 232     return cast(Tuple[np.ndarray, np.ndarray], gaussian_kde_(arr))
        233 
        234 
    
    c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in __init__(self, dataset, bw_method, weights)
        204             self._neff = 1/sum(self._weights**2)
        205 
    --> 206         self.set_bandwidth(bw_method=bw_method)
        207 
        208     def evaluate(self, points):
    
    c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in set_bandwidth(self, bw_method)
        554             raise ValueError(msg)
        555 
    --> 556         self._compute_covariance()
        557 
        558     def _compute_covariance(self):
    
    c:\Miniconda\envs\python3\lib\site-packages\scipy\stats\kde.py in _compute_covariance(self)
        566                                                bias=False,
        567                                                aweights=self.weights))
    --> 568             self._data_inv_cov = linalg.inv(self._data_covariance)
        569 
        570         self.covariance = self._data_covariance * self.factor**2
    
    c:\Miniconda\envs\python3\lib\site-packages\scipy\linalg\basic.py in inv(a, overwrite_a, check_finite)
        975         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
        976     if info > 0:
    --> 977         raise LinAlgError("singular matrix")
        978     if info < 0:
        979         raise ValueError('illegal value in %d-th argument of internal '
    
    LinAlgError: singular matrix
    
    
    type: bug module: EDA 
    opened by dhuntley1023 11
  • plot(df) and plot_correlation(df) fail when data has 'list' columns

    plot(df) and plot_correlation(df) fail when data has 'list' columns

    When running plot(df) and plot_correlation(df) on the following dataframe, since the author column is a list, both plot and plot_correlation failed.

    For plot(), the reported error is TypeError: unhashable type: 'list'

    For plot_correlation(df), the reported error AssertionError: No numerical columns found

    Screen Shot 2019-12-21 at 1 36 30 PM type: bug module: EDA 
    opened by jnwang 11
  • eda.plot: support text analysis

    eda.plot: support text analysis

    The issue is related to the task of supporting text analysis. The initial idea is that:

    1. Have a type system, and support a type named 'text'.

    2. plot(df): for text column, we show the word cloud.

    3. plot(df, x): for text column, we show some information related to the text, such as the length distribution, the most common words and the statistics of Lowercase/Uppercase Letter.

    For now, I think we should start from the type system, which is needed for multiple tasks. Please let me know how you think about the overall design. If we both agree, I will make another issue about the type system. @jnwang @dovahcrow @brandonlockhart @Waterpine @dylanzxc .

    type: enhancement Epic module: EDA 
    opened by jinglinpeng 10
  • data_connector: Fetch all publications of one specific author

    data_connector: Fetch all publications of one specific author

    Suppose a user wants to fetch all publications of one specific author (e.g., Jian Pei). Dataprep.data_connector cannot meet her needs. For example, the first paper is not written by Jian Pei, but it was returned since the author list contains the keywords Jian and Pei.

    Screen Shot 2019-12-21 at 1 53 52 PM

    A user can get all publications of Jian Pei through this API: https://dblp.org/search/publ/api?q=author%3AJian_Pei%3A

    Please consider to support this feature.

    type: enhancement module: Connector 
    opened by jnwang 10
  • feat(clean): add clean_df function

    feat(clean): add clean_df function

    Description

    clean_df function: conduct a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame. Closes #503.

    How Has This Been Tested?

    I have tested this function using a few real-world datasets. I will also add my test function later.

    Checklist:

    • [x] My code follows the style guidelines of this project
    • [x] I have already squashed the commits and make the commit message conform to the project standard.
    • [x] I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
    • [x] I have performed a self-review of my own code
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [x] I have made corresponding changes to the documentation
    • [ ] My changes generate no new warnings
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by AndyWangSFU 9
  • Add the option to pass a target varaible when creating the EDA report

    Add the option to pass a target varaible when creating the EDA report

    I am often interested in understanding the relationship between one specific column (my target variable) and the others.

    It would be nice if we could pass a target variable when creating the EDA report: i.e. eda.create_report(target="has_survived").

    Then all variable plots would all be crossed with this target variable. You already have this functionality in the eda.plot(df, target) function from the docs.

    This is a functionality that I like to use in the sweetviz library:

    sweetviz.analyze(source, target_feat)
    
    type: enhancement 
    opened by rluthi 0
  • Need data-type for each column in create_report function.

    Need data-type for each column in create_report function.

    Hi all, Currently, I need to add a data-type (type) param in creat_report() like as plot() function. This data type can help me generate report with numerical/categorical features without affecting "Distinct Count".

    This image below was automatically generated by creat_report. However, my expected output is numerical stats and visualization. image

    My expected feature:

    dttype = {c: "Continuous" for c in dataframe.columns}
    creat_report(dataframe, dtype=dttype)
    

    Any solution to my problem, please support me. Thanks

    type: enhancement 
    opened by anthng 2
  • EDA plot() not showing properly in VScode

    EDA plot() not showing properly in VScode

    Describe the bug EDA plot() not showing properly in VScode

    To Reproduce Steps to reproduce the behavior:

    1. Open new Jupyter notebook
    2. Import from dataprep.eda import plot, plot_correlation, create_report, plot_missing
    3. Try .head() method on any test dataset so it shows first DF rows as output
    4. Next cell call .plot() method from dataprep
    5. Near finishing execution, the method adds padding to the left of all Output cells and it's impossible to scroll to the right to see the full output

    Or:

    
    import numpy as np
    import pandas as pd
    import datetime
    from datetime import date
    import matplotlib
    import seaborn as sns
    import matplotlib.pyplot as plt
    import plotly.graph_objects as go
    from sklearn.preprocessing import StandardScaler, normalize
    from sklearn import metrics
    from sklearn.mixture import GaussianMixture
    from mlxtend.frequent_patterns import apriori
    from mlxtend.frequent_patterns import association_rules
    import warnings
    warnings.filterwarnings('ignore')
    data=pd.read_csv('marketing_campaign.csv',header=0,sep=';')
    
    from dataprep.eda import plot, plot_correlation, create_report, plot_missing
    plot(data)
    

    Expected behavior I expected to be shown the output normally in the screen

    Screenshots

    Images: image_2022-12-20_105546368

    image_2022-12-20_105618462

    image_2022-12-20_105702601

    Desktop (please complete the following information):

    • OS: Windows 11
    • Browser: None
    • Platform VSCode
    • Platform Version 1.74.1
    • Python Version 3.10.9
    • Dataprep Version 0.4.4

    Additional context

    type: bug 
    opened by ldavidr3 1
  • Make the documentation more lightweight and readable

    Make the documentation more lightweight and readable

    DataPrep is a great tool! And the automatic EDA module is probably the best open-source option out there.

    However, I find your package difficult to understand and learn from your documentation. It makes it feels more complicated than it actually is.

    Some ideas to improve :

    • Highlight 2 line EDA report creation (only advanced users need the detail)
    • Separate documentation from case studies
    • Organise the sub-sections in doc of the Clean module
    type: enhancement 
    opened by rluthi 3
  • build(deps): bump express from 4.17.1 to 4.18.2 in /dataprep/clean/gui/clean_frontend

    build(deps): bump express from 4.17.1 to 4.18.2 in /dataprep/clean/gui/clean_frontend

    Bumps express from 4.17.1 to 4.18.2.

    Release notes

    Sourced from express's releases.

    4.18.2

    4.18.1

    • Fix hanging on large stack of sync routes

    4.18.0

    ... (truncated)

    Changelog

    Sourced from express's changelog.

    4.18.2 / 2022-10-08

    4.18.1 / 2022-04-29

    • Fix hanging on large stack of sync routes

    4.18.0 / 2022-04-25

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • build(deps): bump certifi from 2022.6.15 to 2022.12.7

    build(deps): bump certifi from 2022.6.15 to 2022.12.7

    Bumps certifi from 2022.6.15 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 1
Releases(v0.4.4-alpha.1)
  • v0.4.4-alpha.1(Jul 8, 2022)

    Bugfixes 🐛

    • eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)
    • eda: jinja2.markup import broken with 3.1 (b9b60a0a)
    • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
    • eda: report for empty df (485e58d3)
    • eda: plot_diff when columns are not aligned (7e53dbf6)
    • eda: scipy version issue (8798a146)
    • eda: na column name when upgrade dask (43fdd1a6)
    • eda: pd grouper issue when upgrade dask (761c4455)
    • clean: delete abundant print (0e072a80)
    • eda.plot: fix display issue in notebook (6ed13b09)
    • eda.plot: fix pagination styling issues (8396f2d9)
    • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
    • eda: interaction error in report for cat-only df (e60239a0)
    • eda: fix cat-cat error (94f70ef6)
    • eda: fix stat layout issue (5bb535d7)
    • eda.create_report: fix display issue in notebook (487659fd)
    • clean: remove usaddress library (c192ab43)
    • clean: fix the bug of am, pm (4c3b2312)
    • clean: fix the bug of am, pm (caf2b372)
    • eda: fixed issue where plots weren't rendering twice (fd3fd573)
    • eda: wordcloud setting in terminal (00901699)

    Features ✨

    • eda: added sorting feature for create_diff_report (8b187a6c)
    • eda: add running total for time series test (d0940726)
    • eda: add create_db_report submodule (9784cceb)
    • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
    • eda.create_report: add sort by approximate unique (5738db2a)
    • eda: add sort variables by alphabetical and missing (fb93493a)
    • clean: New version of GUI (6828807b)
    • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

    Code Quality + Testing 💯

    • eda: add tests for intermediate compute functions (700add77)

    Documentation 📃

    • clean: revise _init.py (02ede811)
    • clean: add doc of clean GUI (5e2f38ac)
    • eda.plot: add pagination for plot (c4cd4b97)
    • eda.create_report: remove old doc file (e1153cb1)
    • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
    • eda: add doc for getting imdt result (6fbcfe4c)
    • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.4a1-py3-none-any.whl(9.13 MB)
    dataprep-0.4.4a1.tar.gz(8.66 MB)
  • v0.4.4(Jul 15, 2022)

    Bugfixes 🐛

    • eda: type error for npartitions (57db1ede)
    • eda.create-db-report: remove pystache dependency and replace it with jinja2 (676fff1a)
    • eda.create-db-report: add missing style files from previously ignored by gitignore (75361915)
    • eda: jinja2.markup import broken with 3.1 (b9b60a0a)
    • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
    • eda: report for empty df (485e58d3)
    • eda: plot_diff when columns are not aligned (7e53dbf6)
    • eda: scipy version issue (8798a146)
    • eda: na column name when upgrade dask (43fdd1a6)
    • eda: pd grouper issue when upgrade dask (761c4455)
    • clean: delete abundant print (0e072a80)
    • eda.plot: fix display issue in notebook (6ed13b09)
    • eda.plot: fix pagination styling issues (8396f2d9)
    • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
    • eda: interaction error in report for cat-only df (e60239a0)
    • eda: fix cat-cat error (94f70ef6)
    • eda: fix stat layout issue (5bb535d7)
    • eda.create_report: fix display issue in notebook (487659fd)
    • clean: remove usaddress library (c192ab43)
    • clean: fix the bug of am, pm (4c3b2312)
    • clean: fix the bug of am, pm (caf2b372)
    • eda: fixed issue where plots weren't rendering twice (fd3fd573)
    • eda: wordcloud setting in terminal (00901699)

    Features ✨

    • clean: add updated version of rapidfuzz and python-crfsuite (59f35066)
    • eda.create-db-report: add save report functionality (2fb16ad6)
    • eda: add get_db_names (a7bf8206)
    • eda: added sorting feature for create_diff_report (8b187a6c)
    • eda: add running total for time series test (d0940726)
    • eda: add create_db_report submodule (9784cceb)
    • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
    • eda.create_report: add sort by approximate unique (5738db2a)
    • eda: add sort variables by alphabetical and missing (fb93493a)
    • clean: New version of GUI (6828807b)
    • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

    Code Quality + Testing 💯

    • eda: add test for npartition type error (5affd75a)
    • eda: add tests for intermediate compute functions (700add77)

    Documentation 📃

    • eda: add the use-case of dataprep.eda for spark dataframe with ray (4bf14e7c)
    • clean: revise _init.py (02ede811)
    • clean: add doc of clean GUI (5e2f38ac)
    • eda.plot: add pagination for plot (c4cd4b97)
    • eda.create_report: remove old doc file (e1153cb1)
    • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
    • eda: add doc for getting imdt result (6fbcfe4c)
    • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.4-py3-none-any.whl(9.13 MB)
    dataprep-0.4.4.tar.gz(8.66 MB)
  • v0.4.3(Mar 31, 2022)

    Bugfixes 🐛

    • eda: fixed create_report browser sort rendering issue, returned context values directly instead of selecting by css class (331a9644)
    • eda: report for empty df (485e58d3)
    • eda: plot_diff when columns are not aligned (7e53dbf6)
    • eda: scipy version issue (8798a146)
    • eda: na column name when upgrade dask (43fdd1a6)
    • eda: pd grouper issue when upgrade dask (761c4455)
    • clean: delete abundant print (0e072a80)
    • eda.plot: fix display issue in notebook (6ed13b09)
    • eda.plot: fix pagination styling issues (8396f2d9)
    • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
    • eda: interaction error in report for cat-only df (e60239a0)
    • eda: fix cat-cat error (94f70ef6)
    • eda: fix stat layout issue (5bb535d7)
    • eda.create_report: fix display issue in notebook (487659fd)
    • clean: remove usaddress library (c192ab43)
    • clean: fix the bug of am, pm (4c3b2312)
    • clean: fix the bug of am, pm (caf2b372)
    • eda: fixed issue where plots weren't rendering twice (fd3fd573)
    • eda: wordcloud setting in terminal (00901699)

    Features ✨

    • eda: added sorting feature for create_diff_report (8b187a6c)
    • eda: add running total for time series test (d0940726)
    • eda: add create_db_report submodule (9784cceb)
    • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
    • eda.create_report: add sort by approximate unique (5738db2a)
    • eda: add sort variables by alphabetical and missing (fb93493a)
    • clean: New version of GUI (6828807b)
    • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

    Code Quality + Testing 💯

    • eda: add tests for intermediate compute functions (700add77)

    Documentation 📃

    • clean: revise _init.py (02ede811)
    • clean: add doc of clean GUI (5e2f38ac)
    • eda.plot: add pagination for plot (c4cd4b97)
    • eda.create_report: remove old doc file (e1153cb1)
    • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
    • eda: add doc for getting imdt result (6fbcfe4c)
    • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.3-py3-none-any.whl(9.02 MB)
    dataprep-0.4.3.tar.gz(8.57 MB)
  • v0.4.2(Feb 21, 2022)

    Bugfixes 🐛

    • eda: na column name when upgrade dask (43fdd1a6)
    • eda: pd grouper issue when upgrade dask (761c4455)
    • clean: delete abundant print (0e072a80)
    • eda.plot: fix display issue in notebook (6ed13b09)
    • eda.plot: fix pagination styling issues (8396f2d9)
    • eda: restyled plots into same row, set height + width of plots to be same (c6ffcd4d)
    • eda: interaction error in report for cat-only df (e60239a0)
    • eda: fix cat-cat error (94f70ef6)
    • eda: fix stat layout issue (5bb535d7)
    • eda.create_report: fix display issue in notebook (487659fd)
    • clean: remove usaddress library (c192ab43)
    • clean: fix the bug of am, pm (4c3b2312)
    • clean: fix the bug of am, pm (caf2b372)
    • eda: fixed issue where plots weren't rendering twice (fd3fd573)
    • eda: wordcloud setting in terminal (00901699)

    Features ✨

    • eda.plot: add pagination threshold and add auto jump in pagination navigation (cfdd0dec)
    • eda.create_report: add sort by approximate unique (5738db2a)
    • eda: add sort variables by alphabetical and missing (fb93493a)
    • clean: New version of GUI (6828807b)
    • eda: enriched show details tab by adding plots and overview statistics (eeb210db)

    Code Quality + Testing 💯

    • eda: add tests for intermediate compute functions (700add77)

    Documentation 📃

    • clean: add doc of clean GUI (5e2f38ac)
    • eda.plot: add pagination for plot (c4cd4b97)
    • eda.create_report: remove old doc file (e1153cb1)
    • eda.create_report: convert rst docs file to ipynb and add additional docs for variables sort (bf39a568)
    • eda: add doc for getting imdt result (6fbcfe4c)
    • eda: add the doc of run dataprep.eda on Hadoop yarn (628686d5)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.2-py3-none-any.whl(3.52 MB)
    dataprep-0.4.2.tar.gz(3.14 MB)
  • v0.4.1(Nov 25, 2021)

    v0.4.1

    Bugfixes 🐛

    • eda: stat layout in plot (946319f7)
    • eda: fix display in plot(df) (c11bb94c)
    • eda: report for pandas extension type (2cbb3873)
    • eda: fix saving imdt as json file (5ee6529f)

    Features ✨

    • clean: Add wiki and simple GUI(7f4ab12a)
    • eda: added overview and variables section for create_diff_report (dc4cf7da)
    • eda: add categorical interaction in create_report (7f13cd57)

    Code Quality + Testing 💯

    • eda: added basic automated tests (3a0653e0)

    Documentation 📃

    • eda: link creete_diff_report to intro (05d9850b)
    • eda: added docs for create_diff_report (d8fc9d4b)
    • eda: enrich parameters in report (3d0a148a)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.1-py3-none-any.whl(3.37 MB)
    dataprep-0.4.1.tar.gz(3.00 MB)
  • v0.4.0(Oct 26, 2021)

    v0.4.0

    Bugfixes 🐛

    • eda: fix string type (b7e3321f)
    • eda: fix value table display (57281bc2)
    • eda: remove imdt output from plot (5c227e15)
    • eda: adjusted save report method to accept one parameter (4ceefcc1)
    • eda: clean config code and fix scatter sample param (8ab27f92)
    • plot_diff: fix ci issue (44ce81cf)
    • clean: clean_duplication issue 646 (ca9f7085)
    • eda: fix category type error (9750694a)

    Features ✨

    • eda: refactored code and added density parameter to plot_diff(df) (323ae6b0)
    • eda: save imdt as json file (78673867)
    • connector: integrate connectorx into connector (106457e3, a64e3563, 9f89d3bf)
    • clean: add clean_ml function (909cd196)
    • clean: add multiple clean functions for number types (3c05be58)
    • eda.diff: add plot_diff([df1..dfn], continuous) (3bfb4f57)
    • clean: support conversion into packed binary format in clean_ip (7e30f93f, 37a83b03)

    Code Quality + Testing 💯

    • eda: add densify test and doc for diff (f8d2054d)
    • eda: add test for config (ab3172f5)

    Performance 🚀

    • clean: update documentation of clean_duplication (50f90fa9)

    Documentation 📃

    • clean: change the introduction (862b4478)
    • eda: change eda colab position (ce25b17d, d00b0bd5)
    • clean: add documentation for multiple clean functions for number types (732480f1)
    • clean: add documentation for clean_ml function (0c139db6)
    • eda: scattter.sample_rate added to documentation (549b3193)
    • eda: fix plot show (0b40a40f)
    • readme: add benchmark link (e807f798)
    • readme: small text change on clean and connector (e193a6a7)
    • readme: fix titanc link (29cc06cc)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.4.0-py3-none-any.whl(2.00 MB)
    dataprep-0.4.0.tar.gz(1.65 MB)
  • v0.3.0(May 20, 2021)

    v0.3.0

    Bugfixes 🐛

    • eda: fix long name in missing heatmap (f6cc399e)
    • connector: fix bug in url_path_params (c95a7ff1)
    • eda: fix NA and int viz issue in plot_diff (ef36d5ac)
    • eda: fix missing for SmallCard and DateTime type (201e487b)
    • eda: fix create_report for dask csv (93e85673)
    • clean: fix mixesd up formats of date in one column (e2956956)
    • eda: fixed uncaught dtype and long var names (24f0295e)
    • eda: fix correlation of num columns with small distinct values (9959b78a)
    • eda: fix issue with dataframe of one column (910bb71a)
    • eda: add geopoint in type count (94cbca23)
    • eda: fixed uncaught dtype exceptions (d301eb75)
    • eda: fix str transform with small distinct as categorical (65e7f907)
    • eda: fix na values display issue (1ce5775e)
    • eda: keep na when preprocess df (17d82191)
    • clean: fix returned df_clean in clean_dupl (180e6ad2)
    • clean: escape apostrophes in code exported by clean_dupl (e6ea7e97)
    • eda: fixed endless loop and UI issues (69779cd6)
    • eda: fix insight error (9ad4e26b)
    • eda: suppress warnings for missing and report (df2a1e70)
    • eda: fix insights of plot_correlation (f0ca5f41)
    • eda: suppress warnings of progress bar and dask (ca8da4e1)
    • eda.create_report: fix constant column error (160844ad)
    • docs: fix docs of clean_df (38dd4b2a)
    • clean: remove unneeded replace in clean_dupl (51c02cdd)
    • eda: fixed bugs come with random generated datasets (53ecf76c)
    • eda: fix bugs in log transformation (209d7d0c)
    • eda: fixed and optimized css layouts (58e1b18f)
    • clean: fix bug in validate_country (28068d46)
    • eda: fix column name and index related issues (40a89b91)
    • eda: variables can be none (325b0904)
    • connector: path to new config repo (59603e5b)
    • clean: lat_long regex not match a date format (49d3d227)
    • eda.distribution: highlight variable names (998b1762)
    • eda: fix the error of numerical cell in object column (91c4f9df)
    • eda.distribution: box plot with object dtype (a37e9f21)
    • clean: add comma after street suffix or name (e7655db9)
    • clean: cast values as str in validate funcs (8e1b459a)

    Features ✨

    • clean: tuple of input formats for clean_country() (6bc65513)
    • clean: add clean_text function (55d3ae95)
    • eda: change color of geo map (1dbcddbf)
    • clean: add clean_currency function (deb55938)
    • clean: add clean_df() function (b750284f)
    • type: detect column as categorical for small unique values (4696e598)
    • eda: add geo_plot function (bbe64ec2)
    • eda: create_report UI improvement (c849b013)
    • eda: added new function plot_diff (79523c30)
    • connector: allow parameters appear in url path (5adaf301)
    • eda: value frequency table (bc37b794)
    • eda: create_report UI improvement (72a0ca95)
    • clean: add clean_duplication() function (98ff38d0)
    • clean: support letters in clean_phone (25d163b3)
    • eda: specify colors in plot(df), plot(df, x) (33fa36ea)
    • connector: add functionality that lists supported websites (88187e18)
    • clean: add clean_address function (e839ecd3)
    • clean: add clean_headers function (40742a19)
    • eda: parameter management and how-to guide (d2e8b10a)
    • clean: add clean_date function (6aa6410e)
    • create_report: add tabs for correlation and missing (6dc568b5)

    Code Quality + Testing 💯

    • eda: add test for geo point (943033a6)
    • eda: add dataset test for report (0de5208b)
    • eda: add test of random df (68239f03)
    • clean: add tests for clean_duplication() (a4b9d32b)
    • eda: add random data generator (e83f95b3)
    • clean: add tests for clean_headers (0aca076e)
    • eda: add test case of object column with numerical cell (57839841)
    • clean) : add tests for clean_date and validate_date (812dbb8d)

    Performance 🚀

    • eda: optimize df preprocess and performance of create_report (e7eb182f)
    • clean: update documentation of clean_date (c540fcc7)
    • clean: improve performance of clean_duplication (8fda37e8)
    • eda: use approximate nunique (60300644)
    • clean: improve the peformace of clean_email() (176382bc)
    • clean: improve performance of clean_date (854329ba)

    Documentation 📃

    • readme: update video, paper and titanic report for eda (1126dea8)
    • eda: replace x, y, z with col1, col2, col3 (57f65b30)
    • clean: add documentation for clean_text (65436b06)
    • eda: add documentation for insights (1e4659be)
    • clean: add documentation for clean_df() (4ecf0d71)
    • eda: update user guide's datasets (2428f98e)
    • eda: add documentation for geo plot (3558257c)
    • clean: add user guide for clean_duplication (d834e857)
    • clean: fix clean documentation (e3bed2ba)
    • connector: revision (23085dd3)
    • clean: add documentation for clean_date function (d445f36a)
    • connector: add info docs (cb8cb5c5)
    • connector: add config file section (f55226ea)
    • connector: adding a process overview via DBLP section (5794d6c8)
    • connector: remove stale rst files (433fdfe4)
    • connector: convert pagination section from rst to ipynb (e4b9ba0c)
    • connector: convert authorization section from rst to ipynb (d25af473)
    • connector: change the pointer in index file from connector.rst to introduction.ipynb (218e41c6)
    • connector: rewrite introduction and form doc structure (6a876937)
    • connector: update API reference doc (9bed1694)
    • clean: improve DataPrep.Clean ReadMe (a0bc96b0)
    • eda: update legacy documentations for eda (8f948e05)
    • clean: add documentation for clean_address (4061fca3)
    • clean: add documentation for clean_headers (7a9d519c)
    • clean: add links from user guide to api ref (182b5254)
    • clean: Docstrings for phone and email (47f1e33d)
    • datasets: add introduction for datasets (83d42cee)
    • clean: add API reference (68182f6a)
    • clean: add documentation for clean_ip function (9da3ed1e)
    • connector: add query() section (c904d1fc)
    • connector: add connect() section (bff842ed)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.3.0-py3-none-any.whl(1.69 MB)
    dataprep-0.3.0.tar.gz(1.59 MB)
  • v0.2.15(Jan 6, 2021)

    Bugfixes 🐛

    • eda: add test to plot_missing (303a13e6)
    • eda: when data size is small using plot_missing (9e59aa00)
    • eda: set encoding to udf when file is opened (f43c1aa2)
    • clean: split parameter for clean_phone (f9bb1003)
    • connector: config manager checks _meta.json (5c2278de)
    • eda.create_report: univar datetime analysis (4632852a)
    • eda.report: encoding and show issue (721ae7be)

    Features ✨

    • datasets: add load_dataset and get_dataset_names (2b9e1f95)
    • connector: allow using config from other branches (276afff3)
    • connector: from_key parameter validation (bd89ef29)
    • clean: add clean_ip function (3b232708)
    • connector: improve info (2a175a82)
    • eda: enrich plot_correlation (29c444e2)
    • clean: implement clean_phone for Canadian/US formats (45d43682)
    • eda: modify doc of plot_missing (489c9220)
    • clean: add errors parameter, enhance report for clean_url (aa7ec9cb)
    • clean: add clean_url function (2894d0a0)
    • eda: add stat. in plot_missing (0f44f153)
    • connector: adding validation for auth params (0a7c712d)
    • eda: convert all plot functions to new UI (36f8fa3e)
    • connector: update info function documentation (7b6ae530)
    • connector: create display dataframe function (9767cf47)

    Code Quality + Testing 💯

    • clean: add tests for clean_ip and validate_ip (fc156829)
    • clean: add tests for clean_url (452dbe8f)
    • clean: add tests for clean_phone (fcf73106)
    • clean: add tests for clean_email() (fdd02c62)
    • clean: add tests for clean_country() (8a593fa6)
    • clean: add tests for clean_lat_long (aea26025)

    Performance 🚀

    • clean: improve the peformace of the clean subpackage (c7c787bd)

    Documentation 📃

    • README: add link to each section (b687076a)
    • README: polish EDA section (fd5ef8c4)
    • clean: add documentation for clean_url (bf937f9d)
    • clean: add documentation for clean_phone (8165a428)
    • readme: fix the broken image (12e1fa16)
    • readme: add introduction for dataprep.clean (3710037d)
    • clean: add docs for clean_country (21639814)
    • eda: modify doc for plot_correlation (b6b377c9)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.15-py3-none-any.whl(188.63 KB)
    dataprep-0.2.15.tar.gz(149.16 KB)
  • v0.2.14(Oct 22, 2020)

    Bugfixes 🐛

    • eda.plot_missing: new label texts and color mapping (71a95f91)
    • connector: add missing authdef (8b274b92)
    • eda.create_report: handle unhashable dtypes (77437491)

    Features ✨

    • connector: remove jsonschema dependency (6f07faf9)
    • connector: don't support xml website anymore (fa173a06)
    • connector: simplify generator, add connect (a96d9b3c)
    • clean: implement clean_country function (5dea1bde)
    • connector: do not update local config if it already exists (cd675f30)
    • eda: Redesigned layout for plot_missing (c85eaa5d)
    • connector: add generator UI (4d1e9004)

    Performance 🚀

    • eda: optimize plot_missing and plot_corr (b46036dc)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.14-py3-none-any.whl(151.39 KB)
    dataprep-0.2.14.tar.gz(118.94 KB)
  • v0.2.13(Oct 1, 2020)

    Bugfixes 🐛

    • eda: change dtype 'string' to 'object' (8ddddbcf)
    • eda: remove unecessary compute (98c4ab0c)
    • connector: wrong calculation for pagination (516038b9)
    • eda.data_array: handle empty df correctly (97db86d7)
    • eda.distribution: fix pie chart insight (d3564a6f)
    • eda.distribution: delay scipy computations (89fafaec)
    • eda.correlation: wrong mask calculation (8ebe9cc0)
    • eda.plot: fixed wordcloud, all nan column (ce762d55)

    Features ✨

    • connector: implement authorization code (e6838ca1)
    • connector: full text search _q to be a universal parameter (947584ab)
    • cleaning: add clean_email() function (4658a208)
    • connector: implement generator (7a93ea0e)
    • connector: add token based pagination (5ec6e00c)
    • connector: implement page pagination (02c93b4e)
    • connector: implement header authentication (d879c207)
    • connector: use pydantic for schema (dff08442)
    • connector: rename pagination types (500ce130)
    • cleaning: add report parameter for clean_lat_long (f0af6212)
    • connector: Parameter check when calling query() (0db7a16b)
    • eda: support series as the input (bad6a873)
    • eda.plot: Redesigned layout for plot(df, x) (04c7fd55)
    • cleaning: clean latitude, longitude coordinates (93927a98)
    • eda.report: allow disabling the progress bar (2a90f7f3)
    • eda.correlation: move nan corr values to the bottom (4bba52e0)
    • eda: add progress bar for dask local scheduler (e13257c8)
    • eda.plot: increase # of bins and ngroups (f78cfaef)

    Performance 🚀

    • eda.plot: changed drop_null to dropna (0a7fe56d)
    • eda.missing: use DataArray (fb69ea1b)
    • eda.plot: optimize bivariate computations (031748e9)
    • eda: improve progress bar performance (64be8895)
    • eda.correlation: increase the performance (3575aac4)
    • eda.correlation: performance tuning (68471e50)

    Documentation 📃

    • cleaning: add documentation for clean_email() (5bc37706)
    • cleaning: update clean_lat_long docs (d698a10e)
    • cleaning: add documentation for clean_lat_long (eaba8c71)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.13-py3-none-any.whl(138.07 KB)
    dataprep-0.2.13.tar.gz(106.97 KB)
  • v0.2.12(Aug 25, 2020)

    Bugfixes 🐛

    • eda.create_report: optional dependency on ipython (75542cda)

    Features ✨

    • eda.plot: add plot(df, x) insights (090d2f33)
    • connector: early return when df is empty for multi page (99fd5164)
    • eda.plot: Redesigned layout for plot(df) (5baebcb2)
    • eda.plot: Add auto insights (f176e9f8)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.12-py3-none-any.whl(109.29 KB)
    dataprep-0.2.12.tar.gz(88.79 KB)
  • v0.2.11(Aug 10, 2020)

    Bugfixes 🐛

    • eda: fix holoview palette deprecation warning (b5b27d36)
    • eda.correlation: truncate axis tick values (ed6ef8c6)

    Features ✨

    • eda.create_report: new report object for saving and showing report (66fefd59)
    • eda.plot_missing: add dendrogram for plot_missing (1e11d5c5)
    • eda.config: add config class (f2bc8c50)

    Performance 🚀

    • eda.plot: optimize plot() by tweaking dask (54b6f667)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.11-py3-none-any.whl(92.87 KB)
    dataprep-0.2.11.tar.gz(79.18 KB)
  • v0.2.10(Jul 26, 2020)

    Bugfixes 🐛

    • eda.create_report: updated key name (80405159)

    Features ✨

    • eda: add show_browser function (04cf7306)
    • eda: add a drop_null function (11acb3e0)

    Code Quality + Testing 💯

    • eda.create_report: added test script for create_report function (c88a6eb2)

    Performance 🚀

    • eda: optimize plot_missing performance by tweaking dask (d7669779)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.10-py3-none-any.whl(85.06 KB)
    dataprep-0.2.10.tar.gz(72.17 KB)
  • v0.2.9(Jul 12, 2020)

    Bugfixes 🐛

    • eda: deal with no missing df for plot_missing (9d4d39b6)
    • eda.plot: kde yaxis tick locations (7c48fe14)
    • eda: display inside Google Colab (e61d16cf)

    Features ✨

    • data_connector: adding support for authorization type TokenParam (c59480c8)

    Code Quality + Testing 💯

    • data_connector: add test for QueryParam auth (818891e1)

    Documentation 📃

    • eda: update plot_missing doc (4ee03afa)
    • data_connector: Example notebook for YouTube usage (5ae4283b)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.9-py3-none-any.whl(82.41 KB)
    dataprep-0.2.9.tar.gz(70.63 KB)
  • v0.2.8(Jul 4, 2020)

    Bugfixes 🐛

    • eda.plot: fix wordcloud and change the stats layout (7fe25c74)

    Documentation 📃

    • data_connector_example: Example to fetch and anaylze tweets (e1d97d7a)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.8-py3-none-any.whl(73.39 KB)
    dataprep-0.2.8.tar.gz(64.01 KB)
  • v0.2.7(Jun 29, 2020)

    Bugfixes 🐛

    • eda: fix the plot doesn't show up (b72770a6)

    Features ✨

    • eda.plot: support text analysis (21ef6293)

    Documentation 📃

    • data_connector: Twitter usage example notebook (a698811f)
    • data_connector: documentation update (8977f757)
    • data_connector: documentation update (8fc30940)
    • data_connector: documentation update (0bd70b7a)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.7-py3-none-any.whl(73.33 KB)
    dataprep-0.2.7.tar.gz(63.98 KB)
  • v0.2.6(Jun 1, 2020)

    Bugfixes 🐛

    • eda.basic: fixed kde calculation (a6e58e94)

    Features ✨

    • eda: notebook with code from dvsp blog post (bfd06b89)
    • eda.basic: added time series plots, support (0d10ebbb)

    Code Quality + Testing 💯

    • data_connector: add integration test (95e52996)

    Documentation 📃

    • data_connector: use remote yelp config in the docstring (d67bb70b)
    • data_connector: improved docstring (c363ac26)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(May 5, 2020)

    Bugfixes 🐛

    • eda.basic: fixed kde calculation (a6e58e94)

    Features ✨

    • eda: notebook with code from dvsp blog post (bfd06b89)
    • eda.basic: added time series plots, support (0d10ebbb)

    Code Quality + Testing 💯

    • data_connector: add integration test (95e52996)

    Documentation 📃

    • data_connector: use remote yelp config in the docstring (d67bb70b)
    • data_connector: improved docstring (c363ac26)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.5-py3-none-any.whl(57.25 KB)
    dataprep-0.2.5.tar.gz(47.32 KB)
  • v0.2.4(Apr 18, 2020)

  • v0.2.3(Apr 12, 2020)

    Documentation 📃

    • readme: fix the fig (4d04ea78)
    • readme: make examples link to documentation (cfb1e39d)
    • readme: add links and tooltips for the plots (13c7ab10)

    Contributors this release 🏆

    The following users contributed code to DataPrep since the last release.

    🎉🎉 Thank you! 🎉🎉

    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.3-py3-none-any.whl(53.14 KB)
    dataprep-0.2.3.tar.gz(42.98 KB)
  • v0.2.2(Apr 4, 2020)

  • v0.2.1(Mar 20, 2020)

  • v0.2.0(Mar 20, 2020)

    This release includes some new features and lots of bug fixes.

    Feature

    • Implement report (ac13bf898a2927e0fa838037de1772e3f5f60b74)
    • Implement dc.show_schema() (487a14fb4b14ed6aead0f724a03dfc2666d3ce4f)
    • Implement dc.info (50e3e18d4423833bf70e5fae72ba6037c5b0958e)
    • Support template (8a3a4a370aa796775fdb007b9df17b84fd53d282)

    Fix

    • Remove_if_empty on template (443097273fe72aa1f7a5280c527e92b89c69eeb9)
    • Fix parameter names (ef0119f498af10f3d5662edafcbafc92ea5c2567)
    • Improved plot(df) efficiency (1ecd54376bcef015d5091a9a6a2bef1f7ed172b0)
    • Fixed xtic rounding (697496e8d9687c8f675eca88c0af83aa6d995beb)
    • Fix scatter and top-k nan (6a1bdb7780afe62b152997a75715b21dc1cb018a)
    • Fixed xtics for histograms (1f13d1b51ae8e1ad50e72829575a4bc27c4cc591)
    • Plot_correlation only supports for numerical data (c06726063a1dc7910c11111a421c521fc481d0bf)
    • It works for the columns with missing values (5329c54cc9112a1d011955518e61a729f8e28007)
    • Make the tooltip style align with plot(df) (33e5403fb828b924a9cbacc0d2395e3f1a5eaa87)

    Documentation

    • Add documentation (4f342ce2890529e617284abc54c3d84cd6d4f6c2)
    Source code(tar.gz)
    Source code(zip)
    dataprep-0.2.0-py3-none-any.whl(52.31 KB)
    dataprep-0.2.0.tar.gz(41.91 KB)
Owner
SFU Database Group
SFU Database Group
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 3, 2023
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Amazon Web Services - Labs 53 Dec 8, 2022
Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021