Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

Overview

AutoViz

banner

Pepy Downloads Pepy Downloads per week Pepy Downloads per month standard-readme compliant Python Versions PyPI Version PyPI License

Automatically Visualize any dataset, any size with a single line of code.

AutoViz performs automatic visualization of any dataset with one line. Give any input file (CSV, txt or json) and AutoViz will visualize it.

Table of Contents

Install

Prerequsites

To clone AutoViz, it's better to create a new environment, and install the required dependencies:

To install from PyPi:

conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
pip install autoviz

To install from source:

cd <AutoViz_Destination>
git clone [email protected]:AutoViML/AutoViz.git
# or download and unzip https://github.com/AutoViML/AutoViz/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd AutoViz
pip install -r requirements.txt

Usage

Read this Medium article to know how to use AutoViz.

In the AutoViz directory, open a Jupyter Notebook and use this line to instantiate the library

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()

Load a dataset (any CSV or text file) into a Pandas dataframe or give the name of the path and filename you want to visualize. If you don't have a filename, you can simply assign the filename argument "" (empty string).

Call AutoViz using the filename (or dataframe) along with the separator and the name of the target variable in the input. AutoViz will do the rest. You will see charts and plots on your screen.

filename = ""
sep = ","
dft = AV.AutoViz(
    filename,
    sep=",",
    depVar="",
    dfte=None,
    header=0,
    verbose=0,
    lowess=False,
    chart_format="svg",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
)

AV.AutoViz is the main plotting function in AV.

Notes:

  • AutoViz will visualize any sized file using a statistically valid sample.
  • COMMA is assumed as default separator in file. But you can change it.
  • Assumes first row as header in file but you can change it.
  • verbose option
    • if 0, display minimal information but displays charts on your notebook
    • if 1, print extra information on the notebook and also display charts
    • if 2, will not display any charts, it will simply save them in your local machine under AutoViz_Plots directory

API

Arguments

  • filename - Make sure that you give filename as empty string ("") if there is no filename associated with this data and you want to use a dataframe, then use dfte to give the name of the dataframe. Otherwise, fill in the file name and leave dfte as empty string. Only one of these two is needed to load the data set.
  • sep - this is the separator in the file. It can be comma, semi-colon or tab or any value that you see in your file that separates each column.
  • depVar - target variable in your dataset. You can leave it as empty string if you don't have a target variable in your data.
  • dfte - this is the input dataframe in case you want to load a pandas dataframe to plot charts. In that case, leave filename as an empty string.
  • header - the row number of the header row in your file. If it is the first row, then this must be zero.
  • verbose - it has 3 acceptable values: 0, 1 or 2. With zero, you get all charts but limited info. With 1 you get all charts and more info. With 2, you will not see any charts but they will be quietly generated and save in your local current directory under the AutoViz_Plots directory which will be created. Make sure you delete this folder periodically, otherwise, you will have lots of charts saved here if you used verbose=2 option a lot.
  • lowess - this option is very nice for small datasets where you can see regression lines for each pair of continuous variable against the target variable. Don't use this for large data sets (that is over 100,000 rows)
  • chart_format - this can be SVG, PNG or JPG. You will get charts generated and saved in this format if you used verbose=2 option. Very useful for generating charts and using them later.
  • max_rows_analyzed - limits the max number of rows that is used to display charts. If you have a very large data set with millions of rows, then use this option to limit the amount of time it takes to generate charts. We will take a statistically valid sample.
  • max_cols_analyzed - limits the number of continuous vars that can be analyzed

Maintainers

Contributing

See the contributing file!

PRs accepted.

License

Apache License, Version 2.0

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Issues
  • Project logo [help wanted]

    Project logo [help wanted]

    If anyone with design sensibilities sees this. We are open to changing the project logo.

    We like the pandas logo for example https://github.com/pandas-dev/pandas

    opened by morenoh149 18
  • Verbose = 2 does not save images

    Verbose = 2 does not save images

    Running verbose = 2 does not save the images anywhere! Can you please see this issue?

    opened by Metalkiler 10
  • How do we see output using a script file a terminal?

    How do we see output using a script file a terminal?

    Hi AutoViML,

    Firstly, congratulations and thanks for this wonderful package. This works perfectly fine with Jupyter notebooks but how do I use the same if I am using an IDE let say Spyder?

    Thanks in advance. Mohit

    opened by bansalism2 9
  • "Not able to read or load file. Please check your inputs and try again..."

    hello Ram, when i run the code on my dateset, dft = av.AutoViz('', sep, target, df) i get this error "Not able to read or load file. Please check your inputs and try again..."

    what could the issue be?

    opened by isioma42 8
  • some variables in data removed automatically

    some variables in data removed automatically

    Hi, I gave input csv contains 20 variables,while preprocessing it removed all important columns,may i know the reason?. note:- removed columns contains fill data without null values

    opened by sivanagendra123 5
  • exporting the report

    exporting the report

    Similar project to AutoViz are Sweetviz and Pandas Profiling.

    They could export the report as a HTML file. I wonder if this library also has this function?

    opened by sunset1234321 4
  • Misplaced graph x ylabel

    Misplaced graph x ylabel

    Hi Ram, I have tried this package and found out a potential bug. When I tried to do the AV.AutoViz('', ',', 'target', df) to run an autoViz stuff, the x y labels of each graph are misplaced (x label should be placed at y label and vice versa.). I have tried two datasets and it still happened. Please look into this and see if this is a bug or I just did something wrong. Thanks! Jeff

    opened by HiIamJeff 4
  • Installation instructions and sample code not working

    Installation instructions and sample code not working

    You need from autoviz import ... in the sample code. Preferably you should give a sample that can be just copy&pasted and run, and provide pictures of how it looks, so that one could evaluate whether to install this instead of the many other plotting libraries.

    The dependencies are extremely heavy. Is it absolutely necessary to install Jupyter? Something inside also depends on sklearn, which was not included in pip deps.

    As for CSV reading; if you are not able to autodetect/guess separators and date formats, do not bother "including" it in your library. It is just two lines of code to first load the data with pandas and then use another library for plotting, and in most cases one needs to do something in between anyway (data preprocessing).

    An ideal plotting library would have API alike this:

    from fictionalplot import Figure  # if possible, keep it to just one simple import
    
    fig = Figure()   # Internally holds graphics context, Qt window, websocket to browser or whatever
    fig.plot(df)  # display the graph and return instantly, try to auto-guess suitable format based on df
    

    If using a Qt window, spawn a new process that does not terminate when the Python program ends, and that is automatically shared by all figures of all running programs (don't block execution of the program like Matplotlib does). If using Notebook/browser, you don't need separate process because browser already does that.

    For true interactive plots (e.g. receive user input on scaling changes to recalculate new data in Python), use async/await to avoid blocking Python from executing while waiting for user input (but stay away from import asyncio which is utter crap -- instead use trio if you must).

    Good luck with your plotting library. We could certainly use some good options (I am not entirely happy with either Matplotlib nor Plotly, and everything else is just bad).

    opened by Tronic 3
  • AutoViz Crashes with the Error

    AutoViz Crashes with the Error

    When I try to apply AutoViz to analizing the data of one of the competitions in Kaggle (namely, https://www.kaggle.com/c/lish-moa/data), it crashes.

    Below is the error trap I get

    Imported AutoViz_Class version: 0.0.68. Call using: 
        from autoviz.AutoViz_Class import AutoViz_Class
        AV = AutoViz_Class()
        AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,
                                lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30)
                
    To remove previous versions, perform 'pip uninstall autoviz'
    Shape of your Data Set: (21948, 876)
    Classifying variables in data set...
        875 Predictors classified...
            This does not include the Target column(s)
        2 variables removed since they were ID or low-information variables
        List of variables removed: ['sig_id', 'cp_type']
    Since Number of Rows in data 21948 exceeds maximum, randomly sampling 2500 rows for EDA...
    872 numeric variables in data exceeds limit, taking top 40 variables
    Number of numeric variables = 872
        Number of variables removed due to high correlation = 227 
        Adding 1 categorical variables to reduced numeric variables  of 645
    Selected No. of variables = 646 
    Finding Important Features...
    Not able to read or load file. Please check your inputs and try again...
    

    My code to reproduce the problem is provided in https://gist.github.com/gvyshnya/7644fd77567051203ad96d95fbc7ef2a

    I run that code on my local machine (not in a Kaggle kernel). The above-mentioned code expects the data files from the competition to be placed in data subfolder (relative to the folder where you place the python script with the code).

    Below are the key details about my OS and Python Environment

    • Windows 10
    • Python 3.7 in Anaconda
    • AutoViz_Class version: 0.0.68

    The trace from pd.show_versions(as_json=False) on my machine is provided below, just in case

    INSTALLED VERSIONS
    ------------------
    commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
    python           : 3.7.0.final.0
    python-bits      : 64
    OS               : Windows
    OS-release       : 10
    Version          : 10.0.18362
    machine          : AMD64
    processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
    byteorder        : little
    LC_ALL           : None
    LANG             : None
    LOCALE           : None.None
    
    pandas           : 1.1.2
    numpy            : 1.19.2
    pytz             : 2018.5
    dateutil         : 2.7.3
    pip              : 20.1
    setuptools       : 49.2.0
    Cython           : 0.28.5
    pytest           : 5.3.2
    hypothesis       : None
    sphinx           : 1.7.9
    blosc            : None
    feather          : None
    xlsxwriter       : 1.1.0
    lxml.etree       : 4.2.5
    html5lib         : 1.0.1
    pymysql          : None
    psycopg2         : None
    jinja2           : 2.11.1
    IPython          : 6.5.0
    pandas_datareader: None
    bs4              : 4.6.3
    bottleneck       : 1.2.1
    fsspec           : None
    fastparquet      : None
    gcsfs            : None
    matplotlib       : 3.1.2
    numexpr          : 2.6.8
    odfpy            : None
    openpyxl         : 2.5.6
    pandas_gbq       : 0.12.0
    pyarrow          : None
    pytables         : None
    pyxlsb           : None
    s3fs             : None
    scipy            : 1.4.1
    sqlalchemy       : 1.2.11
    tables           : 3.4.4
    tabulate         : 0.8.2
    xarray           : None
    xlrd             : 1.1.0
    xlwt             : 1.3.0
    numba            : 0.48.0
    
    opened by gvyshnya 3
  • AutoViz misidentifies my dependent variable as a categorical variable, which is in fact a continuous variable.

    AutoViz misidentifies my dependent variable as a categorical variable, which is in fact a continuous variable.

    AutoViz misidentifies my dependent variable as a categorical variable, which is in fact a continuous variable.

    My dependent variable is the Loneliness scale score within the range of 1 of 4. image

    When I run the basic code of Autoviz() below, I do not get any results regarding my dependent variable. report_AV = AV.AutoViz('', dfte=data)

    When I run the code containing depVar argument below, I get the results that appears to regard my dependent variable as a categorical variable. This makes the result useless for my research. report_AV = AV.AutoViz('', dfte=data, depVar='Loneliness')

    Here are some examples that I get from the above code. image

    I've checked with the datatypes of my dataframe, and my dependent variable column's datatype is float64.

    Is there any way to solve this issue?

    opened by yoonwonj 3
  • Frequency distribution plot of target column is wrongly interpreted

    Frequency distribution plot of target column is wrongly interpreted

    Hi, when I have run on the steel classification dataset the frequency distribution plot for the target column is showing wrong. Dist_Plots_target

    opened by Bhavana165 1
  • Normed Histogram plot with negative y value?

    Normed Histogram plot with negative y value?

    Hi, the plots I have all has negative y values. How to interpret this? Screen Shot 2021-09-08 at 10 04 39 PM

    I think the following code generates the plots. sns.distplot(dft.loc[dft[dep]==target_var][each_conti],bins=binsize, ax= ax1, label=target_var, hist=False, kde=True, color=color2) legend_flag += 1

    opened by shuaiwang88 1
  • The title part at the top of the output image is cut off

    The title part at the top of the output image is cut off

    When using vervose=2 to output an svg or png file, there is an issue where the top title part is cut off. There seems to be a problem with the height value setting, please check.

    opened by beobest2 2
  • JSON file example

    JSON file example

    Is there an example where I could use a JSON file with AutoVIZ, seem to be running into errors. Would be good to have a sample file that works.

    opened by kishan19 2
  • Use black formatter

    Use black formatter

    Would you welcome a PR adding black formatting to the project? https://github.com/psf/black

    opened by morenoh149 3
Owner
AutoViz and Auto_ViML
Automated Machine Learning: Build Variant Interpretable Machine Learning models. Project Created by Ram Seshadri.
AutoViz and Auto_ViML
Matplotlib tutorial for beginner

matplotlib is probably the single most used Python package for 2D-graphics. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. We are going to explore matplotlib in interactive mode covering most common cases.

Nicolas P. Rougier 2.1k Oct 15, 2021
Python scripts to manage Chia plots and drive space, providing full reports. Also monitors the number of chia coins you have.

Chia Plot, Drive Manager & Coin Monitor (V0.5 - April 20th, 2021) Multi Server Chia Plot and Drive Management Solution Be sure to ⭐ my repo so you can

null 350 Oct 15, 2021
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 717 Oct 13, 2021
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 614 Feb 17, 2021
A Python package that provides evaluation and visualization tools for the DexYCB dataset

DexYCB Toolkit DexYCB Toolkit is a Python package that provides evaluation and visualization tools for the DexYCB dataset. The dataset and results wer

NVIDIA Research Projects 69 Oct 18, 2021
Plot, scatter plots and histograms in the terminal using braille dots

Plot, scatter plots and histograms in the terminal using braille dots, with (almost) no dependancies. Plot with color or make complex figures - similar to a very small sibling to matplotlib. Or use the canvas to plot dots and lines yourself.

Tammo Ippen 128 Oct 21, 2021
Handout for the tutorial "Creating publication-quality figures with matplotlib"

Handout for the tutorial "Creating publication-quality figures with matplotlib"

JB Mouret 1.7k Oct 20, 2021
Visualize and compare datasets, target values and associations, with one line of code.

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an open-source Python library that generat

Francois Bertrand 1.8k Oct 21, 2021
Visualize and compare datasets, target values and associations, with one line of code.

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an open-source Python library that generat

Francois Bertrand 1.2k Feb 18, 2021
Flexitext is a Python library that makes it easier to draw text with multiple styles in Matplotlib

Flexitext is a Python library that makes it easier to draw text with multiple styles in Matplotlib

Tomás Capretto 54 Oct 16, 2021
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

???? Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

wq framework 1.1k Oct 24, 2021
Pyan3 - Offline call graph generator for Python 3

Pyan takes one or more Python source files, performs a (rather superficial) static analysis, and constructs a directed graph of the objects in the combined source, and how they define or use each other. The graph can be output for rendering by GraphViz or yEd.

Juha Jeronen 144 Oct 15, 2021
basemap - Plot on map projections (with coastlines and political boundaries) using matplotlib.

Basemap Plot on map projections (with coastlines and political boundaries) using matplotlib. ⚠️ Warning: this package is being deprecated in favour of

Matplotlib Developers 636 Oct 15, 2021
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.2k Oct 22, 2021
A D3.js plugin that produces flame graphs from hierarchical data.

d3-flame-graph A D3.js plugin that produces flame graphs from hierarchical data. If you don't know what flame graphs are, check Brendan Gregg's post.

Martin Spier 633 Oct 22, 2021
Simple Inkscape Scripting

Simple Inkscape Scripting Description In the Inkscape vector-drawing program, how would you go about drawing 100 diamonds, each with a random color an

Scott Pakin 30 Oct 26, 2021
Tools for exploratory data analysis in Python

Dora Exploratory data analysis toolkit for Python. Contents Summary Setup Usage Reading Data & Configuration Cleaning Feature Selection & Extraction V

Nathan Epstein 572 Oct 12, 2021
A simple python tool for explore your object detection dataset

A simple tool for explore your object detection dataset. The goal of this library is to provide simple and intuitive visualizations from your dataset and automatically find the best parameters for generating a specific grid of anchors that can fit you data characteristics

GRADIANT - Centro Tecnolóxico de Telecomunicacións de Galicia 81 Oct 9, 2021
Uniform Manifold Approximation and Projection

UMAP Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, bu

Leland McInnes 5.1k Oct 23, 2021