Explorative Data Analysis Guidelines

Florian Rohrer

Last update: Dec 26, 2022

Related tags

Documentation ml_checklist

Overview

Explorative Data Analysis

Get data into a usable format!
Find out if the following predictive modeling phase will be successful!

Combine everything into a single big table
- Convert files to .csv
- Merge files
- Fix encoding issues
- Clean column names (english, no whitespace, no special chars)
- Are there duplicate columns?
- Fix datatypes (datetime, int, float, string)
Look at the raw data
- Sort data
- Filter data by various criteria
Investigation
- Non-sensical observations/artifacts?
- Coding of categorical features?
- Missing values?
- Outliers?
- Constant values (=Zero Importance)?
- Low importance features?
- Collinear, correlated or otherwise dependent features?
- Highly skewed features?
- Irrelevant features?
Univariate Analysis
- Look at mean, median, min, max, std, iqr, quantiles (1%, 5%, 25%, 50%, 75%, 95%, 99%)
- Draw boxplots, histograms
Multivariate Analysis
- Draw scatter plots
- Create correlation matrix
Time Series? -> Plot variables over time
Fixing issues
- Impute missing values (mode, median, mean)
- Remove variables that have too many missings
- Remove observations that have too many missings
- Select appropriate time slice
Preparation
- Clip values that are too small/too large
- Scale to [0,1] or normalize (mean=0, std=1) or Robust / Quantile Scaling
- One-hot encoding, Label Encoding (0,1,2,3)
- Create log-transformed versions for highly skewed variables
- Create binned versions for variables
- Combine categories for highly skewed categorical variables
- Create sum/difference/product/quotient of variables
- Create polynomial features

Comments

Minor adjustments before being released in Devchecklists.

Hey Florian! First of all, thanks for submitting your Explorative Data Analysis checklist!

We will love to release it on Devchecklists. But before it, can you please take a look at our guide for submitting new checklists? Basically, you will need to rename your file to checklist-en.

We still working in some minor improvements before officially launch it. So any feedback is appreciated. Let me know if you need more information.

opened by felipefarias 2

Quick tutorial on orchest.io that shows how to build multiple deep learning models on your data with a single line of code using python

Deep AutoViML Pipeline for orchest.io Quickstart Build Deep Learning models with a single line of code: deep_autoviml Deep AutoViML helps you build te

6 Oct 2, 2022

Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

This repository contains code that gets data regarding financial disclosures from the Court Listener API main.py: contains driver code that interacts

2 Aug 6, 2022

Soccerdata - Efficiently scrape soccer data from various sources

SoccerData is a collection of wrappers over soccer data from Club Elo, ESPN, FBr

195 Jan 4, 2023

Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.

Aadhaar Number Masking Pipeline Implementation of a complete pipeline that masks the Aadhaar Number in given images to adhere to Govt. of India's Priv

1 Nov 6, 2021

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

CKAN: The Open Source Data Portal Software CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work

3.6k Dec 27, 2022

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

1 Sep 3, 2022

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of charts.

3 Sep 9, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

3.8k Dec 30, 2022

2.8k Feb 18, 2021

Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

LayoutAnalysisEvaluator Layout Analysis Evaluator for: ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records ICD

17 Dec 8, 2022

A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

19 Nov 24, 2022

CodeAnalysis - Static Code Analysis: a code comprehensive analysis platform

TCA, Tencent Cloud Code Analysis English | 简体中文 What is TCA Tencent Cloud Code A

1.3k Jan 7, 2023

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

1 Jan 1, 2022

Delta Conformity Sociopatterns Analysis - Delta Conformity Sociopatterns Analysis

Delta_Conformity_Sociopatterns_Analysis ∆-Conformity is a local homophily measur

2 Jan 9, 2022

Streamlit App For Product Analysis - Streamlit App For Product Analysis

Streamlit_App_For_Product_Analysis Здравствуйте! Перед вами дашборд, позволяющий

1 Jan 10, 2022

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

7 Mar 12, 2022

Malware-analysis-writeups - Some of my Malware Analysis writeups

About This repo contains some malware analysis writeups i've created over time m

14 Jun 22, 2022

Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

25 Mar 14, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Explorative Data Analysis Guidelines

Related tags

Overview

Explorative Data Analysis

You might also like...

Quick tutorial on orchest.io that shows how to build multiple deep learning models on your data with a single line of code using python

Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

Soccerdata - Efficiently scrape soccer data from various sources

Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

A set of functions and analysis classes for solvation structure analysis

CodeAnalysis - Static Code Analysis: a code comprehensive analysis platform

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Delta Conformity Sociopatterns Analysis - Delta Conformity Sociopatterns Analysis

Streamlit App For Product Analysis - Streamlit App For Product Analysis

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Malware-analysis-writeups - Some of my Malware Analysis writeups

Universal 1d/2d data containers with Transformers functionality for data analysis.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Comments

Minor adjustments before being released in Devchecklists.

Owner

Florian Rohrer

Plotting and analysis tools for ARTIS simulations

Docov - Light-weight, recursive docstring coverage analysis for python modules

Fully reproducible, Dockerized, step-by-step, tutorial on how to mock a "real-time" Kafka data stream from a timestamped csv file. Detailed blog post published on Towards Data Science.

Data-Scrapping SEO - the project uses various data scrapping and Google autocompletes API tools to provide relevant points of different keywords so that search engines can be optimized

A tutorial for people to run synthetic data replica's from source healthcare datasets

advance python series: Data Classes, OOPs, python

A Python library for setting up projects using tabular data.

An open source utility for creating publication quality LaTex figures generated from OpenFOAM data files.

Python code for working with NFL play by play data.

This contains timezone mapping information for when preprocessed from the geonames data