Mining the Stack Overflow Developer Survey

Last update: Nov 16, 2021

Related tags

Data Analysis dataminingproject

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

python 3.8.1 and up
pandas 1.3.4 and up
matplotlib 3.4.3 and up
numpy 1.21.0 and up
sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

Andrew Kraynak (LinkedIn, Github)
Samuel Kaczynski (LinkedIn, Github)

Stack Overflow Error Parser

A python tool that executes python files and opens respective Stack Overflow threads in browser for errors encountered.

3 Jul 24, 2022

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

Azure SDK for Python This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public

3.4k Jan 3, 2023

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

Azure SDK for Python This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public

3.4k Jan 2, 2023

The (Python-based) mining software required for the Game Boy mining project.

ntgbtminer - Game Boy edition This is a version of ntgbtminer that works with the Game Boy bitcoin miner. ntgbtminer ntgbtminer is a no thrills getblo

31 Nov 4, 2022

The (Python-based) mining software required for the Nintendo Switch mining project.

ntgbtminer - Nintendo Switch edition This is a version of ntgbtminer that works with the Nintendo Switch bitcoin miner. ntgbtminer ntgbtminer is a no

4 Jun 3, 2021

The (Python-based) mining software required for the Game Boy mining project.

31 Nov 4, 2022

4Geeks Academy Full-Stack Developer program final project.

Final Project Chavi, Clara y Pablo 4Geeks Academy Full-Stack Developer program final project. Authors Javier Manteca - Coding - chavisam Clara Rojano

1 Feb 5, 2022

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

2 Jan 20, 2022

CVE-2021-40346 integer overflow enables http smuggling

CVE-2021-40346-POC CVE-2021-40346 integer overflow enables http smuggling Reference: https://jfrog.com/blog/critical-vulnerability-in-haproxy-cve-2021

34 Nov 15, 2022

A simple solution for water overflow problem in Python

Water Overflow problem There is a stack of water glasses in a form of triangle as illustrated. Each glass has a 250ml capacity. When a liquid is poure

2 Oct 22, 2021

Poupool is an overflow swimming pool control software

Poupool - The swimming pool controller Poupool is a swimming pool control software. It is based on Transitions, Pykka and Paho MQTT. The user interfac

8 Jul 18, 2022

Buffer Overflow para SLmail5.5 32 bits

SLmail5.5-Exploit-BoF Buffer Overflow para SLmail5.5 32 bits con un par de utilidades para que puedas hacer el tuyo REQUISITOS PARA QUE FUNCIONE: Desa

15 Jul 30, 2022

This repo explains in details about buffer overflow exploit development for windows executable.

Buffer Overflow Exploit Development For Beginner Introduction I am beginner in security community and as my fellow beginner, I spend some of my time a

11 Dec 17, 2022

Buffer overflow example for python

1 Jan 4, 2022

CVE-2021-39685 Description and sample exploit for Linux USB Gadget overflow vulnerability

8 May 25, 2022

Automated tool to exploit basic buffer overflow remotely and locally & x32 and x64

Automated tool to exploit basic buffer overflow (remotely or locally) & (x32 or x64)

5 Oct 9, 2022

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset This repository provides a unified online platform, LoLi-P

457 Jan 3, 2023

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

Explainable Fact Checking: A Survey This repository and the accompanying webpage contain resources for the paper "Explainable Fact Checking: A Survey"

42 Nov 17, 2022

An AutoML survey focusing on practical systems.

This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow daily. Hence, we cannot hope to provide a comprehensive description of every interesting idea or approach available.

16 Aug 14, 2022

Mining the Stack Overflow Developer Survey

Related tags

Overview

Mining the Stack Overflow Developer Survey

Objectives

Usage

Dependencies

Methodology

Preprocessing

Models

Decision tree model

Random forest model

Results and Analysis

Authors

You might also like...

Stack Overflow Error Parser

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

The (Python-based) mining software required for the Game Boy mining project.

The (Python-based) mining software required for the Nintendo Switch mining project.

The (Python-based) mining software required for the Game Boy mining project.

4Geeks Academy Full-Stack Developer program final project.

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

CVE-2021-40346 integer overflow enables http smuggling

A simple solution for water overflow problem in Python

Poupool is an overflow swimming pool control software

Buffer Overflow para SLmail5.5 32 bits

This repo explains in details about buffer overflow exploit development for windows executable.

Buffer overflow example for python

CVE-2021-39685 Description and sample exploit for Linux USB Gadget overflow vulnerability

Automated tool to exploit basic buffer overflow remotely and locally & x32 and x64

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

An AutoML survey focusing on practical systems.

Owner

Data-sets from the survey and analysis

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

cLoops2: full stack analysis tool for chromatin interactions

Command-line tool that instantly fetches Stack Overflow results when an exception is thrown

Windows Stack Based Auto Buffer Overflow Exploiter

Automatically search Stack Overflow for the command you want to run

Cisco RV110w UPnP stack overflow

Stack overflow search API