Mining the Stack Overflow Developer Survey

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

  • python 3.8.1 and up
  • pandas 1.3.4 and up
  • matplotlib 3.4.3 and up
  • numpy 1.21.0 and up
  • sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

You might also like...
Stack Overflow Error Parser

A python tool that executes python files and opens respective Stack Overflow threads in browser for errors encountered.

The (Python-based) mining software required for the Game Boy mining project.

ntgbtminer - Game Boy edition This is a version of ntgbtminer that works with the Game Boy bitcoin miner. ntgbtminer ntgbtminer is a no thrills getblo

The (Python-based) mining software required for the Nintendo Switch mining project.

ntgbtminer - Nintendo Switch edition This is a version of ntgbtminer that works with the Nintendo Switch bitcoin miner. ntgbtminer ntgbtminer is a no

The (Python-based) mining software required for the Game Boy mining project.

The (Python-based) mining software required for the Game Boy mining project.

4Geeks Academy Full-Stack Developer program final project.

Final Project Chavi, Clara y Pablo 4Geeks Academy Full-Stack Developer program final project. Authors Javier Manteca - Coding - chavisam Clara Rojano

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

CVE-2021-40346 integer overflow enables http smuggling
CVE-2021-40346 integer overflow enables http smuggling

CVE-2021-40346-POC CVE-2021-40346 integer overflow enables http smuggling Reference: https://jfrog.com/blog/critical-vulnerability-in-haproxy-cve-2021

A simple solution for water overflow problem in Python

Water Overflow problem There is a stack of water glasses in a form of triangle as illustrated. Each glass has a 250ml capacity. When a liquid is poure

Poupool is an overflow swimming pool control software
Poupool is an overflow swimming pool control software

Poupool - The swimming pool controller Poupool is a swimming pool control software. It is based on Transitions, Pykka and Paho MQTT. The user interfac

Buffer Overflow para SLmail5.5 32 bits

SLmail5.5-Exploit-BoF Buffer Overflow para SLmail5.5 32 bits con un par de utilidades para que puedas hacer el tuyo REQUISITOS PARA QUE FUNCIONE: Desa

This repo explains in details about buffer overflow exploit development for windows executable.

Buffer Overflow Exploit Development For Beginner Introduction I am beginner in security community and as my fellow beginner, I spend some of my time a

Buffer overflow example for python
Buffer overflow example for python

Buffer overflow example for python

CVE-2021-39685 Description and sample exploit for Linux USB Gadget overflow vulnerability

CVE-2021-39685 Description and sample exploit for Linux USB Gadget overflow vulnerability

Automated tool to exploit basic buffer overflow remotely and locally & x32 and x64
Automated tool to exploit basic buffer overflow remotely and locally & x32 and x64

Automated tool to exploit basic buffer overflow (remotely or locally) & (x32 or x64)

 Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset
Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset This repository provides a unified online platform, LoLi-P

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

Explainable Fact Checking: A Survey This repository and the accompanying webpage contain resources for the paper "Explainable Fact Checking: A Survey"

An AutoML survey focusing on practical systems.

This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow daily. Hence, we cannot hope to provide a comprehensive description of every interesting idea or approach available.

Owner
null
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

null 1 Jan 26, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Command-line tool that instantly fetches Stack Overflow results when an exception is thrown

rebound Rebound is a command-line tool that instantly fetches Stack Overflow results when an exception is thrown. Just use the rebound command to exec

Jonathan Shobrook 3.9k Jan 3, 2023
Windows Stack Based Auto Buffer Overflow Exploiter

Autoflow - Windows Stack Based Auto Buffer Overflow Exploiter Autoflow is a tool that exploits windows stack based buffer overflow automatically.

Himanshu Shukla 19 Dec 22, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
Cisco RV110w UPnP stack overflow

Cisco RV110W UPnP 0day 分析 前言 最近UPnP比较火,恰好手里有一台Cisco RV110W,在2021年8月份思科官方公布了一个Cisco RV系列关于UPnP的0day,但是具体的细节并没有公布出来。于是想要用手中的设备调试挖掘一下这个漏洞,漏洞的公告可以在官网看到。 准

badmonkey 25 Nov 9, 2022
Stack overflow search API

Stack overflow search API

Vikash Karodiya 1 Nov 15, 2021