Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Overview

WeRateDogs Twitter Data from 2015 to 2017

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Table of Contents

  1. Introduction
  2. Project Overview
  3. Requirements
  4. Project Movitivation
  5. Key Files
  6. Results
  7. Licensing, Authors, and Acknowledgements

1. Introduction

Real-world data rarely comes clean. Using Python and its libraries, I gathered data from a variety of sources and in a variety of formats, assessed its quality and tidiness, then cleaned it. This is called data wrangling. I documented my wrangling efforts in a Jupyter Notebook, then showcased them through analyses and visualizations using Python and its libraries.

The dataset that I wrangled (and analyzing and visualizing) was the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

WRD_twitter_banner

2. Project Overview

Tasks in this project were as follows:

  • Step 1: Gathering data
  • Step 2: Assessing data
  • Step 3: Cleaning data
  • Step 4: Storing data
  • Step 5: Analyzing, and visualizing data
  • Step 6: Reporting
    • My data wrangling efforts
    • My data analyses and visualizations

3. Requirements

This project was created in a Jupyter Notebook made available via Anaconda and written in python.\ The following versions of languages and libraries were used in creating this project:

  • python==2.7.18
  • ipython==7.31.0
  • matplotlib==3.5.1
  • numpy==1.22.0
  • pandas==1.3.5
  • requests==2.27.1
  • scipy==1.7.3
  • seaborn==0.11.2
  • tweepy==4.4.0

4. Project Motivation

The goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

The overall purpose of this Udacity project was to refine our data wrangling skills with secondary importance on delivering multiple polished visualzations and tell a story or solve a problem. In other words, the journey was more important than the destination.

5. Key Files

  • twitter_archive_enhanced.csv
    The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which Udacity used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, only those tweets with ratings were filtered. The data was extracted programmatically by Udacity, but the data was left messy on purpose. The ratings aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I had to assess and clean these columns to use them for analysis and visualization.

  • tweet_json.txt
    Resulting data queried using Twitter's API. It was necessary to gather the retweet count and favorite count which were omitted from the basic twitter_archive_enhanced.csv.

  • image-predictions.tsv
    Udacity ran every image in the WeRateDogs Twitter archive was through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

  • wrangle_act.ipynb
    This contains the bulk of the project. This notebook contains all code for gathering, assessing, cleaning, analyzing, and visualizing data.

  • wrangle_report.pdf
    This was a report for documenting the data wrangling process: gather, assess, and clean.

  • act_report.pdf
    Documentation of analysis and insights

  • twitter_archive_master.csv
    Cleaned and merged dataset containing data from the 3 source data sets

6. Results

As said in the project motivation, the data wrangling process itself was more relevant than uncovering insights. At any rate, I was able to answer the following 4 questions:

  1. What is the most retweeted tweet?
    From the data I had from 2015 to 2017, this gem was the most retweeted tweet.
  2. What is the most common rating?
    12/10
  3. What are the most common breeds found by the neural network?
    The top 5, from less to most common, were Pug, Chihuahua, Welsh Corgi, Labrador Retriever, then finally Golden Retriever.
  4. What is the average retweet count for each rating?
    Screen Shot 2022-01-11 at 21 22 39
    I saw a general positive correlation between dog rating and retweet count (i.e. popularity). 13/10 and 14/10 tweets had the most retweets on average. Further details of the results can be seen in the act_report.pdf file.

7. Licensing, Authors, and Acknowledgements

All data provided and sourced by Udacity.

You might also like...
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

 Project under the certification
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

MS in Data Science capstone project. Studying attacks on autonomous vehicles.
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

This mini project showcase how to build and debug Apache Spark application using Python
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Owner
Keenan Cooper
Keenan Cooper
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 2, 2022
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

null 7 Nov 20, 2022
MDAnalysis is a Python library to analyze molecular dynamics simulations.

MDAnalysis Repository README [*] MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale,

MDAnalysis 933 Dec 28, 2022
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

Jafar Pourbemany 2 Aug 21, 2022
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 2, 2022
Project: Netflix Data Analysis and Visualization with Python

Project: Netflix Data Analysis and Visualization with Python Table of Contents General Info Installation Demo Usage and Main Functionalities Contribut

Kathrin Hälbich 2 Feb 13, 2022
Renato 214 Jan 2, 2023