End-to-end Data Science project
This is the repo with the notebooks, code, and additional material used in the ITI's workshop. The goal of the sessions was to illustrate the end-to-end process of an real project.
Additional material
In addition to the notebooks and code, the following material is also available:
- Video recordings of the sessions are uploaded to youtube
- Slide decks are also added to this repo here
Problem statement
Our (fictional) client is an IT educational institute. They have reached out to us has reach out with the following: “IT jobs and technologies keep evolving quickly. This makes our field to be one of the most interesting out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job. “Do I need to learn C++ to be a Data Scientist?” “Do DevOps and System admins use the same technologies?” “I really like JavaScript; can I use it in Data Analytics?” Those are some of the questions that our students ask. Could you please develop a data-driven solution for our students to answer such questions? They mostly want to understand the relationships between the jobs and the technologies.
Level guide
Basic | Intermediate | Advanced | |
---|---|---|---|
Business case | Decide on the KPIs that you will positively influence | Calculate the expected financial returns | |
Data collection | Decide on and collect a suitable data source for your business case | Decide on, collect and connect multiple data sources for better performance | |
Legal review | Get basic information about the local data privacy law | Study the local data privacy law | |
Cookie Cutter | Create the standard directory structure | ||
Git | Use Git's GUI to track on master branch | Use Git's CLI to track on Dev branch and merge back to Master | Decide on a branching strategy and solve merge conflicts |
Environments | Install python packages using conda | Create a dedicated conda environment | Share your environment and install it on a different machine |
Data cleaning | Use basic statistics to filter out non-sense entries | Use advanced statistics and unsupervised learning to filter out non-sense entries | Calculate a 'sanity probability value' for each data point and use it later as the weight |
Descriptive analytics | Calculate summary statistics to provide data insights | Produce visualizations to provide deeper understanding | Apply unsupervised learning to provide even deeper understanding |
Predictive analytics | Create a single baseline model | Create multiple hyper-tuned models. Benchmark their performance | Combine the chosen models via ensemble and provide prediction confidence |
Prescriptive analytics | Recommend the action that the user should take | ||
Software Engineering | Refactor your notebooks to simple python scripts | Create a production OOP class for predictions | Expose your model using an API |
MLops | Export and load models from pickle files | Track your models using Mlflow | Create and run a docker image for your project |
Product | Create a Web App / GUI to expose prediction functionality | Add the relevant historical insights, predictions and optimization results | Collect users' feedback and retrain your model accordingly |