Code for the Higgs Boson Machine Learning Challenge organised by CERN & EPFL

Overview

A method to solve the Higgs boson challenge using Least Squares - Novae

This project is the Project 1 of EPFL CS-433 Machine Learning. The project is the same as the Higgs Boson Machine Learning Challenge posted on Kaggle. The dataset and the detailed description can also be found in the GitHub repository of the course.

Team name: Novae

Team members: Giacomo Orsi, Vittorio Rossi, Chun-Tso Tsai

About the Project

The task of this project is to train a model based on the provided train.csv to have the best prediction on the data given in test.csv or any other general case.

We built our model for the problem using regularized linear regression after applying some data cleaning and features engineering techniques. A report describing our approach and our results can be found in the file report.pdf. In the end, we obtained an accuracy of 0.836 and an F1 score of 0.751 on the test.csv dataset.

Instructions

  • The project runs under Python 3.8 and requires NumPy=1.19.
  • Please make sure to place train.csv and test.csv inside the data folder. Those files can be downloaded here.
  • Go to the script/ folder and execute run.py. A model will be trained with the given hyper-parameters and predictions for the test dataset will be outputed in the file out.csv.

Modules

implementations.py

Contains the implementations of different learning algorithms. Including

  • Least squares linear regression
    • least_squares: Direct computation from linear equations.
    • least_squares_GD: Gradient descent.
    • least_squares_SGD: Stochastic gradient descent.
    • ridge_regression: Regularized linear regression from direct computation.
  • Logistic regression
    • logistic_regression: Gradient descent
    • reg_logistic_regression: Gradient descent with regularization.

There are also some helper functions in this file to facilitate the above functions.

data_processing.py

Calls the following files to process the data.

  • data_cleaning.py: Contains functions used to
    1. Categorize data into subgroups.
    2. Replace missing values with the median.
    3. Standardize the features.
  • feature_engineering.py: Contains functions used to generate our interpretable features.

run.py

Generates the submission .csv file based on the data of test.csv stored in the folder data/. Our optimized model is also defined in this file.

Some helper Functions

  • models.py: Create the models for predicting the labels for new data points without true labels.
  • expansions.py: Contains a function to apply polynomial expansion to our features to add extra degrees of freedom for our models.
  • proj1_helpers.py: Contains functions which loads the .csv files as training or testing data, and create the .csv file for submission.
  • cross_validation.py: Contains a function to build the index for k-fold cross_validation.
  • disk_helper.py: Save/load the NumPy array to disk for further usage. Useful for saving hyper-parameters when trying a long training process.

Notebook

It is possible to use the Jupyter notebook project_notebook.ipynb located in the scripts folder to train the best hyper-parameters for the model. In the notebook it is possible to cross-validate a logistic and a least square regression model over given lambdas and degrees.

You might also like...
My personal code and solution to the Synacor Challenge from 2012 OSCON.

Synacor OSCON Challenge Solution (2012) This repository contains my code and solution to solve the Synacor OSCON 2012 Challenge. If you are interested

Rank 3 : Source code for OPPO 6G Data Generation Challenge
Rank 3 : Source code for OPPO 6G Data Generation Challenge

OPPO 6G Data Generation with an E2E Framework Homepage of OPPO 6G Data Generation Challenge Datasets H1_32T4R.mat H2_32T4R.mat Please put the original

CityLearn Challenge Multi-Agent Reinforcement Learning for Intelligent Energy Management, 2020, PikaPika team

Citylearn Challenge This is the PyTorch implementation for PikaPika team, CityLearn Challenge Multi-Agent Reinforcement Learning for Intelligent Energ

DTCN SMP Challenge - Sequential prediction learning framework and algorithm
DTCN SMP Challenge - Sequential prediction learning framework and algorithm

DTCN This is the implementation of our paper "Sequential Prediction of Social Me

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and m

The comma.ai Calibration Challenge!

Welcome to the comma.ai Calibration Challenge! Your goal is to predict the direction of travel (in camera frame) from provided dashcam video. This rep

CVPR 2021 Challenge on Super-Resolution Space
CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

AI grand challenge 2020 Repo (Speech Recognition Track)
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

The G|oogl|e challenge for Quantum Coalition Hackathon 2021

Qchack 2021 Google Challenge This is a challenge for the brave 2021 qchack.io participants. Instructions Hello, intrepid qchacker, welcome to the G|o

Owner
Giacomo Orsi
CS Student at EPFL. Previously at University of Bologna
Giacomo Orsi
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Jan 6, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Dec 30, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
Code for 1st place solution in Sleep AI Challenge SNU Hospital

Sleep AI Challenge SNU Hospital 2021 Code for 1st place solution for Sleep AI Challenge (Note that the code is not fully organized) Refer to the notio

Saewon Yang 13 Jan 3, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022
Code for the Image similarity challenge.

ISC 2021 This repository contains code for the Image Similarity Challenge 2021. Getting started The docs subdirectory has step-by-step instructions on

Facebook Research 173 Dec 12, 2022
Tracking code for the winner of track 1 in the MMP-Tracking Challenge at ICCV 2021 Workshop.

Tracking Code for the winner of track1 in MMP-Trakcing challenge This repository contains our tracking code for the Multi-camera Multiple People Track

DamoCV 29 Nov 13, 2022