A method to solve the Higgs boson challenge using Least Squares - Novae
This project is the Project 1 of EPFL CS-433 Machine Learning. The project is the same as the Higgs Boson Machine Learning Challenge posted on Kaggle. The dataset and the detailed description can also be found in the GitHub repository of the course.
Team name: Novae
Team members: Giacomo Orsi, Vittorio Rossi, Chun-Tso Tsai
About the Project
The task of this project is to train a model based on the provided train.csv
to have the best prediction on the data given in test.csv
or any other general case.
We built our model for the problem using regularized linear regression after applying some data cleaning and features engineering techniques. A report describing our approach and our results can be found in the file report.pdf
. In the end, we obtained an accuracy of 0.836 and an F1 score of 0.751 on the test.csv
dataset.
Instructions
- The project runs under
Python 3.8
and requiresNumPy=1.19
. - Please make sure to place
train.csv
andtest.csv
inside thedata
folder. Those files can be downloaded here. - Go to the
script/
folder and executerun.py
. A model will be trained with the given hyper-parameters and predictions for the test dataset will be outputed in the fileout.csv
.
Modules
implementations.py
Contains the implementations of different learning algorithms. Including
- Least squares linear regression
least_squares
: Direct computation from linear equations.least_squares_GD
: Gradient descent.least_squares_SGD
: Stochastic gradient descent.ridge_regression
: Regularized linear regression from direct computation.
- Logistic regression
logistic_regression
: Gradient descentreg_logistic_regression
: Gradient descent with regularization.
There are also some helper functions in this file to facilitate the above functions.
data_processing.py
Calls the following files to process the data.
data_cleaning.py
: Contains functions used to- Categorize data into subgroups.
- Replace missing values with the median.
- Standardize the features.
feature_engineering.py
: Contains functions used to generate our interpretable features.
run.py
Generates the submission .csv
file based on the data of test.csv
stored in the folder data/
. Our optimized model is also defined in this file.
Some helper Functions
models.py
: Create the models for predicting the labels for new data points without true labels.expansions.py
: Contains a function to apply polynomial expansion to our features to add extra degrees of freedom for our models.proj1_helpers.py
: Contains functions which loads the.csv
files as training or testing data, and create the.csv
file for submission.cross_validation.py
: Contains a function to build the index for k-fold cross_validation.disk_helper.py
: Save/load the NumPy array to disk for further usage. Useful for saving hyper-parameters when trying a long training process.
Notebook
It is possible to use the Jupyter notebook project_notebook.ipynb
located in the scripts
folder to train the best hyper-parameters for the model. In the notebook it is possible to cross-validate a logistic and a least square regression model over given lambdas and degrees.