English_proficiency_prediction_NLP
The aim of this task is to predict someone's English proficiency based on a text input.
Using the The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html
The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.
The goal is to build a machine learning algorithm for predicting the SST score of each participant based on their transcript.
Steps:
1 - Pre-process the dataset: extract the participant transcript (all tags). Inside participant transcript, you can remove all other tags and extract only English words.
2 - Process the dataset: extract features with the Bag of Word (BoW) technique
3 - Train a classifier to predict the SST score
4 - Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.
5 - Try to improve your system (for example you can try to use GloVe instead of BoW).