KNN With Spark
Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https://archive.ics.uci.edu/ml/datasets/Fertility). The data was first normalized, also using PySpark. Euclidean Distance was used as the similarity measure. The optimal k found for both datasets was 5. The iris dataset had a test accuracy of 97% and the fertility dataset had a test accuracy of 88%.