CSE 450 - Prove

02 Prove : Assignment - kNN Classifier

Overview

Add to your experiment shell from the previous assignment. Implement a new algorithm, k-Nearest Neighbors, that can be configured for any size of neighborhood, k. To classify an instance, the algorithm should identify its k nearest neighbors and use their labels to determine the target classification.

For this assignment, you only need to run the algorithm on the Iris dataset, then next week, we will work with more interesting/complex data. Also next week, you will need to account for the fact that the different attributes have different ranges/scales for their data. This week, you can implement the algorithm without scaling/normalizing the data.

Objectives

Implement k-Nearest Neighbors.
Experiment with the algorithm parameters.
Verify your results with an "off-the-shelf" implementation.

Implementation

The k-NN classifier will follow the same basic structure of the HardCodedClassifier classifier you implemented last week.

The fit method:
```
def fit(self, data, targets):
```
Should store the training data and associated targets as member variables.
You'll also need a calc_distance method:
```
def calc_distance(self, x1, x2):
```
That takes two samples and returns the Euclidean Distance between those samples.

Since the Iris dataset has four features (petal width (PW) and height (PH), and sepal width (SW) and height (SH)), the euclidean distance between samples 1 and 2 would be equal to:

$$ \sqrt{(PW_{2} - PW_{1})^2 + (PH_{2} - PH_{1})^2 + (SW_{2} - SW_{1})^2 + (SH_{2} - SH_{1})^2} $$
Finally, your predict method:
```
def predict(self, data, k):
```
Should take the list of $n$ test samples and a value for $k$, and return a list of $n$ predictions generated by the k-Nearest Neighbors algorithm.

(i.e. if your test data had three rows, then your list of predictions should contain three predictions.

For each sample in the test data, the predict method needs to use the calc_distance method to calculate the distance between that sample and every sample in the training data.

Then, it should sort those distances, and determine the class that occurs the most often in the k closest cases. (You may find Python's Counter module useful for this step, particularly the most_common function.
As with the HardCodedClassifier, your code should use the k-NN Classifier to generate a list of predictions for the test data, then compare those predictions with the actual classes and calculate the accuracy.

Experimentation

Because we are not using a very complex dataset this week, you don't need to do as much experimenting as you might otherwise, but you should:

Experiment with different values of $k$.
Compare your results to an existing implementation of k-Neighbors Neighbors. There are several different options for this. One option is to use the Python scikit-learn algorithm implementation. You should look up more documentation, but in short, it can be used as easily as:
```
from sklearn.neighbors import KNeighborsClassifier

# ...
# ... code here to load a training and testing set
# ...

classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_data, train_target)
predictions = classifier.predict(test_data)
```

Requirements

As always, you are encouraged to go above and beyond and take initiative in your learning. As described in the syllabus, meeting the minimum standard requirements can qualify you for 93%, but going above and beyond is necessary to get 100%. The following are the expectations for a minimum standard, with a few suggestions for going above and beyond.

Minimum Standard Requirements

Implement basic kNN algorithm
Be able to load and use the Iris dataset
Basic experimentation:
- Play with different values of K
- Compare to an existing implementation

Some opportunities to go above and beyond:

KD-Tree
Experiment with several more datasets
Handle non-numeric data
Any other ideas you have

Submission

When complete, you need to upload two things (possibly more than two files) to I-Learn:

Download the assignment submission form, answer its questions and upload this form to I-Learn.
You will then need to submit your source code.

If you used a Jupyter Notebook, you should not submit that directly. Instead, upload it to a github repository and submit a link to that file.

If you used a Jupyter Notebook in Google Colaboratory, you can save a copy directly from there to GitHub (Click File -> Save a copy in GitHub...

Alternatively, if you used a Jupyter Notebook using Anaconda, do not upload the .ipynb file directly, instead, please export the file to HTML (Click File -> Download as... -> Html). Then, upload the HTML file to I-learn.

Finally, if you used a regular .py source file, you can submit that (or a link to it from GitHub) directly to I-Learn. If you used Google Colaboratory, you can save the notebook as a .py file by clicking File -> Download as .py

References

1. Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals Of Eugenics, 7(2), 179-188. doi:10.1111/j.1469-1809.1936.tb02137.x.↩

Machine Learning & Data Mining | CSE 450