02 Prove : Assignment - kNN Classifier
Overview
Add to your experiment shell from the previous assignment. Implement a new algorithm, k-Nearest Neighbors, that can be configured for any size of neighborhood, k. To classify an instance, the algorithm should identify its k nearest neighbors and use their labels to determine the target classification.
For this assignment, you only need to run the algorithm on the Iris dataset, then next week, we will work with more interesting/complex data. Also next week, you will need to account for the fact that the different attributes have different ranges/scales for their data. This week, you can implement the algorithm without scaling/normalizing the data.
Objectives
Implementation
The k-NN classifier will follow the same basic structure of the HardCodedClassifier
classifier you implemented last week.
-
The
fit
method:def fit(self, data, targets):
Should store the training data and associated targets as member variables.
-
You'll also need a
calc_distance
method:def calc_distance(self, x1, x2):
That takes two samples and returns the Euclidean Distance between those samples.
Since the Iris dataset has four features (petal width (
PW
) and height (PH
), and sepal width (SW
) and height (SH
)), the euclidean distance between samples1
and2
would be equal to:$$ \sqrt{(PW_{2} - PW_{1})^2 + (PH_{2} - PH_{1})^2 + (SW_{2} - SW_{1})^2 + (SH_{2} - SH_{1})^2} $$
-
Finally, your
predict
method:def predict(self, data, k):
Should take the list of $n$ test samples and a value for $k$, and return a list of $n$ predictions generated by the k-Nearest Neighbors algorithm.
(i.e. if your test data had three rows, then your list of predictions should contain three predictions.
For each sample in the test data, the
predict
method needs to use thecalc_distance
method to calculate the distance between that sample and every sample in the training data.Then, it should sort those distances, and determine the class that occurs the most often in the
k
closest cases. (You may find Python's Counter module useful for this step, particularly themost_common
function. -
As with the
HardCodedClassifier
, your code should use the k-NN Classifier to generate a list of predictions for the test data, then compare those predictions with the actual classes and calculate the accuracy.
Experimentation
Because we are not using a very complex dataset this week, you don't need to do as much experimenting as you might otherwise, but you should:
Experiment with different values of $k$.
Compare your results to an existing implementation of k-Neighbors Neighbors. There are several different options for this. One option is to use the Python scikit-learn algorithm implementation. You should look up more documentation, but in short, it can be used as easily as:
from sklearn.neighbors import KNeighborsClassifier # ... # ... code here to load a training and testing set # ... classifier = KNeighborsClassifier(n_neighbors=3) classifier.fit(train_data, train_target) predictions = classifier.predict(test_data)
Requirements
As always, you are encouraged to go above and beyond and take initiative in your learning. As described in the syllabus, meeting the minimum standard requirements can qualify you for 93%, but going above and beyond is necessary to get 100%. The following are the expectations for a minimum standard, with a few suggestions for going above and beyond.
Minimum Standard Requirements
Implement basic kNN algorithm
Be able to load and use the Iris dataset
Basic experimentation:
Play with different values of K
Compare to an existing implementation
Some opportunities to go above and beyond:
KD-Tree
Experiment with several more datasets
Handle non-numeric data
Any other ideas you have
Submission
When complete, you need to upload two things (possibly more than two files) to I-Learn:
Download the assignment submission form, answer its questions and upload this form to I-Learn.
-
You will then need to submit your source code.
If you used a Jupyter Notebook, you should not submit that directly. Instead, upload it to a github repository and submit a link to that file.
If you used a Jupyter Notebook in Google Colaboratory, you can save a copy directly from there to GitHub (Click
File -> Save a copy in GitHub...
Alternatively, if you used a Jupyter Notebook using Anaconda, do not upload the .ipynb file directly, instead, please export the file to HTML (Click
File -> Download as... -> Html
). Then, upload the HTML file to I-learn.Finally, if you used a regular
.py
source file, you can submit that (or a link to it from GitHub) directly to I-Learn. If you used Google Colaboratory, you can save the notebook as a.py
file by clickingFile -> Download as .py