Machine Learning & Data Mining | CSE 450

03 Prove : Assignment

Using kNN with (slightly) More Interesting Data

Objective

Be able to use a machine learning algorithm with more interesting data than a built-in, all-numeric dataset.

Instructions

Building on the code you have written the past two weeks, your assignment this week is to read in several different datasets of various types and experiment with kNN on these datasets. In particular, you need to demonstrate your ability to do the following:

  1. Read data from text files (e.g., comma- or space-delimited)

  2. Handle non-numeric data

  3. Handle missing data

  4. Use kNN for regression

Library Implementations

While you are welcome to use your own implementation of kNN, for this assignment, you may find it easier to use an existing library implementation. The recommended option is to use the sk-learn implementation, which can be used as follows:


from sklearn.neighbors import KNeighborsClassifier

# ...
# ... code here to load a training and testing set
# ...

classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_data, train_target)
predictions = classifier.predict(test_data)

There is also a regression version of this algorithm available:


from sklearn.neighbors import KNeighborsRegressor

# ...
# ... code here to load a training and testing set
# ...

regr = KNeighborsRegressor(n_neighbors=3)
regr.fit(train_data, train_target)
predictions = regr.predict(test_data)

Pre-processing

To help work with these datasets, you should use the library pandas, which is a very popular data science library that helps with pre-processing. Then, when the dataset is prepared, you'll need to convert it to a plain numpy array (rather than a Pandas DataFrame) to use with your prediction algorithms.

Hard-coded or General-purpose?

One question that always comes up when doing this kind of work is how much should your code be general-purpose versus tailed specifically to a particular data. This is a difficult question to answer, and there should be a balance of somewhere in between.

Two principles will likely apply:

  1. We should make it as general-purpose as we can, but

  2. Almost all pre-processing code I have seen is messy and feels hardcoded to that dataset

In the real-world, most companies have very specific datasets that they care about, and they aren't interested in trying to make code work on hundreds of others, they just want to get their data in place. With that in mind, our goal should be to decouple the pre-processing logic from our algorithms as much as possible.

To this end, you should create a separate function to load each dataset. It can (and will) do as many messy things as you'd like, but when it is done, it should return two NumPy arrays, one for the data, and one for the targets.

Experiment Guidelines

Please use the following datasets:

For each dataset, read it in, handle missing values, text data, etc., appropriately, and then use a kNN classifier with varying values for k.

Requirements

As always, you are encouraged to go above and beyond and take initiative in your learning. As described in the syllabus, meeting the minimum standard requirements can qualify you for 93%, but going above and beyond is necessary to get 100%. The following are the expectations for a minimum standard, with a few suggestions for going above and beyond.

Minimum Standard Requirements

  1. Read data from text files

  2. Appropriately handle non-numeric data.

  3. Appropriately handle missing data.

  4. Use kNN for classification and regression.

  5. Basic experimentation on the provided datasets.

Some opportunities to go above and beyond:

Submission

When complete, you need to upload two things (possibly more than two files) to I-Learn:

  1. Download the assignment submission form, answer its questions and upload this form to I-Learn.

  2. Submit your source code to I-Learn. If you would prefer, you can post your source code to a Git repository and provide a link in your submission form.