CSE 450 - Prove

03 Prove : Assignment

Using kNN with (slightly) More Interesting Data

Objective

Be able to use a machine learning algorithm with more interesting data than a built-in, all-numeric dataset.

Instructions

Building on the code you have written the past two weeks, your assignment this week is to read in several different datasets of various types and experiment with kNN on these datasets. In particular, you need to demonstrate your ability to do the following:

Read data from text files (e.g., comma- or space-delimited)
Handle non-numeric data
Handle missing data
Use kNN for regression

Library Implementations

While you are welcome to use your own implementation of kNN, for this assignment, you may find it easier to use an existing library implementation. The recommended option is to use the sk-learn implementation, which can be used as follows:


from sklearn.neighbors import KNeighborsClassifier

# ...
# ... code here to load a training and testing set
# ...

classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_data, train_target)
predictions = classifier.predict(test_data)

There is also a regression version of this algorithm available:


from sklearn.neighbors import KNeighborsRegressor

# ...
# ... code here to load a training and testing set
# ...

regr = KNeighborsRegressor(n_neighbors=3)
regr.fit(train_data, train_target)
predictions = regr.predict(test_data)

Pre-processing

To help work with these datasets, you should use the library pandas, which is a very popular data science library that helps with pre-processing. Then, when the dataset is prepared, you'll need to convert it to a plain numpy array (rather than a Pandas DataFrame) to use with your prediction algorithms.

Hard-coded or General-purpose?

One question that always comes up when doing this kind of work is how much should your code be general-purpose versus tailed specifically to a particular data. This is a difficult question to answer, and there should be a balance of somewhere in between.

Two principles will likely apply:

We should make it as general-purpose as we can, but
Almost all pre-processing code I have seen is messy and feels hardcoded to that dataset

In the real-world, most companies have very specific datasets that they care about, and they aren't interested in trying to make code work on hundreds of others, they just want to get their data in place. With that in mind, our goal should be to decouple the pre-processing logic from our algorithms as much as possible.

To this end, you should create a separate function to load each dataset. It can (and will) do as many messy things as you'd like, but when it is done, it should return two NumPy arrays, one for the data, and one for the targets.

Experiment Guidelines

Please use the following datasets:

Car Evaluation. The data for this is found in the .data file, which is a comma-separated values file. The target is the last column where you classify the car as acceptable, unacceptable, good, or very good. Please note that this dataset contains categorical (non-numeric) data.
Automobile MPG - Your task here is to predict the MPG column (which is the first column, rather than the last column that we usually use for the target). Note that this is a regression task, rather than classification, because you are predicting a real-value. Also note that some columns are multi-valued discrete categories, rather than true numeric values, and also that there is some missing data.
Student Performance. This dataset contains two .csv files, one for a Math class (student-mat.csv) and one for a Portuguese class (student-por.csv). For this assignment, you can simply focus on the Math class. The task here is to predict the final grade which is stored in the final column (G3). Please note that this is a regression task.

For each dataset, read it in, handle missing values, text data, etc., appropriately, and then use a kNN classifier with varying values for k.

Requirements

As always, you are encouraged to go above and beyond and take initiative in your learning. As described in the syllabus, meeting the minimum standard requirements can qualify you for 93%, but going above and beyond is necessary to get 100%. The following are the expectations for a minimum standard, with a few suggestions for going above and beyond.

Minimum Standard Requirements

Read data from text files
Appropriately handle non-numeric data.
Appropriately handle missing data.
Use kNN for classification and regression.
Basic experimentation on the provided datasets.

Some opportunities to go above and beyond:

Get everything to work with your own kNN implementation.
Explore multiple options for handling non-numeric data, comparing the results of each approach.
Explore multiple options for handling missing data, comparing the results of each approach.
Experimentation with several additional data sets.
Using data that comes in a more complex manner than a simple downloadable comma- or space-delimited file.
Any other ideas you have.

Submission

When complete, you need to upload two things (possibly more than two files) to I-Learn:

Download the assignment submission form, answer its questions and upload this form to I-Learn.
Submit your source code to I-Learn. If you would prefer, you can post your source code to a Git repository and provide a link in your submission form.

Machine Learning & Data Mining | CSE 450