Machine Learning & Data Mining | CSE 450

03 Teach : Activity - Data Manipulation

Overview

In this activity you will work with one another to practice some of the preprocessing programming components you read about.

Instructions

For this activity you will work with a census dataset from the UCI Data Repository. This dataset contains a number of different attributes, some of which are numeric, some are categorical. The task is the predict the last column which defines whether the person made more or less than $50K.

Core Requirements

  1. Download the census dataset. Install Pandas and use it to read in the data.

    Set the column names to something descriptive/appropriate.

    Ensure that everything is working by printing out some summaries of the data.

  2. Appropriately handle the missing data.

  3. Convert the categorical attributes into some form of numeric attributes.

Stretch Challenges

  1. Convert the dataset into a NumPy array, and use an sklearn classifier to classify it.

  2. Normalize the numeric attributes using z-Score normalization.

  3. Use 10-fold cross validation to verify the accuracy of your predictions.

Instructor Help

Please do not open the instructor code until you have worked on this assignment for the class period. At that point, if you are still struggling to complete any of these sections you are welcome to use this code to help guide you through the remaining sections:

Submission

When complete, please report on your progress in the associated I-Learn Quiz.