OO Programming and Data Structures | CS 241

12 Prepare : Checkpoint A

Overview

After completing (or while completing) the preparation material for this week, complete the following exercise.

This checkpoint is intended to help you practice the mechanics of working with the Pandas library.

Instructions

For this checkpoint, we will use the same census data file that we used before. This originally comes from the UCI data repository, but you can download it directly here.

This dataset contains about 32,000 records from the 1994 census. Here is an example of the first few lines:


39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K

Programming Assignment

Your assignment this week is to use Pandas to load this dataset and then to filter it to find the median age.

Helpful hints:

Installing Libraries

To use the Pandas library, you'll need to install it. This can be done via the command line using either pip3 or pip:


pip3 install pandas

If pip3 is not recognized as a program, you could try pip:


pip install pandas

If it is still not recognized as a program, you likely need to add the directory where these utilities reside to your system Path.

If it starts to run, but then you ultimately get an error about not having administrative permissions, then you'll need to run with elevated permissions. On Windows, this is done by running the Command Prompt program as "Administrator". To do this, you can search for "Command" in your windows search box, and when the program displays, you right-click it and select Run as Administrator. On a mac, this is done by using "sudo" like so:


sudo pip3 install pandas

Once you have it installed, you'll want to use the same process to install matplotlib:


pip3 install matplotlib

Loading a dataset:

To load a csv file in Pandas, you simply need to use the pandas.read_csv() method:


import pandas

census_data = pandas.read_csv("census.csv")

Finding a median:

Finding means and medians in Pandas is easy. If you have a column called, income, and want to get the median income, you can find it by doing:


mean = my_data.income.mean()
median = my_data.income.median()
max = my_data.income.max()

In the case of this census dataset, the file does not have a header row, so the columns don't have names directly. You can either supply the names, or you can just use the index of the column (in this case age is index 0), like so:


max = my_data[0].max()

Submission

When you have completed this work, report your progress on the associated I-Learn quiz.