OO Programming and Data Structures | CS 241

09 Prove : Homework - Data Structures

Hashing and Hash Tables

Outcomes

At the end of this study, successful students will be able to:

  1. Articulate the strengths and weaknesses of Hashing and Hash Tables.

  2. Use Hash Tables (sets and dictionaries) in Python to solve problems.

Overview

Hashing and Hash Tables turn out to be some of the most useful tools from data structures.

Preparation Material

Please read the following:

Using Hash Tables in Python

In Python, the built-in Sets and Dictionaries are each implemented using Hash Tables. (They could also be implemented using Balanced Binary Search Trees with O(log n) insertion, removal, and lookup time.)

Sets

A Set is a useful data structure when you are interested in knowing whether something is or is not a member of a collection of items, but are not concerned with maintaining a particular order.

For example, we may have a set of participants who are registered for a event, and we want to know if someone has registered or not, but the order they registered is not important.

For more details about how to create and use sets, please read the brief section on Sets from the official Python documentation.

Dictionaries

Dictionaries are also commonly referred to as Tables or Maps. They provide a mapping from a key to a value. For example, consider a dictionary that contains the phone numbers of our friends:


# The dictionary can be created like so:
phone_numbers = {}

# New numbers can be added like this:
phone_numbers["Jenny"] = "(555) 867-5309"

# Phone numbers can be looked up, like this:
print(phone_numbers["Jenny"])

Notice that in a way, this is just like using an array or a list, except that the index is now a meaningful string (or another type of object, if we would like). Also, notice, that if a second number were assigned to the same key, it would not store both, but rather would overwrite the value:


# First add one number
phone_numbers["Jenny"] = "(555) 867-5309"

# Now set a different number for Jenny
phone_numbers["Jenny"] = "(555) 123-4567"

# In this case, the second number has replaced the first one
# Thus, printing it out, will show "(555) 123-4567"

print(phone_numbers["Jenny"]) # This shows only the second value, "(555) 123-4567"

As another example, we could have a dictionary that contains the state capital for each state. This could be declared as used as follows:


capitals = {}
capitals["Idaho"] = "Boise"
capitals["Utah"] = "Salt Lake City"
capitals["California"] = "Sacramento"
# ... other states here ...

state = input("Enter a state name: ")
capital = capitals[state]

print("The capital of {} is {}".format(state, capital))

Obviously, we would want to do additional things here, such as check to see that the state existed, before printing it out. And, we may even want to loop through all of the states and print out their capitals. These things and more can all be done with dictionaries. For more details on using dictionaries, please read the brief section on Dictionaries from the official Python documentation.

Homework Assignment

Write a program to read through a census file and keep track of the number of people in different education levels.

For this assignment, you will work with the adult dataset from the UCI data repository. You can download the dataset directly here. It is also available in the CS department Linux system at: /home/cs241/datasets/census.csv

This dataset contains about 32,000 records from the 1994 census. Here is an example of the first few lines:


39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K

The 4th column (at index 3, if you are working with a 0-based index), contains the education level of the user. The possible values for this are:


Preschool
1st-4th
5th-6th
7th-8th
9th
10th
11th
12th
HS-grad
Some-college
Assoc-voc
Assoc-acdm
Bachelors
Prof-school
Masters
Doctorate

Your program should count up the number of records in each of the above categories and then display the totals of each (they do not have to be ordered in a certain manner in your display).

You should not use a separate variable for each one, with an if statement like this:


# DON'T DO IT LIKE THIS

if level == "Preschool":
    preschool_count += 1
elif level == "1st-4th":
    first_count += 1
...

Instead, you should use a dictionary to store the values:

Instructions

Follow these steps to guide you through the process:

  1. First, download the dataset from the link above. Make sure to save it to the same folder that has your Python program. If you are using PyCharm, you can drag and drop it into the table of contents pane in PyCharm.

  2. Create a dictionary to hold the education level counts.

  3. Write code that opens a file, reads through it line by line, and then splits the line into separate parts, based on the comma (",").

  4. Get the education level of each record from the 4th column (index 3).

  5. If the education level is not present in your dictionary, add it with a count of 1. Otherwise, if the education level is already in your dictionary set it to be the old value + 1.

  6. After looping through the whole file and counting up each value, loop through each key in the dictionary and print out the key followed by the value associated with that key.

Expected Output


 5355 -- Bachelors
10501 -- HS-grad
 1175 -- 11th
 1723 -- Masters
  514 -- 9th
 7291 -- Some-college
 1067 -- Assoc-acdm
 1382 -- Assoc-voc
  646 -- 7th-8th
  413 -- Doctorate
  576 -- Prof-school
  333 -- 5th-6th
  933 -- 10th
  168 -- 1st-4th
   51 -- Preschool
  433 -- 12th

Your formatting / ordering does not have to match this exactly.

Hints

You can find out if a given key is in a dictionary with code like: (Assuming that friend is the key you want to look for, and phone_numbers is the dictionary.)


if friend in phone_numbers:
    ...

# or to see if it's not there
if friend not in phone_numbers:
    ...

You can get a collection of all the keys of a dictionary with the .keys() function, like: phone_numbers.keys() and then you can loop through them with code like:


for friend in phone_numbers.keys():
    ...

Testing your program

An automated testBed script not available for your program. Instead please manually test your program and ensure that it works correctly.

Submission

You are encouraged to work with others on the programming and the quiz.

When you have a good understanding of this data structure and have completed the programming project, take the accompanying I-Learn Quiz. It has questions about how the data structure works, and also, has places for your to report on the parts of the programming assignment that you completed.