Machine Learning & Data Mining | CSE 450

01 Prepare : Programming Tools

Python

This semester we will be using Python 3 for our coding exercises. Python has become very popular in both industry and academia and is well-suited for data analysis.

This is not a "coding" course per se, and at this point in your education it is assumed that you can come up to speed on a new programming language on your own. But, here are a few suggestions that may be beneficial:

Learn the Basics of Python

This introduction assumes you have programmed in another language previously, and already understand fundamental programming concepts such as variables, functions, parameter passing, return values, boolean logic, arrays, and loops.

If you don't understand these things, the following resources may not help you very much. Note that many of these resources say the same things in different ways. Since everyone learns differently, try out each of them and see which one resonates with you.

The following are written tutorials designed for experienced programmers:

If you prefer an interactive tutorial that is more hand-holdy, you might prefer one of these:

For a more in-depth interactive tutorial, How to Think Like a Computer Scientist is also very good.

Learn to be "Pythonic"

You can also learn a lot about how to do things the right/easier way in Python by learning about "Python Idioms":

If you already know Python 2, you should know that all of the code examples in this class will be done using Python 3. You can see a list of differences here.

Libraries common to data science

There are a handful of python libraries that we'll be using repeatedly (SciPy, NumPy, Pandas, matplotlib) in this course. While you don't need to go through in-depth tutorials on them all immediately, it would be good to become familiar with their capabilities. You'll find many good resources and reference guides online for these libraries.

Setting up a Python environment for Data Science

There are many different options for downloading and running Python and for authoring and debugging Python programs. It is your choice what tool set and configuration to use, however the following recommendations have worked well for students in the past:

Google Colaboratory

By far the easiest way to get up and running is to use Google's cloud-based Colaboratory environment.

This option allows you to create, upload, edit, run, and share Jupyter notebooks in your browser with no local setup required. (most modern browsers should work, but Chrome is recommended).

This video demonstrates how to work with Google Colaboratory, including options for submitting assigments.

Local Development

If you prefer to have a local execution environment, you will need to install the following:

You can use any python editor, but the Spyder IDE is designed for data science work (and has a similar style as RStudio if you are familiar with that), whereas others (e.g., PyCharm) are fine for data science work, but are specifically designed with more general software development in mind.

The easiest way to get Spyder, Python 3, and the needed libraries is to download and use the Anaconda distribution which includes all of these items. You can find links and instructions to download and install it from the Spyder homepage, https://www.spyder-ide.org/ .

How to Find Help When You're Stuck

Let's say that you're trying to complete an assignment that asks you to load a file of numbers, store them in a list, then calculate the average of all the numbers in the list.

The first step would be to break this down into small steps:

# 1. Open a file and read each item
# 2. Convert text to a number
# 3. Store the number in a list
# 4. Loop through the list and add all the values.
# 5. Calculate the average

Let's pretend you know how to do part 1, but not part 2:

# 1. Open a file and read each item
theFile = open("/Users/lfalin/Desktop/test.txt",'r')
for line in theFile:
    print(line, end='')

# 2. Convert text to a number
# 3. Store the number in a list
# 4. Loop through the list and add all the values.
# 5. Calculate the average

You could conceivably try googling for "Calculate average of numbers in a text file" and find a bunch of code that would work. However, in that case you would learn very little. In addition, using such code would be considered plagiarism.

Instead, you can search for just how to do step 2: python how to convert text to a number.

Looking at the first result, we find a link to StackOverflow that demonstrates how to use the float() and int() functions. You can now incorporate that knowledge into your own work:

# 1. Open a file and read each item
theFile = open("/Users/lfalin/Desktop/test.txt",'r')
for line in theFile:
    # 2. Convert text to a number
    number = float(line)
    print(number)
# 3. Store the number in a list
# 4. Loop through the list and add all the values.
# 5. Calculate the average

Note what does and does not count as plagiarism when it comes to programming assignments. In the plagiarism case, we're looking for the code we need to solve the problem and using that code directly (and often without understanding what it does).

In the second case, we've done our own work to break the problem down into discrete steps and we're trying to see how to complete a specific algorithmic step in a particular language.

Aside from the tutorials linked above, as well as the official Python documentation and tutorial, the best place to find answers to Python questions (or programming questions in general) is StackOverflow.

Good Programming Practice

While it will not be specifically enforced, as in any programming course, you are strongly encouraged to use good style, code conventions, and programming practice.

The Google Python Style Guide is a nice guide to base your code on. (Be aware that the textbook author does not adhere to these principles).

Finally, as heads up, be aware that if you put statements directly in your python files, it causes side effects when your code is imported into other files because that code is run. Thus, if you are writing functions that may be included in other programs, good programming practice is to use a main function as follows. This will ensure that your main function is run if your file is invoked directly and not if it is included in another module.


def main():
    print "Hello World"

if __name__ == "__main__":
    # This will only be run if this file is the one that was run directly.
    main()

R

The other tool that has become very popular in the data mining world is R. It is an open source statistical package that is excels at math, statistics, and graphing. We will use R in conjunction with Python and other tools throughout the course. We will discuss it more later on, and you can wait to install it until that point.