Context

A course for upper-year, non CS students emphasizing utility
Course teaches Hello World — Machine Learning
At this stage they have seen:
- Variables & Types
- Loops
- Conditionals
- Functions
- Lists & Numpy Arrays
- Dictionaries
- Basic Searching and Sorting
- File I/O
- Pandas
- Data Visualization
They do NOT know about Machine Learning yet (next class)
This would be one of the lectures near the very end of the semester
- Typically, these upper-year students are taking the class for exactly these skills (Data Science)
They are familiar with the website delivery (done to mimic Python docs)
- More notes than slides
- Discover ideas as opposed to throwing them at the students
They are used to:
- Lecture activities
- Interpreter
- Notebook style programming
- Pencil & paper
- Using libraries/packages
This is not a class like software carpentry

Data Science¶

I remember being amused the first time I heard “Data Science”
It’s not that well defined to be honest
- There are other buzz words that float around with Data Science too
Basically, using math, stats, algorithms, visualization, machine learning, and other forms of analytics to get information from data
Some have even said that it’s kinda’ like a different paradigm for science
- I have no well formulated hypothesis, but I do have data… I wonder if there are some relationships here?

Data Science is not another name for:

Statistics
Analytics
AI
Machine Learning
Deep Learning

Warning¶

We’re about to jump about 2.5ish years ahead in your CS education
Normally, you’d learn a whole bunch more CS. Both theoretical and applied
Have some statics and other math classes

If we wanted to do this right, we’d need to learn about:

Complexity theory
Advanced algorithms & Data structures
Linear Algebra
Multivariable calculus
Multivariate statistics (lots of stats, actually)
Even more stats
MORE STATS!
Signal Processing
Information Theory
…
…
…
Data Science

But that’d take too long, so…

We’re going to skip straight to the last step

Seriously?

Yes
Data Science is now too important for me not to show it to you
Further, doing it right is subjective
- For our purposes, we don’t need to be experts in calculus, algebra, statistics, etc. in order to make use of the techniques

What you can expect:

A very superficial introduction to Data Science
- An example of how I would get some data and start playing around with it to see what I can do
You’ll have some ideas about how to apply specific techniques and what they can tell you about data
In order to avoid getting bogged down in detail, I’m going to play fast and loose with some definitions and concepts
- Sorry (or not, depending on your perspective)
You’ll be able to turn your science up to an 11!

Centres for Vampire Control¶

If you would like to follow along, please follow this link: Colab Notebook

There is currently a huge vampire plague that’s quite problematic
You have been recruited by the United Nations to join the international Centres for Vampire Control and Prevention (CVC)
It is VERY difficult to identify if a subject is a vampire or not and it requires a lot of expensive testing that takes a long time
The CVC wants to know if it’s possible to identify vampires base on easy to measure features about the subjects
- Height (cm)
- Weight (kg)
- How averse they are to wooden stakes
- If they currently have garlic breath
- How reflective they are in a mirror
- How shiny/sparkly they are
The CVC has worked very hard (and spent 100s of millions of dollars) to gather data from 2000 subjects that they have also identified if they are a vampire or not
- The confidential data
- The last column is 0 if they are human, and 1 if they are a vampire
- Upload this to your Colab Notebook if you would like to follow along

Note

In this case our life is easy. We have a clear goal and not too too much data to be overwhelming. Many times life is not this simple.

Playing with Data¶

Note

I have broken these course notes down into steps, but this should not suggest that these are the standard steps one would always take.

Step 0.0: Look at the data!

Ok, csv in tabular format

Each row is a subject
Each column is a feature

Step 0.1: Get some imports to get things going

# Important Imports
import csv
import matplotlib.pyplot as plt
import numpy
import pandas
import scipy
import scipy.stats
import seaborn

At this stage you should be familiar-ish with what these are
seaborn is a new one that we will be using here, but it’s just a handy tool that helps make fancy plots
Each of these is one of the highly used tools of a Data Scientist

Step 0.2: Load up the data

# Loading up Some Data

# Some constants to make life easier later
FILE_NAME = 'CVC_data.csv'
LABELS = ['height (cm)', 'weight (kg)', 'stake aversion', 'garlic breath', 'reflectance', 'shinyness', 'IS_VAMPIRE']

oFile = csv.reader(open(FILE_NAME, 'r'))
data = numpy.array(list(oFile))

# Remove the subject name
data = data[:,1:]

# We can be lazy and just make everything a float
data = data.astype(float)


# If we want to do it all in one line of code
#data = numpy.array(list(csv.reader(open(FILE_NAME, 'r'))))[:,1:].astype(float)

# Putting it into a pandas dataframe
data = pandas.DataFrame(data, columns = LABELS)

We don’t really need to use pandas here
- We could easily just use numpy to do everything
BUT, pandas does provide some easy to use functions that will save some time

Step 1: Get some simple summary statistics

# Summarize ALL the Data
data.describe()

Question

Can you notice anything interesting here based on the summary statistics?

Not much here, but the mean of 0.407 can tell us one thing I suppose
- Huzzah, we learned something from the data!

Perhaps if we break the data down into their respective classifications

# Select only the rows where they are known humans
dataHuman = data[data['IS_VAMPIRE'] < 0.5]
dataHuman.describe()

# Select only the rows where they are known vampires
dataVampire = data[data['IS_VAMPIRE'] > 0.5]
dataVampire.describe()

We already found some differences it seems!
Difficult to REALLY say as these are just means/averages
Also, this is a summary statistic of many samples
- Like, how helpful is this if I gave you a new subject and told you their garlic breath was measured as a 0.38?
We’re not done, but we definitely have something!
- All by just getting some simple summary statistics
We could have done this with numpy, but pandas makes this easy for us

Step 2: Start Visualizing the Data

# Create a nice pair plot for all the data
seaborn.pairplot(data)

Don’t panic, it’s just a Pair Plot

I know there is a lot of information here
It’s actually simple to unpack this
Basically, we’re plotting each feature against every other feature in scatter plots
Along the diagonal, when we have a feature compared to itself, we simply have the histogram/distribution of data
- It’s cool and useful to think of the histograms and scatter plots together
Seaborn makes this easy
- Could have done with matplotlib, but this saves us some time

Question

Do you notice anything interesting here?

Let’s try seeing if there are any simple linear relationships between the features/dimensions

# Linear Correlations
cor = numpy.corrcoef(data.T)
plt.matshow(cor)
plt.xticks(range(len(LABELS)), LABELS, rotation=90)
plt.yticks(range(len(LABELS)), LABELS)
plt.colorbar()

Question

Does this matrix line up with observations we made with the pair plot?

Warning

Linear correlation is great and all, but it doesn’t tell us everything because sometimes the feature might have nonlinear relationships

That pair plot was cool, but it’s hard to tell what’s really going on

Simple solution, add another dimension to the visualization.
- Colour is a dimension too you know!

# Pair plot again, but let's add a new dimension (colouU)
seaborn.pairplot(data, hue='IS_VAMPIRE')

Wow, this makes it so much clearer!

In fact, it is making it pretty obvious which features seem to matter
- It’s basically jumping off the screen screaming at you
It even made the stake aversion and garlic breath features’ histograms so much clearer
- They CLEARLY look as if they are different distributions

Let’s clean up the pair plot a little more by removing the features that seem to not be too helpful

# Pair plot again, but let's add colour AND narrow it down to what seems to be the three key features
seaborn.pairplot(data, vars = ['stake aversion', 'garlic breath', 'reflectance'], hue='IS_VAMPIRE')

It doesn’t get any better than this!!!!

Classifying Data¶

Step 1: Classification with One Dimension

Question

Based on the above image, if I asked you to pick one feature to help us classify/predict if a subject was a vampire or not, which would you pick?

# To me it really looks like Reflectange would be good.
plt.hist(dataHuman['reflectance'], color = 'b', alpha = 0.5)
plt.hist(dataVampire['reflectance'], color = 'r', alpha = 0.5)

To me these look like they are obviously different distributions, but let’s geek out and be super sure with a t-test

# Check what the p-val is
print(scipy.stats.ttest_ind(dataHuman['reflectance'], dataVampire['reflectance']))

If you run the above code, you get a p-value of 0.0

Obviously not really 0, but python gave up and just said it’s virtually 0

Everything at this stage is telling us that we are likely good to pick reflectance as a feature to help classify/predict if a subject is a vampire

Question

Based on the histograms, where would you pick the cutoff based on an eyeball test?

# Can change for fun
CUTOFF = 0.75

# ACTUAL
y = data['IS_VAMPIRE']

# My prediction based on my cutoff
y_hat = []
for d in data['reflectance']:
  if d < CUTOFF:
    y_hat.append(1.0) # is a vampire
  else:
    y_hat.append(0.0) # is a human

The above code just goes through each data point and checks if the reflectance is above or below our cutoff

How accurate am I?

# Compare each y and y_hat
correctCount = 0
for i in range(len(y)):
    if y[i] == y_hat[i]:
        correctCount += 1

print('Accuracy: ' + str(correctCount/len(y)))
print('Number Wrong: ' + str(len(y) - correctCount))

If you run the above code, you get:

Accuracy: 0.997

Number Wrong: 6

WOW! That’s actually amazing!

We want a better view of the accuracy though, so let’s go with:

True Positive

True Negative

False Positive

False Negative

# Compare each y and y_hat
# Assume 1 for IS_VAMPIRE is a *positive*
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y)):
    if y[i] == 1.0 and y_hat[i] == 1.0:
        TP += 1
    elif y[i] == 0.0 and y_hat[i] == 1.0:
        FP += 1
    elif y[i] == 1.0 and y_hat[i] == 0.0:
        FN += 1
    elif y[i] == 0.0 and y_hat[i] == 0.0:
        TN += 1
    else:
        print('Something went wrong')

print('True Positive Rate: ' + str(TP/(TP + FN)))
print('True Negative Rate: ' + str(TN/(TN + FP)))
print('False Positive Rate: ' + str(FP/(FP+TN)))
print('False Negative Rate: ' + str(FN/(TP+FN)))

If you run the above code, you get:

True Positive Rate: 0.9963144963144963

True Negative Rate: 0.9974704890387859

False Positive Rate: 0.002529510961214165

False Negative Rate: 0.0036855036855036856

If our rule is If their reflectance is less than 0.75, then they are a vampire, that’s brilliant

You can basically not do better than that in terms of the simplicity of a classifier
- We love when our rule/function/classifier/model is easy to explain

Step 2: Classification with Two Dimension

Obviously we’re happy with our simple rule, but can we do better by including a new dimension?

Question

Based on the pair plot (same as before), if you wanted to include two dimensions for classification, which would you pick?

Don’t think about a single point anymore (like with the histograms/distributions), think line in 2D space

HINT: One will likely be reflectance

# Reflectanve vs. Stake Aversion looks good
plt.scatter(dataHuman['reflectance'], dataHuman['stake aversion'], color = 'b', alpha = 0.5)
plt.scatter(dataVampire['reflectance'], dataVampire['stake aversion'], color = 'r', alpha = 0.5)

Question

If you could draw a straight line, how accurate do you think you could get?

Looks like I could get perfect, or perhaps 1 false negative.

I wonder if there is a handy way to easily find the line?

Step 3: Classification with THREE Dimension

# 3D Plot
# I'm sorry, but this is somewhat *magic code*
# Sorry, I know... :(
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(dataHuman['reflectance'], dataHuman['stake aversion'], dataHuman['garlic breath'], color = 'b', alpha = 0.5)
ax.scatter3D(dataVampire['reflectance'], dataVampire['stake aversion'], dataVampire['garlic breath'], color = 'r', alpha = 0.5)

ax.view_init(40, 60)

Imagine creating a plane to split this data now… I’m betting we could probably get 100%

I wonder if there is a handy way to easily find the plane?

Step 4: Classification with Four+ Dimension

Well, we’re out of useful ones here, but there is nothing stopping us from using higher dimensions if we had them

Just a little hard to visualize

Dimensionality¶

If we only have the one dimension, how many data points do we need to fill our domain?

It’s a continuous value, but it looks like it has two decimal points

Looks like the low for reflectance is roughly -0.35 and high of 1.05

Total of roughly 140 values

If we have the two dimension, how many data points do we need to fill our domain?

140 from reflectance

Stake has a low of -0.53 and a high of 1.33, for a total of 186

That means we need 140 * 186, or 26,040 data points to fill that space

If we have the three dimension, how many data points do we need to fill our domain?

26,040 from reflectance and stake aversion

152 for garlic

3,958,080‬ points needed to fill that space

Obviously the observations could be for the same values
And not all the areas of the space would necessarily be occupied
BUT, it does show that we need more and more data the more features/dimensions we want to use

Drawing the Straight Line?¶

Question

Before we drew a straight line and got like 0 or maybe 1 error

If I asked you to draw a curved line, do you think you could get 100% for sure

I suppose we don’t even need lines…

K-Nearest Neighbours

If it looks like a dog and barks like a dog, it’s probably a dog

Question

We will cover K-Nearest Neighbours in our Machine Learning lecture, but for now, here’s a preview

from sklearn.neighbors import KNeighborsClassifier

# Select the 2Ds we liked the most (ignore 3rd because this is easier)
X = data[['stake aversion', 'reflectance']]

# Consider the 2 closest neighbours
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X, y)
print(knn.score(X, y))

0.9985

# Consider the 1 closest neighbours
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(X, y)
print(knn.score(X, y))

1.0

Question

Why am I getting 100%?

Can you think of any problems here?

Warning

This is actually a HUGE problem with what I did

I checked how well my classifier worked on the data it was trained on/fit to

I have no idea how well this generalizes

This is also probably overfit

This is not to say that we did not learn something valuable, but that we need to be careful about our conclusions at this stage

Question

Any ideas on how to fix this big problem?

Understand that this is just a light introduction
We will learn a lot more about Machine Learning next class
- But that will also be a light intro like today’s lecture
If you have a scratch to itch in the meantime, check out Google’s Machine Learning Crash Course
- You have all the necessary skills to have at it
- It’s a little Artificial Neural Network heavy, but it gets the point across

Next Class: Machine Learning

Table of Contents

Previous topic

This Page

Data Science¶

Warning¶

Centres for Vampire Control¶

Playing with Data¶

Classifying Data¶

Dimensionality¶

Drawing the Straight Line?¶