Problem Statement

Given experimental data on the bounce height of differently sized bouncy balls, find a line that best fits the size-vs-bounceheight data points.

Background:

Linear Regression is when you have a group of points on a graph, and you find a line that approximately resembles that group of points. A good Linear Regression algorithm minimizes the error, or the distance from each point to the line. A line with the least error is the line that fits the data the best. We call this a line of best fit.

Aproach

We will approximate the line of best fit through a brute-force linear regression, simply trying different m & b values in y=mx+b, and recording those that result in the least error.

Note: We recognize that this is a pretty mediocre approach from a data science perspective. The purpose of this project has more to do with a basic data science work flow, as well as python / jupyter proficiency.

Part 1: Calculating Error

The line we will end up with will have a formula that looks like:

y = m*x + b

m is the slope of the line and b is the intercept, where the line crosses the y-axis.

get_y() takes in m, b, and x. It should return what the y value would be for that x on that line.

In [1]:
def get_y(m, b, x):
  return m * x + b

print(get_y(1, 0, 7) == 7)   # should return True
print(get_y(5, 10, 3) == 25) # should return True

We want to try a bunch of different m values and b values to see which line produces the least error. To calculate error between a point and a line, we will define a function calculate_error(). It will take in m, b, and an [x, y] point called point, and return the distance between the line and the point.

Steps:

  1. Get the x-value from the point and store it in a variable called x_point
  2. Get the y-value from the point and store it in a variable called y_point
  3. Use get_y() to get the y-value that x_point would be on the line
  4. Find the difference between the y from get_y and y_point
  5. Return the absolute value of the distance (you can use the built-in function abs() to do this)

The distance represents the error between the line y = m*x + b and the point given.

In [2]:
def calculate_error(m, b, point):
    x_point = point[0]
    y_point = point[1]
    return abs(y_point - get_y(m, b, x_point))

Tests of calculate_error():

In [3]:
#this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))
#the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))
#the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))
#the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))

# summary: should output 0, 1, 1, 5
0
1
1
5

Here is our dataset. It is in the form (ball_diameter_in_cm, bounce-height-in-m)

In [4]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

The first datapoint, (1, 2), means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.

As we try to fit a line to this data, we will need a function calculate_all_error, which takes m and b that describe a line, and points, a set of data like the example above.

calculate_all_error will iterate through each point in points and calculate the error from that point to the line (using calculate_error). It will keep a running total of the error, and then return that total after the loop.

In [5]:
def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error

Tests of calculate_all_error():

In [6]:
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))

#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))

#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))


#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))

# summary: should output 0, 4, 4, 18
0
4
4
18

It looks like we now have a function that can take in a line and a set of data and return how much error that line produces when we try to fit it to the data.

Our next step is to find the m and b that minimizes this error, and thus fit the data best.

Part 2: Try a bunch of slopes and intercepts

Our linear regression approach will be trial and error. We will try a bunch of different slopes (m values) and a bunch of different intercepts (b values) and see which one produces the smallest error value for the dataset.

Let's create a list of possible possible_ms that goes from -10 to 10 inclusive, in increments of 0.1.

In [7]:
possible_ms = [x/10 for x in range(-100,101)]

Now, let's make a list of possible_bs that goes from -20 to 20 inclusive, in steps of 0.1:

In [8]:
possible_bs = [x/10 for x in range(-200, 201)]

We are going to find the smallest error. First, we will make every possible y = m*x + b line by pairing all of the possible ms with all of the possible bs. Then, we will see which y = m*x + b line produces the smallest total error with the set of data stored in datapoint.

First, we'll create the variables that we'll be optimizing:

  • smallest_error — this should start at infinity (float("inf")) so that any error we get at first will be smaller than our value of smallest_error
  • best_m — we can start this at 0
  • best_b — we can start this at 0

We want to:

  • Iterate through each element m in possible_ms
  • For every m value, take every b value in possible_bs
  • If the value returned from calculate_all_error on this m value, this b value, and datapoints is less than our current smallest_error,
  • Set best_m and best_b to be these values, and set smallest_error to this error.

By the end of these nested loops, the smallest_error should hold the smallest error we have found, and best_m and best_b should be the values that produced that smallest error value.

In [9]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error = float("inf")
best_m = 0
best_b = 0

for m in possible_ms:
    for b in possible_bs:
        error = calculate_all_error(m, b, datapoints)
        if error < smallest_error:
            best_m = m
            best_b = b
            smallest_error = error
            
print(best_m, best_b, smallest_error)

# with the given dataset, we should get 0.4, 1.6, 5.0
0.4 1.6 5.0

Part 3: What does our model predict?

Now we have seen that for this set of observations on the bouncy balls, the line that fits the data best has an m of 0.4 and a b of 1.6:

y = 0.4x + 1.6

This line produced a total error of 5.

Using this m and this b, let's see what our line predicts the bounce height of a ball with a width of 6 to be?

  • m = 0.4
  • b = 1.6
  • x = 6
In [10]:
get_y(.4,1.6,6)
# 4 expected
Out[10]:
4.0

Our model predicts that the 6cm ball will bounce 4m.