Given experimental data on the bounce height of differently sized bouncy balls, find a line that best fits the size-vs-bounceheight data points.
Background:
Linear Regression is when you have a group of points on a graph, and you find a line that approximately resembles that group of points. A good Linear Regression algorithm minimizes the error, or the distance from each point to the line. A line with the least error is the line that fits the data the best. We call this a line of best fit.
We will approximate the line of best fit through a brute-force linear regression, simply trying different m & b values in y=mx+b, and recording those that result in the least error.
Note: We recognize that this is a pretty mediocre approach from a data science perspective. The purpose of this project has more to do with a basic data science work flow, as well as python / jupyter proficiency.
The line we will end up with will have a formula that looks like:
y = m*x + b
m
is the slope of the line and b
is the intercept, where the line crosses the y-axis.
get_y()
takes in m
, b
, and x
. It should return what the y
value would be for that x
on that line.
def get_y(m, b, x):
return m * x + b
print(get_y(1, 0, 7) == 7) # should return True
print(get_y(5, 10, 3) == 25) # should return True
We want to try a bunch of different m
values and b
values to see which line produces the least error. To calculate error between a point and a line, we will define a function calculate_error()
. It will take in m
, b
, and an [x, y] point called point
, and return the distance between the line and the point.
Steps:
x_point
y_point
get_y()
to get the y-value that x_point
would be on the lineget_y
and y_point
abs()
to do this)The distance represents the error between the line y = m*x + b
and the point
given.
def calculate_error(m, b, point):
x_point = point[0]
y_point = point[1]
return abs(y_point - get_y(m, b, x_point))
Tests of calculate_error()
:
#this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))
#the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))
#the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))
#the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))
# summary: should output 0, 1, 1, 5
Here is our dataset. It is in the form (ball_diameter_in_cm, bounce-height-in-m)
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
The first datapoint, (1, 2)
, means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.
As we try to fit a line to this data, we will need a function calculate_all_error
, which takes m
and b
that describe a line, and points
, a set of data like the example above.
calculate_all_error
will iterate through each point
in points
and calculate the error from that point to the line (using calculate_error
). It will keep a running total of the error, and then return that total after the loop.
def calculate_all_error(m, b, points):
total_error = 0
for point in points:
total_error += calculate_error(m, b, point)
return total_error
Tests of calculate_all_error()
:
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))
#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))
#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))
#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))
# summary: should output 0, 4, 4, 18
It looks like we now have a function that can take in a line and a set of data and return how much error that line produces when we try to fit it to the data.
Our next step is to find the m
and b
that minimizes this error, and thus fit the data best.
Our linear regression approach will be trial and error. We will try a bunch of different slopes (m
values) and a bunch of different intercepts (b
values) and see which one produces the smallest error value for the dataset.
Let's create a list of possible possible_ms
that goes from -10 to 10 inclusive, in increments of 0.1.
possible_ms = [x/10 for x in range(-100,101)]
Now, let's make a list of possible_bs
that goes from -20 to 20 inclusive, in steps of 0.1:
possible_bs = [x/10 for x in range(-200, 201)]
We are going to find the smallest error. First, we will make every possible y = m*x + b
line by pairing all of the possible m
s with all of the possible b
s. Then, we will see which y = m*x + b
line produces the smallest total error with the set of data stored in datapoint
.
First, we'll create the variables that we'll be optimizing:
smallest_error
— this should start at infinity (float("inf")
) so that any error we get at first will be smaller than our value of smallest_error
best_m
— we can start this at 0
best_b
— we can start this at 0
We want to:
m
in possible_ms
m
value, take every b
value in possible_bs
calculate_all_error
on this m
value, this b
value, and datapoints
is less than our current smallest_error
,best_m
and best_b
to be these values, and set smallest_error
to this error.By the end of these nested loops, the smallest_error
should hold the smallest error we have found, and best_m
and best_b
should be the values that produced that smallest error value.
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error = float("inf")
best_m = 0
best_b = 0
for m in possible_ms:
for b in possible_bs:
error = calculate_all_error(m, b, datapoints)
if error < smallest_error:
best_m = m
best_b = b
smallest_error = error
print(best_m, best_b, smallest_error)
# with the given dataset, we should get 0.4, 1.6, 5.0
Now we have seen that for this set of observations on the bouncy balls, the line that fits the data best has an m
of 0.4 and a b
of 1.6:
y = 0.4x + 1.6
This line produced a total error of 5.
Using this m
and this b
, let's see what our line predicts the bounce height of a ball with a width of 6 to be?
get_y(.4,1.6,6)
# 4 expected
Our model predicts that the 6cm ball will bounce 4m.