Linear Regression Machine Learning (tutorial)

[BLANK_AUDIO] Let’s see, let’s see everybody. I’ll minimize myself. I don’t need to see myself. I need to see you guys. I need to see you guys. The world, this is Rash! I am hyped for this live session. Yo, class is in session everybody. We are about to do some math this
live session, so I’m super excited. But first of all,
let me take some roll call, all right? So, let’s see, Collin,
Brandon, Nil, David, Dakosh, Sebastian, Raj, Spencer,
Naresh, Niko, Clement, hi, guys! Michael, Benjamin, all right. So, that was roll call. Welcome to this live session for
the deporting, inter-deporting course. Okay. This is going to be so awesome. Because, I have been
waiting to do some math. Guess what guys. Guess what. I bought this pad to write some math on. Okay. I’ve never used this before so,
I’m super excited for this. I’m going to show you guys the math. Behind linear regression. By the end of this video, you guys are going to know like the back of your
hand, how to do linear regression. That includes gradient descent. And guess what?, we use gradient descent
all over the place in machine learning. Don’t worry if you don’t know what
that is, I’m going to show it to you. Okay? So, we’re going to deep dive into this. So, we’re going to start
off with a five minute Q&A, like always, and I think we’ve got some
Udacity peeps in the house as well. Drew. Nico, and Max, who’s the other instructor for the course. So, I think you’re here shout out,
say something so people know who you are and, so I’m going to
do my five minute Q&A, like always, and I’m going to answer all the questions
related to me, and my everything, but if you have any Udacity specific
questions, they will answer those, okay? So, let’s start up with
a five minute Q&A, and then we’re going to get right
into the code and that, okay? Do I have to know about
partial derivatives? We are going to do a partial derivative,
but I’ll show you how that works. [BLANK_AUDIO] I had to cut off cutie
pie to catch this. Wow I’m honored, I’m honored. Hey baby girl,
let me see that regression. All right, so that’s not a question. Let’s get some real questions in there,
some quality questions. All right. [BLANK_AUDIO] Would you want to check out
my Vive AI Assistant demo? Sure, yes, post a GitHub link in
the comments of one of my videos, I read all my comments,
I answer all my comments. See, I’m not fake,
you know what I’m saying? I answer all my comments. I’m here for you guys, have I
enrolled in- is calculus required for linear regression? Yes. A little bit of calculus, but
I’m going to go through that. Don’t be afraid by the word calculus. This is actually very intuitive. [BLANK_AUDIO] Can you mention some details
about the upcoming as well, looking to predict the genre from- [BLANK_AUDIO] All right. [BLANK_AUDIO] What basic maths will be needed? You’ll need to know basic algebra, okay? And then we’re going to learn
the calculus necessary to do this in this video, okay? Are against the future, yes. I mean the idea, between generate
model in general are really exciting, because you can generate
except don’t exist. This. And that has a lot of potential for
art and culture. GANs can change culture, right? We can generate music. We can generate art. We can generate paintings in
ways that humans couldn’t. Best book to understand math behind ML. Machine Learning and
Probabilistic Approach. That’s a pretty good one. Just mastering or coding too. Both. Mostly coding. Linear regression versus
other classifiers like SGDC. Linear regression is definitely easier. What is no free lunch? It’s a theorem, the no free lunch
theorem at a very high level it’s like, well you can’t make assumptions. You can’t make assumptions whenever
you are doing anything related to proving something. Just, when will you do NLP? Yo, I’m going to do so
much NLP in this course. I can’t wait for NLP, it’s coming up. Will you cover GANs? I kind of want to just do GANs
right now, you know what I mean? I’m super excited for
GANs, I will do GANs. [BLANK_AUDIO] I will give an intuition why to
do graded descent over, yes, I will explain that. Linear algebra is the way to go?, yes. What’s the difference between
cycle learn and TF learn? Cycle learn and TF learn,
great question. So, TF learn is a high level
wrapper on top of transfer flow. It’s very similar looking
to cycle learn, but cycle learn specifically is, it does. So, TF1 only focused on
deep neural networks. Cycle learn uses support vectrum shades,
and all sorts of other machinery models. Whereas TF1 is the same kind of. It has the same brevity, but
it focuses only on deep neural networks. Do you prefer WICA? No. No, when will you start
working on Anaconda? I mean, I’ll most likely start using
docker to contain those things. All right, rap for 50 case off,
let me rap for 50 case off. In this time,
I’m going to play an instrumental. I’m not going to just rap with that kind
of instrumental, you know what I mean? Don’t be discouraged, rap,
hip hop instrumental on YouTube, whatever it starts playing. Someone say a keyword and
then we’re going to get started. Triumph hip hop instrumental,
what is this about. Let’s go, play it. All right, let me just unplug my mic, so you guys can see this,
where’s the music. I’m going to say something,
you know what I’m saying? [MUSIC] 50k subs. I got 50k subs, man,
my mind is so fresh. I’m looking at this coffee mess,
looks like the best. I got caffeine on my mind,
it takes me so high through the sky. I got a USB 4, my my. I’m going to be writing math today,
like it’s all mine. Online, I see you man, it’s all fine. It’s all writing piece
of equations online. I see you coming back like threw
me your progression, wait. So that was it for the rap. Okay, so, that’s it for the rap. So, now we’re going to get
started with the code, okay? So, let’s go ahead and do this. I’m going to start screen sharing,
and then we’re going to get started. All right, here we go. [SOUND] Here we go, Google Hangouts. All right, and
what does Hangouts want to do? Hangouts wants to screen share. Hangouts wants to screen share. Your entire screen. Chair. All right, so I’ll minimize this,
and minimize, and then I’ll move this out of the way,
so I can see what you guys are doing. Okay, and we’re going to code this baby. Okay, I am in the corner here, let
make sure that you guys are seeing is, what I want you to see. [BLANK_AUDIO] Yes, what you guys are seeing is exactly
what I want you to see, perfect. All right. [BLANK_AUDIO] Okay, so here we go. [BLANK_AUDIO] Here’s what we’re going to do guys,
let me make this statement [SOUND]. This is big enough right? So in this lesson,
we’re going to do linear regression. And what is linear regression, right? So linear regression, in this case, and
let me make sure everything’s working. Everybody’s here, live chat’s working,
live video is not working. Okay, so here’s how it goes. So we’re going to do this, okay? So this is going to be
called linear regression. This is linear regression and
let me just show you guys. The best way to explain it is
to show it through visuals, so I’ll show it through visuals,
what exactly we’re going to be doing. And to show you visually,
I will give you a link to this, and I will just show it right here. This is what’s happening. So we have a set of points, and these
points are the test scores of students, and the amount of hours studied, okay? So this is what it looks like. So this, right on the right, this graph
here, these set of points are the set. The x values are the amount
of hours they studied and the y values are the test
scores they got. Okay. And intuitively, to us, there must be some kind of
correlation between these two values. But we want to prove this
programmatically, we want to prove this, I’m sorry, mathematically, we want to
prove that there is a relationship. And how do we prove that
there is a relationship? We draw a line of best fit. So how do we know what that line of best
fit is, or that linear regression is? Well we don’t know, we don’t know. We have to find that, and the way we’re going to find the line
of best fit is using gradient descent. And that process,
that training process looks like this. We’re going to draw a random line,
compute the error for that line. And I’ll talk about how we’re
going to compute that error. And that error value is going to say
how well-fit is this line to the data? And then based on that error,
it’s going to act as a compass. It’s going to tell us, well, how best should you re draw the lines
to be closer to the line invested. And we’ll keep doing that. So, it’ll be like draw a line, compute
error, draw a line, compute error, until eventually the line that we draw is the
optimal line that we should draw, okay? So, that’s at a very high level. But now I’m going to
go into the code and we’re going to talk
about this in detail. All right, so
lets go ahead and start it. [BLANK_AUDIO] So to start off, to start off I’m
going to write my main function, okay? So let me move all this stuff out of the
way, so I’ll get right into the code, all right? I’ll get right into the code. And guys, if people have questions and
I’m not able to answer them because I’m busy doing something,
please help me answer questions. I very much appreciate it. I very much appreciate it. Okay? So let me just start off by
writing the main function. What does the main function do? That’s where the meat of the code goes. Right, okay so in the main function,
we’ll write a run function, which is where we’re going to
store all of our logic. Okay, so let’s write up a run function. So the run function is a chance for us
to show what we’re doing at high levels, at a high level. So step one, is collect our data, right? Always in machine learning,
we want to collect our data. So we’ll get our data points. And what we’re going to do, how are we
going to collect our data, right? Well, to collect our data, we have to
import the one library that we’re using. I know guys,
we’re using a single library. And that library is NumPy, all right? And we’re going to use this little
symbol that means we don’t have to continually say NumPy whenever we
call its method or its functions. Okay, so what is the function
we’re going to use for NumPy? So the function we’re going to use for
NumPy, I’m sorry, right, main,
thank you, main, good call. So, the function we’re going to use for
NumPy is genfromtxt(). And what this is going to do, is it’s going to get the data
point from our data file. And let me show you guys
the data file as well. But basically we’re going to
separate it by the compass. Okay, and
we’re going to get those points. So what does this,
what does this data look like? Well, let me pull up terminal, and show you guys exactly what
this data looks like. So it looks like beta. Okay? So let me zoom in on this. Zoom, way more. 200 zoom. So these are just the hours studied,
on the left side, and then the test scores for a bunch of students, for
an intro to computer science class. Okay? The hours studied and
the test scores they got. Okay, so
that’s what we’re going to pull. That’s our data set. That’s what we’re going to
pull into our points variable. So, points is going to contain
a bunch of xy value pairs. Where x is the amount of hours
studied and y is the test score. Okay? And it’s separated by the comma. Okay, so that’s step one. We’ve done that, and genfromtext is
essentially running two main loops. The first loop converts each line
of the to a sequence of strings. And the second one is converting each
string to the appropriate data type. Okay, so that’s step one. Now, step two is to define
our hyperparameters. Okay, in machine learning, we have
what are called hyper-parameters. These are tuning nuts for our model. They are basically the parameters
that define how our model is analyzing certain data. How fast it’s spinning through the data. What operations performing on the data. There’s a whole bunch
of hyper-parameters. Thank you for the feedback. There’s a whole bunch
of hyper parameters and what we’re going to use
is the learning rates. Now the learning rate is used
a lot in machine learning, and it basically defines how
fast should our model converge? Convergence means when you
get the optimal result, the optimal model,
the line of best fit, in our case. That is convergence. So how fast should we converge? You might be thinking, well, shouldn’t
the learning rate just be a million, if you want to converge super fast? Well, no. Like all hyper-parameters,
it’s a balance, okay? So if the learning rate is too small,
we’re going to get slow convergence. But if it’s too big, then our error
function might not decrease, okay? So it might not converge. So, that’s our first hyper-parameter. Our next hyper-parameter is going
to be the initial value for b, and the initial value for m. And what is b and m? Well what we’re going to do, is we’re
going to calculate the slope, right? So this looks like a y equals mx plus b,
and so this is why I said we only
need to know basic algebra. This is the formula,
this is the slope formula, okay? All lines follow this formula, where y. So, m is the slope, b is the y
intercept, x and y are the points. Okay, so that’s the line, okay? So, this our initial b value,
our initial slope, and our initial y intercept. They’re going to start off as 0, okay? So, and then the last type of parameter
is going to be the number of iterations. How much do we want to train this model? Well, we have a very,
very small data set. There’s only a 100 points, okay. And for that, we’re not going to need to iterate
a million times or 100,000 times. We’re just going to iterate 1,000 times. Okay? So that’s our hyper-parameters, and now step three is going to be to fit,
train our models. It’s train our model. Train our model. Okay, so the first step is going to be to show
the starting gradient descent, okay? At b equals,
what is the starting gradient descent? It’s going to be zero, right? And then m is going to be the starting
point, for that we’ll say one. And this is just for
us to see the difference here, okay? [BLANK_AUDIO] All right, .format(initial_b,
initial_m). And so, what’s happening here? [BLANK_AUDIO] Compute error, for_line_given_points. So, all right, let me just write
this out and I’ll explain. initial_b, initial_m,
and then the points. Okay, so what’s happening here? Let’s go over what I just wrote here. So, in this line, we’re going to show
the starting b value, the starting m value, so what is our starting
y-intercept, what is our starting slope. And what is our starting error? And I’m going to show you how we’re
going to calculate that error. And to get that error,
given our b and m values, we have this function here called
compute_error_for_line_given_points. It’s going to take the b,
m and the points, and it’s going to compute the error for
that and it’s going to out put that. So, that’s going to be
our starting point, okay? And then, now, we’re going to actually
perform our gradient descent, and it’s going to give us
the optimal b and the optimal slope. I’m sorry, it’s going to go to the
optimal slope and the optimal y descent. So, for gradient descent,
we’re going to call this method the gradient_descent_runner,
so a given point. Given an initial b value, I’m sorry
initial m value given our learning rate, so this is where we’re going to
use all that kind of parameters, right?, because this is where
we’re training our model. So, number of iterations. Those are all the things we need for
this, okay?, and we’re going to define this
function in a second. We’re going to go deep dive and
define these functions. Okay, so then after we print our model,
well now we can just print it out, right? So, let me just copy and paste this. So, now, this is not our starting point,
this is now our ending gradient, ending point. So, face our ending point where b is
two, m is two and then error is three. And this number just define. What we’re going to see at the end. For the number of iterations for b. And then- [BLANK_AUDIO] For m and then for computing the error
for line at given points given that the final b, the final m value,
and then our points. Okay, so. [BLANK_AUDIO] Okay, so that is high level,
what’s happening here? So, all I did was I just
printed out the initial b and m value, which is nothing,
and then the error, and then I computed the rate of descent,
and then I print out the final values. So, I’m about to do this now. Okay, so we haven’t actually done this,
now we’re going to do it. So, the first thing I’m
going to talk about is, how we going to compute that error. Let’s write at that first function. What was that first function called? It was called
compute_error_for_line_given_points. Okay, so and the data set I’m
going to provide that as well, but let’s go ahead and
run up this method okay? So, this is the first step. We’re going to write up this method. Compute error for line at given points. Okay, I’m so
excited to show you guys this, because I get to use my math pad for
a second. Okay, so let me write this out,
okay?, hold on. Okay, here we go. So, let me write this out. Okay, so we’ve got a line here. Man, what a great line that is. Okay, so this is our plot, okay? And, so
we’ve got a bunch of data points here. We’ve got a bunch of data points. Write this all over the place and what we are going to do is to draw
a random line through the data. We don’t know the line invested, so we are going to draw a random
line through the data. And then, we are going to compute
the error of that line, so that error will tell us
how good our line is. Okay, so
how do we know how good our line is? But what we’re going to do is, we’re
going to go for every single y value, on that line we’re going to calculate
the distance from each point from our data to the line. Okay, so all of these distances,
all of these distances, distance one, distance two, distance three, distance
four, distance five, distance six and then you probably have more data
points down here, these distances, the distance to this line. And, so we’re going to take all those
distances and we want to sum them. And, so let me show you the equation for
that, okay? So, rather than actually
writing out this equation, like really sloppily, I’m going to
show it to you using this, okay? So, okay. So, this is the equation. So, let me explain what this is. So, we got all those distances,
right?, we got all those distances. We’re going to sum those
distances together and that, and I get the average of that. But guess what, we’re not just
going to sum those values alone, we’re going to square those values. And why are we squaring those values? Because, we’re squaring those values,
because we want it first of all to be positive, and it doesn’t really
matter what the actual value is. It’s more about the magnitude
of those values, right? And we want to minimize
that magnitude over time. So, this is the equation for that. Okay, so let me explain what
the hell this is, okay? So, we’re computing the error. We are computing the error
of our line given m and b. So, given m and b we are going to
compute the error of our line. M is our slope and b is our y intercept. So, this E, looking thing,
is called sigma notation. It’s a little weird,
giving you guys a little refresher here. This E thing, we’re going to see it a lot in machine
learning, it’s called sigma notation. And basically it’s a way of describing, calculating the sum of a set of values,
all right? So, the sum of a set of values,
which is what we’re doing. We’re calculating the sum of a set
of points, so if the starting point is where i equals 1 and the ending
point, and N is for every point. Okay, so for every point, you want to
calculate the difference in y values. So, it’s y-(mx+b). And why do we say (mx+b)? Because in the sub equation,
N y equals (mx+b) right? So, it’s y-(mx+b),
which essentially boils down to just y. So, it’s y minus y squared. And then we’re doing that for
every single point. And, so we’re going to add
all of those points together. Okay, and then get the average. And, so that why 1/N. Because we’re going to
get the average of that. And that’s value. That value is the error. Okay?, so at high level,
that is what that is. So, now let’s programmatically
write this out, okay? So, we’re going to start by initializing
the error, initialize it at zero. Okay?, so our total error at
the start is just going to be zero. There’s not anything that’s- [BLANK_AUDIO] We don’t have an error yet, okay? So, then for every point, so for
i in range of starting at zero, and then going for
the length of the points, right? So all of our data points, so for
every data point that we have. We’re going to say,
let’s get the x value, so x=points [i, 0]. And then we’re going to
get that y value, right? So, get the y value, right? So, I’m just basically
programmatically showing what I just talked about mathematically. Right? So, we’ve got the x value,
we’ve got the y value. And we want to compute that distance,
right? We’re going to do this
every single time. [BLANK_AUDIO] Then get the difference. [BLANK_AUDIO] Square it, and then add it to the total. Okay, so
here’s the actual equation, right? So, we’re going to do plus equal,
because it’s a summation, and we’re going to programmatically show what I
just talked about right here, right? y-(mx+b) squared. Okay? And we’re going to get the sum of that. So, y-(m * x + b) squared, okay? And we’re going to do that for
every point, so this whole iteration loop right here, is that equation,
okay?, minus the average part. So, that’s going to give
us the total value. The last part is to average it. So, we’ll take totalError
/ float [len[points]). So, we want it to be a float value. [BLANK_AUDIO] And that is the equation. That is the equation right there. Okay, so and then get the average. Get the average. [BLANK_AUDIO] So, this ten line
function just described, what I talked about right here
in this math equation, okay? We sum all the distances between all
those points, as I showed right here. We summed them all up, we squared
them and then we got the average. And that is our error. Okay? And we’re calculating that,
because we want a way for us, a measure of us,
something to minimize over time. Right? Something to minimize every
time we redraw our line, we want to minimize this error. Because this error basically is
a signal, it’s a compass for us. It’s telling us,
this is how bad your line is. It needs to get better. You need to make me smaller. I’m really big right now,
make me smaller. And that’s what gradient descent does. That’s what gradient descent does. And I’m going to explain how
gradient descent works in a second. But that’s that curves function, right? Okay?, what was the second
function we wrote? It was called gradient descent runner. So, this is our actual
brain descent function. So, now let’s write this out. Okay?, this is our second of
three methods, before we’re done. So, gradient_descent_runner. So, given a set of points,
given a starting value for b, given a starting value for m, given our learning rates and
given our number of iterations. We’re going to use all of these things
to calculate gradient descents. We’re going to use every single thing. Okay? [BLANK_AUDIO] Okay. So, let’s get that starting b and
m value, okay? So, the starting value for
b, we’re going to say to b. And the starting value for
m, we’re going to say to m. Okay? Simple enough. And now,
we’re going to perform gradient descent. What is gradient descent? I cannot wait to explain
gradient descent, guys. I found the perfect analogy for gradient
descent, and I’m really excited. Okay, before I explain that. Let’s just perform that you can erase
this, because the actual math is going to start in the last function
that I’m about to write. So, for
every single iteration that we define, we’re going to perform what’s
called gradient descent. So, we’re going to update b and
m with the new more accurate b and m by performing a gradient descent. By performing this gradient step, okay? So, b and m, we’re going to returned b
and m by performing this gradient step. We can already explain,
this is where the math is happening. Given out current b, our current m, given r the array
of points that we have. And then finally given
the learning rate. We’re going to calculate
that final value of b and m. And guess what? Once this gradient descent is done. We’re going to return that optimal e and
f, right? And, so that’s what we talked
about at the starting part, right? We returned that optimal b and
m and value. And before the gradient descent,
and then we then printed it out, because that optimal b and
m value gave us a line of best fit. We plug them into the y=
(mx+b) equate the formula. It gave us the line of best fit. So, now we’re going to write
out the gradient step. And this is gradient
mother f-ing descent. Okay, so
this is how it’s going to go down, okay? Here’s how it’s going to go down,
step_gradient. So, I’m just going to say, it’s time for
the magic, the magic, the greatest, the greatest, okay? So, that’s how excited am I,
just wrote the greatest twice. Okay, [LAUGH]. So, given our current b and
m values points and the learningRates. And this actually isn’t going to
help with that, so I’ll delete that. So, here are learningRates, okay? Let’s perform gradient_descent. So, okay, what is gradient_descent? Okay, so let me show you guys this. [SOUND]
How best do I describe this? So, we have. [BLANK_AUDIO] Let me just show you this image. This is going to help a lot. [BLANK_AUDIO] Okay, so this is a graph. So, let’s just look at the graph,
I mean it’s the same graph. It’s looking at it from
two different angles. It’s the same graph, okay? So, let’s look at the one on the left,
just to pick one. It’s the same graph though. We have a bunch of y values,
sorry a bunch of b values, and a bunch of m values. And then we have that error, right? That error that I just talked about,
right? So, given the 2D graph of b given
are every single y intercept, we could have given every single m
value we could have, what is the error? Okay, so for every y intercept and
slope curve what is the error? And, so we will find this is
a three dimensional graph. This is a three dimensional graph. Because the error value it’s kind
of like, it’s start up high, and then I do approach what’s called
the local minimal in our case. A local minimal, which is the small
that point at the very bottom, that is our that is where
we’re trying to get to. Okay so. Given a set of y-intercepts,
and given a set of slopes. Possible y-intercepts and possible slopes, we want to compute
the error for those three things. And if we were to graph the relationship
between these three things, it would look like this. Now, it tends to always
look very similar to this. In more complex cases we’d have many
minimal, we’d have many little values. But what we’re trying to do is get that
point, where the error is smallest. And, so how do we get that point
where the error is smallest? Well, we’re going to perform
what’s called gradient descent to get that smallest point. That value, smallest point. And a great analogy for this is a bowl. So, let me just search bowl, okay? It’s kind of like a bowl. It’s like we drop a ball into a bowl,
and we want to find that point, where the ball stops,
that endpoint, the lowest point. That b, m value is our optimal
line of vested fit value. Okay?, and the way we’re going to
get that is gradient descent. We’re going to descend, right?,
we’re descending down the bowl using the gradient, and
gradient is another word for slope. We’re going to descend down that bowl
until we get, through iteration, that lowest point. And gradient descent is used. Everywhere in machine learning. Okay? It is like the optimization method for
deep neural networks. It’s not that apparent right now. But know this. Know and understand gradient descent
like the back of your hands, because it is going to be very
useful in the future, okay? So. I don’t know why I’m
doing that equation. That was unneccessary. That was the equation for the sum of squared errors that we just
talked about, sum of squared distances. So, how are we going to
calculate that gradient descent. Well, now let’s actually do it. So, [BLANK_AUDIO] For our step gradient function, we’ll start off with an initial
gradient value for a b. So, b is going to be zero and x gradient
is going to be zero as well, okay? These are the starting points for
our gradients and gradient means slope. And, so the gradient is going
to act like a compass, and it’s going to always point down hill,
so this is what I mean by, once we calculate that error,
it’s going to act as a compass for us. It’s going to tell us. Where we should be going? What direction we should be going? How we should next redraw our lines. So for- [BLANK_AUDIO] Okay, someone asked why is
the lowest point the best? The lowest point is the best, because
it is where our error is the smallest. And when our error is the smallest, that’s when we have
the line of best fit. When the error is smallest,
that b and m value, those two, what we plug into our slope equation, is
going to give us the line of best fit. So, that’s why we’re
calculating the error, okay? So. [BLANK_AUDIO] So, for i in range[0, len[points]). [BLANK_AUDIO] Okay, so what we’re going to do is we’re
going to iterate through every single point on our scatter plot. Okay, so every single data point that
we have, we’re going to collect it. Okay, so we’re going to say, okay,
what is, so for google our first point, right? First point,
which gives us an x value and a y value. X value and y value. So, let me also write out
a little comment for this. Starting points for our gradients, okay? [BLANK_AUDIO] Now, we’re going to get the direction
with respect to b and m. Now, this is the last part, but
it’s a very, very important part. And this is where calculus
comes into play, okay? So, I’m going to talk about
how we’re doing this. Okay, so let me talk about
what we’re about to do. So, what we’re going to do, is so,
given for every single point, for every single point that we have, we’re going to calculate what’s
called the partial derivative, okay? It’s called the partial
derivative with respect to b and with respect to m, okay? And what that’s going to do, is it’s
going to give us a direction to go for both the b value and the m value, right? So, remember, in this graph,
we want a direction, right? We want to be going down the gradient. And, so on this left hand side
you see this gradient search. The m values and the b values are
increasing in the direction that they should be, because gradient intersect
is essentially a search policy. It’s a search policy. We’re trying to find
that minimum error value. Okay? And what we’re going to do to get that, is we’re going to compute the partial
derivative with respect to b, n, and f. Okay, let me show you the equation for
the partial derivative, okay? The partial derivative is
going to be right here. [BLANK_AUDIO] So, this is what the partial
derivative does. The partial derivative, we call it partial, because it’s not
telling us the whole story, right? We say, it’s partial, because we’re
calculating it for both b and m. There are two different dates. And, so
it’s going to give us the tangent line. So, it’s going to give us this
line as you see right here, right? See this line,
that line is our direction. And we’re going to use it to
update our g and m values. Okay? So, that’s what that is. And let me also show you the equation
for the partial derivative, because we’re about to write it out. So, here’s what the equation for the
partial derivative with respect to m and b looks like. Okay? They’re two different equations, right? So, let’s talk about the one on top. So, this little curvy thing
that you see up here, that just signifies that this
is a partial derivative. That’s that signifier that
this is a partial derivative. Now, we talked about sigma notation,
right?, because it’s a summation of values,
right? And that’s what we’re doing. We’re summing the partial derivative for
all of our points, okay? For all of them to compute
that gradient value, okay? And the partial variable with respect
to m and b is going to look like this. So, let’s write this out, okay? So, the b gradient, so
it’s going to give us two values. So, the b gradient is
going to be plus equals. And then what was it? Let me look at the equation again. 2 over N, so
negative 2 over N, all right? [BLANK_AUDIO] Thanks good vibes. And then it was y minus, right? And these are the equations,
they are laws. They are beautiful laws,
that always stay the same. And they give us a way of understanding the direction that we want to move in. Okay, so, b_current. Okay so. All right, so then we’ll do the same
thing, and what was the second equation. It looked pretty much the same,
minus it doesn’t have this x, right? The second one doesn’t have this x,
right? So, we’ll say, but it does have this 2N. It does have this 2N,
and then it does have [BLANK_AUDIO] Let’s see. Let’s have this x. It does have (y-([m_current * x). [BLANK_AUDIO] + b_current, okay? Okay, so now, we’ve computed
our partial derivatives, right? So, let me one more time show you guys. It’s giving us directions to go for
both b and m. And remember, they’re partial. It’s not telling us the whole story,
it’s telling us what direction should we go for b, and
what direction should we go for m? And it’s going to tell us the direction,
remember a bowl to get to that bottom point, where that error is
the smallest right here, okay? So, right here where my mouse is,
that point is what we want to get to, and that’s what the partial
derivative is going to help us with. So, once we’ve computed
the partial derivatives, the sum of them with respect to b and m, now we’re going to update our b and
m values, right? So, we’re going to use that
to update our b and m values. And guess what? This is our last step. This is our last step using
this partial derivative. [BLANK_AUDIO] Using our partial derivatives,
right plural? There’s two of them. So, and that’s going to give us a new
value for b and m, our updated b and m value. So, we have our current value for
b whatever it is, that we fed into the separated function
that keeps updating every time. And this is where our learning_rate
comes into play, okay? This is why our learning rate is so
important, because it defines the rate at which we’re updating our b and
n values, right? So, remember that 0.0001, right? And then also our n_current. [BLANK_AUDIO] That is learning_rate, [BLANK_AUDIO] Times the m gradient. Okay, and
then it’ll return those values. And we’re doing this every time, right? This is new b, and new m,
they our final b and m. It’s a step function, where we’re
doing this every iteration, right? We’re doing this for
the number of iterations we had 1000. But it’s going to return a new b and
m value every time. And guess what guys? That’s it for our code. That was it, so
let’s go over what we’ve done. Okay, but actually let me check for
errors, right? [BLANK_AUDIO] Let me check for errors, and
then I’m going to answer more questions, because I really want to make sure you
guys understand how this works, okay? So, let me demo this. So, python Only and
is not defined. Okay, right, guess what. I didn’t define N. N is the number of points. Length of points. Okay? So, let’s go. Learning rate is not defined. Where? Where is learning rate not defined? Learning rate is not defined. Wait a second. Yeah, right. Learning rate, right. Okay, what else is bad? I’ve got an overflow for double scalars. [BLANK_AUDIO] 14 y minus [BLANK_AUDIO] Uh-huh, uh-huh, uh-huh. [BLANK_AUDIO] [INAUDIBLE] Okay, so. What’s going on here? Okay, let’s save this. So yeah, it printed out the final, okay
so it got our final value right here. And if we wanted to,
let’s see, hold on a second. If we wanted to,
we got our backup here just in case. So right? So let me blow this up. Like way, way up. Let me just separate it. So this is what our outputs
going to look like. Right. So boom! Just like that. That’s how fast it trade,
in milliseconds. Why? Because our data set is so small. Okay, it’s data set was so small. Alright, so. That what’s happened and
after a thousand iterations, we got the optimal b and m values. So, right as we start up with b and
m at o at we calculate the error for our random line that we drew and
it was huge. But, eventually, after running
gradient descend we got the optimal b, the optimal m and
the lowest error point, which is at the smallest
point in the bowl. And we to do that we use gradient
decent with respect to b and m. Okay so let me go over one last time
every single thing that we just done. Is to really go over it and then will
do my last five minute Q & A okay. So we start out by collecting
our data set, right. Our data set was a collection
of test scores and the amount of hours studied, right. The x y value the test scores and the amount of hours studied
a two variable data set. Then we define our type of parameters
for our linear regression. Our learning weight, which talks
about how fast we should learn, our initial BNM values for
the slope equation: y=mx+b. The number of iterations, 1,000,
because our data set is pretty small. And then we ran gradient descent. So, what did gradient descent look like? Well for every iteration, for a thousand iterations, we computed the
gradients with respect to both b and m. And we did that constantly,
until we got that optimal b and m value. That gives us that line of best fit. Now, how did we compute the gradients? To do that, we said, okay, we’ll have a starting point
of 0 for both of those gradients. Remember, gradient is just
another word for slope. And then we said, okay so for every single point in our scatter plot,
for our data, we’ll compute the partial derivative
with the respect to of both b and m. And those two values are going
to give us a direction, a sense of direction of
where we want to go. How do we get to that lowest
point in that goal, right? That three dimensional graphic,
that lowest point and we use the learning rate to determine
how fast we want to update our DMN values, we got the difference
between the current value, and what we had before, and we return that. So for every point, we did that for
a thousand iterations, okay? And that’s what gave us the output and it looks like, visually,
it looks like this. [SOUND] Right? It’s like up, up, up, up, up,
up, up, up, up, up, up, up. It’s kind of like Wheel of Fortune,
right? It starts off fast, and it gets slower
and slower as it approaches convergence, the word we use when we have the optimal
line of best fit, convergence. See, let me do it one more time. Up, just like that, okay? So that was that, and now I’m going to screen share and
do a last five minute Q and A. Alright, stop screen share. Hi everybody, okay,
let me bring you guys back on screen, do my last five minute Q and
A, ask me anything and yeah. How’s it going everybody? [BLANK_AUDIO] Any questions? I’m open to questions. [BLANK_AUDIO] Where did I use NumPy? It’s at the very top. So, right, what’s the practical
use of linear regression? Great question. Any time we want to find
the relationship between two different variables. And then in more complex
cases there could be more. But, we want to prove mathematically. Right? Math is all about proving things
in a way that is unfalsifiable, that no one can say,
hey, that’s not true. Well I can prove it mathematically. So it’s a way to show the relationship
between two value pairs. So maybe housing prices,
and the time of year right? What is the real estate
market going to look like? Any time intuitively you think there
was a relationship you can prove it with linear regression, but
really I did this to show Grady the set. That optimization process is very
popular in Deep Learning and we’re going to use that in our Deep,
Run networks on the rest of the course, okay? And, why a device for this google? Because it is the deepest learning
library that is out there right now. That’s why. And, of course it would be,
because Google knows what they’re doing. They handle billions and
billions of queries every day. They have to be able to do
machine learning at scale. And, problems, they solve problems that
no one else has even thought of solving. And all of those solutions
are found in TensorFlow. For machine learning or
please think of the eye doctor. You can create a classifier to
classify between different types of disorder that you see in an x-ray. That’s going to augment doctors at
first, but eventually replace them. How about fitting a quadratic
curve inside of a linear line? We could do that as well. [BLANK_AUDIO] I’m going to provide the data set and
the code. I can talk slower, sure. How to find the optimal morning rate? That’s a great question. There’s several methods of doing that,
but that’s great intuition. Sometimes we can use machine learning to
find the optimal hyper-parameters, so it’s kind of like machine learning for
machine learning, but we’ll talk about that later. This is the first course,
he just calculates,I’ll do more of that in the future,
I’m going to keep doing calculus, okay? Two more questions then
we’re good to go, two more. How would you recommend me
to start machine learning? Watch this series. And watch my Learn Python for
Data Science series, watch my Intro to Tension Flow series, watch my
Machine Learning for Hackers series. Watch my videos. Why is your Udacity too extensive? I didn’t decide the price guys. I try to get it low. It’s whatever. You get paid graders for that okay. And grading is not cheap,
okay human graders. But look all the videos are going to be
released here on my channel all right. So I’m here for you guys, okay? I’m trying to grow my brand. I’m trying to grow myself,
Sharad Ravel, okay? [BLANK_AUDIO] This is the end, okay? So that’s it for the questions. And all right, so for now, I’ve gotta, [BLANK_AUDIO] Shoot a findings scene. For my next video. What? Yeah, so, thanks for watching. [SOUND] Love you guys. I’ll post the link in the comments
right when I’m done alright? The video description. I’ll post the GitHub link, and
then the data set, everything. So don’t go to the descriptions
within the hour, okay? Bye! Okay. [BLANK_AUDIO]

100 thoughts on “Linear Regression Machine Learning (tutorial)

  • Hey Man!!.. You are doing a GRRRRRRRRRRRRRRRRRR888888888888888888888 JOBB!!!!!!!!…. All these Stuffs For Free……HATSS OFF!! I'm a CSE undergrad. from India & I'll praise u more on next comments :P…. Power to You Bro!!! btw… do you live in India??
    P.S. I haven't commented ever on Youtube… Its very Hard for any1 to get me commenting….. Not sayin that m a gr8 person or so.. But ur Gr8 Work made me do this… Awesome Man Awesome!!!!!

  • Dear Siraj,
    Why is it necessary to compute the partial derivate of the error function no. of points times?
    It has been used in step_gradient function.

  • Does the product moment correlation coefficient 'r' have to be greater than a certain value? In order to actually find the line of regression?

  • I don't understand the details, but it's nice to see the general 5-step format of creating Python code from linear regression. Thank you for showing how real-world data is used in programming, and for explaining how it's a smaller piece of building steps to the outcome, so eventually the outcome can just be used in machine and deep learning (in other videos). Thanks Siraj!

  • Hi Siraj, this was amazing, thank you!

    I have a doubt about partial derivatives. At 35:30, the formula doesn't include the summation of 2/N, right? (Since its before the Sigma notation)

    But at 37:42, you seem to have summed 2/N in both. So is there something I'm missing?

  • i tried to run it on python3 and received a lot of syntax errors, someone can help me?
    apparently, its all about the print function, but i dont get it since i pick the code from here

  • Hi Siraj, lovin your videos so far. Very informative, direct to the point and fun to watch.

    I'd like to ask something though which I'm having hard time composing the right question so I tried to rephrase it 3x, I hope you get what I'm asking.
    1. What did we prove for finding the line of best fit for the data in the demo?
    2. I mean how would you explain the result of the training process?
    3. How do you explain the relationship of amount of hour study vs the test score?

    That is something that is not clear to me.

    Siraj you are the best man teaching ML, AI n stuff practically I have come across…Please do not stop at any point no matter what …You are inspiration for people trying to learn these things ..
    Hey siraj if you read this please reply comment !

  • I guess I must not know Python syntax well enough (I've used it, but am no expert at it yet). In the print 'starting gradient descent…' line, what are the {0}, {1}, {2}, and where are they coming from? I think they might be arguments going into the function, but I don't see any arguments in the run() function signature.


  • For some reason, the code seems a lot easier to read to me than the equation. Combined with the intuition offered, it's no problem. However, those print lines seem quite cryptic.

  • We should make some kind of simulation with a few dots on the graph, where you can move the line around by hand and watch the iterations of the sum equation with the totals on the side.

  • It might not be immediately obvious what that graph was showing. If we filled out the graph of the error based on the simulation, then it would be so intuitive that people could see it.

  • Hey Siraj you are really an inspiration.
    Can you please guide me how can we use deep belief networks for regression problems? As most of the examples given online are for classification-/mnist data sets ,

  • Nice course Siraj u explain hard topics fast and make it sound easy + u give a practical demo but to learn ML, I feel we use your course as a summary or a recap to a course by Andrew NG, i know its boring there but that should be a pace to learn something new. In here every second of have ton of info

  • Siraj , First of all this is dope, Amazed how you do it ?and What all things you have gone through for it .Finally, I am really great fan of your work man keep making such great content.

  • I made my own linear regression algorithm, and found that just calculating the partial derivative by the slope between two very close points worked well. I'm not sure exactly what your equation does, but doesn't it use the same logic? After all, we don't have the literal equation to take the derivative of.

  • Hi Siraj, for linear regression what will happen if we pass on the initial m amd b values from OLS regression and then apply gradient descent keeping the learning rate as in the example, aklso how do we narrow in on the value of learning rate and number of iterations

  • How accurate is this? I run the same data through excel and this what I got, y = 1.322x+7.991. but in the code you demonstrated, those values are varying so much. How can b value vary so much? I something I'm missing here? someone, please help me to understand this.

  • Hello to you Siraj, I appreciate enormously what you do, I would like to know the knowledge to acquire to be able to follow a course on deep learning

  • Thanks siraj for every thing. Could you provide some more insights to how to determine correct learning rate & Number of iterations. that will really helpful & you are awesome.

  • This is definitely the right implementation of gradient descent. However, you didn't include any vectorization in your implementation, which is crucial for optimal numerical calculations. Therefore, I don't agree with this being "the right way" of doing linear regression with gradient descent.

    Thanks for taking the time to show another way of doing gradient descent in python. The video is ok.

    P.S.: I didn't see the "perfect analogy." I just saw an average explanation. But is good to see that you love yourself, lol. Very respectable. Regards.

  • I am working on regression and tried to code your lecture example. but i found the following error, i couldn't solve it. C:Python36pro>
    File "", line 43

    SyntaxError: unexpected EOF while parsing , Nothing is written on line 43, my code ends on line 42

  • Why is it ok to literally do a video based on Matt Nedrich's article ? It's clearly obvious you were looking at his article on your second monitor as you were typing the code and when you talked.

  • How do we visualize the scenario where the problem has more than one feature for example in the problem you stated, the number of features mentioned is 1,i.e. the number of hours a student studies with the corresponding y values(the marks scored by the student). What if we add another feature say #Number_of_books_referred, how do we visualize this scenario?
    So now value of y will be depended on #Hours_Studied and #Number_of_books_referred. Can we still visualize this in 2d?

  • ##this is data of dataframe
    ## 2104 399900
    ##0 1600 329900
    ##1 2400 369000
    ##2 1416 232000
    ##3 3000 539900
    ##4 1985 299900
    ##5 1534 314900
    ##6 1427 198999
    ##7 1380 212000
    ##8 1494 242500
    ##9 1940 239999
    ##10 2000 347000
    ##11 1890 329999
    ##12 4478 699900
    ##13 1268 259900
    ##14 2300 449900
    ##15 1320 299900
    ##16 1236 199900
    ##17 2609 499998
    ##18 3031 599000
    ##19 1767 252900
    ##20 1888 255000
    ##21 1604 242900
    ##22 1962 259900
    ##23 3890 573900
    ##24 1100 249900
    ##25 1458 464500
    ##26 2526 469000
    ##27 2200 475000
    ##28 2637 299900
    ##29 1839 349900
    ##30 1000 169900
    ##31 2040 314900
    ##32 3137 579900
    ##33 1811 285900
    ##34 1437 249900
    ##35 1239 229900
    ##36 2132 345000
    ##37 4215 549000
    ##38 2162 287000
    ##39 1664 368500
    ##40 2238 329900
    ##41 2567 314000
    ##42 1200 299000
    ##43 852 179900
    ##44 1852 299900
    ##45 1203 239500

    import numpy as np
    import pandas as pd

    print("length of data {0}".format(m))
    def error_calc(df,slope,intercept):
    for i in range(0,m):
    return (err/float((2*m)))
    def gradient(df,init_slope,init_intercept,learningrate):
    for i in range(0,m):
    return [new_slope,new_intercept]

    def run_gradient(df,init_slope,init_intercept,learningrate,n):
    for i in range(0,n):
    print("error = ",err)
    return [slope,intercept]

    def run():
    while more_run>0:
    n=int(input("enter the number of iterations"))
    print("initial values slope= {0}, gradient={1},error={2}".format(slope,intercept,error_calc(df,slope,intercept)))
    print('final values slope= {0}, gradient={1},error={2}'.format(slope,intercept,error_calc(df,slope,intercept)))
    while predict >0:
    predict=int(input("enter greater if want to predict"))
    size_house=int(input("enter the size"))
    print("price of house of size = {0} will be = {1}".format(size_house,pred_price))
    more_run=int(input("enter greater than 0 for moreiteration"))
    if __name__=='__main__':

    i think i have made some mistake ,and not able to get correct result. would you please have a look at it,that will be very helpfull.

    thanks in advance.

  • The negative sign is missing at line 45 for m_gradient which is why the error is going to inf. Both the partial equations have negative signs in front of them.

  • Dude you are awesome mad impulsive person. You are totally in depth with what ever you are doing on that moment. I love that.

  • I wrote a blog on understanding linear regression using just SymPy, a symbolic mathematics library in python. It was shared on [Hacker News]( where it garnered a fair bit of attention. Here is a link to the blog:

    Please leave your comments on the page via Disqus.

  • I will be really thankful if u can make few videos on realistic data and few kaggle problems….. Because iris and other stuff have only limited things and features. PLEASE please PLEASE

  • The error function is being used just to print in this case?

    I expected to multiply it, by the learning rate, then by the gradient to update weights.

  • The chance of you responding to this comment = the chance of a random person /7billions of existing human check the infos in the 2nd and above Google search result tabs.

  • Hi Siraj I am new to the world of computer science.
    i just wanted to ask u that instead of using the algorithm of gradient descent i.e doing the partial derivative of error function w.r.t B and M, instead can't we store error value generated at every B and M value and than sort out point of minimum error and hence get optimum B and M value.

  • I came here after the first two weeks of Andrew Ng's machine learning course. It's soo cool to see something you have been learning about for two weeks to happen. Can't wait to implement it myself with my own data sets.

  • when i set initial b and m equal to 1 each, the final b and m are different . why does it happen?

  • at 38:50 when we calculate the gradient: why do we subtract it from our current_b and current_m? I mean why are we not adding for example?

  • It was a great tutorial by you siraj…very helpful.
    I am getting this error.can someone kindly help?……' index 2 is out of bounds for axis 1 with size 2'.

  • as Fabian Becker said,
    Please don't say you've found the optimal m/b. You could be hitting local minima or simply not have done enough iterations. Gradient descent is very vulnerable to fitness landscapes that are non-linear.
    is there any algo that can help me reaching the global optima

  • what is the industry method of implementing linear regression ? or else every time we have code like above to realize the linear regression?

  • I know it's been a long time since this video was uploaded, but i really want to know something.

    If the correlation of the data is negative (negative m, decreasing slope), should this still work?
    I tried it and my slope isn't fitting the data points at all, my m value is positive, it's like the slope was inverted.

    Sorry if I couldn't express myself, English is not my first language.
    By the way, great video, learned a lot in 40 minutes.

  • This is amazing, that Siraj for great tutorial
    I've one question, after calculation gradient des, and mean square error, I plotted m, y, and cost by matplotlib, it worked but is there any better library to plot graphs. MAtplotlib is not the easiest one to understand from its documentation.

  • Hey Siraj, I know this video is quite old, but I have a question. How did you solve this error:

    RuntimeWarning: overflow encountered in double_scalars
    totalError += (y – (m * x + b))**2

    In the video it occurs at about 41:11.

  • yeahhh… u look cute, i like to watch your video but (apologize) your hand movements makes it irritating could u please…… no offense……

  • I m getting runtime warning : overflow encountered in double_scalars for calculating new_b and new_m and values are giving nan value as output… can you help me with this ?

  • at 41:21, was ran to get the output. original python file is showing error as infinity with overflow encountered in double_scalar. How to solve the runtime warning and get correct error, otherthan inf value ?

  • Check out my implementation of gradient descent in python for multivariate as well as univariate linear regression. Please star the repository if you like it.
    Kudos to Siraj Sir for giving me inspiration to extend this optimisation algorithm to work with multiple features in your data.

  • Isn't "error rate" 112 kind of high? I thought the error rate was supposed to be as small as possible?

  • Siraj, I like your video…
    but the gradient and the Y intercept do not make sense when I tried plotting them on graph

  • This was a really helpful video!! Knowing how to implement gradient descent from scratch is one of the most fundamental things for neural nets.

  • Why do we use gradient descent to get the for minimum error cant we just store the errors and the values and find where the error is minimum ?

  • Hey Siraj, I have a question regarding the formula you used for calculating least squares (I doubt this question will be answered, but i'll give it a go):
    Question :
    Why are you squaring the difference when you could've just taken the absolute value of each consecutive term? Is the use of squaring more prevalent due to the fact it makes distinguishing outliers obvious (since if the squared result is above a certain threshold, it shall be considered massive)?

    Also, is this squaring and not taking absolute value related to variance and standard deviation, because the same concept holds true there (though formula might be different, but it's quite similar if you think about it)?

    Would love to hear back from you, however hard it may be.

    Qasim Wani

  • In the coursera course of Andrew Ng in calculating the gradient descent there is no minus sign is taken and here there is a neg sign of the partial I can't get it why the neg sign is needed ,the gradient itself would be negative..please clarify

  • Thank you Siraj for helping me understand Linear regression. I have question for Siraj and everyone. please do well to answer me. thanks in advance.
    I want to perform a logistic regression. I was asked to use state and political party and vote gotten as my independent variable and make a prediction whether a political party wins or loses. I have 36 states in my country and i want to use 3 dominant parties i want to use as a case study. my problem is how the layout of these data will be; I am unable to resolve party been in a separate column unless I take one political party and take one state and do the prediction explicitly and then move on to another.
    Please i really needs you guys help to resolve these issue. Thanks in advance.

Leave a Reply

Your email address will not be published. Required fields are marked *