Lecture 17 – Three Learning Principles


ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about radial basis
functions, and the functional form of the hypothesis in that model
is the superposition of a bunch of Gaussians, centered around mu_k. And we had two models, or two versions
of that model, one of them where the centers are fewer than the number of
data points, which is the most common one, in which case we need to come up
with the value of the centers, mu_k, and learn the values of w_k. And it turned out to be a very simple
algorithm in that case, where you use unsupervised learning to get the mu_k’s,
the centers, by clustering the input points without reference to
the label that they have. And after you do that, it becomes a very
simple linear model where you get the w_k’s, the parameters, using
the usual pseudo-inverse. And in the other case, where we used
as many centers as there are data points, and the centers were
the data points, there was obviously no first step. And in that case, in order to get the
w_k, we actually used the real inverse rather than the pseudo-inverse. One of the interests of radial basis
functions– they are very popular functions to use in machine learning,
but one of the most important features about them is how they relate to so
many aspects of machine learning. So I’d like to go through this, because
it’s actually very instructive and it puts together some of
the notions we had. So let me magnify this a bit. Radial basis functions have this as
the building block, the Gaussian, and they are related to
nearest neighbor. In the case of nearest neighbor, you
have a data point, one of your points in the training set, and it influences
the region around it. So everything in the region around it in
the input space inherits the label of that point, until you get to a point
which is closer to another data point, and then you switch to that point. So you can think now of RBF
as a soft version of that. The point affects the points around
it, but it’s not black and white. It’s not full effect and
then zero effect. It’s gradually diminishing effect. It’s also related to neural networks,
thinking of this as the activation in the hidden layer, as we saw last time. And the activation for the
neural networks in the hidden layer was a sigmoid. And the main conceptual difference
between the two in this case is that this is local. It takes care of one
region of the space at the time, whereas this is global. That thing affects points regardless
of the value of the signal, and you get the effect of a function by getting
the differences between these different sigmoids. Then we had the relationship to SVM,
which is very easy because in the case of SVM, we had
an outright RBF kernel. So there was simply a very easy way to
compare them because they use the same kernel, except that there were
many interesting differences. For example, when we use the RBF, we
cluster the points, we determine the centers according to an unsupervised
learning criterion. And in the case of SVM, the centers,
if you’re going to call them that, happen to be the support vectors in
which the output is very much consulted in deciding what these
support vectors are. And the support vectors happen to be
around the separating boundary, whereas the centers here happen to be
all over the input space, in order to represent different clusters
of the inputs. The two remaining relations as far as
RBF are concerned are regularization and unsupervised learning. Unsupervised learning is easy, because
that is the utility we had in order to cluster the points and
find the centers. So you look at the points, and then
you try to find the representative center for them such that when you put
a radial basis function around that point, it captures the contribution of
those points, and then more or less dies out, or at least is not as
effective when it goes far away, and this is another center
that does the same. The interesting aspect was
regularization because, it seems on face value, it’s a completely
different concept. RBF is a model. Regularization
is a method that we apply on top of any model. But it turns out that RBF’s were derived
in the first place in function approximation using just a consideration
of regularization. So you have a bunch of points, you want
to interpolate and extrapolate them, and you don’t want the
curve to be too wiggly. So you capture a smoothness criterion
using a function of derivatives, and then when you solve for them, you find
that the interpolation is done by Gaussians, which gives you the RBF’s. So this is what this model does. Today, we’re going to switch gears
completely and in a very pleasant way. If you think about it, we have gone
through lots of math, and lots of algorithms, and lots of homework, and
all of that, and I think we paid our dues and we earned the ability to
do some philosophy, if you will. So we’re going to look at learning
principles without very strong appeal to math, because we have very strong math
foundation to stand on already. And we’ll try to understand the concepts,
and relate these concepts as they appear in machine learning, because
they also appear in other fields in science in general, and
they are fascinating concepts in their own right. And when we put them in the context of
machine learning, they assume a real meaning and a real understanding that
will help us understand the principles in general. So the three principles, the usual
label for them is Occam’s razor, sampling bias, and data snooping. And you may be familiar with some of
them, and we have already alluded to data snooping in one of the lectures. And if you look at them, Occam’s
razor relates to the model. Both of these guys relate to the data. One of them has to do with collecting
the data, and the other one has to do with handling the data. And we’ll take them one at a time, and
see what they are about and how they apply to machine learning and so on. So let’s start with Occam’s razor. There is a recurring theme in machine
learning, and in science, and in life in general that less is more. Simpler is better, and so on. And there are so many manifestations of
that, and I just chose one of the most famous quotes. I put “quote”
between quotes because it’s not really a quote. He didn’t say that in so many words, but
at least, that’s what people keep quoting Einstein as saying. And it says that an explanation of
the data– so you are running an experiment, you collect the data,
and you want to make an explanation of the data. The explanation could be E equals
M C squared, or something else. So you are trying to find an explanation
of the data, and here is a condition about what
the explanation should be like. It should be as simple as possible,
but no simpler. Very wise words. As simple as possible, that’s
the Occam’s razor part. No simpler, because now you
are violating the data. You have to be able to
explain the data. So this is the rule. And that quote, in one manifestation or
another, has occurred in history. Isaac Newton has something that is
similar, and a bunch of them, but I’m going to quote the one that
survived the test of time, which is Occam’s razor. So let’s first explain
what the razor is. Well, a razor is this. You have to write “Occam” on it in
order to become Occam’s razor! And the idea here is symbolic. So the notion of the razor
is the following. You have an explanation of the
data, and you have your razor. So what you do, you keep trimming the
explanation to the bare minimum that is still consistent with the data, and
when you arrive at that, then you have the best possible explanation. And it’s attributed to William of Occam
in the 14th century, so it goes back quite a bit. What we would like to do, we’d like
to state the principle of Occam’s razor, and then zoom in, in order
to make it concrete. Rather than just a nice thing to have,
we’d like to really understand what is going on. So let’s look at the statement. The statement, in English, not in
mathematics, says that the simplest model that fits the data
is also the most plausible. And we put it in a box, because
it’s important. So, first thing to realize about this
statement is that it is neither precise nor self-evident. It’s not precise, because I really
don’t know what simplest means. We need to pin that down. Right? I know that the simplest model is nice,
but I’m saying something more than just nice. I’m saying it’s most plausible. It is the most likely to be true
for explaining the data. That is a statement, and you actually
need to argue why this is true. It’s not wishful thinking that we just
use the simple, and things will be fine. There is something said here. So there are two questions to answer,
in order to make this concrete. The two questions are, the first one
is, what does it mean for a model to be simple? It turns out to be a complex question,
but we will see that it’s actually manageable in very concrete terms. The second question is, how do we
know that this is the case? How do we know that simpler is better,
in terms of performance? So we’ll take one question
at a time, and address it. First question, simple
means exactly what? Now, you look at the literature and
complexity is all over the place. It’s a very appealing concept with very
big variety of definitions, but the definitions basically belong
to two categories. When you measure the complexity,
there are basically two types of measures of complexity. And my goal here is to be able to
convince you that they actually are talking about more or less the same
thing, in spite of being inherently different conceptually. The first one is a complexity of
an object, in our case, a hypothesis h or the final hypothesis g. That is one object, and we can say that
this is a complex hypothesis or a simple hypothesis. The other set of definitions
have to do with the complexity of a set of objects. In our case, the hypothesis set. We say
that this is a complex hypothesis set, complex model, and so on. And we did have concretely a measure of
complexity of small h and a measure of complexity of big H, and if you
remember, we actually used the same symbol for them. It was Omega. Omega here was the penalty for
model complexity when we did the VC analysis, and Omega here was
the regularization term. This is the one we add in the augmented
error, in order to capture the complexity of what we end up with. So we already have a feel that there is
some kind of correspondence, and if you look at the different definitions
outside, there are many definitions of the complexity of an object, and
I’m going to give you two from different worlds. One of them is MDL, stands for
Minimum Description Length. And the other one, which is simple,
is the order of a polynomial. Let me take the minimum
description length. So the idea is that I give you an object
and you try to specify the object, and you try to specify it
with as few bits as possible. The fewer the bits you can get
away with, the simpler the object in your mind. So the measure of complexity here is
how few bits can I get away with, in specifying that object? And let’s take just an example, in order
to be able to relate to that. Let’s say I’m looking at an integer that
happens to be a million digits, a million decimal digits. Huge numbers, any numbers. Now, I’m trying to find the complexity
of individual numbers of that length. There will be different complexities. So let me give you one number which is,
let’s say, 10 to the million minus 1, in order to make
it a million digits. So let’s say 10 to the
million minus 1. Now, 10 to the million minus 1 is
99999999, a million times, right? In spite of the fact that this is
a million in length, it is a simple object because you were able to
describe it as “10 to the million minus 1”. That is not a very long
description, right? And therefore, because you managed to
get a short description, the object is simple in your mind. This is very much related to Kolmogorov complexity. The only difference between Kolmogorov
complexity and minimum description length is that minimum
description length is more friendly. It doesn’t depend on computability
and other issues. But this is the notion. And you can see that when we describe
the complexity of an object, that complexity is an intrinsic
property of the object. Order of a polynomial is
simpler to understand. I tell you there is a 17th-order
polynomial versus a 100th-order polynomial, and you already can see that
the object is more complex when you have a higher order. And indeed, this was our definition of
the complexity of the target, if you recall, when we were running the
experiments of deterministic noise. In that case, we needed to generate
target functions of different complexity, and the way we did it, we
just increased the order of the polynomial as our measure of the
complexity of that object. Now we come to the complexity
of a class of objects. Well, there are notions running around
that actually define that, and I’m going to quote two of
them, very famous. The entropy is one, and the one
we are most familiar with, which is the VC dimension. Now, these guys apply
to a set of objects. For example, the entropy. You run an experiment, you consider
all possible outcomes of the experiment, the probabilities that go
with them, and you find one collective function that captures
the probability, sum of p logarithm of 1 over p, and that becomes your entropy and
that describes the disorder, the complexity, whatever you want, of the
class of objects, each outcome being one object. In the case of the VC dimension, it
applies directly to the notion we are most familiar with. It applies to a hypothesis set, and it
looks at the hypothesis set as a whole, and produces one number that describes
the diversity of that hypothesis set. And the diversity in that case
we measure as the complexity. So if you look at one object from that
set, and you look at this measure of complexity, now that measure of
complexity is extrinsic with respect to that object. It depends on what other guys
belong to the same category. That’s how I measure the complexity of
it, whereas in the first one, I didn’t want to be a member of anything. I just looked at that object, and tried
to find an intrinsic property of that object that captures the complexity. So these are the two categories
you will find in the literature. Now, when we think of simple as far as
Occam’s razor, as far as different quotes are concerned, we are thinking
of a single object. I tell you E equals M C squared, or I looked
at the board, P V equals n R T, and that is a simple statement. You don’t look at what other
alternatives were there to explain the data. You just look at that object
intrinsically, and that is what you think of as the measure of complexity. When you do the math in order to prove
Occam’s razor in one version or another, the complexity you are using is
actually the complexity of the set of objects. And we have seen that already. We looked at the VC dimension, for
example, in order to prove something of an Occam’s nature in this course
already, and that captured the complexity of a set of objects. So this is a little bit worrying,
because the intuitive concept is one thing, and the mathematical
proofs deal with another. But the good news is that the complexity
of an object and the complexity of a set of objects, as we
described in this slide, are very much related, almost identical. And here is the link between them: counting. Couldn’t be simpler. Here is the idea. Let’s say we are using the minimum
description length, which is very popular and versatile. So it takes l bits to specify
a particular object, h. I’m taking the objects here to be h,
because I’m in machine learning. The objects are hypotheses,
so I use that. Now, the measure of complexity in this
term is that the complexity of this fellow is l bits, because
that is my definition. Now, this implies something. This implies that if I look at all the
guys that are similar to this object in terms of complexity, they also happen
to have l bits worth of minimum description. How many of them are there? Well, 2^l, right? And now you can look at the set of all
similar objects, and you call it H, and you have one of 2^l as
the description of an object here, and you can take the “1 of 2^l” as
the description of the complexity of that set. So now we are establishing
something in our mind. Something is being complex in its
own right, when it’s one of many. Something is simple in its own
right, when it’s one of few. That is the link that makes us able
to use this side for the proofs, and make a claim on this side. It is not an exact correspondence,
but it is an overwhelmingly valid correspondence. Now these are with bits, and
I can pin it down exactly. How about real-valued parameters? Let’s look at our 17th-order polynomial. You can look at a 17th-order polynomial,
and you can see that because it’s 17th order, it goes up
and down and up and down, and that looks complex. But also, because if it’s a 17th order
polynomial, it’s one of many, in the realm of infinity in this case, because
having 17 parameters to choose makes me able to choose
a whole bunch of guys that belong to the same category. So the class of 17th-order polynomials
is big, and therefore, it’s not only that the individual is complex,
the set is also complex. There are exceptions to this rule,
and one notable exception was a deliberate exception. And we wanted something that looks
complex, so that it does our job of fitting, but is one of few. And therefore, we are not going to pay
the full price for it being complex, and that was our good old friend SVM. Remember this fellow? This looks complex all right, but it’s
actually not really complex because it’s defined only by very
few support vectors. And therefore in spite of the fact that
it looks complex, it’s really one of few, and that is what we achieve
by the support vector machines. Now, let us take this in our mind,
that we are going to use the complexity of an object as the same as
the complexity of the set of objects that the object naturally belongs to,
and we will see some ramifications. So now I’m going to give you the
first puzzle of the lecture. There are 5 puzzles in this lecture,
so you need to pay attention, and each puzzle makes a point. And the first one has to do
with this complexity, so let’s look at the puzzle. The puzzle has to do with a football
oracle, someone who can predict football games perfectly. You watch Monday night football, you
want to know the result, and something happens Monday morning. You get a letter in the mail. You open the letter. Hi. Today, the home team will win.
Or, the home team will lose. You don’t make much of it, just
some character sent something. It’s not a big deal. You watch the game, and
it’s a good call. OK, interesting. 50%, lucky. Next Monday, another letter,
another prediction. And the funny thing is that he predicted
either the home team will win or not, and it was very long odds. Everybody thought the
other way around. And at the end of the game, the guy was
right, and the guy was right for 5 weeks in a row. Now you are really very curious, and you
are eagerly waiting in the 6th week in the morning of Monday
to see where the letter is. You have a perfect record. Now comes the letter. The letter says: you want
more predictions? Pay me $50. Very simple question: Should you pay? The question is easily answered, because
now the scams are so many that the default, I just don’t
look at anything. There must be something to it. But I really want to pin down what is
it, because that is the message we are carrying out. So the idea here is that no, you
shouldn’t, and the guy is really not predicting anything. And the reason for that
is the following. He’s not sending letters to you only. He’s sending letters to 32 people. In the first game, for half of them, he
said that the home team will lose. The second one, he said the
home team will win. Now, because he did that, he is sure
that some of the guys will get the correct answer. So the game is played, and
the home team loses. So in the second week, he goes for the
guys where he was right, and sends half of them that the home team will
lose, and the other half, the home team will win. Now, he had plans to send the other guys
as well something similar, except that it’s hopeless now because he
already lost with them, so they’re not going to pay him the $50. So just for the memory, this is
what would have been sent. There are no letters sent here, but he
would have gone zero one, zero one. And he waits for the game, and
out comes: the home team won. So you can see who he’s going to
send letters to now, right? The other guys are a lost cause. This would have been sent
to them, but that’s OK. And he waits, and what
happens this time? The home team lost. And therefore, here is
your next letter. Home team won. Here is
your next letter. Only two people are surviving
from this thing. And here is the result,
the home team won. Now at that point, the guy
sent how many letters? 32 plus 16 plus 8 plus 4 plus 2,
so about 64, 63 to be exact. The postage on that, writing the letter,
he probably spent $30 on that. And he’s charging you, the lucky
guy out of the 32, $50. That’s a money making proposition. Very nice, and it’s understood
and illegal, by the way! But the interesting thing here is to
understand, why is this related to what we’ve just talked about? You thought the prediction ability
was great because you only saw your letters. There is one hypothesis, and
it got it right perfectly. The problem is that actually, the
hypothesis set is very complex, and therefore the prediction
value is meaningless. You just didn’t know. You didn’t see the hypothesis set. So now we understand what is the
complexity of an object. Now we go to the question,
why is simpler better? So the first thing to understand is that
we are not saying that simpler is more elegant. Simpler is more elegant, but this is
not the statement of Occam’s razor. Occam’s razor is stating that simpler
will have better out-of-sample performance. That’s a concrete statement. In all honesty, if Occam said that you
take the more complex guy and it will give you better out-of-sample
error, I will take the more complex one, thank you. I am after performance. I’m not after elegance here. It’s nice that the elegant guy happens
also to be better, but we need to establish that it is actually better. And there is a basic argument. It
manifests itself in many ways, and we have already run one in this
course during the theory. And you put some assumptions, and
there’s a formal proof under idealized conditions of the following. Instead of going through any formal
proofs– quite a variety of them, I am extracting the crux of the proof. What is the point being made? And I’m going to relate it to the
proof that we ourselves ran. So here is high-order steps. There are fewer simple hypotheses
than complex ones. That is what we established from
the definition of complexity. And in our case, that was captured
by the growth function. You probably have forgotten
what this is, long ago. This was taking N points, finding what
your hypothesis set can generate in terms of different patterns on those
N points, we call dichotomies. So if it can generate everything like
the postal guy, then it’s a huge hypothesis set. If it can generate few of them, then
it’s a simple hypothesis, and it’s measured by that growth function, and
that resulted in the VC dimension. Remember all of that? So now, fine. Fewer simple hypotheses
than complex ones. OK, then what? The next thing is because there are
fewer ones, it is less likely to fit a given data set. That is, you have N points, and
you’re going to generate labels. Let’s say you generate them at random,
and you ask yourself, what are the chances that my hypothesis
set will fit? Well, if it has few of those guys,
obviously that goes down, and the probability, if you take it uniformly,
simply would be the growth function divided by 2^N. If my growth function is polynomial, then
very quickly, the probability of fitting a given data
set is very small. OK, fine, I can buy that. So now that’s nice, but you want
to convince me now that simpler is better in fit. Here, you told me that I cannot fit. So what is the point? The punchline in all of those is that if
something is less likely, then when it does happen, it’s more significant. And there are many manifestations of
this, even when you define the entropy that I alluded to. A probability of an event is p. What is the information associated
with that particular point? The smaller the probability, the bigger
the information, the bigger the surprise when it happens. And indeed, you define the term
as being logarithm 1 over p. So if p is very small, tons
of bits of information. If something half the time will happen,
half the time will not happen, it’s just 1 bit. It’s not a big deal. And, looking back at the postal
scam, the only difference between someone believing in the scam and
someone having the big picture is the fact that the growth function, from your
point of view when you received the letters, was 1. You thought you were the only person.
Here is one hypothesis, and you got it right, and you gave a lot of
value for that because this is unlikely to happen. On the other hand, the reality of it is
that the growth function is actually 2^N and this is certain to happen,
so when it happens, it’s meaningless. Let’s look at a scientific experiment,
where a fit is meaningless. So you are running an experiment, or you
ask people to run an experiment, to establish whether conductivity of
a particular metal is linear in the temperature. I can design an experiment for that. So you go and you ask two scientists to
conduct experiments, and they go, and they come back with
the following results. Here is the first scientist. Took the metal, but they had a dinner
appointment, so they were in a hurry, so they got 2 points and drew
the line and gave you this. The second guy had a supper appointment,
so had more time to do it, so did it 3 times,
and then the line. I have a very specific question, which
is: what evidence do they provide for the hypothesis that conductivity is
indeed linear in the temperature? What is clear without thinking too much
is that this guy provided more evidence than this guy. It is interesting to realize that this
guy provided nil, none, nada. Why is that? Because obviously, 2 points can
always be connected by a line. So the notion that goes with this
is called falsifiability. If your data has no chance of falsifying
your assertion, then by the same token, it does not provide any
evidence for that assertion. You have to have a chance of falsifying
your assertion, in order to be able to draw the evidence. This is called the axiom of
non-falsifiability, and in some sense, it’s equivalent to the arguments
we have done so far. And in our terms, the linear model is
just way too complex for the size of the data set, which is 2, to
be able to generalize at all. And therefore, there is
no evidence here. In this case, this guy could
have been falsified if the red point came here. Therefore, he actually provides
an evidence. This is the point. This guy could not have
been falsified. So now we go to the next notion, which
is sampling bias. It’s a very interesting notion, and it’s tricky. And by the way, if you look at all of
these principles, it’s not like they’re just concepts, and nice,
and relate to other fields. They also provide you with red flags
when you’re doing machine learning. For example, when you use Occam’s
razor, what does it mean? It means that beware of fitting the data
with complex models, because it looks great in sample and you are very
encouraged, and when you go out of sample, you know what happens. You know all too well by the
theory we have done. Similarly, when we talk about sampling
bias and later, data snooping, there are traps that we need to avoid when
we practice machine learning. So let’s look at sampling bias,
and we start with a puzzle. Here is the puzzle. It has to do with the presidential
election, not this one. But in 1948, this was the first
presidential election after World War II, which was a big deal, and the two
people who ran was Truman, who was currently President, and
he ran against Dewey. And it was very close in terms of–
people will take opinion polls, and it’s not clear who is going to win. So now, one newspaper ran a phone poll,
and what they did is ask people how they actually voted. So this is not before the election
asking, what do you think? This is the night of the election,
after the election closed, they actually called people picked at random
at their home, asked them: who did you vote for? Black and white. Dewey or Truman, et cetera? They collected the thing, and they
applied some statistical thing or Hoeffding or some other quantity,
and came with the conclusion that Dewey has won decisively. Decisively doesn’t mean he won by 60%. Decisively means that he won
above the error bar. The probability that the opposite
is true is diminishingly small. And the result was so obvious that they
decided to be the first to break the news, and they printed their
newspaper declaring: Great. OK, so Dewey won. What happens when someone
wins an election? They have a victory rally. So let’s look at the victory rally. One problem. Victory rally was Truman, and you can
see the big smile on the guy’s face. So what happened? Well, polls are polls and there
is always a probability, and this and that. No, that’s not the issue here. That’s the key. So don’t blame delta for it. delta? What was delta again? We’ve been doing techniques
for a while. I forgot all about the theory. So let’s remind you what delta was. We were talking about the discrepancy
between in-sample, the poll, out-of- sample, the general population, the
actual vote, and we were asking ourselves, what is the probability
that this will be bigger than something, such that the
result is flipped? You thought it was Dewey winning,
and it turned out to be Truman. And that turned out to be less than or
equal to delta, and delta is expressed in terms of epsilon, N, and whatnot. So in principle, it is possible,
although not very probable, that the newspaper was just incredibly unlucky. Now, the statement is
very interesting. No, the newspaper was not unlucky. If they did the poll again and again and
again, with 10 times the sample, or 100 times the sample, they will
get exactly the same thing. OK?! So what is the problem? The problem is the following. There is a bias in the poll they
conducted and it is because of a rather laughable reason. In 1948, phones were expensive. That means that households that had
phones used to be richer, and richer people at that point favored Dewey more
than the general population did. So there was a sampling bias. There was always the case– the
population they were asking actually favored Dewey. The sample was very reflective of the
general population, of that mini general population. The problem is that, that general
population is not the overall general population. And that brings us to the statement
of the sampling bias principle. It says that if the data is sampled in
a biased way, then learning will produce a similarly biased outcome. Learning is not an oracle, not
like the football oracle. Learning sees the world through
the data you give it. I’m a learning algorithm,
here is the data. You give me skewed data, I’m going
to give you a skewed hypothesis. I’m doing my job. I’m trying to fit the data. So this is always the case, and then
you realize that there is always a problem in terms of making sure that
the data is actually representative of what you want. So again, we put this in a box. That’s the second principle,
so it’s important. And let’s look at a practical
example in learning. In financial forecasting, people use
machine learning a lot, and sometimes when you look at the markets, the
markets are completely crazy. A rumor comes out and the market
goes this way, et cetera. And you are a technical person, you
are trying to find an intrinsic pattern in the time series. So you decide, I’m going to use the
normal conditions of the market. So I’m going to take periods of the
market where the market was normal, and then there is actually a pattern
when people buy, buy, buy, and sell, sell, sell, something happens, or
whatever you are going to discover using your linear regression or
other learning algorithm. And you do this. And then you deploy it, and when you
test it, you test it in the real market, and realize that now
there is a sampling bias. In spite of the fact that you were very
happy in-sample, you actually forgot a part of the market, and you
don’t know whether that part will be terrible for you, great for
you, or neutral for you. You just don’t know. That’s what sampling bias does. The newspaper could have done this poll
and, by their sheer luck, the general population thinks the same of
Truman and Dewey as the small sample they talked about, in which case the
result would have come out and they would have never discovered
that they made a mistake. So sampling bias makes you vulnerable,
at the mercy of the part that you didn’t touch. In this case, you didn’t touch the
market in certain conditions, and if it does happen, all bets are off. One way to deal with sampling bias
is matching the distributions. It’s a very interesting technique, and
it’s actually applied in practice. I’m going to mention that. So what is the idea? The idea is that you have a distribution
on the input space, in your mind, and there was one assumption
in Hoeffding and VC inequality and all of that. They didn’t make too many assumptions,
but one assumption they certainly made is that you pick the points for training
from the same distribution you pick for testing. That was the only thing
that they require. So when you have sampling
bias, that is violated. And therefore, you try to say
I don’t have the same distribution. I have data picked from some
distribution, and I’m going to deliver the hypothesis to the customer, and
they’re going to test it in other conditions. What do I do? What you do, you try to
match the distributions. You don’t reach for the distributions
and match them. You do something that will effectively
make them match. And you look at this, and let’s
say that this is the training distribution, and the test distribution
is off a little bit. This is a probability density function. Both of them are Gaussian. One of them is off and with
a different sigma. So what you do, if you have access to those– if
someone tells you what the distributions are and then gives you
a sample, there is a way by either giving different weights for the
training data, or re-sampling the training data, to get another set as
if it was pulled from the other distribution. It’s a fairly simple method. Very seldom that you actually have the
explicit knowledge of the probability distributions, so it’s not that useful in
practice, but in principle, you can see that it can be done. And the price you pay for it is
that you had 100 examples. When you are done with this scaling and
re-sampling or whatever method you use, the effective size now is 90. So you lose a little bit in terms of
the independence of the points, and therefore, you get effectively
a smaller sample because of it. But at least, you deal with the
sampling bias that you wanted to deal with. Now, this method works, and even if you
don’t know the distribution, there are ways to try to infer
the distribution that work. But it doesn’t work if there is a region
in the input space where the probability is zero for training, nothing
will be sampled from that part, but you are going to test on it. There is a probability of getting
a point there, very much like guys without a phone. That happened to have zero probability
in the sample, but they don’t have zero probability in the
general population. And in that case, there is nothing that
can be done in terms of matching, because obviously you don’t know
what happened in that part. On the other hand, in many other cases,
there is a simple procedure, which is actually very
useful in practice. If you look at, for example, the Netflix
competition, one of the things you realize is that I have the
data set, it’s a huge data set, 100 million points. And then I’m going to test your
hypothesis on the final guys, the final ratings. So it’s a much smaller set. And the interesting aspect about it is
that if you look at the distribution of the general ratings, the 100 million,
it really is different from the distribution of these guys. Therefore, the question came up, can I
do something during the training such that I make the 100 million look as if
they were pulled from the distribution of the last guys? Very interesting question, has a very
concrete answer, and the 100 million become 10 million, not that you are
throwing away points, but you are weighting them such that when you are
done, they look smaller than a set. But then you are actually matched to
that, and you can get a dividend in performance. So there is a cure for sampling bias in
certain cases, and there is no cure in other cases, in which all you can
do is admit that you don’t know how your system will perform in the
parts that were not sampled. That would be fatal if you are doing
a presidential poll, but may not be as fatal when you are doing machine
learning, because all you are going to do, you are going to warn against
using this system within that particular sub-domain. Third puzzle, try to detect
sampling bias here. Credit approval. We have seen that before. That’s
a running example in the course, so let me remind you what that was. The bank wants to approve
credit automatically. It goes for the historical records of
customers who applied before, and they were given credit cards, so you have
a benefit of, let’s say, 3 or 4 years worth of credit behavior. And you look back at their inputs, and
the inputs in those cases were simply the information they provided at the
time they applied for credit, because this is the information that will
be available from a new customer. And you get something like that.
This is the application. You also have the output, which is
simply– you go back and see whatever the credit behavior is and you ask
yourself, did they make money for me? Because it’s not only credit
worthiness, that you are a reliable person. It’s also that some people who are
flirting with disaster are very profitable for the bank, because they max
out and they pay this ridiculous percentage, so they make a lot of money
as long as they don’t default. Once they default, it’s a problem. So there’s a question of just,
did you make profit or not? That’s a question. And I’m going to approve future
customers if I expect that they will make profit for me. That’s the deal. Where is the sampling bias? We probably alluded to it
in one form or another. The problem is that you’re using
historical data of customers you approved, because these are the
only ones you actually have credit behavior on. So the guys who applied, and
you rejected them, are not part of this sample. And when you are done, you are
going to have a system that applies to a new applicant. You do not know a priori whether that
applicant will be approved or not, according to your old criteria. So it could belong to the population
that was never part of your training sample. Now, this is one case where the sampling
bias is not that terrible in terms of effect, not in terms of
characterizing what is going on. You have a part of the population, and
they have zero probability in terms of training, and nonzero probability
in terms of testing. It’s good, old-fashioned
sampling bias. But the point is that banks tend to be
a bit aggressive in providing credit because, as I mentioned, the borderline
guys are very profitable. So you don’t want to just be
conservative and cut them off, because you’re going to be losing revenue. Because of this, the boundary that you
are talking about is pretty much represented by the guys
you already accepted. You already made mistakes
in what you accepted. So when you get that boundary, the
chances are the guys you missed out will be deep on one side. You got all the support vectors,
if you want, so the interior points don’t matter. They matter a little bit, but
actually, that system with the sampling bias does pretty
good on future guys. By evidence that you reject someone, how
do you know that it’s good because you rejected it? They apply somewhere else, and they make
the other guy lose money, so you realize that your decision was good. So you can verify, if you have
a consortium of banks, whether actually that sampling bias here has an impact,
or doesn’t have an impact. Final topic, data snooping,
the sweetest of all. Well, it’s the sweetest because it
is so tricky, and manifests itself in so many ways. Let me first state the principle. The principle says, if a data set has
affected any step of the learning process, then the ability of the same
data set to assess the outcome has been compromised. Very simply stated. The principle doesn’t forbid
you from doing anything. You can do whatever you want. Just realize that if you use
a particular data set, whether it’s the whole, or a subset or whatever, use it
to navigate into– I’m going to do this, I’m going to choose this model,
I’m going to choose this lambda, I’m going to do this, I’m going to
reject this, whatever it is. You made a decision, then when you
have an outcome from the learning process and you use the same data set
that affected the choice of that, the ability to fairly assess the performance
of the outcome has been compromised by the fact that this was
chosen according to the data set. I think this is completely understood by
us, having gone through the course. We put it in a box, and then we make
the statement that this is the most common trap for practitioners,
by and large. I’ve dealt with Wall Street firms
quite a bit in my career, and there are lots of people who are using
machine learning, and it is rather incredible how they manage
to data-snoop. And there is a good reason for it,
because when you data-snoop, you end up with better performance, you think,
because that’s why you snooped. I looked at the data, I
chose a better model. The other guy didn’t look at the data,
and they are struggling with the model, and they are not getting the
same in-sample, and I am ahead. It looks very tempting to do. And it’s not just looking at the data. The problem is that there are many ways
to fall into the trap, and they are all happy ways. So if you think of it as landmines,
it is actually happy landmines. You very cheerfully step on the mine,
because you think you are doing well. So you need to be very careful. And because it has different
manifestations, what I’m going to do now, I’m going to go through examples of
data snooping. Some of them we have seen before, and some
of them we haven’t. And then you will get the idea. What
should I avoid, and what kind of discipline or compensation should I have,
in order to be able not to suffer from the consequences
of data snooping? So the first way of data snooping,
we have seen before, is looking at the data. So I’m borrowing something
from our experience. Remember the nonlinear transform? Yeah. So you have a data set like this, and
let’s say you didn’t even look at the data and you decided that, I’m going
to use a 2nd-order transform. So this is the transform, you
take a full 2nd order. You apply it, and you look at the
outcome, and this is good. I managed to get zero in-sample error. What is the price I’m paying
for generalization? One, two, three, four, five, six. That’s an estimate for the VC dimension,
so that’s the compromise between this six and however
many points, et cetera. So you realize, I fit the data
well but I don’t like the fact that it’s six. I don’t have too many points, so my
handle on generalization is not good. So let me try to do better,
at least in your mind. So what you do is say, wait a minute,
I didn’t need all of these guys. I could have gone with just this guy,
knowing that this is the origin. All you need to do is just x_1
squared and x_2 squared. This is just a circle centered
at the origin. Why do I need the other funny stuff? This would be if I’m going
for a more elaborate set. So now one, two, three, now I have VC
dimension of three, so I’m better. Of course, we know better, but
I’m just playing along. And then you get carried away and
say, I can even do this. It’s not an ellipse, it’s a circle, so
I can just add up x_1 squared and x_2 squared as one coordinate,
and then I have two. And you see what the problem
is, and the problem is what we mentioned before. What you are really doing, you are
a learning algorithm in your own right, but free of charge. That’s the problem. You are looking at the data, and you are
zooming in, and you’re zooming in. You’re learning. You’re learning. You are narrowing down the hypotheses, and
then leaving the final learning algorithm just to get you the radius. Yeah, big deal. Well, the problem is that you are
charging now for a VC dimension of two, which is the last part of the
learning cost, which is choosing the coefficients here. But you didn’t charge for the fact that
you are a learning algorithm, and you took the data into consideration,
and you kept zooming in from a bigger hypothesis set. You didn’t charge for the full
VC dimension of that. Now, it is very important to realize
that the problem here is that the snooping here involves the data set. Because what happens when you
look at the data set? You are vulnerable to designing your
model, or your choices in the learning, according to the idiosyncrasies
of the data set. And therefore, you may be doing well on
that data set, but you don’t know whether you will be doing in another,
independently generated data set from the same distribution, which would be
your out-of-sample, so that’s the key. On the other hand, you are completely
allowed, encouraged, ordered to look at all other information related to your
target function and input space, except for the realization of the data
set that you are going to use for training, unless you are going
to charge accordingly. So here is the deal. Someone comes in, I ask him, how
many inputs do you have? What is the range of the inputs? How did you measure the inputs? Are they physically correlated? Do you know of any properties
that I can apply? Is it monotonic in this? All of this is completely valid and
completely important for you in order to zoom in correctly, because right
now, you are not using the data. You are not subject to
overfitting the data. You are using properties of the target
function and the input space proper, and therefore improving your chances
of picking a correct model. The problem starts when you look
at the data set and not charge accordingly, very specifically. Here is another puzzle. This one is financial forecasting.
Befitting. So right now, there will be data
snooping somewhere here, and you need to look out for it. In this case, this is a real
situation with real data. You are predicting the exchange rate
between the US dollar versus the British pound. So you have eight years worth of daily
trading, where you just simply take the change from day to day. And eight years would be
about 2,000 points. There are about 250 trading days
per year, at least when the data was collected. And what you are planning
to do is the following. You look here. Let me magnify it. This is your input for the prediction,
and this is your output. So r is the rate. So you don’t look at the rate in the
absolute, you look at delta rate, the difference between the rate today
and the rate yesterday. That’s what you’re trying to predict. You’re asking yourself whether
it’s going up or down every day, and by how much. So you get delta, and you get delta for
the 20 days before, hoping that a particular pattern of up and down in
the exchange rate will make it more likely that today’s change, which hasn’t
happened yet– you are deciding to either buy or sell at the open– whether this will be positive
or negative and by how much. So if you make a certain prediction,
then you can obviously capitalize on that, and make predictions
according to that. And if you are right more often than
not, you will be making money because you are losing less often than
winning if you have the right objective function. So this is the case. What happens
here is that now you have the 2,000 points, so for every day, there
is a change, delta r. And what you do first, you normalize
the data to zero mean and unit variance. And then after that, you have
this array of 2,000 points. You create training set and test set. So the training set in this case, you
take 1,500 points, 1,500 days. So every day now, you take the day, and
you take the previous 20 days as their input. That becomes your training. And for the test, you picked it at
random, not the last ones, just to make sure that there is no funny stuff,
change in this or that. You just want to see if something is
inherent, so just to be on the safe side, you did it randomly. And then you take 500 points
in order to test on. So right now, out of the 2,000 array of
points, you have a big array of 20 points input, one output, 20 points
input, one output, 1,500 of those. And on the other side on the test, 20
points input, one output, 20 inputs, one output, 500 of those. This is for the test. That’s the game. So you go on with the training. You train your system on the training
set, and to make sure, because you heard of data snooping,
these guys are in a lock. You didn’t look at the
data at any point. You just carried all of this
automatically, and then when you are done and you froze the final hypothesis,
you open the safe, you get the test data, and you
see how you did. And this is how you did. You train only on D_train, you test on
D_test, and this is what you get. I’m not saying how often you got it
right, but I’m actually saying that you put a trade according to the
prediction, and I’m asking you how much money you made. So for the 500 points, sometimes you
win, sometimes you lose, but you win more often than you lose,
which is good. And at the end of two years worth–
that’s what 500 days would be– you would have made a respectable 22%
unleveraged, so that’s pretty good. So you are very happy, and now having
done that, you go to the bank and tell them I have this great
prediction system. Here is the system. I’m going to sell it for you, and I
guarantee that it will be– you do the error bars and whatever. And they go, and they go live, and they
lose money, and they sue you, and all of that. So you ask yourself,
what went wrong? What went wrong is that
there is snooping. And what’s interesting is, where
exactly is the snooping? So there are many things: random, the
fact that I used inputs that happened to be outputs to the other guy? No, no, that’s legitimate. I’m just really getting the pattern. You just go around it, and it is really
remarkably subtle, to the level where you can fall into that very, very
easily, and here is where the snooping happened. The snooping happened
when you normalized. What? I had the daily rates, right? 2,000 of them. I have the change. All of that is legitimate. Now, I slipped a fast one by you– I hope I did– when I told you, first you
normalize this to zero mean and unit variance. It looked like an innocent step, because
you get them to a nice numerical range, and some methods will actually
ask you to please put the data normalized, because it’s sensitive to
the dynamic range of the data. The problem is that I did this
before I separated the training from the testing. So I took into consideration the mean
and variance of the test set. That extremely slight snooping into
what’s supposed to be the test set, supposed not to affect anything, has
affected me, but by just a mean and– How could it possibly
make a difference? Well, if you didn’t do that,
you split the data first. You took the training set only,
and you did the normalization. And whatever the mu and sigma squared
that did the normalization for the training set, you took them frozen and
applied them to the test set so that they live in the same range of values. And you did the training now and
the test without any snooping. Under those conditions, this is
what you would have gotten. So no wonder you lost money. All the money you made is because you
sniffed on the average of the out-of-sample. And the average matters, because if you
think about it, let’s say that the US dollar had a trend of going up. That will affect the mean,
but you don’t know that– at least, you don’t know it for the
out-of-sample unless you got something out-of-sample. So I’m not saying normalization
is a bad idea. Normalization is a super idea. Just make sure that whatever parameters
you use for normalization are extracted exclusively from what
you call a training set, and then you are safe. Otherwise, you will be getting
something that you are not entitled to get. Easy to think about, if you are actually
thinking: I’m going to deploy this system. I don’t have the test set. So if you don’t have the test set, you
cannot possibly use those points in order to normalize. So use only things that you will
actually be able to use when you deploy the system. In this case, you have
only the training. Now, the third manifestation of
data snooping comes from the reuse of the data set. That is also very common. So what you do, I give you
a homework problem. Oh, I am very excited about
neural networks. Let me try neural networks. Oops, they didn’t work. I heard support vector
machines are better. Let me try them. Yeah, I did, but it was
the wrong kernel. Let me use the RBF kernel. Oh, maybe I’m just using too
sophisticated a model. Let me go back to the linear models,
and just use a nonlinear transformation. And eventually, using the same
data set, you will succeed. And the best way to describe it is
a very nice quote in machine learning. It says, “If you torture the data long
enough, it will confess”, but exactly the same way that a confession would
mean nothing in this case. So the problem here is that when you
do this, you are increasing the VC dimension without realizing it. I used neural networks and it didn’t
work, and then I used support vector machines with this and that. Guess what is the final model
you used in order to learn? The union of all of the above. It’s just that some of them
happened to be rejected by the learning algorithm. That’s fine, but this is
the resource you had. So you think of the VC dimension, and
the VC dimension is of the total learning model. Again, as we will see, there will
be remedies for data snooping, and there is a question of– it’s not like I
have to try a system, and when I fail, I just quit. That’s not what is being said. It’s just asking you to account
for what you have been doing. Don’t be fooled into thinking that I
can do whatever, and then the final guy that I use with a very simple model,
after all the wisdom that I accumulated from the data, is the VC
dimension that I’m going to charge. That just doesn’t work. The interesting thing is that this could
happen, not because you used the data, but because others
used the data. Oh my God, it’s really terrible here. Here’s the deal. You decide to try your methods
on some data set. So you go to one of the data sets
available on the internet, let’s say for heart attacks or something, and
you say, I am very aware of data snooping, right? I’m not going to look at the data, I’m
not going to normalize using the data. I’m going to get the data, and put them
in a safe, and close the safe, and I will just do my homework
before I even touch the data. And your homework is in the form of
reading papers about other people who used the data set. You want to get the wisdom, so
you use this, and you find that people realize that Boltzmann machines
don’t work in this case. The best kernel for the SVM happens
to be polynomial of order three, whatever it is. So you collect it, and you look
at it, and then you have your own arsenal of things. So as a starting point, you put
a baseline based on the experience you got, and you say that I’m
going now to modify it. Now you open the safe
and get the data. Now you realize what happened. You didn’t look at the data, but you
used something that was affected by the data, through the work of others. So in that case, don’t be surprised that
if all you did was determine a couple of parameters, that’s the only
thing you added to the deal, and you got a great performance. And you say, I have two parameters,
VC dimension is 2, I have 7,000 points. I must be doing great out of sample,
and you go out of sample, and it doesn’t happen. Doesn’t happen because actually, it’s
not the two parameters, it’s all the decisions that led to that model. And the key problem in all of those
is always to remember that you are matching a particular
data set too well. You are now married to that data set. You kept trying things, et cetera, and
after a while, you know exactly what to do with this data set. If someone comes and generates another
data set from the same distribution, and you look at it, it will look
completely foreign to you. What happens? It used to be that whenever these two
points are close, there is always a point in the same line far away. That’s obviously an idiosyncrasy
of the thing. Now you give me a data set
that doesn’t have that. That must be generated from
a different distribution. No, it’s generated from
the same distribution. You just got too much into this data
set, to the level where you are starting to fit funny stuff,
fitting the noise. There are two remedies for data
snooping, and I’m going to do this, and then give you the final
puzzle, and call it a day. You avoid it, or you account for it. That’s it. So avoiding it is interesting. It really requires strict discipline. So I’ll tell you a story
from my own lab. We were working on a problem, and
performance was very critical, and we were very excited about what we are
having, all the ingredients that make you go for data snooping. You just want to push it a little bit. We realized that this is the case,
so we had that discipline that we’ll take the data– the first thing we did, we sampled
points at random, put them in the safe, and then the rest of the guys
you can use for your training, validation, whatever you want. So at some point, one of my colleagues
who was working on the problem declared that they already have
the final hypothesis ready. It was a neural network at that point. So now I was the safe keeper, so now
I’m supposed to give them the test points, in order to see what
the performance is like. I smelled a rat, so what I decided, I
asked them, could you please send me the weights of the final hypothesis
before I send you the data set? That was the requirement, because
now it’s completely clear. He’s committed to one
final hypothesis. If I send him the data set and he says
it performed great, I can verify that because he has already sent me that. It’s a question of causality
in this case. And the problem is that it is
not that difficult to come– Here is the data set, and what you
really had, you had the candidate, but you had three other guys that
are in the running. And then you look at the data, and you
decide, maybe I get one from the running, et cetera. You can do very little. And in particular, in financial
applications, it’s extremely vulnerable, because it’s so noisy. It is very easy when you fit the noise
a little bit, you will make much better performance than you will ever
get from the pattern, so you had better be extremely careful. And therefore, you have a discipline
that really is completely waterproof that you did not data-snoop. Accounting for data snooping is not
that bad, because we already did a theory, and when we have a finite number
of hypotheses we are choosing from for validation, we know
the level of contamination. Even if it’s an infinite one,
we have the VC dimension. We had very nice guidelines to tell us
how much contamination happened. The most vulnerable part is looking at
the data, because it’s very difficult to model yourself and say, what is the
hypothesis set that I explored, in order to come up with that model
by looking at the data? So because accounting is very difficult,
that’s why I keep raising a flag about looking at the data. But if you can account, by all means,
that’s all you need to do. Look at the data all you want, just
charge accordingly, and you will be completely safe as far as machine
learning is concerned. Final puzzle, and we call it a day. And we are still in data snooping,
so maybe this has to do with data snooping, but it also has to do
with sampling bias, so it’s an interesting puzzle. This is a case where you are testing
the long-term performance of a famous strategy in trading, which
is called “buy and hold”. What does it mean? You buy and hold. You don’t keep– I’m going to sell
today, because it’s going down. No, you just buy, and sit
on it, forget about it. It’s like a pension plan or something. And five years later, you look
at it and see what happens. So you want to see how much money
you make out of this. So what you do is you decide to
use 50 years worth of data. That’s usually a good life span in
a professional life, so that will cover how much money you make at the time
you retire, from the time you start contributing to it. So here is the way you do the test. You want the test to be as broad as
possible, so you go for the S&P 500. You take all currently traded
stocks, the 500 of them. And then you go back, and you assume that
you strictly applied a buy and hold for all of them. So don’t be tempted to say that I’m
going now to modify it, because this guy crashed at some point, so if I sold
and then bought again, I would make more money. No, no, no. It’s buy and hold we are testing. That was frozen. So you do this, and then you compute,
and you find that you will make fantastic profit. And you compute, if I do this– you are
now young in your career– and apply it, by the time I retire,
I will have a couple of yachts and I will do this. It’s a wonderful thing. Can you see the problem? You are very well trained now,
so you can detect it. The problem is there is a sampling bias,
formally speaking, because you looked at the currently traded stock. That obviously excludes the guys that
were there and took a dive, and that obviously puts you at
a very unfair advantage. And it’s interesting that people do
treat this not as a sampling bias but as a data snooping, in spite of the
fact that it doesn’t fit our definition of data snooping. It does fit the definition of snooping,
because you looked at the future when you are here. It’s as if you are looking 50 years from
now, and someone tells you which stocks will be traded at that point. So that’s not allowed. But nonetheless, some people will
treat this as data snooping. In our context, this is formally just
sampling bias, and sampling bias that happens to be created or caused
by a form of snooping. I will stop here, and we will take
questions after a short break. Let’s start the Q&A. MODERATOR: In the last one homework
that people were using LIBSVM, it emphasized the fact that data should be
scaled, so why did we not discuss this in the course, or what? PROFESSOR: There are
many things I did not discuss in the course. I had a budget of 18 lectures,
and I chose what I consider to be the most important. There is a question of input data
processing, and there is a question not only of normalization, it’s also
a question of de-correlation of inputs and whatnot, which is
a practical matter. And the fact that I did not cover
something doesn’t mean that it’s not important. It just means that it’s a constrained
optimization problem, and you have the solution, and I have to have
a feasible solution. So that’s what I have. I think we have an in-house question. STUDENT: Thanks. Professor, you mentioned that if you
reuse the same data set to compare between different models, it’s
a form of data snooping. So how do we know what form
of model is better? PROFESSOR: The part of it which
is formally data snooping is the part where you used the failure of the
previous model to direct you to the choice of the new model, without
accounting for the VC dimension of having done that. So effectively, it’s not you that looked
at the data, but the previous model looked at the data and made
a decision, and you didn’t charge for it. So that is the data-snooping
aspect of it. If you did this as a formal hierarchy. You start out, here is the data
set, I don’t look at it. I’m going to start with support vector
machines with RBF, and then if I fail, I’m going to do this, et cetera. And given that this is my hierarchy,
the effective VC dimension is whatever, this is completely
legitimate. The snooping part is using the data for
something without accounting for it– in this case, using the data for
rejecting certain models and directing yourself to other models. STUDENT: Yes. So by accounting for the data snooping,
do you mean you consider the effective VC dimension of your entire
model, and use a much larger data set for your entire model? PROFESSOR: You’ll get the VC dimension,
so if the VC dimension is so big that the current
number, the amount of data set, won’t give you any generalization, the
conclusion is that I won’t be able to generalize unless I get more data,
which is what you’re suggesting. So the basic thing is that you are going
to learn, and you are going to finally hand a hypothesis to someone. What do you expect in terms
of performance? Data snooping makes you much more
optimistic than you should, because you didn’t charge for things that
you should have charged for. That’s the only statement being made. STUDENT: Is there a possibility that
data snooping will make you pessimistic, will make you
more conservative? PROFESSOR: I can probably construct
deliberate scenarios under which this is the case, but in all the
problems that I have seen, people are always eager to get good performance. That is the inherent bias, and that is
what directs you toward something optimistic, because you do something
that gets you smaller in-sample error, and you think now that this in-sample
error is relevant, but you didn’t account for what it cost you to
get to that in-sample error. So it’s always in the optimistic
direction. STUDENT: Yes. Thank you. PROFESSOR: Sure. MODERATOR: Assuming that there is
sampling bias, can you discuss how can you get around it? PROFESSOR: So we discussed
it a little bit. If there is a sampling bias, if you
know the distributions, you can– let me look at the– so in this case, let’s say that I
give you these distributions. What this means, you generated the data
according to the blue curve, and therefore, you will get
some data here. So what is clear, for example, is that
the data that correspond to the center of the red curve, which is the
test, are under-represented in the training set. And on the other hand, the data that
are here are over-represented. The blue curve is much bigger, it
will give you some samples. It will hardly ever be the case
that you will get that sample from the testing. So what you do, you devise a way of scaling,
or giving importance– not scaling the y value, just scaling
the emphasis of the examples– such that you compensate for this
discrepancy, as if you are coming from here, and there are some re-sampling
methods to do the same effect. So this is one approach. The other approach, which is in the
absence of those guys, is to look at the input space in terms
of coordinates. Let’s say that with the case of the
Netflix, you look at, for example, users rated a certain
number of movies. Some of them are heavy users, and
some of them are light users. So you put how many movies a user rated,
and you try to see that in the training and in the test, you have
equivalent distribution as far as the number of ratings are concerned. And you look for another coordinate and
a third coordinate, and you try to match these coordinates. This is an attempt to basically take
a peek at the distribution, the real distributions that we don’t know, in
terms of the realization along coordinates that we can relate to. So there are some methods to do that.
Basically, you are compensating by doing something to the training set you
have, to make it look more like it was coming from the test distribution. MODERATOR: Is there any counter
example to Occam’s razor? PROFESSOR: Is there– MODERATOR: Counter example
to Occam’s razor or not? PROFESSOR: It’s statistically
speaking in what we– I can take a case where I violate
the marriage between the complexity of an object and the
complexity of the set that belongs to the object. So I can take one hypothesis which is
extremely sophisticated in terms of the minimum description length or the
order of the polynomial, but it happens to be the only hypothesis
in my hypothesis set. Now, if this happens to be close to
your target function, you will be doing great, in spite of
the fact that it’s complex. So I can create things where I start
violating certain things like that. But in the absence of further
knowledge, and in very concrete statistical terms, Occam’s
razor holds. So the idea is that when you use
something simpler, on average, you will be getting a better performance. That’s the conclusion here. MODERATOR: Specifically talking about
applications in computer vision and the idea of sampling bias comes to mind,
is there any particular method used there to correct this, or just
any of the things we discussed? PROFESSOR: I think it’s the same
as discussed, just applied to the domain. Sometimes the method becomes very
particular when you look at what type of features you extract in
a particular domain, and therefore, it gets modified
in that way. But the principle of it is that you take
the data points from your sample, and give them either different weight
or different re-sampling, such that you replicate what would have happened
if you were sampling from the test distribution. MODERATOR: I think that’s it. PROFESSOR: Very good.
We’ll see you on Thursday.

7 thoughts on “Lecture 17 – Three Learning Principles

  • Your enjoyment of the subject matter shows on your face! I can't blame you; it's a very fascinating subject. Thanks for making it easy to understand.

  • Именно та лекция в курсе, которую стоит посмотреть, даже если на весь курс не хватило времени:)
    Превосходная лекция, превосходный лектор.

  • 8:55, about Occam's Razor. It is not quite correct. The Occam's Razor principle, is that the model with smallest number of assumptions about data and real behavior of unseen data is most plausible. "Entities must not be multiplied beyond necessity" (Non sunt multiplicanda entia sine necessitate). Or Numquam ponenda est pluralitas sine necessitate [Plurality must never be posited without necessity]. The princple is much older than the Occam, but he was the one to popularise it in the west in more formal way.

    That doesn't mean that the model must be the simplest. If the model is more complex, but have less assumption, and it fits the data, it is better.

Leave a Reply

Your email address will not be published. Required fields are marked *