ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about radial basis

functions, and the functional form of the hypothesis in that model

is the superposition of a bunch of Gaussians, centered around mu_k. And we had two models, or two versions

of that model, one of them where the centers are fewer than the number of

data points, which is the most common one, in which case we need to come up

with the value of the centers, mu_k, and learn the values of w_k. And it turned out to be a very simple

algorithm in that case, where you use unsupervised learning to get the mu_k’s,

the centers, by clustering the input points without reference to

the label that they have. And after you do that, it becomes a very

simple linear model where you get the w_k’s, the parameters, using

the usual pseudo-inverse. And in the other case, where we used

as many centers as there are data points, and the centers were

the data points, there was obviously no first step. And in that case, in order to get the

w_k, we actually used the real inverse rather than the pseudo-inverse. One of the interests of radial basis

functions– they are very popular functions to use in machine learning,

but one of the most important features about them is how they relate to so

many aspects of machine learning. So I’d like to go through this, because

it’s actually very instructive and it puts together some of

the notions we had. So let me magnify this a bit. Radial basis functions have this as

the building block, the Gaussian, and they are related to

nearest neighbor. In the case of nearest neighbor, you

have a data point, one of your points in the training set, and it influences

the region around it. So everything in the region around it in

the input space inherits the label of that point, until you get to a point

which is closer to another data point, and then you switch to that point. So you can think now of RBF

as a soft version of that. The point affects the points around

it, but it’s not black and white. It’s not full effect and

then zero effect. It’s gradually diminishing effect. It’s also related to neural networks,

thinking of this as the activation in the hidden layer, as we saw last time. And the activation for the

neural networks in the hidden layer was a sigmoid. And the main conceptual difference

between the two in this case is that this is local. It takes care of one

region of the space at the time, whereas this is global. That thing affects points regardless

of the value of the signal, and you get the effect of a function by getting

the differences between these different sigmoids. Then we had the relationship to SVM,

which is very easy because in the case of SVM, we had

an outright RBF kernel. So there was simply a very easy way to

compare them because they use the same kernel, except that there were

many interesting differences. For example, when we use the RBF, we

cluster the points, we determine the centers according to an unsupervised

learning criterion. And in the case of SVM, the centers,

if you’re going to call them that, happen to be the support vectors in

which the output is very much consulted in deciding what these

support vectors are. And the support vectors happen to be

around the separating boundary, whereas the centers here happen to be

all over the input space, in order to represent different clusters

of the inputs. The two remaining relations as far as

RBF are concerned are regularization and unsupervised learning. Unsupervised learning is easy, because

that is the utility we had in order to cluster the points and

find the centers. So you look at the points, and then

you try to find the representative center for them such that when you put

a radial basis function around that point, it captures the contribution of

those points, and then more or less dies out, or at least is not as

effective when it goes far away, and this is another center

that does the same. The interesting aspect was

regularization because, it seems on face value, it’s a completely

different concept. RBF is a model. Regularization

is a method that we apply on top of any model. But it turns out that RBF’s were derived

in the first place in function approximation using just a consideration

of regularization. So you have a bunch of points, you want

to interpolate and extrapolate them, and you don’t want the

curve to be too wiggly. So you capture a smoothness criterion

using a function of derivatives, and then when you solve for them, you find

that the interpolation is done by Gaussians, which gives you the RBF’s. So this is what this model does. Today, we’re going to switch gears

completely and in a very pleasant way. If you think about it, we have gone

through lots of math, and lots of algorithms, and lots of homework, and

all of that, and I think we paid our dues and we earned the ability to

do some philosophy, if you will. So we’re going to look at learning

principles without very strong appeal to math, because we have very strong math

foundation to stand on already. And we’ll try to understand the concepts,

and relate these concepts as they appear in machine learning, because

they also appear in other fields in science in general, and

they are fascinating concepts in their own right. And when we put them in the context of

machine learning, they assume a real meaning and a real understanding that

will help us understand the principles in general. So the three principles, the usual

label for them is Occam’s razor, sampling bias, and data snooping. And you may be familiar with some of

them, and we have already alluded to data snooping in one of the lectures. And if you look at them, Occam’s

razor relates to the model. Both of these guys relate to the data. One of them has to do with collecting

the data, and the other one has to do with handling the data. And we’ll take them one at a time, and

see what they are about and how they apply to machine learning and so on. So let’s start with Occam’s razor. There is a recurring theme in machine

learning, and in science, and in life in general that less is more. Simpler is better, and so on. And there are so many manifestations of

that, and I just chose one of the most famous quotes. I put “quote”

between quotes because it’s not really a quote. He didn’t say that in so many words, but

at least, that’s what people keep quoting Einstein as saying. And it says that an explanation of

the data– so you are running an experiment, you collect the data,

and you want to make an explanation of the data. The explanation could be E equals

M C squared, or something else. So you are trying to find an explanation

of the data, and here is a condition about what

the explanation should be like. It should be as simple as possible,

but no simpler. Very wise words. As simple as possible, that’s

the Occam’s razor part. No simpler, because now you

are violating the data. You have to be able to

explain the data. So this is the rule. And that quote, in one manifestation or

another, has occurred in history. Isaac Newton has something that is

similar, and a bunch of them, but I’m going to quote the one that

survived the test of time, which is Occam’s razor. So let’s first explain

what the razor is. Well, a razor is this. You have to write “Occam” on it in

order to become Occam’s razor! And the idea here is symbolic. So the notion of the razor

is the following. You have an explanation of the

data, and you have your razor. So what you do, you keep trimming the

explanation to the bare minimum that is still consistent with the data, and

when you arrive at that, then you have the best possible explanation. And it’s attributed to William of Occam

in the 14th century, so it goes back quite a bit. What we would like to do, we’d like

to state the principle of Occam’s razor, and then zoom in, in order

to make it concrete. Rather than just a nice thing to have,

we’d like to really understand what is going on. So let’s look at the statement. The statement, in English, not in

mathematics, says that the simplest model that fits the data

is also the most plausible. And we put it in a box, because

it’s important. So, first thing to realize about this

statement is that it is neither precise nor self-evident. It’s not precise, because I really

don’t know what simplest means. We need to pin that down. Right? I know that the simplest model is nice,

but I’m saying something more than just nice. I’m saying it’s most plausible. It is the most likely to be true

for explaining the data. That is a statement, and you actually

need to argue why this is true. It’s not wishful thinking that we just

use the simple, and things will be fine. There is something said here. So there are two questions to answer,

in order to make this concrete. The two questions are, the first one

is, what does it mean for a model to be simple? It turns out to be a complex question,

but we will see that it’s actually manageable in very concrete terms. The second question is, how do we

know that this is the case? How do we know that simpler is better,

in terms of performance? So we’ll take one question

at a time, and address it. First question, simple

means exactly what? Now, you look at the literature and

complexity is all over the place. It’s a very appealing concept with very

big variety of definitions, but the definitions basically belong

to two categories. When you measure the complexity,

there are basically two types of measures of complexity. And my goal here is to be able to

convince you that they actually are talking about more or less the same

thing, in spite of being inherently different conceptually. The first one is a complexity of

an object, in our case, a hypothesis h or the final hypothesis g. That is one object, and we can say that

this is a complex hypothesis or a simple hypothesis. The other set of definitions

have to do with the complexity of a set of objects. In our case, the hypothesis set. We say

that this is a complex hypothesis set, complex model, and so on. And we did have concretely a measure of

complexity of small h and a measure of complexity of big H, and if you

remember, we actually used the same symbol for them. It was Omega. Omega here was the penalty for

model complexity when we did the VC analysis, and Omega here was

the regularization term. This is the one we add in the augmented

error, in order to capture the complexity of what we end up with. So we already have a feel that there is

some kind of correspondence, and if you look at the different definitions

outside, there are many definitions of the complexity of an object, and

I’m going to give you two from different worlds. One of them is MDL, stands for

Minimum Description Length. And the other one, which is simple,

is the order of a polynomial. Let me take the minimum

description length. So the idea is that I give you an object

and you try to specify the object, and you try to specify it

with as few bits as possible. The fewer the bits you can get

away with, the simpler the object in your mind. So the measure of complexity here is

how few bits can I get away with, in specifying that object? And let’s take just an example, in order

to be able to relate to that. Let’s say I’m looking at an integer that

happens to be a million digits, a million decimal digits. Huge numbers, any numbers. Now, I’m trying to find the complexity

of individual numbers of that length. There will be different complexities. So let me give you one number which is,

let’s say, 10 to the million minus 1, in order to make

it a million digits. So let’s say 10 to the

million minus 1. Now, 10 to the million minus 1 is

99999999, a million times, right? In spite of the fact that this is

a million in length, it is a simple object because you were able to

describe it as “10 to the million minus 1”. That is not a very long

description, right? And therefore, because you managed to

get a short description, the object is simple in your mind. This is very much related to Kolmogorov complexity. The only difference between Kolmogorov

complexity and minimum description length is that minimum

description length is more friendly. It doesn’t depend on computability

and other issues. But this is the notion. And you can see that when we describe

the complexity of an object, that complexity is an intrinsic

property of the object. Order of a polynomial is

simpler to understand. I tell you there is a 17th-order

polynomial versus a 100th-order polynomial, and you already can see that

the object is more complex when you have a higher order. And indeed, this was our definition of

the complexity of the target, if you recall, when we were running the

experiments of deterministic noise. In that case, we needed to generate

target functions of different complexity, and the way we did it, we

just increased the order of the polynomial as our measure of the

complexity of that object. Now we come to the complexity

of a class of objects. Well, there are notions running around

that actually define that, and I’m going to quote two of

them, very famous. The entropy is one, and the one

we are most familiar with, which is the VC dimension. Now, these guys apply

to a set of objects. For example, the entropy. You run an experiment, you consider

all possible outcomes of the experiment, the probabilities that go

with them, and you find one collective function that captures

the probability, sum of p logarithm of 1 over p, and that becomes your entropy and

that describes the disorder, the complexity, whatever you want, of the

class of objects, each outcome being one object. In the case of the VC dimension, it

applies directly to the notion we are most familiar with. It applies to a hypothesis set, and it

looks at the hypothesis set as a whole, and produces one number that describes

the diversity of that hypothesis set. And the diversity in that case

we measure as the complexity. So if you look at one object from that

set, and you look at this measure of complexity, now that measure of

complexity is extrinsic with respect to that object. It depends on what other guys

belong to the same category. That’s how I measure the complexity of

it, whereas in the first one, I didn’t want to be a member of anything. I just looked at that object, and tried

to find an intrinsic property of that object that captures the complexity. So these are the two categories

you will find in the literature. Now, when we think of simple as far as

Occam’s razor, as far as different quotes are concerned, we are thinking

of a single object. I tell you E equals M C squared, or I looked

at the board, P V equals n R T, and that is a simple statement. You don’t look at what other

alternatives were there to explain the data. You just look at that object

intrinsically, and that is what you think of as the measure of complexity. When you do the math in order to prove

Occam’s razor in one version or another, the complexity you are using is

actually the complexity of the set of objects. And we have seen that already. We looked at the VC dimension, for

example, in order to prove something of an Occam’s nature in this course

already, and that captured the complexity of a set of objects. So this is a little bit worrying,

because the intuitive concept is one thing, and the mathematical

proofs deal with another. But the good news is that the complexity

of an object and the complexity of a set of objects, as we

described in this slide, are very much related, almost identical. And here is the link between them: counting. Couldn’t be simpler. Here is the idea. Let’s say we are using the minimum

description length, which is very popular and versatile. So it takes l bits to specify

a particular object, h. I’m taking the objects here to be h,

because I’m in machine learning. The objects are hypotheses,

so I use that. Now, the measure of complexity in this

term is that the complexity of this fellow is l bits, because

that is my definition. Now, this implies something. This implies that if I look at all the

guys that are similar to this object in terms of complexity, they also happen

to have l bits worth of minimum description. How many of them are there? Well, 2^l, right? And now you can look at the set of all

similar objects, and you call it H, and you have one of 2^l as

the description of an object here, and you can take the “1 of 2^l” as

the description of the complexity of that set. So now we are establishing

something in our mind. Something is being complex in its

own right, when it’s one of many. Something is simple in its own

right, when it’s one of few. That is the link that makes us able

to use this side for the proofs, and make a claim on this side. It is not an exact correspondence,

but it is an overwhelmingly valid correspondence. Now these are with bits, and

I can pin it down exactly. How about real-valued parameters? Let’s look at our 17th-order polynomial. You can look at a 17th-order polynomial,

and you can see that because it’s 17th order, it goes up

and down and up and down, and that looks complex. But also, because if it’s a 17th order

polynomial, it’s one of many, in the realm of infinity in this case, because

having 17 parameters to choose makes me able to choose

a whole bunch of guys that belong to the same category. So the class of 17th-order polynomials

is big, and therefore, it’s not only that the individual is complex,

the set is also complex. There are exceptions to this rule,

and one notable exception was a deliberate exception. And we wanted something that looks

complex, so that it does our job of fitting, but is one of few. And therefore, we are not going to pay

the full price for it being complex, and that was our good old friend SVM. Remember this fellow? This looks complex all right, but it’s

actually not really complex because it’s defined only by very

few support vectors. And therefore in spite of the fact that

it looks complex, it’s really one of few, and that is what we achieve

by the support vector machines. Now, let us take this in our mind,

that we are going to use the complexity of an object as the same as

the complexity of the set of objects that the object naturally belongs to,

and we will see some ramifications. So now I’m going to give you the

first puzzle of the lecture. There are 5 puzzles in this lecture,

so you need to pay attention, and each puzzle makes a point. And the first one has to do

with this complexity, so let’s look at the puzzle. The puzzle has to do with a football

oracle, someone who can predict football games perfectly. You watch Monday night football, you

want to know the result, and something happens Monday morning. You get a letter in the mail. You open the letter. Hi. Today, the home team will win.

Or, the home team will lose. You don’t make much of it, just

some character sent something. It’s not a big deal. You watch the game, and

it’s a good call. OK, interesting. 50%, lucky. Next Monday, another letter,

another prediction. And the funny thing is that he predicted

either the home team will win or not, and it was very long odds. Everybody thought the

other way around. And at the end of the game, the guy was

right, and the guy was right for 5 weeks in a row. Now you are really very curious, and you

are eagerly waiting in the 6th week in the morning of Monday

to see where the letter is. You have a perfect record. Now comes the letter. The letter says: you want

more predictions? Pay me $50. Very simple question: Should you pay? The question is easily answered, because

now the scams are so many that the default, I just don’t

look at anything. There must be something to it. But I really want to pin down what is

it, because that is the message we are carrying out. So the idea here is that no, you

shouldn’t, and the guy is really not predicting anything. And the reason for that

is the following. He’s not sending letters to you only. He’s sending letters to 32 people. In the first game, for half of them, he

said that the home team will lose. The second one, he said the

home team will win. Now, because he did that, he is sure

that some of the guys will get the correct answer. So the game is played, and

the home team loses. So in the second week, he goes for the

guys where he was right, and sends half of them that the home team will

lose, and the other half, the home team will win. Now, he had plans to send the other guys

as well something similar, except that it’s hopeless now because he

already lost with them, so they’re not going to pay him the $50. So just for the memory, this is

what would have been sent. There are no letters sent here, but he

would have gone zero one, zero one. And he waits for the game, and

out comes: the home team won. So you can see who he’s going to

send letters to now, right? The other guys are a lost cause. This would have been sent

to them, but that’s OK. And he waits, and what

happens this time? The home team lost. And therefore, here is

your next letter. Home team won. Here is

your next letter. Only two people are surviving

from this thing. And here is the result,

the home team won. Now at that point, the guy

sent how many letters? 32 plus 16 plus 8 plus 4 plus 2,

so about 64, 63 to be exact. The postage on that, writing the letter,

he probably spent $30 on that. And he’s charging you, the lucky

guy out of the 32, $50. That’s a money making proposition. Very nice, and it’s understood

and illegal, by the way! But the interesting thing here is to

understand, why is this related to what we’ve just talked about? You thought the prediction ability

was great because you only saw your letters. There is one hypothesis, and

it got it right perfectly. The problem is that actually, the

hypothesis set is very complex, and therefore the prediction

value is meaningless. You just didn’t know. You didn’t see the hypothesis set. So now we understand what is the

complexity of an object. Now we go to the question,

why is simpler better? So the first thing to understand is that

we are not saying that simpler is more elegant. Simpler is more elegant, but this is

not the statement of Occam’s razor. Occam’s razor is stating that simpler

will have better out-of-sample performance. That’s a concrete statement. In all honesty, if Occam said that you

take the more complex guy and it will give you better out-of-sample

error, I will take the more complex one, thank you. I am after performance. I’m not after elegance here. It’s nice that the elegant guy happens

also to be better, but we need to establish that it is actually better. And there is a basic argument. It

manifests itself in many ways, and we have already run one in this

course during the theory. And you put some assumptions, and

there’s a formal proof under idealized conditions of the following. Instead of going through any formal

proofs– quite a variety of them, I am extracting the crux of the proof. What is the point being made? And I’m going to relate it to the

proof that we ourselves ran. So here is high-order steps. There are fewer simple hypotheses

than complex ones. That is what we established from

the definition of complexity. And in our case, that was captured

by the growth function. You probably have forgotten

what this is, long ago. This was taking N points, finding what

your hypothesis set can generate in terms of different patterns on those

N points, we call dichotomies. So if it can generate everything like

the postal guy, then it’s a huge hypothesis set. If it can generate few of them, then

it’s a simple hypothesis, and it’s measured by that growth function, and

that resulted in the VC dimension. Remember all of that? So now, fine. Fewer simple hypotheses

than complex ones. OK, then what? The next thing is because there are

fewer ones, it is less likely to fit a given data set. That is, you have N points, and

you’re going to generate labels. Let’s say you generate them at random,

and you ask yourself, what are the chances that my hypothesis

set will fit? Well, if it has few of those guys,

obviously that goes down, and the probability, if you take it uniformly,

simply would be the growth function divided by 2^N. If my growth function is polynomial, then

very quickly, the probability of fitting a given data

set is very small. OK, fine, I can buy that. So now that’s nice, but you want

to convince me now that simpler is better in fit. Here, you told me that I cannot fit. So what is the point? The punchline in all of those is that if

something is less likely, then when it does happen, it’s more significant. And there are many manifestations of

this, even when you define the entropy that I alluded to. A probability of an event is p. What is the information associated

with that particular point? The smaller the probability, the bigger

the information, the bigger the surprise when it happens. And indeed, you define the term

as being logarithm 1 over p. So if p is very small, tons

of bits of information. If something half the time will happen,

half the time will not happen, it’s just 1 bit. It’s not a big deal. And, looking back at the postal

scam, the only difference between someone believing in the scam and

someone having the big picture is the fact that the growth function, from your

point of view when you received the letters, was 1. You thought you were the only person.

Here is one hypothesis, and you got it right, and you gave a lot of

value for that because this is unlikely to happen. On the other hand, the reality of it is

that the growth function is actually 2^N and this is certain to happen,

so when it happens, it’s meaningless. Let’s look at a scientific experiment,

where a fit is meaningless. So you are running an experiment, or you

ask people to run an experiment, to establish whether conductivity of

a particular metal is linear in the temperature. I can design an experiment for that. So you go and you ask two scientists to

conduct experiments, and they go, and they come back with

the following results. Here is the first scientist. Took the metal, but they had a dinner

appointment, so they were in a hurry, so they got 2 points and drew

the line and gave you this. The second guy had a supper appointment,

so had more time to do it, so did it 3 times,

and then the line. I have a very specific question, which

is: what evidence do they provide for the hypothesis that conductivity is

indeed linear in the temperature? What is clear without thinking too much

is that this guy provided more evidence than this guy. It is interesting to realize that this

guy provided nil, none, nada. Why is that? Because obviously, 2 points can

always be connected by a line. So the notion that goes with this

is called falsifiability. If your data has no chance of falsifying

your assertion, then by the same token, it does not provide any

evidence for that assertion. You have to have a chance of falsifying

your assertion, in order to be able to draw the evidence. This is called the axiom of

non-falsifiability, and in some sense, it’s equivalent to the arguments

we have done so far. And in our terms, the linear model is

just way too complex for the size of the data set, which is 2, to

be able to generalize at all. And therefore, there is

no evidence here. In this case, this guy could

have been falsified if the red point came here. Therefore, he actually provides

an evidence. This is the point. This guy could not have

been falsified. So now we go to the next notion, which

is sampling bias. It’s a very interesting notion, and it’s tricky. And by the way, if you look at all of

these principles, it’s not like they’re just concepts, and nice,

and relate to other fields. They also provide you with red flags

when you’re doing machine learning. For example, when you use Occam’s

razor, what does it mean? It means that beware of fitting the data

with complex models, because it looks great in sample and you are very

encouraged, and when you go out of sample, you know what happens. You know all too well by the

theory we have done. Similarly, when we talk about sampling

bias and later, data snooping, there are traps that we need to avoid when

we practice machine learning. So let’s look at sampling bias,

and we start with a puzzle. Here is the puzzle. It has to do with the presidential

election, not this one. But in 1948, this was the first

presidential election after World War II, which was a big deal, and the two

people who ran was Truman, who was currently President, and

he ran against Dewey. And it was very close in terms of–

people will take opinion polls, and it’s not clear who is going to win. So now, one newspaper ran a phone poll,

and what they did is ask people how they actually voted. So this is not before the election

asking, what do you think? This is the night of the election,

after the election closed, they actually called people picked at random

at their home, asked them: who did you vote for? Black and white. Dewey or Truman, et cetera? They collected the thing, and they

applied some statistical thing or Hoeffding or some other quantity,

and came with the conclusion that Dewey has won decisively. Decisively doesn’t mean he won by 60%. Decisively means that he won

above the error bar. The probability that the opposite

is true is diminishingly small. And the result was so obvious that they

decided to be the first to break the news, and they printed their

newspaper declaring: Great. OK, so Dewey won. What happens when someone

wins an election? They have a victory rally. So let’s look at the victory rally. One problem. Victory rally was Truman, and you can

see the big smile on the guy’s face. So what happened? Well, polls are polls and there

is always a probability, and this and that. No, that’s not the issue here. That’s the key. So don’t blame delta for it. delta? What was delta again? We’ve been doing techniques

for a while. I forgot all about the theory. So let’s remind you what delta was. We were talking about the discrepancy

between in-sample, the poll, out-of- sample, the general population, the

actual vote, and we were asking ourselves, what is the probability

that this will be bigger than something, such that the

result is flipped? You thought it was Dewey winning,

and it turned out to be Truman. And that turned out to be less than or

equal to delta, and delta is expressed in terms of epsilon, N, and whatnot. So in principle, it is possible,

although not very probable, that the newspaper was just incredibly unlucky. Now, the statement is

very interesting. No, the newspaper was not unlucky. If they did the poll again and again and

again, with 10 times the sample, or 100 times the sample, they will

get exactly the same thing. OK?! So what is the problem? The problem is the following. There is a bias in the poll they

conducted and it is because of a rather laughable reason. In 1948, phones were expensive. That means that households that had

phones used to be richer, and richer people at that point favored Dewey more

than the general population did. So there was a sampling bias. There was always the case– the

population they were asking actually favored Dewey. The sample was very reflective of the

general population, of that mini general population. The problem is that, that general

population is not the overall general population. And that brings us to the statement

of the sampling bias principle. It says that if the data is sampled in

a biased way, then learning will produce a similarly biased outcome. Learning is not an oracle, not

like the football oracle. Learning sees the world through

the data you give it. I’m a learning algorithm,

here is the data. You give me skewed data, I’m going

to give you a skewed hypothesis. I’m doing my job. I’m trying to fit the data. So this is always the case, and then

you realize that there is always a problem in terms of making sure that

the data is actually representative of what you want. So again, we put this in a box. That’s the second principle,

so it’s important. And let’s look at a practical

example in learning. In financial forecasting, people use

machine learning a lot, and sometimes when you look at the markets, the

markets are completely crazy. A rumor comes out and the market

goes this way, et cetera. And you are a technical person, you

are trying to find an intrinsic pattern in the time series. So you decide, I’m going to use the

normal conditions of the market. So I’m going to take periods of the

market where the market was normal, and then there is actually a pattern

when people buy, buy, buy, and sell, sell, sell, something happens, or

whatever you are going to discover using your linear regression or

other learning algorithm. And you do this. And then you deploy it, and when you

test it, you test it in the real market, and realize that now

there is a sampling bias. In spite of the fact that you were very

happy in-sample, you actually forgot a part of the market, and you

don’t know whether that part will be terrible for you, great for

you, or neutral for you. You just don’t know. That’s what sampling bias does. The newspaper could have done this poll

and, by their sheer luck, the general population thinks the same of

Truman and Dewey as the small sample they talked about, in which case the

result would have come out and they would have never discovered

that they made a mistake. So sampling bias makes you vulnerable,

at the mercy of the part that you didn’t touch. In this case, you didn’t touch the

market in certain conditions, and if it does happen, all bets are off. One way to deal with sampling bias

is matching the distributions. It’s a very interesting technique, and

it’s actually applied in practice. I’m going to mention that. So what is the idea? The idea is that you have a distribution

on the input space, in your mind, and there was one assumption

in Hoeffding and VC inequality and all of that. They didn’t make too many assumptions,

but one assumption they certainly made is that you pick the points for training

from the same distribution you pick for testing. That was the only thing

that they require. So when you have sampling

bias, that is violated. And therefore, you try to say

I don’t have the same distribution. I have data picked from some

distribution, and I’m going to deliver the hypothesis to the customer, and

they’re going to test it in other conditions. What do I do? What you do, you try to

match the distributions. You don’t reach for the distributions

and match them. You do something that will effectively

make them match. And you look at this, and let’s

say that this is the training distribution, and the test distribution

is off a little bit. This is a probability density function. Both of them are Gaussian. One of them is off and with

a different sigma. So what you do, if you have access to those– if

someone tells you what the distributions are and then gives you

a sample, there is a way by either giving different weights for the

training data, or re-sampling the training data, to get another set as

if it was pulled from the other distribution. It’s a fairly simple method. Very seldom that you actually have the

explicit knowledge of the probability distributions, so it’s not that useful in

practice, but in principle, you can see that it can be done. And the price you pay for it is

that you had 100 examples. When you are done with this scaling and

re-sampling or whatever method you use, the effective size now is 90. So you lose a little bit in terms of

the independence of the points, and therefore, you get effectively

a smaller sample because of it. But at least, you deal with the

sampling bias that you wanted to deal with. Now, this method works, and even if you

don’t know the distribution, there are ways to try to infer

the distribution that work. But it doesn’t work if there is a region

in the input space where the probability is zero for training, nothing

will be sampled from that part, but you are going to test on it. There is a probability of getting

a point there, very much like guys without a phone. That happened to have zero probability

in the sample, but they don’t have zero probability in the

general population. And in that case, there is nothing that

can be done in terms of matching, because obviously you don’t know

what happened in that part. On the other hand, in many other cases,

there is a simple procedure, which is actually very

useful in practice. If you look at, for example, the Netflix

competition, one of the things you realize is that I have the

data set, it’s a huge data set, 100 million points. And then I’m going to test your

hypothesis on the final guys, the final ratings. So it’s a much smaller set. And the interesting aspect about it is

that if you look at the distribution of the general ratings, the 100 million,

it really is different from the distribution of these guys. Therefore, the question came up, can I

do something during the training such that I make the 100 million look as if

they were pulled from the distribution of the last guys? Very interesting question, has a very

concrete answer, and the 100 million become 10 million, not that you are

throwing away points, but you are weighting them such that when you are

done, they look smaller than a set. But then you are actually matched to

that, and you can get a dividend in performance. So there is a cure for sampling bias in

certain cases, and there is no cure in other cases, in which all you can

do is admit that you don’t know how your system will perform in the

parts that were not sampled. That would be fatal if you are doing

a presidential poll, but may not be as fatal when you are doing machine

learning, because all you are going to do, you are going to warn against

using this system within that particular sub-domain. Third puzzle, try to detect

sampling bias here. Credit approval. We have seen that before. That’s

a running example in the course, so let me remind you what that was. The bank wants to approve

credit automatically. It goes for the historical records of

customers who applied before, and they were given credit cards, so you have

a benefit of, let’s say, 3 or 4 years worth of credit behavior. And you look back at their inputs, and

the inputs in those cases were simply the information they provided at the

time they applied for credit, because this is the information that will

be available from a new customer. And you get something like that.

This is the application. You also have the output, which is

simply– you go back and see whatever the credit behavior is and you ask

yourself, did they make money for me? Because it’s not only credit

worthiness, that you are a reliable person. It’s also that some people who are

flirting with disaster are very profitable for the bank, because they max

out and they pay this ridiculous percentage, so they make a lot of money

as long as they don’t default. Once they default, it’s a problem. So there’s a question of just,

did you make profit or not? That’s a question. And I’m going to approve future

customers if I expect that they will make profit for me. That’s the deal. Where is the sampling bias? We probably alluded to it

in one form or another. The problem is that you’re using

historical data of customers you approved, because these are the

only ones you actually have credit behavior on. So the guys who applied, and

you rejected them, are not part of this sample. And when you are done, you are

going to have a system that applies to a new applicant. You do not know a priori whether that

applicant will be approved or not, according to your old criteria. So it could belong to the population

that was never part of your training sample. Now, this is one case where the sampling

bias is not that terrible in terms of effect, not in terms of

characterizing what is going on. You have a part of the population, and

they have zero probability in terms of training, and nonzero probability

in terms of testing. It’s good, old-fashioned

sampling bias. But the point is that banks tend to be

a bit aggressive in providing credit because, as I mentioned, the borderline

guys are very profitable. So you don’t want to just be

conservative and cut them off, because you’re going to be losing revenue. Because of this, the boundary that you

are talking about is pretty much represented by the guys

you already accepted. You already made mistakes

in what you accepted. So when you get that boundary, the

chances are the guys you missed out will be deep on one side. You got all the support vectors,

if you want, so the interior points don’t matter. They matter a little bit, but

actually, that system with the sampling bias does pretty

good on future guys. By evidence that you reject someone, how

do you know that it’s good because you rejected it? They apply somewhere else, and they make

the other guy lose money, so you realize that your decision was good. So you can verify, if you have

a consortium of banks, whether actually that sampling bias here has an impact,

or doesn’t have an impact. Final topic, data snooping,

the sweetest of all. Well, it’s the sweetest because it

is so tricky, and manifests itself in so many ways. Let me first state the principle. The principle says, if a data set has

affected any step of the learning process, then the ability of the same

data set to assess the outcome has been compromised. Very simply stated. The principle doesn’t forbid

you from doing anything. You can do whatever you want. Just realize that if you use

a particular data set, whether it’s the whole, or a subset or whatever, use it

to navigate into– I’m going to do this, I’m going to choose this model,

I’m going to choose this lambda, I’m going to do this, I’m going to

reject this, whatever it is. You made a decision, then when you

have an outcome from the learning process and you use the same data set

that affected the choice of that, the ability to fairly assess the performance

of the outcome has been compromised by the fact that this was

chosen according to the data set. I think this is completely understood by

us, having gone through the course. We put it in a box, and then we make

the statement that this is the most common trap for practitioners,

by and large. I’ve dealt with Wall Street firms

quite a bit in my career, and there are lots of people who are using

machine learning, and it is rather incredible how they manage

to data-snoop. And there is a good reason for it,

because when you data-snoop, you end up with better performance, you think,

because that’s why you snooped. I looked at the data, I

chose a better model. The other guy didn’t look at the data,

and they are struggling with the model, and they are not getting the

same in-sample, and I am ahead. It looks very tempting to do. And it’s not just looking at the data. The problem is that there are many ways

to fall into the trap, and they are all happy ways. So if you think of it as landmines,

it is actually happy landmines. You very cheerfully step on the mine,

because you think you are doing well. So you need to be very careful. And because it has different

manifestations, what I’m going to do now, I’m going to go through examples of

data snooping. Some of them we have seen before, and some

of them we haven’t. And then you will get the idea. What

should I avoid, and what kind of discipline or compensation should I have,

in order to be able not to suffer from the consequences

of data snooping? So the first way of data snooping,

we have seen before, is looking at the data. So I’m borrowing something

from our experience. Remember the nonlinear transform? Yeah. So you have a data set like this, and

let’s say you didn’t even look at the data and you decided that, I’m going

to use a 2nd-order transform. So this is the transform, you

take a full 2nd order. You apply it, and you look at the

outcome, and this is good. I managed to get zero in-sample error. What is the price I’m paying

for generalization? One, two, three, four, five, six. That’s an estimate for the VC dimension,

so that’s the compromise between this six and however

many points, et cetera. So you realize, I fit the data

well but I don’t like the fact that it’s six. I don’t have too many points, so my

handle on generalization is not good. So let me try to do better,

at least in your mind. So what you do is say, wait a minute,

I didn’t need all of these guys. I could have gone with just this guy,

knowing that this is the origin. All you need to do is just x_1

squared and x_2 squared. This is just a circle centered

at the origin. Why do I need the other funny stuff? This would be if I’m going

for a more elaborate set. So now one, two, three, now I have VC

dimension of three, so I’m better. Of course, we know better, but

I’m just playing along. And then you get carried away and

say, I can even do this. It’s not an ellipse, it’s a circle, so

I can just add up x_1 squared and x_2 squared as one coordinate,

and then I have two. And you see what the problem

is, and the problem is what we mentioned before. What you are really doing, you are

a learning algorithm in your own right, but free of charge. That’s the problem. You are looking at the data, and you are

zooming in, and you’re zooming in. You’re learning. You’re learning. You are narrowing down the hypotheses, and

then leaving the final learning algorithm just to get you the radius. Yeah, big deal. Well, the problem is that you are

charging now for a VC dimension of two, which is the last part of the

learning cost, which is choosing the coefficients here. But you didn’t charge for the fact that

you are a learning algorithm, and you took the data into consideration,

and you kept zooming in from a bigger hypothesis set. You didn’t charge for the full

VC dimension of that. Now, it is very important to realize

that the problem here is that the snooping here involves the data set. Because what happens when you

look at the data set? You are vulnerable to designing your

model, or your choices in the learning, according to the idiosyncrasies

of the data set. And therefore, you may be doing well on

that data set, but you don’t know whether you will be doing in another,

independently generated data set from the same distribution, which would be

your out-of-sample, so that’s the key. On the other hand, you are completely

allowed, encouraged, ordered to look at all other information related to your

target function and input space, except for the realization of the data

set that you are going to use for training, unless you are going

to charge accordingly. So here is the deal. Someone comes in, I ask him, how

many inputs do you have? What is the range of the inputs? How did you measure the inputs? Are they physically correlated? Do you know of any properties

that I can apply? Is it monotonic in this? All of this is completely valid and

completely important for you in order to zoom in correctly, because right

now, you are not using the data. You are not subject to

overfitting the data. You are using properties of the target

function and the input space proper, and therefore improving your chances

of picking a correct model. The problem starts when you look

at the data set and not charge accordingly, very specifically. Here is another puzzle. This one is financial forecasting.

Befitting. So right now, there will be data

snooping somewhere here, and you need to look out for it. In this case, this is a real

situation with real data. You are predicting the exchange rate

between the US dollar versus the British pound. So you have eight years worth of daily

trading, where you just simply take the change from day to day. And eight years would be

about 2,000 points. There are about 250 trading days

per year, at least when the data was collected. And what you are planning

to do is the following. You look here. Let me magnify it. This is your input for the prediction,

and this is your output. So r is the rate. So you don’t look at the rate in the

absolute, you look at delta rate, the difference between the rate today

and the rate yesterday. That’s what you’re trying to predict. You’re asking yourself whether

it’s going up or down every day, and by how much. So you get delta, and you get delta for

the 20 days before, hoping that a particular pattern of up and down in

the exchange rate will make it more likely that today’s change, which hasn’t

happened yet– you are deciding to either buy or sell at the open– whether this will be positive

or negative and by how much. So if you make a certain prediction,

then you can obviously capitalize on that, and make predictions

according to that. And if you are right more often than

not, you will be making money because you are losing less often than

winning if you have the right objective function. So this is the case. What happens

here is that now you have the 2,000 points, so for every day, there

is a change, delta r. And what you do first, you normalize

the data to zero mean and unit variance. And then after that, you have

this array of 2,000 points. You create training set and test set. So the training set in this case, you

take 1,500 points, 1,500 days. So every day now, you take the day, and

you take the previous 20 days as their input. That becomes your training. And for the test, you picked it at

random, not the last ones, just to make sure that there is no funny stuff,

change in this or that. You just want to see if something is

inherent, so just to be on the safe side, you did it randomly. And then you take 500 points

in order to test on. So right now, out of the 2,000 array of

points, you have a big array of 20 points input, one output, 20 points

input, one output, 1,500 of those. And on the other side on the test, 20

points input, one output, 20 inputs, one output, 500 of those. This is for the test. That’s the game. So you go on with the training. You train your system on the training

set, and to make sure, because you heard of data snooping,

these guys are in a lock. You didn’t look at the

data at any point. You just carried all of this

automatically, and then when you are done and you froze the final hypothesis,

you open the safe, you get the test data, and you

see how you did. And this is how you did. You train only on D_train, you test on

D_test, and this is what you get. I’m not saying how often you got it

right, but I’m actually saying that you put a trade according to the

prediction, and I’m asking you how much money you made. So for the 500 points, sometimes you

win, sometimes you lose, but you win more often than you lose,

which is good. And at the end of two years worth–

that’s what 500 days would be– you would have made a respectable 22%

unleveraged, so that’s pretty good. So you are very happy, and now having

done that, you go to the bank and tell them I have this great

prediction system. Here is the system. I’m going to sell it for you, and I

guarantee that it will be– you do the error bars and whatever. And they go, and they go live, and they

lose money, and they sue you, and all of that. So you ask yourself,

what went wrong? What went wrong is that

there is snooping. And what’s interesting is, where

exactly is the snooping? So there are many things: random, the

fact that I used inputs that happened to be outputs to the other guy? No, no, that’s legitimate. I’m just really getting the pattern. You just go around it, and it is really

remarkably subtle, to the level where you can fall into that very, very

easily, and here is where the snooping happened. The snooping happened

when you normalized. What? I had the daily rates, right? 2,000 of them. I have the change. All of that is legitimate. Now, I slipped a fast one by you– I hope I did– when I told you, first you

normalize this to zero mean and unit variance. It looked like an innocent step, because

you get them to a nice numerical range, and some methods will actually

ask you to please put the data normalized, because it’s sensitive to

the dynamic range of the data. The problem is that I did this

before I separated the training from the testing. So I took into consideration the mean

and variance of the test set. That extremely slight snooping into

what’s supposed to be the test set, supposed not to affect anything, has

affected me, but by just a mean and– How could it possibly

make a difference? Well, if you didn’t do that,

you split the data first. You took the training set only,

and you did the normalization. And whatever the mu and sigma squared

that did the normalization for the training set, you took them frozen and

applied them to the test set so that they live in the same range of values. And you did the training now and

the test without any snooping. Under those conditions, this is

what you would have gotten. So no wonder you lost money. All the money you made is because you

sniffed on the average of the out-of-sample. And the average matters, because if you

think about it, let’s say that the US dollar had a trend of going up. That will affect the mean,

but you don’t know that– at least, you don’t know it for the

out-of-sample unless you got something out-of-sample. So I’m not saying normalization

is a bad idea. Normalization is a super idea. Just make sure that whatever parameters

you use for normalization are extracted exclusively from what

you call a training set, and then you are safe. Otherwise, you will be getting

something that you are not entitled to get. Easy to think about, if you are actually

thinking: I’m going to deploy this system. I don’t have the test set. So if you don’t have the test set, you

cannot possibly use those points in order to normalize. So use only things that you will

actually be able to use when you deploy the system. In this case, you have

only the training. Now, the third manifestation of

data snooping comes from the reuse of the data set. That is also very common. So what you do, I give you

a homework problem. Oh, I am very excited about

neural networks. Let me try neural networks. Oops, they didn’t work. I heard support vector

machines are better. Let me try them. Yeah, I did, but it was

the wrong kernel. Let me use the RBF kernel. Oh, maybe I’m just using too

sophisticated a model. Let me go back to the linear models,

and just use a nonlinear transformation. And eventually, using the same

data set, you will succeed. And the best way to describe it is

a very nice quote in machine learning. It says, “If you torture the data long

enough, it will confess”, but exactly the same way that a confession would

mean nothing in this case. So the problem here is that when you

do this, you are increasing the VC dimension without realizing it. I used neural networks and it didn’t

work, and then I used support vector machines with this and that. Guess what is the final model

you used in order to learn? The union of all of the above. It’s just that some of them

happened to be rejected by the learning algorithm. That’s fine, but this is

the resource you had. So you think of the VC dimension, and

the VC dimension is of the total learning model. Again, as we will see, there will

be remedies for data snooping, and there is a question of– it’s not like I

have to try a system, and when I fail, I just quit. That’s not what is being said. It’s just asking you to account

for what you have been doing. Don’t be fooled into thinking that I

can do whatever, and then the final guy that I use with a very simple model,

after all the wisdom that I accumulated from the data, is the VC

dimension that I’m going to charge. That just doesn’t work. The interesting thing is that this could

happen, not because you used the data, but because others

used the data. Oh my God, it’s really terrible here. Here’s the deal. You decide to try your methods

on some data set. So you go to one of the data sets

available on the internet, let’s say for heart attacks or something, and

you say, I am very aware of data snooping, right? I’m not going to look at the data, I’m

not going to normalize using the data. I’m going to get the data, and put them

in a safe, and close the safe, and I will just do my homework

before I even touch the data. And your homework is in the form of

reading papers about other people who used the data set. You want to get the wisdom, so

you use this, and you find that people realize that Boltzmann machines

don’t work in this case. The best kernel for the SVM happens

to be polynomial of order three, whatever it is. So you collect it, and you look

at it, and then you have your own arsenal of things. So as a starting point, you put

a baseline based on the experience you got, and you say that I’m

going now to modify it. Now you open the safe

and get the data. Now you realize what happened. You didn’t look at the data, but you

used something that was affected by the data, through the work of others. So in that case, don’t be surprised that

if all you did was determine a couple of parameters, that’s the only

thing you added to the deal, and you got a great performance. And you say, I have two parameters,

VC dimension is 2, I have 7,000 points. I must be doing great out of sample,

and you go out of sample, and it doesn’t happen. Doesn’t happen because actually, it’s

not the two parameters, it’s all the decisions that led to that model. And the key problem in all of those

is always to remember that you are matching a particular

data set too well. You are now married to that data set. You kept trying things, et cetera, and

after a while, you know exactly what to do with this data set. If someone comes and generates another

data set from the same distribution, and you look at it, it will look

completely foreign to you. What happens? It used to be that whenever these two

points are close, there is always a point in the same line far away. That’s obviously an idiosyncrasy

of the thing. Now you give me a data set

that doesn’t have that. That must be generated from

a different distribution. No, it’s generated from

the same distribution. You just got too much into this data

set, to the level where you are starting to fit funny stuff,

fitting the noise. There are two remedies for data

snooping, and I’m going to do this, and then give you the final

puzzle, and call it a day. You avoid it, or you account for it. That’s it. So avoiding it is interesting. It really requires strict discipline. So I’ll tell you a story

from my own lab. We were working on a problem, and

performance was very critical, and we were very excited about what we are

having, all the ingredients that make you go for data snooping. You just want to push it a little bit. We realized that this is the case,

so we had that discipline that we’ll take the data– the first thing we did, we sampled

points at random, put them in the safe, and then the rest of the guys

you can use for your training, validation, whatever you want. So at some point, one of my colleagues

who was working on the problem declared that they already have

the final hypothesis ready. It was a neural network at that point. So now I was the safe keeper, so now

I’m supposed to give them the test points, in order to see what

the performance is like. I smelled a rat, so what I decided, I

asked them, could you please send me the weights of the final hypothesis

before I send you the data set? That was the requirement, because

now it’s completely clear. He’s committed to one

final hypothesis. If I send him the data set and he says

it performed great, I can verify that because he has already sent me that. It’s a question of causality

in this case. And the problem is that it is

not that difficult to come– Here is the data set, and what you

really had, you had the candidate, but you had three other guys that

are in the running. And then you look at the data, and you

decide, maybe I get one from the running, et cetera. You can do very little. And in particular, in financial

applications, it’s extremely vulnerable, because it’s so noisy. It is very easy when you fit the noise

a little bit, you will make much better performance than you will ever

get from the pattern, so you had better be extremely careful. And therefore, you have a discipline

that really is completely waterproof that you did not data-snoop. Accounting for data snooping is not

that bad, because we already did a theory, and when we have a finite number

of hypotheses we are choosing from for validation, we know

the level of contamination. Even if it’s an infinite one,

we have the VC dimension. We had very nice guidelines to tell us

how much contamination happened. The most vulnerable part is looking at

the data, because it’s very difficult to model yourself and say, what is the

hypothesis set that I explored, in order to come up with that model

by looking at the data? So because accounting is very difficult,

that’s why I keep raising a flag about looking at the data. But if you can account, by all means,

that’s all you need to do. Look at the data all you want, just

charge accordingly, and you will be completely safe as far as machine

learning is concerned. Final puzzle, and we call it a day. And we are still in data snooping,

so maybe this has to do with data snooping, but it also has to do

with sampling bias, so it’s an interesting puzzle. This is a case where you are testing

the long-term performance of a famous strategy in trading, which

is called “buy and hold”. What does it mean? You buy and hold. You don’t keep– I’m going to sell

today, because it’s going down. No, you just buy, and sit

on it, forget about it. It’s like a pension plan or something. And five years later, you look

at it and see what happens. So you want to see how much money

you make out of this. So what you do is you decide to

use 50 years worth of data. That’s usually a good life span in

a professional life, so that will cover how much money you make at the time

you retire, from the time you start contributing to it. So here is the way you do the test. You want the test to be as broad as

possible, so you go for the S&P 500. You take all currently traded

stocks, the 500 of them. And then you go back, and you assume that

you strictly applied a buy and hold for all of them. So don’t be tempted to say that I’m

going now to modify it, because this guy crashed at some point, so if I sold

and then bought again, I would make more money. No, no, no. It’s buy and hold we are testing. That was frozen. So you do this, and then you compute,

and you find that you will make fantastic profit. And you compute, if I do this– you are

now young in your career– and apply it, by the time I retire,

I will have a couple of yachts and I will do this. It’s a wonderful thing. Can you see the problem? You are very well trained now,

so you can detect it. The problem is there is a sampling bias,

formally speaking, because you looked at the currently traded stock. That obviously excludes the guys that

were there and took a dive, and that obviously puts you at

a very unfair advantage. And it’s interesting that people do

treat this not as a sampling bias but as a data snooping, in spite of the

fact that it doesn’t fit our definition of data snooping. It does fit the definition of snooping,

because you looked at the future when you are here. It’s as if you are looking 50 years from

now, and someone tells you which stocks will be traded at that point. So that’s not allowed. But nonetheless, some people will

treat this as data snooping. In our context, this is formally just

sampling bias, and sampling bias that happens to be created or caused

by a form of snooping. I will stop here, and we will take

questions after a short break. Let’s start the Q&A. MODERATOR: In the last one homework

that people were using LIBSVM, it emphasized the fact that data should be

scaled, so why did we not discuss this in the course, or what? PROFESSOR: There are

many things I did not discuss in the course. I had a budget of 18 lectures,

and I chose what I consider to be the most important. There is a question of input data

processing, and there is a question not only of normalization, it’s also

a question of de-correlation of inputs and whatnot, which is

a practical matter. And the fact that I did not cover

something doesn’t mean that it’s not important. It just means that it’s a constrained

optimization problem, and you have the solution, and I have to have

a feasible solution. So that’s what I have. I think we have an in-house question. STUDENT: Thanks. Professor, you mentioned that if you

reuse the same data set to compare between different models, it’s

a form of data snooping. So how do we know what form

of model is better? PROFESSOR: The part of it which

is formally data snooping is the part where you used the failure of the

previous model to direct you to the choice of the new model, without

accounting for the VC dimension of having done that. So effectively, it’s not you that looked

at the data, but the previous model looked at the data and made

a decision, and you didn’t charge for it. So that is the data-snooping

aspect of it. If you did this as a formal hierarchy. You start out, here is the data

set, I don’t look at it. I’m going to start with support vector

machines with RBF, and then if I fail, I’m going to do this, et cetera. And given that this is my hierarchy,

the effective VC dimension is whatever, this is completely

legitimate. The snooping part is using the data for

something without accounting for it– in this case, using the data for

rejecting certain models and directing yourself to other models. STUDENT: Yes. So by accounting for the data snooping,

do you mean you consider the effective VC dimension of your entire

model, and use a much larger data set for your entire model? PROFESSOR: You’ll get the VC dimension,

so if the VC dimension is so big that the current

number, the amount of data set, won’t give you any generalization, the

conclusion is that I won’t be able to generalize unless I get more data,

which is what you’re suggesting. So the basic thing is that you are going

to learn, and you are going to finally hand a hypothesis to someone. What do you expect in terms

of performance? Data snooping makes you much more

optimistic than you should, because you didn’t charge for things that

you should have charged for. That’s the only statement being made. STUDENT: Is there a possibility that

data snooping will make you pessimistic, will make you

more conservative? PROFESSOR: I can probably construct

deliberate scenarios under which this is the case, but in all the

problems that I have seen, people are always eager to get good performance. That is the inherent bias, and that is

what directs you toward something optimistic, because you do something

that gets you smaller in-sample error, and you think now that this in-sample

error is relevant, but you didn’t account for what it cost you to

get to that in-sample error. So it’s always in the optimistic

direction. STUDENT: Yes. Thank you. PROFESSOR: Sure. MODERATOR: Assuming that there is

sampling bias, can you discuss how can you get around it? PROFESSOR: So we discussed

it a little bit. If there is a sampling bias, if you

know the distributions, you can– let me look at the– so in this case, let’s say that I

give you these distributions. What this means, you generated the data

according to the blue curve, and therefore, you will get

some data here. So what is clear, for example, is that

the data that correspond to the center of the red curve, which is the

test, are under-represented in the training set. And on the other hand, the data that

are here are over-represented. The blue curve is much bigger, it

will give you some samples. It will hardly ever be the case

that you will get that sample from the testing. So what you do, you devise a way of scaling,

or giving importance– not scaling the y value, just scaling

the emphasis of the examples– such that you compensate for this

discrepancy, as if you are coming from here, and there are some re-sampling

methods to do the same effect. So this is one approach. The other approach, which is in the

absence of those guys, is to look at the input space in terms

of coordinates. Let’s say that with the case of the

Netflix, you look at, for example, users rated a certain

number of movies. Some of them are heavy users, and

some of them are light users. So you put how many movies a user rated,

and you try to see that in the training and in the test, you have

equivalent distribution as far as the number of ratings are concerned. And you look for another coordinate and

a third coordinate, and you try to match these coordinates. This is an attempt to basically take

a peek at the distribution, the real distributions that we don’t know, in

terms of the realization along coordinates that we can relate to. So there are some methods to do that.

Basically, you are compensating by doing something to the training set you

have, to make it look more like it was coming from the test distribution. MODERATOR: Is there any counter

example to Occam’s razor? PROFESSOR: Is there– MODERATOR: Counter example

to Occam’s razor or not? PROFESSOR: It’s statistically

speaking in what we– I can take a case where I violate

the marriage between the complexity of an object and the

complexity of the set that belongs to the object. So I can take one hypothesis which is

extremely sophisticated in terms of the minimum description length or the

order of the polynomial, but it happens to be the only hypothesis

in my hypothesis set. Now, if this happens to be close to

your target function, you will be doing great, in spite of

the fact that it’s complex. So I can create things where I start

violating certain things like that. But in the absence of further

knowledge, and in very concrete statistical terms, Occam’s

razor holds. So the idea is that when you use

something simpler, on average, you will be getting a better performance. That’s the conclusion here. MODERATOR: Specifically talking about

applications in computer vision and the idea of sampling bias comes to mind,

is there any particular method used there to correct this, or just

any of the things we discussed? PROFESSOR: I think it’s the same

as discussed, just applied to the domain. Sometimes the method becomes very

particular when you look at what type of features you extract in

a particular domain, and therefore, it gets modified

in that way. But the principle of it is that you take

the data points from your sample, and give them either different weight

or different re-sampling, such that you replicate what would have happened

if you were sampling from the test distribution. MODERATOR: I think that’s it. PROFESSOR: Very good.

We’ll see you on Thursday.

Your enjoyment of the subject matter shows on your face! I can't blame you; it's a very fascinating subject. Thanks for making it easy to understand.

Thanks for the great lectures. I like this saying 'if you torture the data long enough, it will confess'

Именно та лекция в курсе, которую стоит посмотреть, даже если на весь курс не хватило времени:)

Превосходная лекция, превосходный лектор.

hi, did somebody understand why the is the growth function in the postal cram equal to 1?

8:55, about Occam's Razor. It is not quite correct. The Occam's Razor principle, is that the model with smallest number of assumptions about data and real behavior of unseen data is most plausible. "Entities must not be multiplied beyond necessity" (Non sunt multiplicanda entia sine necessitate). Or Numquam ponenda est pluralitas sine necessitate [Plurality must never be posited without necessity]. The princple is much older than the Occam, but he was the one to popularise it in the west in more formal way.

That doesn't mean that the model must be the simplest. If the model is more complex, but have less assumption, and it fits the data, it is better.

what does VC stand for?

Hay…