Regression Intro – Practical Machine Learning Tutorial with Python p.2


Alright, so now we are at least going to get started with setting up a simple linear regression example. The first thing that we need to make sure we have is scikit learn, Pandas and Quandl. So open up terminal, command prompt, whatever. And pip install sklearn. pip install quandl. And pip install pandas. Once you have all those, you are good to go. So to install those, go ahead and pause the video and pick back up once you have them. Ok, so once you have those, let’s go ahead and get started with a simple example. So we are starting with regression, and the idea of regression is to take continuous data, and figure out a best fit line to that data. and basically with that just boils down to we are trying to like “model” your data and the way we do that with regression at least with simple linear regression is just with a straight line so the equation of the line as we will talk more about down the line but as you might remember from school, y=mx+b, so if you have x, you can figure out what y is, also if you have m and b. So basically the whole point of regression is to find out what m an b is. So for example, a lot of people use regression with stock prices so that’s what we are gonna do at least in this one. And, so the idea is, this is continuous data and you’ve got months and months of stock prices and and each price is in its own kind of unique day. But all the data is kind of one dataset together as opposed to with like classification, where each group of data has its own unique label. So with machine learning, basically everything boils down, at least with supervised machine learning, everything boils down to features and labels. Features are like your attributes, or in this case, the continuous data. So, let’s go ahead and get started and we’ll talk a little bit more about features. So first of all, let’s go ahead and import pandas as pd. And then we are gonna import Quandl with a capital Q. And then what we are gonna say is df for dataframe equals Quandl.get and we’ll put in the ticker. You can get this from quandl. So if you just go to quandl.com You can use a little search and find stuff like if I say google stock we can probably find it. Let’s see I am trying to find, we are using the wiki dataset Let’s just do free. Anyway, when you find it you can find all kinds of different datasets here but we are looking just simply for the wiki one. Here it is. You will pick up a dataset and you can come over here You can either just download here or more importantly here is the quandl code and then you can click on like python and this is the exact statement to get it. If you have an account, you can make basically unlimited request free data. If you don’t use an account, like we are not gonna use an account here, like we are not gonna use an auth token. If you don’t have an account, I think, it’s limited like 50 calls a day. We are actually only use quandl fairly short term here and then maybe later on. So you really don’t need to create an account, but if you like quandl, you might as well make an account at some point. So anyways, quandl.get and then wiki/google was the ticker there so then we can just simply print let’s print the df.head just so we can see what it is we are working with. We’ll see that basically each column here is a feature. So the open, high, low, close, these are features. So in machine learning you can have all the features you want but you want to have meaningful features features that actually have something to do with your data So some people are pretty avid believers in the ideas like pattern recognition with stock prices and that might be you but do you need every single one of these open high low close columns to do pattern recognition? No. Also, you would know we’ve got open high low close volume and then adjusted and adjusted is adjusted after a thing like stock splits so a stock split maybe your company has 10 stocks and each stock is $1000 a share and you decide I want people to be able to buy shares of my company for less than $1000 So you might say, ok, BAM, every share is now two shares so we have 20 total shares and the share price is $500 so you have adjusted prices to account for that so it doesn’t like like the stock price went from $1000 to $500 so that’s what adjusted is so we are gonna be using those but again, each one of these is really related to the other one like the correlation of these two columns is super high so would you use each one of these columns Does that the next one really brings that much meaningful data? No but one thing to always think about when you have features and labels is maybe like what about the relationship between those columns so when we get into something like deep learning and then some of the other algorithms you can start to discover relationships between attributes but with regression, just simply no. what you wanna do you wanna like simplify your data as much as possible. You want as many meaningful features as you can get but useless features as we’ll show kinda through this series can really cause a lot of trouble for your machine learning classifiers especially the more simple ones in supervised learning and so on anyways let’s close out of this and let’s go ahead and grab some features what we’re gonna say first we are gonna pair this down, we are gonna say dataframe equals the df and then we are going to create a long list basically all of the columns that we wanna have so we are gonna take adjusted, open, and then I’m just gonna go ahead and copy this copy, ok so that’s adjusted, open, and then we are gonna take oepn, so high, low, close, and volume ok so now we have just these columns so we kinda recreated our dataframe to just be the open high low close and volume of the adjusted ones. so then, like I was saying, some of these columns are relatively worthless but they do have some relationships so for example, like what is interesting about high and low is the margin of high and low tells us a little bit about volatility for the day Also, the open price that’s the starting price for the day and it’s relationship to the close price tells us did the price go up if so, by how much and did it go down? If so, by how much and so on. so the relationship there is very valuable. But a simple linear regression is not gonna seek out that relationship. It’s just gonna work with whatever features you feed through it so what we need to do is define those special relationships and then use those as our features rather than redundant almost prices that not gonna really give us anything else very useful first let’s do the high minus the low percent so this is like the percent volatility almost so we are gonna define a new column we are gonna call it HL_percent and then that is going to be I’m having a hard time here that’s gonna be equal to so percent change is in this case it would be the high minus low divided by the low times 100 so for us it would be df Adj high minus the df Adj close and what’s happening here is just on a per row basis which is just this column minus this column so that column divided by df Adj close and then times 100 you can either times by 100 or not the classifier really is not gonna care about that we are just doing that for ourselves So that’s the high minus low percent and then we actually want just the daily percent change like the daily move so I’m just gonna copy that whole line, paste and then we are gonna call this one percent_change and that is equal to pretty much the same thing only we need to change some stuff so normally percent change is new minus the old divided by the old times 100 so that would be adjusted close minus adjusted open so new minus the old divided by the old times 100 oh, I am sorry, ha, we did it the wrong way. divided by the old, so this would be open times 100 so that’s percent change. actually, you can pass close here again the classifier doesn’t really care as long as everything kinda normalized but yea so either way would been fine this is the actual way you should do it anyways once we have that data we are gonna define a new dataframe and we are instead gonna say so it’s gonna be df equals df[] and then now we define the only columns that we really acutally care about and so in our case the columns we care about are gonna be adjusted close, the high low percent, the percent change and then volume is also somewhat useful to have so volume is just how many trades were occurred basically that day so volume is also kinda related to volatility so you can also make more features with some sort of relationship there but we’ll try to keep this pretty simple so for now we’ll just print df.head and we wait just to make sure everything worked out and sure enough it did so we have all the numbers we are kinda interested in so we got our features and eventually this will actually wound up being, possibly, our label but we’ll get to…, I guess think about between now and the next tutorial features are the kinda of the attributes that make up the label and the label is, hopefully, some sort of prediction into the future so will the adjusted close, will this column, actually be a feature? or will it be a label as it stands right now. so think about that and the next tutorial we’ll pick it up and start getting closer actually making real predictions with this data so if you have any questions, comments, whatever, leave them below otherwise, as always, thanks for watching, thanks for all the supports, subscriptions and until next time

100 thoughts on “Regression Intro – Practical Machine Learning Tutorial with Python p.2

  • If anyone's getting a "No module named 'Quandl'" error I fixed mine by changing all the 'Quandl's to lowercase.

  • Hello buddy…Awesome video..
    I have python 3.6 installed & PIP 19.0.2
    Could you please share the link to install sklearn, quandl & pandas…Thank you

  • Hi! I'm new to ML with python and I am also using IDLE from something just like a week after having used another python editor ( or IDE, I don't remember how they are called). May I ask you why at the start of the video the first thing you do is not to set up some sort of virtual environment? Sorry for what might be a silly question but I am new to IDLE as I have already said. Thanks and sorry for the bad grammar if i misspelled something as I am not english. Cheers

  • It says,
    df['HL_PCT'] = (df['Adj. High'] – df['Adj. Close']) / df['Adj. Close'] *100

    TypeError: list indices must be integers or slices, not str

    When I use same code.Please Help

  • sentdex, if I want to use high price and low price separately to train nn to predict future price (so highs and lows as 2 separate inputs) will your course get me to the point where I can do that? Also how do people normally deal with highs and lows, or do they just get the average and use that to train the nn?

  • For some reason, Quandl does not work on my Python. It shows: Quandl successfully installed but when I try to run it I get "no module called Quandl". Please, could somebody help?

  • Which IDE you are using with white background ?? Which category of Anaconda ?? I have seen jupitor but this is different from jupitor. Kindly mention the name of IDE

  • what is df exactly? a dictionary of lists, but how you are using it, i mean by this – df[['xyz', 'abc']]
    brackets in brackets
    you cant use keys list to redefine dictionary

  • Dear sentdex, disliked this video because the example of stocks you provided is in itself such a hard nut to crack. How come someone will understand Machine learning like this?

  • i have 3 colums in data shortVolume , shortexemptVolume,TotalVolume
    when i tried to get percentage of short this is error im getting
    Issue:
    df['pct_short'] = df['ShortVolume'] / df['TotalVloume'] * 100.0

    TypeError: list indices must be integers or slices, not str

  • hey man i facing ssl certificate error plz tell me how to resolve it i have used tons of ways . but failed

  • Can you please redo this video a lot has changed apparently and I get soooo many errors from the beginning of the tutorial

  • Hey I came from your python introduction/basics playlist and man ur amazing!!Thanks a lot!
    I wanted to know what are things I should know (like installing modules and all) to start this series…

  • for folks having trouble, you'll need to create an account on quandl to use the data. Once you create an account you get apikey. Run this command before importing data quandl.ApiConfig.api_key = "YOURAPIKEY"

  • When i try to install pandas in cmd i get a message that says "Requirement already satisfied" then i try to run the code in python and it throws me that is "no module named pandas" ¿What can i do?

  • import pandas as pandas

    import quandl

    df = quandl.get("WIKI/GOOGL")

    print(df.head())

    #quandl should be lowercase
    3 April 2019

  • !pip install sklearn

    !python -m pip install –upgrade pip

    !pip install –upgrade pandas

    !pip install Quandl

    import pandas as pd

    import quandl

    df = quandl.get('WIKI/GOOGL')

    df.head(15)

    df.tail()

    df.columns

    use_columns = ['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']

    df = df[use_columns]

    df.head()

    rename_cols = ['Adj_Open', 'Adj_High', 'Adj_Low', 'Adj_Close', 'Adj_Volume']

    df.columns = rename_cols

    df.columns

    df['HL_PCT'] = (df.Adj_High – df.Adj_Low)/df.Adj_Low*100.0

    df.head()

    df['PCT_change'] = (df.Adj_Close – df.Adj_Open)/df.Adj_Open*100.0

    df

  • For some reason I just cannot import quandl. I've tried lower case and upper case, and I'm still getting ModuleNotFoundError
    the pip install process went smoothly, I have no idea why I can't import it now. Has anyone had a similar problem or know how to fix this? Please help

  • Please tell me any one, which browser he is using to type, before python, and result is going to python editor.

  • "So the equation of a line… well, we'll talk about more down the line." Heh, nice pun man. Made me chuckle

  • these equations work fine but when I return my statements, all the columns for the PC variables are 0. Any suggestions?

  • guys , I have LimitExceededError, could you please send a download file, so I won't get that error, thanks

  • I feel like im really lost here when i watch these, should i watch your pandas/ other python library tutorials to get a better understanding of these functions you are implementing?

  • Quandl's a piece of shit. I ran it once then it told me I had exceeded my 50 a day. So I signed up for a free account but there's just more and more layers of nonsense once you have the free account. Not a good learning tool.

  • Hi @sendex, can you share the 'WIKI/GOOGL' dataframe somewhere in Google Drive maybe? I try to find it in Quandl and fail to do so. I am a newbie and really interested in learning python machine learning. I think the best way to learn is to follow tutorial one by one.

  • Hi Guys,

    I need your help!!

    WIKI databse is no longer updating stock prices. I dont know how Quandl API works. But i found a webpage which provided codes to import data from quandl

    import quandl

    quandl.ApiConfig.api_key = 'INSERT YOU API KEY HERE'

    data = quandl.get_table('WIKI/PRICES', ticker = ['AAPL', 'MSFT', 'WMT'],

    qopts = { 'columns': ['ticker', 'date', 'adj_close'] },

    date = { 'gte': '2015-12-31', 'lte': '2016-12-31' },

    paginate=True)

    data.head()

    I want to know which quandl datset is free so i can import data in python. I also want to know how python already knows my API key because even i dont know what my API is!! But as per the codeing above python is able to import data without me entering my API!!

  • i cant insstall quand1.."could not find a version that satisfies the requirement quand1" whats this error .. and how can i fix this??

  • if pip install not working go to this pc then properties on left corner advance system settings then environmental variable in advanced create new path "C:UsersyashcAppDataLocalProgramsPythonPython37-32Scripts" (script location)

  • OK, I dont know, but I am unable to get the full dataset, after 'open, high and low' there is 3 dots followed by "Adj. Low Adj. Close Adj. Volume" and there is not Adj. High, which causes KeyError or something.. If anybody by any mistake getting typeerror, then they can do this , "df = pd.DataFrame([['Adj.Open','Adj.High','Adj.Low','Adj.Close','Adj.Volume']])" which has solved the problem for me but not the KeyError error errrr… -_- otherwise good tutorial.. ty (pd is import pandas as pd)

  • i was somehow exceeding the 50 limit with out actually using it 50 times i looked at it and it is very hit and miss to get around this i saved "df" to a pickle file:
     
    import pickle

    try:
    # to save it
    df = quandl.get('WIKI/GOOGL')

    pickle_out = open("df", "wb")
    pickle.dump(df, pickle_out)
    pickle_out.close()
    except:
    # to load it
    pickle_in = open("df", "rb")
    df = pickle.load(pickle_in)

    so now you can bypass quandl.get and test as many times as you want i hope this helps

  • 7:32 : In creating features, shouldn't High-Low percentage be 'High – Low' instead of 'High – Close'. Just an observation.

  • is there any way to to quandl not found error , or any method to use dataset similar to this the other way ? because i am completely beginner to python and quandl is not found. I tried almost every way from internet. Please help @sentdex

  • File "<ipython-input-2-ae39b1471d40>", line 5

    df['label'] = df(forecast_col.shift(-forecasts_out))

    ^

    SyntaxError: invalid syntax

    anyone please help me on this???

  • this thing is not working, i tried lower case for quandl still giving me error, and this error taking about quandle.get…someone pls help me out

  • So is the point of having PCT_Change to add another pattern/feature for the model to take in as an input? Or am I missing something crucial…

  • I'm happy that people like you exist . In Indian engineering colleges we dont have courses on ML in the computer science stream . Thanks a lot !

Leave a Reply

Your email address will not be published. Required fields are marked *