(siraj) Hello, world!
It’s Siraj, and
we’re going to makean app that reads
an article of textand creates a one
sentence summary outof it using the power of
natural language processing.Language is in many ways
the seat of intelligence.It’s the original
communication protocolthat we invented to
describe all the incrediblycomplex processes
happening in our neocortex.Do you ever feel
like you’re gettingflooded with an increasing
amount of articles and linksand videos to choose from?
As this data grows, the
importance of semantic densitydoes as well.
How can you say the
most important thingsin the shortest amount of time?
Having a generated
summary lets youdecide whether you want to
deep dive further or not.And the better it
gets, the more we’llbe able to apply it to
more complex language,like that in a scientific
paper or even an entire book.The future of NLP is
a very bright one.Interestingly enough, one of the
earliest use cases for machinesummarization was by
the Canadian governmentin the early 90s
for a weather systemthey invented called FoG.
Instead of sifting through
all the meteorological datathey had access
to manually, theylet FoG read it and generate
a weather forecast from iton a recurring basis.
It had a set textual
template and itwould fill in the values
for the current weathergiven the data,
something like this.It was just an
experiment, but theyfound that sometimes
people actuallyprefer the computer generated
forecasts to the human ones,partly because the
generated ones usemore consistent terminology.
A similar approach has
been applied in fieldswith lots of data that
needs human readablesummaries, like finance.
And in medicine, summarizing
a patient’s medical datahas proven to be a
great decision supporttool for doctors.
Most summarization tools in
the past were extractive,they selected an existing
subset of words or numbersfrom some data to
create a summary.But you and I do something a
little more complex than that.When we summarize,
our brain buildsan internal semantic
representationof what we’ve just
read and from that, wecan generate a summary.
This is instead an
abstractive methodand we can do this
with deep learning.What can’t we do with it?
So let’s build a
tech summariser thatcan generate a headline from
a short article using Keras.We’re going to use this
collection of news articlesas our training data.
We’ll convert it
to pickle format,which essentially
means converting itinto a raw bytestream.
Pickling is a way of
converting a Pythonobject into a character stream.
So we can easily reconstruct
that object in another Pythonscript.
Modularity for the win.
We’re saving the data as a tuple
with the heading, description,and keywords.
The heading and description
are the list of headingsand their respective
articles in order.And the keywords
are akin to tags,but we won’t be using
those in this example.We’re going to first tokenize,
or split up the text,into individual
words because that’sthe level we’re going to
deal with this data in.Our headline will be
generated one word at a time.We want some way of representing
these words numerically.Bengio coined the term
for this called wordembeddings back in 2003,
but they were firstmade popular by a team
of researchers at Googlewhen they released word2vec,
inspired by Boyz II Men.Just kidding.
Word2vec is a two layer neural
net trained on a big label textcorpus.
It’s a pre-trained
model you can download.It takes a word as its
input and produces a vectoras its output, one
vector per word.Creating word vectors lets us
analyze words mathematically.So these high
dimensional vectorsrepresent words
and each dimensionencodes a different property,
like gender or title.The magnitude along each
axis represents the relevanceof that property to a word.
So we could say king plus
man minus woman equals queen.We can also find the
similarity between words,which equates to distance.
Word2vec offers a
predictive approachto creating word vectors,
but another approachis count based.
And a popular algorithm
for that is GloVe,short for global vectors.
It first constructs a large
co-occurence matrix of wordsby context.
For each word, i.e.
row, it will counthow frequently it sees
it in some context, whichis the column.
Since the number of
context can be large,it factorizes the matrix to
get a lower dimensional matrix,which represents
words by features.So each row has a feature
representation for each word.And they also trained it
on a large text corpus.Both perform similarly well,
but GloVe trains a little fasterso we’ll go with that.
We’ll download the
pre-trained GloVe word vectorsfrom this link and
save them to disk.Then we’ll use them to
initialize an embedding matrixwith our tokenized vocabulary
from our training data.We’ll initialize it
with random numbers thencopy all the GloVe weights
of words that show upin our training vocabulary.
And for every word outside
this embedding matrix,we’ll find the closest
word inside the matrixby measuring the cosine
distance of GloVe vectors.Now we’ve got this
matrix of word embeddingsthat we could do so
many things with.So how are we going to use
these word embeddings to createa summary headline for a
novel article we feed it?Let’s back up for a second.
[INAUDIBLE] first introduced
a neural architecture calledsequence to sequence in 2014.
That later inspired
the Google Brain teamto use it for text
summarization successfully.It’s called sequence to sequence
because we are taking an inputsequence and outputting
not a single value,but a sequence as well.
[SINGING] We gonna
encode, then we decode.We gonna encode, then we decode.
When I feed it a book,
it gets vectorized,and when I decode
that, I’m mesmerized.So we use two
recurrent networks,one for each sequence.
The first is the
encoder network.It takes an input
sequence and createsan encoded representation of it.
The second is the
decoder network.We feed it as its input that
same encoded representationand it will generate an output
sequence by decoding it.There are different ways we
can approach this architecture.One approach would be to let
our encoder network learnthese embeddings from scratch
by feeding it our training data.But we’re taking a less
computationally expensiveapproach, because we already
have learned embeddingsfrom GloVe.
When we build our
encoder LSTM network,we’ll set those
pre-trained embeddingsas our first layer’s weights.
The embedding layer is
meant to turn input integersinto fixed size vectors anyway.
We’ve just given it a huge
head start by doing this.And when we train this
model, it will justfine tune or improve the
accuracy of our embeddingsas a supervised classification
problem where the input data isour set of vocab words
and the labels aretheir associated headline words.
We’ll minimize the cross-entropy
loss using rmsprop.Now, for our decoder.
Our decoder will
generate headlines.It will have the same LSTM
architecture as our encoderand we’ll initialize
its weights usingour same pre-trained
GloVe embeddings.It will take as input
the vector representationgenerated after feeding in the
last word of the input text.So it will first generate
its own representationusing its embedding layer.
And the next step is to
convert this representationinto a word, but there is
actually one more step.We need a way to decide
what part of the inputwe need to remember,
like names and numbers.We talked about the
importance of memory.That’s why we use LSTM cells.
But another important aspect of
learning theory is attention.Basically, what is the most
relevant data to memorize?Our decoder will generate
a word as its outputand that same word
will be fed inas input when generating
the next word until wehave a headline.
We use an attention mechanism
when outputting each wordin the decoder.
For each output word,
it computes a weightover each of the
input words thatdetermines how much
attention shouldbe paid to that input word.
All the weights
sum up to 1 and areused to compute a
weighted averageof the last hidden layers
generated after processingeach of the inputted words.
We’ll take that weighted average
and input it into the softmaxlayer along with the last hidden
layer from the current stepof the decoder.
So let’s see what our model
generates for this articleafter training.
All right, we’ve got this
headline generated beautifully.And let’s do it once more
for a different article.Couldn’t have said
it better myself.So, to break it down, we can
use [? retrained ?] word vectorsusing a model like GloVe easily
to avoid having to create themourselves.
To generate an output sequence
of words given an inputsequence of words, we use
a neural encoder decoderarchitecture.
And by adding an attention
mechanism to our decoder,it can help it decide what
is the most relevant tokento focus on when
generating new text.The winner of the coding
challenge from the last videois Jie Xun See.
He wrote an AI composer
in 100 lines of code.Last week’s challenge
was non-trivialand he managed to get
a working demo up.So definitely
check out his repo.Wizard of the week.
The coding challenge
for this videois to use a sequence
to sequence modelwith Keras to summarize
a piece of text.Post your GitHub
link in the commentsand I’ll announce the
winner next video.Please subscribe for more
programming videos and for now,I’ve got to remember
to pay attention.So thanks for watching.