How to Make a Text Summarizer – Intro to Deep Learning #10


(siraj) Hello, world!
It’s Siraj, and
we’re going to make
an app that reads
an article of text
and creates a one
sentence summary out
of it using the power of
natural language processing.
Language is in many ways
the seat of intelligence.
It’s the original
communication protocol
that we invented to
describe all the incredibly
complex processes
happening in our neocortex.
Do you ever feel
like you’re getting
flooded with an increasing
amount of articles and links
and videos to choose from?
As this data grows, the
importance of semantic density
does as well.
How can you say the
most important things
in the shortest amount of time?
Having a generated
summary lets you
decide whether you want to
deep dive further or not.
And the better it
gets, the more we’ll
be able to apply it to
more complex language,
like that in a scientific
paper or even an entire book.
The future of NLP is
a very bright one.
Interestingly enough, one of the
earliest use cases for machine
summarization was by
the Canadian government
in the early 90s
for a weather system
they invented called FoG.
Instead of sifting through
all the meteorological data
they had access
to manually, they
let FoG read it and generate
a weather forecast from it
on a recurring basis.
It had a set textual
template and it
would fill in the values
for the current weather
given the data,
something like this.
It was just an
experiment, but they
found that sometimes
people actually
prefer the computer generated
forecasts to the human ones,
partly because the
generated ones use
more consistent terminology.
A similar approach has
been applied in fields
with lots of data that
needs human readable
summaries, like finance.
And in medicine, summarizing
a patient’s medical data
has proven to be a
great decision support
tool for doctors.
Most summarization tools in
the past were extractive,
they selected an existing
subset of words or numbers
from some data to
create a summary.
But you and I do something a
little more complex than that.
When we summarize,
our brain builds
an internal semantic
representation
of what we’ve just
read and from that, we
can generate a summary.
This is instead an
abstractive method
and we can do this
with deep learning.
What can’t we do with it?
So let’s build a
tech summariser that
can generate a headline from
a short article using Keras.
We’re going to use this
collection of news articles
as our training data.
We’ll convert it
to pickle format,
which essentially
means converting it
into a raw bytestream.
Pickling is a way of
converting a Python
object into a character stream.
So we can easily reconstruct
that object in another Python
script.
Modularity for the win.
We’re saving the data as a tuple
with the heading, description,
and keywords.
The heading and description
are the list of headings
and their respective
articles in order.
And the keywords
are akin to tags,
but we won’t be using
those in this example.
We’re going to first tokenize,
or split up the text,
into individual
words because that’s
the level we’re going to
deal with this data in.
Our headline will be
generated one word at a time.
We want some way of representing
these words numerically.
Bengio coined the term
for this called word
embeddings back in 2003,
but they were first
made popular by a team
of researchers at Google
when they released word2vec,
inspired by Boyz II Men.
Just kidding.
Word2vec is a two layer neural
net trained on a big label text
corpus.
It’s a pre-trained
model you can download.
It takes a word as its
input and produces a vector
as its output, one
vector per word.
Creating word vectors lets us
analyze words mathematically.
So these high
dimensional vectors
represent words
and each dimension
encodes a different property,
like gender or title.
The magnitude along each
axis represents the relevance
of that property to a word.
So we could say king plus
man minus woman equals queen.
We can also find the
similarity between words,
which equates to distance.
Word2vec offers a
predictive approach
to creating word vectors,
but another approach
is count based.
And a popular algorithm
for that is GloVe,
short for global vectors.
It first constructs a large
co-occurence matrix of words
by context.
For each word, i.e.
row, it will count
how frequently it sees
it in some context, which
is the column.
Since the number of
context can be large,
it factorizes the matrix to
get a lower dimensional matrix,
which represents
words by features.
So each row has a feature
representation for each word.
And they also trained it
on a large text corpus.
Both perform similarly well,
but GloVe trains a little faster
so we’ll go with that.
We’ll download the
pre-trained GloVe word vectors
from this link and
save them to disk.
Then we’ll use them to
initialize an embedding matrix
with our tokenized vocabulary
from our training data.
We’ll initialize it
with random numbers then
copy all the GloVe weights
of words that show up
in our training vocabulary.
And for every word outside
this embedding matrix,
we’ll find the closest
word inside the matrix
by measuring the cosine
distance of GloVe vectors.
Now we’ve got this
matrix of word embeddings
that we could do so
many things with.
So how are we going to use
these word embeddings to create
a summary headline for a
novel article we feed it?
Let’s back up for a second.
[INAUDIBLE] first introduced
a neural architecture called
sequence to sequence in 2014.
That later inspired
the Google Brain team
to use it for text
summarization successfully.
It’s called sequence to sequence
because we are taking an input
sequence and outputting
not a single value,
but a sequence as well.
[SINGING] We gonna
encode, then we decode.
We gonna encode, then we decode.
When I feed it a book,
it gets vectorized,
and when I decode
that, I’m mesmerized.
So we use two
recurrent networks,
one for each sequence.
The first is the
encoder network.
It takes an input
sequence and creates
an encoded representation of it.
The second is the
decoder network.
We feed it as its input that
same encoded representation
and it will generate an output
sequence by decoding it.
There are different ways we
can approach this architecture.
One approach would be to let
our encoder network learn
these embeddings from scratch
by feeding it our training data.
But we’re taking a less
computationally expensive
approach, because we already
have learned embeddings
from GloVe.
When we build our
encoder LSTM network,
we’ll set those
pre-trained embeddings
as our first layer’s weights.
The embedding layer is
meant to turn input integers
into fixed size vectors anyway.
We’ve just given it a huge
head start by doing this.
And when we train this
model, it will just
fine tune or improve the
accuracy of our embeddings
as a supervised classification
problem where the input data is
our set of vocab words
and the labels are
their associated headline words.
We’ll minimize the cross-entropy
loss using rmsprop.
Now, for our decoder.
Our decoder will
generate headlines.
It will have the same LSTM
architecture as our encoder
and we’ll initialize
its weights using
our same pre-trained
GloVe embeddings.
It will take as input
the vector representation
generated after feeding in the
last word of the input text.
So it will first generate
its own representation
using its embedding layer.
And the next step is to
convert this representation
into a word, but there is
actually one more step.
We need a way to decide
what part of the input
we need to remember,
like names and numbers.
We talked about the
importance of memory.
That’s why we use LSTM cells.
But another important aspect of
learning theory is attention.
Basically, what is the most
relevant data to memorize?
Our decoder will generate
a word as its output
and that same word
will be fed in
as input when generating
the next word until we
have a headline.
We use an attention mechanism
when outputting each word
in the decoder.
For each output word,
it computes a weight
over each of the
input words that
determines how much
attention should
be paid to that input word.
All the weights
sum up to 1 and are
used to compute a
weighted average
of the last hidden layers
generated after processing
each of the inputted words.
We’ll take that weighted average
and input it into the softmax
layer along with the last hidden
layer from the current step
of the decoder.
So let’s see what our model
generates for this article
after training.
All right, we’ve got this
headline generated beautifully.
And let’s do it once more
for a different article.
Couldn’t have said
it better myself.
So, to break it down, we can
use [? retrained ?] word vectors
using a model like GloVe easily
to avoid having to create them
ourselves.
To generate an output sequence
of words given an input
sequence of words, we use
a neural encoder decoder
architecture.
And by adding an attention
mechanism to our decoder,
it can help it decide what
is the most relevant token
to focus on when
generating new text.
The winner of the coding
challenge from the last video
is Jie Xun See.
He wrote an AI composer
in 100 lines of code.
Last week’s challenge
was non-trivial
and he managed to get
a working demo up.
So definitely
check out his repo.
Wizard of the week.
The coding challenge
for this video
is to use a sequence
to sequence model
with Keras to summarize
a piece of text.
Post your GitHub
link in the comments
and I’ll announce the
winner next video.
Please subscribe for more
programming videos and for now,
I’ve got to remember
to pay attention.
So thanks for watching.

100 thoughts on “How to Make a Text Summarizer – Intro to Deep Learning #10

  • Wow, great presentation! I've been working on something like this for a long time. Amazing to see you analyze the problem, and talk through a solution in just a few minutes.

    Now, what I think humanity really needs is to recognize claim statements and evidentiary statements linked to the claims, to create Argument instances, and summary representations of Arguments. We need this for machine reasoning. An argument/reason summary service would be very useful in many ways, including machine/human dialog, shared persistent argument web networks for ubiquitous understanding applications within and across knowledge domains. We should be referring to arguments already refined and elaborated, instead of recreating them from scratch all the time. This will allow us to save the work of others over time, and also to preserve the work of others in very accessible formats.

    Arguments are everywhere text is. Evaluating well-formed or properly structured arguments in software should be relatively easy if we can interactively(?) detect and summarize claims, evidence, and reasons in text. Can you make this happen?? A feedback loop with readers/writers is necessary to build the "truth data" set for a number of styles of writing, and to give people instant utility and to build a web of trust (probably block chain-based). A growing, animated web of arguments backed with real text and dialog would be a new step for humanity.

    Defining and handling context is a big problem. Definitions, Assumptions, Logical rules & fallacies, are included in the context problem. I think they are solved with protocols and amendments.

  • I have little knowledge in data mining. I am studying now. I want to make a project of extracting different language from hinglish text. How can I start? Please help me

  • Hey Siraj!
    I really enjoy watching your Videos!
    But can you make more Videos about how to create your own Datasets, and how to use them? This would be great since in all your Videos the Datasets are already given and its hard to aply the Tutorials for your own ideas…
    Thanks!

  • Hello Siraj, i am fan of yours ,you really motivated me to dig deep in ML and i am currently working on a problem, i need suggestions of yours on that, i have been given dataset of companies description and i have to find from companies description, which companies are providing consultancy and which are not, suggest me approaches that will perform very good on this problem,
    1. approaches for labelled dataset
    2. approaches for unlabelled dataset

  • hey siraj, i was also working on Text summarization. Could you please suggest some free dataset for this and how to download it.??

  • We are using the headlines for training. Without having them , can we build the meaningful headlines reading the descriptions alone?

  • heyy.. siraj….I am having a project , for resume shortlisting from a bundle of pdf resumes, I wanted to do this with machine learning, but not having any direct idea about it, how to do that, can u provide some detail information for it. It will be so helpful.

  • I am still a newbee and I am trying to follow your video tutorial, but you do not do anything that could run. You for a background instead the notebook you could very well have the number rain from the matrix. No use what so ever. Obviously the likes and view count is what's important in this channel.

  • This github repo is a direct and straightforward copy of https://github.com/udibr/headlines, same code, same comments, same notebook texts by the author of the paper: Generating News Headlines with Recurrent Neural Networks http://arxiv.org/abs/1512.01712
    The generated samples shown in the video and in the original github repo are obtained with a model trained on a dataset far bigger than the BBC news dataset: the English Gigaword dataset, containing 5.5M news articles with 236M words. Each model takes 4.5 days to train on a GTX 980 Ti GPU.

  • I'm trying to follow your tutorial here: https://github.com/llSourcell/How_to_make_a_text_summarizer in the Data section this is the function I used to process it https://gist.github.com/spicyramen/37f5b1e06529abd1cd653695b8802736

  • Amazing video siraj. what changes should be made if we run this code amazon fine food reviews.Please reply asap.

  • so… your a data scientist? .. ♬ ♬ That don't impress me much. so you've got the brains but.. do you got the touch ♬ ♬

  • did anyone else figure out where the postprocessing library comes from? It appears many others are stuck because of this as well.

  • DOES ANYONE HAVE THIS CODE IN THEIR GITHUB? There are lot of parts missing making the tutorial difficult.

  • Hi siraj and hi guys,
    while i am tring to execute following command with signal media data set
    def get_vocab(lst):
    vocabcount = Counter(w for txt in lst for w in txt.split())
    vocab = map(lambda x: x[0], sorted(vocabcount.items(), key=lambda x: -x[1]))
    return vocab, vocabcount
    vocab, vocabcount = get_vocab(heads+desc)

    i am getting an error
    AttributeError: 'list' object has no attribute 'split'
    did any one face this issue? can any one help me with this

  • Sir I have been studying some material regarding the use of machine learning in mining bitcoins i.e by predicting the correct nonse for a given block. I can not find some satisfying answer by googling. Can you describe this ? Thank you

  • Please provide de code/folder you show at video. With *.pkl so we can reproduce it and learn/test locally

  • Can you please make a video about keyword extraction from text? Would the same model shown in this video be valid for keyword extraction when trained with a different dataset?

  • Hey Siraj! If I wanted to integrate this into an app I am developing which takes in user inputs… Does the code need to be changed anywhere or should I use it as it is?

  • just random asking in 2:30
    with open('data/%s.pkl', 'rb'):as fp:
    does it work? i mean %s without the insert value can work in python?

  • Interesting, is there some model to make abstractive summarize in a whole text? I mean, not only output a headline but output another text based on the original?

  • We are far away from getting AI based accurate summary in real world .. Available algorithms and models including the one demonstrated in this video are not adequate .. One needs in-depth R&D to enhance the model for better accuracy.. I tried couple of existing solutions available in market, but the quality of output summary is just below average.. making no value to business

  • I want to replicate this, but for text in Spanish. Do I need a train set of articles in spanish? or since it vectorizes the words there is no problem if I use the original news database which mostly contains articles written in english?

  • You're doing great boss. Pls, is there a way that that ML automatically classify text
    to actors, title and body

  • extraordinary tutorial !thank you very much. can we extract specific string like (name,date,value) from pdf file. This makes alot for me.If we can extract please suggest the respective resources

  • You are doing the community and the world a great service by putting out such stellar and useful content. 🙌🏾💯💯

  • how can i fix this problem while writing the source code alike belows?
    ModuleNotFoundError: No module named 'cPickle'

  • Hi Siraj Bro ,
    let's assume we have a Train dataset of 4 columns and 100 rows, Columns[1,2,3,4]= [ID, Difficulty (1 to 4), Cipher Text, Target] and test dataset with no target . so here , the entire text format is in encrypted using classical encryption methods. The ciphertext is like "$hA10#dLjaU$*2!/[email protected]&t" . What kind of Machine learning or deep learning techniques used ? RNN's LSTM is used to train the dataset ? if used how it is used ? is any external source to refer this kind of dataset problems . Thank You siraj

  • Hey, just a quick question: the post-processing module listed above–is it a custom module you coded, or is it available for download somewhere?

  • How can we use different text? As far as I understood, you used a dataset that is already defined. I try to make summarize or find a headline from transcripts of the conferences. How can I adapt this approach to that?

  • 1          TemporalSummarization(S, C, q, ts, te)

     S -> Participant system.

     C -> Time-ordered corpus.

     q -> Event keyword query.

     ts -> Event start time.

     te -> Event end time.

    1 . U ← {}

    2  .S.Initialize(q)

     3 .for d ∈ C

     4 . do

    5 . S.Process(d)

     6 . t ← d.Time()

     7 . if t ∈ [ts, te]

    8  .then

    9  . Ut ← S.Decide()

     10. for u ∈ Ut

     11 .do

    12 .U.Append(u, t)

     13 .retrun u

    write to program plz help me

  • I just want the simple code which is used in the video. And did you write a function for get_glove_weights ?

  • Hey Siraj I honestly loved your video! The thing is my friends and I are implementing a project on similar lines. The issue is I tried training my own model but I can't get the loss below 7 and we have a demonstration of the same project in the next three four days it would mean a lot if you or anyone else who has trained their model can send the h5 file to me on my email [email protected] . Thank you again , really great work.

  • can this be done with images? like, take an image as input and image as output? I've been stranded and chose to go into text summarizing to try to finish my project. I'm trying to make something recreate the bottom of an image and I KNOW i've seen it before but I can't find it.

  • Ok I'm entirely new to this. But u just got the text available in the headlines. But I want to generate new text as a summary of some whole text so yes u mentioned encoder decoder but how do I do it. And can someone help me with the code

  • Did you just time travel FROM the future?
    Articulation of such topics with humor?- – – mind blown.
    Amazing..!

  • I'm surprised you didn't use sumy to show how you can do text summarization in less than 10 lines of Python.

  • Suppose if I have a directory of files and images and all my data on my own pc, by passing a text message of data ' show me salary report' it should search the file name and displays it directly.
    We have to train the model such that it understands the text message I pass.
    Please say me how can I do that using NLP

Leave a Reply

Your email address will not be published. Required fields are marked *