Meta Learning

– Can we learn to learn to learn? Hello world, it’s Siraj,
and what if neural networks could learn how to learn? The process of learning to learn, where a top-level AI
optimizes a bottom-level AI, or several of them, is
considered meta-learning, and it’s currently a
very popular topic in AI, the reason being meta-level AI algorithms generally make AI systems learn faster, adapt to changes in their environments in a more robust way, and
generalize to more tasks. And they can be used to
optimize a model’s architecture, its parameters, the type
of data set it uses, or some combination of all of them. If we look at the literature, there are some pretty hilariously
named meta-learning papers that demonstrate these techniques, like Learning to learn by gradient descent by gradient descent. Gotta love it. And DARTS, or Differentiable
Architecture Search algorithm. But in this video, I wanna focus on a specific meta-learning
technique called neuroevolution. This is the process of using what’s called an evolutionary algorithm to
learn neural architectures. The reason this technique
piqued my interest is because, just this year,
Google published some research detailing their effort at
using an evolutionary algorithm to learn the architecture of
a neural image classifier, and it ended up becoming
the state of the art, which was somewhat surprising to many in the research community, since evolutionary algorithms haven’t shown nearly as much promise for real world use cases as supervised and unsupervised methods have so far. And don’t forget brute forcing! Neural networks can perform tasks that would be difficult for humans if they’re given large
amounts of training data. But discovering the optimal architectures for these networks is non-trivial, and takes researchers a
lot of trial and error. Image classification is a well-known problem in the community, as deep learning researchers established the state of the art a couple years ago. Researchers worked hard on
developing newer architectures that progressively brought
the state of the art to newer levels year after year. Ambitiously, Google decided to try an evolutionary algorithm to try to learn what a neural architecture would look like for image classification
instead of hand designing it, and it out-performed the rest. And it wasn’t just Google. Neuroevolutionary strategies
have started to see more adoption as popular
tech companies, like Uber, have started adopting them to help improve the performance of their products. Uber’s dispatch algorithm has to analyze thousands of features in real time to generate more than 30 million rider-driver match pair
predictions per minute, and neuroevolution helps them speed up this crucial process. They’ve got a great blog
post on this as well that lists several examples. So why apply evolution
to neural network design? Well, to quote the contemporary
poet Marshall Mathers, we ain’t nothing but mammals, and nature demonstrates this. When the evolutionary
biologist Charles Darwin visited the Galapagos Islands decades ago, he noticed that some birds appeared to have evolved from a single ancestral flock. They shared common features, but were characterized by their unique beak forms, which
sprung from their unique DNA. We can think of DNA as
a meta-level construct. It’s a blueprint that guides
the replication of cells, a long term memory store
that captures instructions necessary to recreate biological systems that transcend their death. His hypothesis was that the
isolation of each species to a different island
caused this diversity. Eventually, he turned this hypothesis into his now famous theory
of natural selection. This process is algorithmic, and we can simulate it
on silicon processors by creating evolutionary algorithms. An evolutionary algorithm creates a population of randomly
generated members. Each of these members are
represented by some algorithm. It could be any kind, not just
a machine learning algorithm. Even blockchain? No. Then, it will give each member a score based on an objective function. This score is called the fitness function. It’s a measure of how well a member did in relation to the goal. Once all members are scored, the algorithm will select
the highest scoring members by some pre-defined
threshold and breed them to produce more members like them. Breeding involves some interpolation of each member’s features
that is application-specific. In addition to breeding, we’ll
mutate some members randomly to attempt to find even better candidates. The rest of the members die off in a very Darwinian
survival of the fittest way. This process repeats for as
many iterations as necessary. Actually, in this context,
we call them generations, as we defined. In the end, the idea is
that we’ll be left with the very best possible
members of a population. These steps are all inspired by Darwin’s theory of natural selection. We could think of them as optimizers, searching the possible space of solutions for the right one. They’re all a part of the
broader class of algorithms called evolutionary computation. If we again look at the animal kingdom, we’ll observe that there
is a complex interplay in two intertwined processes, inter-life learning and
intra-life learning. We can think of inter-life learning as a process of evolution
via natural selection. Traits, epigenetics, and microbiomes are passed on between animal generations. And intra-life learning relates to how an animal learns during its lifetime. That is, this is conditioned on its interaction with the world, things like recognizing objects, learning to communicate, and walking. Both of these natural approaches are mirrored in computer science. Evolutionary algorithms can be considered inter-life learning,
whereas neural networks can be thought of as intra-life learning, or any gradient-based
optimization strategy really, where specific experiences
result in an update in behavior. So, how do we perform neural evolution using both of these
processes to complete a goal? Let’s say we have a very simple, fully connected neural network. Our goal would be to find the best parameters for
image classification. There are four main ones: the number of layers
our network will have, the number of neurons in each layer, what the activation function will be, and what the optimization
algorithm will be. To start, we’ll initialize
our neural network with random weights, but
not just one neural network like we usually do. Let’s initialize several
to create a population. We’ll need to train the
weights of each network using an image data set, then
benchmark how well it performs at classifying test data. We’ll use its classification
accuracy on the test set as our fitness function. If we sort all of our
networks by their accuracy, we can see which ones
are the lowest performing and remove them. We’ll only select the top scoring networks to be a part of the next generation. We’ll also select a few of
the lower scoring networks, since it could potentially result in us not getting stuck in a local
maximum as we optimize. We can also randomly mutate some of our network parameters as well. Both of these methods are like an evolutionary way of
preventing overfitting. Now we’re going to breed out top picks. In our neural network case, we’ll create a new network, or child, by combining a random
assortment of parameters from its parent networks. So, a child could have
the same number of layers as one parent, and the
rest of its parameters are from another parent. Another child could have the opposite. This mirrors how biology
works in real life, and helps our algorithm converge on an optimized network. If we test out our algorithm and compare it to a brute force search, we’ll find that our algorithm gives us the same result as brute force, but in seven hours of
training instead of 63. As the parameter complexity
of the network increases, evolutionary algorithms
provide exponential speed ups. Google did this as well,
but with lots more data and computing power. They used hundreds of
GPUs and TPUs for days. They initialized 1000 identical convolutional neural networks
with no hidden layers, then through the evolutionary process, networks with higher accuracies are selected as parents,
copied and mutated to generate children,
while the rest die out. It progressively discovered better and better network architectures. In a later experiment, they used a fixed stack of repeated
modules called cells. The number of cells stayed the same, but the architecture of each cell mutated over time. They also decided to use a specific form of regularization to improve the network’s accuracy. Instead of letting the
lowest scoring networks die, they remove the oldest ones regardless of how well they scored, and it ended up improving
the accuracy because their networks didn’t
utilize weight inheritance, and they all needed to train from scratch. This technique selects for networks that remain accurate
when they are retrained, so only architectures that remain accurate through each generation
survive in the long run, which means we’ll get networks that retrain really well. They call their model AmoebaNet, and it’s the new state of the
art in image classification. So what have we learned here? Meta-learning is the process
of learning to learn, where an AI optimizes
one or several other AIs. Evolutionary algorithms use concepts from the evolutionary process, like mutation and natural selection, to solve complex problems, and a meta-learning technique
called neuroevolution uses evolutionary algorithms to optimize neural networks specifically. Please subscribe for
more programming videos, and for now, I’ve gotta find a gradient, so thanks for watching.

77 thoughts on “Meta Learning

Leave a Reply

Your email address will not be published. Required fields are marked *