Scalable and Robust Multi-Agent Reinforcement Learning

[MUSIC].>>I’m very happy to
introduce Chris Amato. Chris is an assistant professor
at Northeastern University, and he has done a lot of really cool work on robotics
and some game-playing, and today he’s going to
talk about multi-agent RL.>>Thank you very much. Yeah, so in the title my talk is ‘Scalable and Robust Multi-Agent
Reinforcement Learning’. I want to thank in
particular two students. These are the students who mainly did the work that
I’m going to talk about. One is Shayegan and the
other one’s Yuchen. I’ll also say I don’t
know where Sam went. But Sam various. He led very nicely with
the ending part of his talk into the stuff that
I’m going to talk about here, is I’m going to focus on the
multi-agent learning stuff, multi-agent reinforcement learning. If he didn’t convey
it enough already, I think we all know that there’s going to be
multiple agents everywhere, if there’s already
becoming lots of agents, robots or otherwise that
are helping us out in our daily lives whether
for drone delivery or autonomous cars like
typically we think of an autonomous car problem
as a single agent problem. But it’s going to be lots of autonomous cars and they’re going to need to coordinate with each other in order to really navigate the streets and optimize
where they should go. Things like UAV surveillance
or even home robots. Nowadays the home
robots are becoming so sophisticated that
they’re coordinate with each other in order to make
sure that your house is as clean as it can possibly
be for instance, right? So there’s lots of these
domains that we can think about where we’re going to
have multi-agent systems that needs to learning rather than just a single agency state as well as video games which is also a
cool application looks answer. So but in these multi-agent domains, we’re going to have lots
of different types of uncertainty which I think both Sheila and Sam set up
nicely before as well. We have the regular
outcome uncertainty that we have in a typical MDP, real-world domains always
have sensor uncertainty. So we’re going to have
partial observability often in these domains. Multi-agent domains often also
have communication uncertainty. So if I have home robots that are
made by different manufacturers, for instance so now I can
know who is to always be able to communicate with
each other or from doing search and rescue or
surveillance in dangerous domains, we don’t have perfect communication. They can’t communicate with each other to have a
centralized solution here. So there’s going to be
these three types of uncertainties that could exist in
any of these different domains. In some domains, some of them may not exist in other domains. Maybe
all three of them will. So for these types of problems, the common representation for the cooperative case for thinking about these different
types of uncertainties. Again we can think
about sub-classes or even one superclass I guess for modeling these cooperative agents here is the decentralized partially observable
markup decision process. It’s decentralized because
now we have this set of agents that’s operating in
a decentralized fashion. It’s obviously partially observable because of the sensor and certainty. So it’s just like
multi-agent extension of the MDP and the POMDP frameworks. So now we have this set of
agents here for instance the robots and at each step
they’re going to take some action, and it’s a cooperative
problems so there’s a single joint reward that’s
generated for the team of agents. Each one of them though
gets its own observation, and then they have to make decisions
choose what to do based on those observations sequences
that they’re going to guess. So a little bit more formally, so again it’s an extension of the
MDP and the POMDP frameworks. So we have our agents or set
of agents or set of states, and then each agent has a set
of actions that they can take. There’s a transition model that
depends on all the agents. So the world state depends on
what I do and what you do, right? The reward model depends on
again what I do and what you do. So it depends on everybody. There’s a set of observations
that could be different for all the different
agents and that but the observation model
again depends on possibly all the different
agents in the world state, and then we have a discount factor. So all these functions depend
on all the different agents. Solutions for these
problems could be anything, but here we need to
remember history because again we’re in partial
observable situations. So we want to try and map our different possible histories
to the different actions that we may want to choose
because we don’t have access to the true state of
the world necessarily. So we can have lots of different
policy representations, we can have direct history
to act to mappings, we could have tree-based
representations finds to control representations recurrent
network representations. There’s one of these
representations are different for each of
the different agents so that can operate in
a decentralized fashion in the domain itself. Then once we have
something like this, we can evaluate this
set of policies one for the different agents using various
types of Bellman equations. This is a Bellman
equation for the tree or the finest the controller
representation. But you could also just think
of v of h here as well, which depends on the
immediate reward, the transition probability
of the system, observation probability
of each agent, and then the next value that you
can get back to that, right? So and the goal is just like
any MDP or POMDP problem, we’re trying to maximize
expected reward over that finite or infinite horizon. So if there’s any questions about any of this stuff
at any point feel free to ask and I will there
to try and clarify things. Okay. So now like I
was talking about so just to make things more realistic, we can think about all of these
different types of uncertainty. So this model is very general, we can consider any type of problem, any type of multi-agent
problem that you can cooperative multi-agent
problem you can think of, we can represent
using this framework. It’s a very common
framework for planning and learning in multi-agent
environments. You have the only. So the only more general framework
that is typically used is for the competitive setting or possibly competitive setting is when each agent has its
own reward function. That’s the parcels
horrible stochastic game which I’m not going
to talk about today, although some of them
do apply in that case, I’m happy to answer questions
about if people want to know but I’m going to focus
on the cooperative case. But it means that it
this generality of the representation means that we have to consider the observability and the other agents in the
solutions to the problems. Okay. So that’s essentially what
I’m going to talk about today is, how do we learn solutions for
this model the Dec-POMDP model? We want to learn solutions that
are scalable to large domains, and then we also want to, remind me what time I’m
suppose finish. Sorry.>>Perfect. Thank you. Then how do we integrate deep reinforcement
learning methods into the multi-agent reinforcement
learning domains in scaling to large horizons as well? So we’ll talk about different
methods to do these things. So first, we’re going to focus
on decentralized learning. So like I said, using Dec-POMDP models is a common framework for multi-agent
reinforcement learning. Most of the methods though, do centralize learning for
decentralized execution. So all the learning is done
offline in a centralized fashion, so they can generate a set
of policies which can then be executed online in a
decentralized fashion. But this is problematic in
a couple of different ways. One, that in order to really
continue to learn online, you need to be decentralized
because execution is decentralized. So in order to learn online, you need to learn in
a decentralized way, which means that each of the
agents is continuing to learn while the other agents is also learning at the same
time, are also learning. This decentralized learning is potentially more scalable
as well because it means that there’s less information that all the different agents
have to keep track of. You keep track of your own
history information separately, and just generating a
policy directly from your own history information to the actions that you
do you want to pick. So this is nice because it’s potentially more scalable and we
can directly apply if we’d like, naively at least, single-agent RL methods to each of the
different agents in the domain. But this is problematic because now the problem is non-stationary from the perspective
of a particular agent. I’m learning while the
same time you’re learning. So this is changing what it seems
like the environment is doing. So we need to reason about this nonstationarity
or do something about this nonstationarity to
make these methods not be quite so naive and
work well in practice. So the first method that we came up with is a combination
of some of these ideas. Using some ideas from multi-agent reinforcement
learning and using some ideas from deep
reinforcement learning, combining them in a
way in order to allow the methods perform well
in these type of domains. So the basic idea here is that first, using the idea of hysteresis, which was originally developed a
while ago for the non-deep case, the tabular case of multi-agent learning where we
now have two learning rates. So instead of just having
one learning rate, we have a learning rate that’s different depending on
what that TD error is. So for the negative case, when it seems like
something bad happened, we use a smaller learning rate. So that is, when
something bad happens, we’re assuming it’s because
the other agent was exploring. The other agents or doing
something stupid for some reason. So we want to discount that. We don’t want to learn from that
quite as much because that’s the random bad thing that happened that we don’t want
to really consider too much. So we use a smaller
learning rate there so we can be more optimistic, so that we can hopefully coordinate
on doing the good thing. So we want to learn
from these cases in which all the agents do the
right thing at the right time, and maybe not learn quite
as much from cases in which there’s random exploration
by any of the other agents, which is going to cause a
negative TD error in these cases. So we have these two
different learning rates for these different cases. So then, if we have
positive TD error, then we were going to change
the positive direction. We have the regular learning rate. Then when we’re changing
in the negative direction, then we use a smaller learning rate. So this is going to make it possibly
a little bit too optimistic and can run into issues
with stochasticity. But by adjusting these
rates over time, we hope to be able to
converge to a good value. Then one common thing
that people do so well, before we get there I guess is. So for the deep case, we’re going to build on DQN. I think everybody knows what
DQN is in this audience. Then again, a common
thing that’s done, I think Matthew is in
here somewhere as well. His back there. So a common thing that
people also do is an extension to DQN or
any of the methods, so stick a recurrent layer in there. So now this is going to
allow us to consider history and take advantage of this
partial observability. So basically, this first approach, the hysteresis is going to help
us to deal with nonstationarity. DQN or deep methods, in general, are going
to help us scalability. In the input space right
now we can deal, hopefully, with larger observation spaces, and the recurrent layer of using DRQN is going to help
with the partial observability. One other thing, so in these cases, so when we’re training
these recurrent layers, where we need to sample trajectories. So previous methods have not used
the replay buffer prior to this because it caused
instability in learning because if each of the agents is pulling from a replay
buffer separately, this cause the gradients to have issues and cause the
learning to be unstable. So in this case, what we did was we generated replay buffers that were
synchronized across the agents, so we can sample random
seeds beforehand and then when we would during
the learning phase, we can think about the
time, the episode, and the agent so that when we’re sampling the trajectories from
each of the different agents, they’re sampled from the
same time steps so that we can be more robust
in our learning. Again, we can do this
in a decentralized way. This thing is called a current experience replay
trajectory or CERT. Okay. So using these methods, we could compare our method
with the previous basic method. So just using without
hysteresis and without the CERT versus using
hysteresis in the CERT. So what we essentially
see in these cases is that our method is scalable
to larger domains. So you see what happens here. So then on top is the
previous methods, and this is a target capture domain. So we have our two agents have partial observability as to
the location of that other agent, that red agent flickers. We need to try and
catch that other agent. If we’re in the same square
root we got to reward of one. So in these cases, for small domains, they
get about the same value. So three-by-three, four-by-four,
they’re about the same. But then when you start moving
to five-by-five without using hysteresis and the CERTs,
learning becomes unstable. So it doesn’t do so well and then for six-by-six and seven-by-seven, the learning doesn’t really
happen for the other case. Whereas for our case, then
we’re able to learn for larger domains even
and get good values. We can actually scale to
larger agent sizes as well. I just didn’t include those results. We also in this paper had a version that tried to solve
multitask problems as well. But again, I’m not including
those results as well. That method works really well. It can solve those problems, but has some issues scaling up
to problems with long horizons. These long horizon problems
especially with multiple agents, it’s very unlikely that those
agents are going to coordinate and doing this thing it’s very
far out into the future. So one thing that we can do which
I think Sheila mentioned as well is we can use an
idea of macro-actions. So in this particular
problem for instance, if we were trying to coordinate for our domain, I should
change these things, but we’re trying to coordinate so
that we need to deliver tools to these different people which
are a bunch of coauthors, if we need to think
about it at a low level, for each of these agents, they might need to move here right
and they come back and then go over here and ask the PR2 for
something and continue on. The horizon, this problem is going to get quite long quite quickly. For this type of problem, there’s no coordination
at that level, like navigation there’s very little coordination
that needs to happen. So we can think of the problem
where we can break it up and have like single-agent parts
and multi-agent parts. From the single-agent parts, we can have a hierarchical
kind of method where the single-agent part can be navigation independent
of the other agents. We can have low-level collision
avoidance so that they can learn, but really it’s just navigating from one location to
another location. We don’t need to do multi-agent reinforcement learning at that level. We need to do multi-agent
reinforcement learning over the level of how do they coordinate, but we don’t need to do multi-agent
reinforce learning over how they navigate from
point a to point b. So we can build a macro actions which is in our case the type of option that we can have
for the different agents, where we can build those
lower-level macro actions and then we can do learning
at the higher level over those independent single
agent macro actions in order to scale to
these large domains. Sometimes these are given
to us in advance like navigation controllers for robots of grasping controllers for robots
to these kind of things. There’s already very good solutions
for those type of problems, and we can just use those directly, other times that we can talk about
trying to learn those as well. We call this a macro
action tech POMDP potentially and it has
slightly different notation, so I have a slide on that slightly
different notation as well. So now macro action is M and a high-level observation maybe
it’s a macro observation is Z and then our policy representation
will be over those Ms and Zs instead of those A’s and O’s that
we had in the previous case. Otherwise, the representation
will be essentially the same, the evaluation gets more complicated now because
it depends on time. I’m not going to go into
this in too much detail, but I’m happy to talk about it if people want is that
technically in order to do the evaluation becomes a
semi-Markov decision process because we have to reason
about how much time. So this will complete after
time step k and will be in state S prime after taking a macro action and in
particular state s, so it becomes a semi-Markov process rather than a traditional
Markov process. But this is the general model
that you would get in this case. So now this idea of
macro actions exists, how do we do learning
for these cases? So all the current deep multi-agent reinforcement
learning methods, they assume synchronized
primitive actions, none of them make use of these asynchronous macro
actions in this case. So it isn’t clear how to incorporate this idea of asynchronous actions
into the deep moral methods. So let’s think about how
we’d like to do this. So the basic idea is
that in this case, we will make the assumption that we only get the information
at the macro-action-level. At each time step, we
get information about which macro action each of
the agents is executing, or in the decentralized case
that particular agent is executing or particular observation
they see in the joint reward. So we get this information
every primitive time step. So at the first time step, Agent 1 gets this information, Agent 2 gets this information. So it gets a particular
macro observation, the macro action that it’s currently executing and then whether there’s a new observation if
it’s terminated or not and each of the different
agents gets this information. Sorry, I’m ignoring
the people over here, I should point to stuff
over there as well, wonder that way. Then this continues. So the next time step we’re
going to get that information, the next time step we
get this information. Like I said, I’ll walk over here. So we’re accumulating
this information per agent in terms of the rewards, so that’s why the sum is happening. So the sum is happening just to
try and for each agent you’re accumulating the reward
while the macro action is continuing and then when the
macro action terminates, then you’ll no longer accumulate
the reward and you start accumulating the next reward
from the next time step. So we can generate this trajectory, this is the trajectories of the agents that are
generated over time at the macro action level and we have all this information
for these ten time steps. So we can generate that
and then from there, we can get what we call
these things Mac-CERTS, just an extension on
the original CERT idea. So in order to train
the recurrent network, we have to sample this sequence
from these trajectories. So maybe we sample this
particular set here between three and eight and then from there now for each
of the different agents, we’ll only have a particular view. So Agent 1 will only have a view
of its particular trajectory, Agent 2 will have its
particular trajectory. Then in this case, so now we need to identify
when the macro actions change. So we don’t need to have all the information about
all these time steps, we could potentially but we’re identifying when
the macroeconomic change. Then in our case, we’re going
to throw away the time info. We’re just going to
compress them so that we’re ignoring the time info. You could imagine you keep
the time information, but in our case we’re
throwing it away. So now we just have
the macro action info for what happens within
those time steps. So once we have this, now we can essentially just throw this into the method that
we talked about earlier. So the deck HDR QN, there’s too many letters there, is the method that we
talked about before the hysteretic deep
recurrent Q network that happens to also
be decentralized. So we can throw the macro action
level information directly into that algorithm so that we can continue to learn
in a decentralized way. So these are all calculated
in a decentralized way, I’m again ignoring
the folks over there, so that we can get
this information for Agent 1 and then this
information for Agent 2. We can put those into the
previous algorithm and it gives us this particular loss
function which is just the, this is the double DQN version, but it doesn’t matter that much. So we can put it in the loss
function in order to try and learn the queue
functions in that case. So this is the
de-centralized version. The decentralized versions actually simpler than the centralized version. So for the centralized version, the idea is so we’re assuming now that we have
perfect communication, we’re going to just do
centralized learning here. So in centralized learning
is potentially useful in the cases when we do have in
fact full communication online, as well as offline or there’s a
bunch of methods that try and use the centralized
values as an interim, to try and learn better
decentralized values that can be learned to execute online. So in this case, we can do the buffer
in a centralized way. So where we going to do
the same thing as before, but here just directly
identifying when the macro actions and which we can
do at the beginning of the end, but the buffer is the same. It’s just we only have at least one joint reward
that we’re accumulating here, rather than the different rewards for each agent which we have in
the de-centralized case. The trick here is that we identify when the macro action ends for any agents.
This is the problem. In the de-centralized case, it’s clear what it
means for termination, like if I terminates then I
stop accumulating my reward. But in the centralized case, it’s not clear because
it’s asynchronous. There’s really unlikely to be a chance when both agents
terminate at the same time, it’s always going to be one agent
terminate and one agent doesn’t. So you have to figure out
when you decide that, otherwise you have
just one big action. So you have to break it up into particular subsets and makes sure that you’re not over
counting in the reward. So then what we do
here is we assume that a macro action terminates when any
agents macro action terminates. So this now Agent 2’s
macro action terminated, so this becomes one
macro action here. Then after this now both of
these agents terminated, so this is another chunk right and then on the next step
Agent 2 terminated, so that’s another chunk. So we can break it up in
that sort of way in order to deal with asynchronous
actions that we have here. So we can generate that into
a centralized buffer in this particular case while
we remove the time info, again just like we did before. Now, we can learn in
a centralized way using these centralized experiences. In this case, this is just the double DR. QN that we
use for learning in that case. There’s one other kind of thing as well which we can use
in the centralized case is when we’re doing the argmax
what happens here is that, remember that not all
agents get to change their macro actions at each of
those steps because what we know from these trajectories is that only some of the agents
stopped and therefore only some of the agents
were able to change what they were able to do,
the problems asynchronous. So the basic idea of this conditional target
value prediction is we fix the macro actions for the
agents that don’t change and we only allow the argmax to go
over agents that do change. So we only consider agents
that are able to change the macro action at that
particular time step based on the trajectory
information that we have. Lots of replay buffers, but the idea is relatively
straightforward I think. Okay. So once we have this idea, now we can talk about results
and obviously we want to try and compare on the
previous domain as well. So this is the previous domain, the target capture domain. In this case, we didn’t learn the macro actions in
the macro observations, we set them for these problems. In this case it’s
pretty straightforward to set the original
actions to just up, down, left, right, and stay and the macro actions
in this cases you get. You get observation on
the target that flickers. So the macro action is to move towards the last observed
location of the targets. Seems like a pretty straightforward
thing to do, right? So if we do that, in this
case the problem is so simple that for a four by four and a 10 by
10 version of the problem. So for the macro-action case, there’s very little learning that
even needs to be done because the macro-actions are pretty good
and the problems pretty simple. So in both of these cases, the primitive version and
the macro-action version, learn the same thing but
the macro-action version as you can imagine it’s going
to learn it much faster. But then we can think about
more complicated problems. This is a box pushing type problem where these are way points
that the agents can move to. We have these two robots. There’s a bigger box that they
can coordinate to push on or there’s these smaller boxes
that they can push independently. So they have to both push this together at the same time
in order to move the box, and they get a reward for moving
it to the goal at the top. So we can look at the
decentralized case here where we have the primitive-actions
and the macro-actions. For this case, again, in the macro-action case we can move to these
different way points, we can push these kind of things, and in the primitive case it’s up, down, turn, straight,
push, things like that. The values here are for the
10 by 10 and 30 by 30 case, where in the primitive-action version you see here that
the primitive-action learns a little bit but very slowly and does not
do very well at all. Whereas the macro-action version
can learn relatively fast, and then this is the optimal value that’s this dashed line up here. So it can pretty quickly get to the optimal
value for this problem. Then we can compare it
with a centralized method. As you would imagine the
centralized method is going to do better than a decentralized method
in this particular problem, because it has more information
coordinated explicitly. So the centralized method learns even faster than the decentralized
method for this problem. We can look at a more
complicated problem. This is the one human
version of the problem. We looked at a couple of
different versions but now this is a multi-robot problem. It will be at least a
multi-robot problem. Where we have this is going
to be a robot that is at a desk that can try and find
objects that are on the desk. Then these are two delivery robots, similar to the picture
that I showed earlier for there’s a human in a
workshop here that’s doing some task and then we need to try and figure out
which objects to get to the person at different times
based on what their tasks are. So we need to monitor
them and bring them the right objects at
the right time in order to make sure that they can work efficiently and complete all
the tasks they need to do. Then we compared our
decentralized method and our centralized method
for this particular problem. The centralized method
could learn pretty quickly and converge to
a near optimal policy, while the decentralized method didn’t do so well on
this particular problem. But the reason why it didn’t do so well isn’t necessarily a
problem with the method itself, rather than a problem
with the problem itself. The problem is just really hard, and the information that you
get in the decentralized case is not enough to make it so
that you can coordinate well. In this particular case, the information that
the fetch robot had in this case this is just not
enough and the data that it got wasn’t enough for it to
be able to figure out what the right objects were to give the
particular humans in this case. So there’s certainly problems in which there isn’t a good
decentralized solution that you can get in these cases whereas
obviously the centralized solution will be better. Okay. So then we ran an extension of
our method on a real robot task. This is the real robot task, where we have our fetch
robot and our two terabots. There’s a human worker that’s over there that
needs to be monitored. This is actually extension of the method that I said
on the previous slide. So it’s not quite the centralized
or decentralized method, it’s pretty much similar to
the centralized solution now, where they’re monitoring the
person and now bringing the person the first object or the first
tool that the human needs. The human’s trying to build this
table and needs the tape measure, and then the clamp,
and then the drill, in order to be able to build this table at the most
efficient manner. This is obviously a simplified
version of problem you can imagine many humans and many more robots
in a much more complicated, but this stuff is hard
to get to work on a set of robots so you
want to start simple. So here you see that it gave
the person the tape measure, and now the next robots
getting the clamp. So it’ll bring the clamp to
the person while the fetch is giving the last tool to
the other robot there. So the clamp can get
brought to the human here, they can clamp the table in that case table is
sufficiently clamped I guess, and then they can finally have
the drill there and they can use the drill to finish drilling all the screws into the table in order to finish
the beautiful table.. So I think I’m quickly
running out of time, so I will skip this method I think. I think I have five minutes left. So I’ll skip this method. It’s a non deep method for
trying to learn controller based representations for the decentralized
macro-action approach here. So I’ll skip this even
though it does have a cool robot video
that I wanted to show. You can look on my web page for the robot video where they’re
doing, I’ll just say, this is search and rescue where there’s a aerial vehicle and
a bunch of ground vehicles. An air vehicle can see more stuff than the ground
vehicles obviously, but the ground vehicles have to bring things to the people
and rescue the people. So they need to coordinate and
communicate with each other in limited communication range in order to be able to figure
out where the people are, and get to them most effectively, and rescue them most effectively. So that’s a particular domain
that we are looking at there. Maybe for this audience
I don’t need this slide, but I like this slide. People often ask or often
try to use deep RL. Nowadays it’s quite popular for trying to solve many
different problems, and it’s helpful in some ways
but it certainly especially in these parse observable
multi-agent problems doesn’t solve all the problems. So there’s a number of big issues that some of
which have been studied in the past and some of which there
aren’t good solutions for, there’s a bunch of good
issues that still need to be solved and be able to deal with solving these
large interesting problems. One of which is centralized
versus decentralized learning. So the stuff that I was talking about today was the decentralized
learning case, which is the hardest case. It’s like all the agents
are acting online based on very limited
information that only they have. It’s quite hard to
learn in that case. So what are the best
methods that we can use? How do we use those signals in a better way to do things
that are more sophisticated, then hysteresis to
be able to deal with the non-stationary that we get when we’re doing
decentralized learning? In the centralized learning case, what’s the best way of balancing off the centralized information
in order to learn well? So there’s a bunch of methods
that are popular for using centralized value functions to help
learn decentralized solutions. But that can often be a bad thing. The centralized value function
can make you do stuff that you just can’t do in
the decentralized case. So we need to think more about what the best methods
are to use in that case as well. Obviously, again, the
deep methods are popular but don’t always have very
good sample efficiency. So how do we use them
online in order to be able to to learn quickly and
dealing with person’s ability? So just using LSTMs in
there is a easy solution, but it’s probably not
the best solution. How can we better use partial observability to
take advantage of that, come up with other structures
in order to be able to learn well and deal with the
persons observability that exists in these
different domains. So I will conclude there by saying for the multi-agent
reinforcement learning problems, one of the most general
representations is this dec POMDP, decentralized POMDP, where it considers the outcome,
sensor and communication, and certainty in these
domains so we can model any multi-agent
coordination problem this way. So I talked about a
couple different methods. One, just integrating Deep Learning with the multi-agent
reinforcement learning under this framework and using macro-actions in abstraction to improve scalability
and the horizon here. Yeah, in using learning methods. So these methods they also
apply in sub-classes as well, even in the super-class,
some of them. So if we don’t have
uncertainty or if we have different reward functions some
of these methods can still work. These are some methods that we’ve started to develop,
but like I said, there’s lots of cool open questions
that still need to be solved. I’m sure Sam and his team
will solve many of them, but there will still be
plenty open for other people. So I’m happy to talk about them as well afterwards
if people have questions.

1 thought on “Scalable and Robust Multi-Agent Reinforcement Learning

Leave a Reply

Your email address will not be published. Required fields are marked *