In the previous post we introduced the basics of reinforcement learning (RL) and the type of problem it can be applied to. The discussed setting was limited in the sense that we were dealing with a single agent acting in a stationary environment. Now we will take it one step further and discuss **Multi-Agent Reinforcement Learning** (**MARL**). Here we deal with multiple explicitly modeled agents in the same environment, hence every agent is part of the environment as it is perceived by all others. Since all agents learn over time and start to behave differently, the assumption of a stationary environment is violated.

# Recap: Single Agent RL

In the last post, we discussed fundamental concepts of RL, such as the **Markov Decision Process** (**MDP**). Here the agent moves from state $$s$$ to state $$s'$$ by choosing the action $$a$$ and receives a reward $$R(s, s', a)$$. From repeated interactions with the MDP, the agents learns the Q-function $$Q(s, a)$$, which measures the expected reward to choose action $$a$$ in state $$s$$. The Q-function forms the basis of the agents decision making. This form of RL is called **Q-Learning**.

There is an alternative approach which we did not discuss, called actor-critic. This is in principle a single-agent algorithm as well, but it is used quite often for MARL, including the research we will discuss below.

## Actor-critic

Let's recall the optimization problem that is to be solved with RL. Basically, we try to learn a policy $$\pi^{\ast}$$ that optimizes the expected discounted reward, i.e.

**[Loss]**

$$ \pi^{\ast} ={\mathrm{argmax}}_{\pi} \mathbb{E}_{P, \pi} \left[ \sum_{t=0}^T \gamma^t r_t \right] $$,

where $$r_t := R(s_{t+1}, s_t, a_t)$$ is the reward received at time step $$t$$ and $$\gamma$$ a value between $$0$$ and $$1$$.

For Q-learning, we keep the policy fixed (e.g. $$\epsilon$$-greedy) and optimize the Q-function, thus learning the policy indirectly.

**Actor-critic**, like Q-Learning, also solves [Loss], but the main difference is that instead of fixing the policy and learning the Q-values, actor-critic methods employ a specific type of policy-gradient learning, where the policy $$\pi_\theta$$ is parameterized with $$\theta$$ and the gradient of [Loss] is computed with respect to $$\theta$$. The actor is acting according to $$\pi_\theta$$. In order to reduce the variance of the gradient, a critic is used. This critic is parameterized by $$\omega$$. While there are multiple possible candidates for the critic, a common choice is the advantage function

$$ A(s_t, a_t) = r_t + \gamma V_\omega(s_{t+1}) - V_\omega(s_t) $$

as it is used in Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C). $$V_\omega (s)$$ is the value function

$$ V_\omega (s) = \mathbb{E}_{P, \pi} \left[ \sum_{t=0}^T \gamma^t r_t | s_0 = s \right] $$

and, in layman's terms, quantifies how "good" it is to be in state $$s$$. The parameters of $$\pi_\theta$$ are optimized according to the gradient

$$ \nabla_\theta J(\theta) \sim \sum_{t=0}^T \nabla_\theta \log \left[ \pi_\theta(a_t|s_t) \right] A(s_t, a_t) $$

In summary, for actor-critic methods like A2C the actor, represented by the policy $$\pi_\theta$$, and a critic, here represented by the value function $$V_\omega$$, have to be trained by adjusting the parameters $$\theta$$ and $$\omega$$. As with Q-Learning, $$\pi_\theta$$ and $$V_\omega$$ can be approximated with Deep Neural Networks.

## Partially Observable Markov Decision Process (POMDP)

In the previous post of this series we introduced the MDP. For many real-world applications, we have to extend this concept a little bit. In an MDP the learning agent knows which state they are in, i.e. the state $$s \in S$$ is fully described. However, there might be some uncertainty in the observations of the state. This could be due to the fact that the agent only has limited information about the true state of the environment, e.g. a robot interacting with a complex outside world based on limited sensor inputs. This setting is modeled by a **Partially Observable Markov Decision Process** (**POMDP**).

A **POMDP** is a 6-tuple $$(S, A, R, P, O, T)$$, where $$S$$ is the set of states, $$A$$ the set of actions, $$R: S\times S \times A \rightarrow \mathbb{R}$$ a reward function, $$P: S \times S \times A \rightarrow \left[ 0, 1\right] $$ the conditional transition probability, $$O$$ is the set of observations and $$T: S \times O \rightarrow \left[ 0,1 \right]$$ is the conditional observations probabilities.

In a POMDP the environment's state $$s$$ is not presented to the agent, but an observation $$o \in O$$ is given with probability $$T(o|s)$$. Besides this, the POMDP works the same as the MDP: the agent is in state $$s \in S$$, but only observes $$o \in O$$. Based on that, he decides to apply action $$a \in A$$ and via the transition probability $$P$$ the agent moves to a new state $$s'$$. For this transition the agent is rewarded with $$R(s, s', a)$$. Note the key difference to the MDP: the agent chooses the action based on the observation $$o$$ and not based on the underlying state $$s$$.

# Multi-agent Reinforcement Learning

Two of the main issues of MARL are the non-stationarity and the partial observability. A few recent publications shall be highlighted here, where these problems are addressed.

MARL comes in multiple different flavors which are characterized by the relationship of the agents, see Zhang 2019. There is the **cooperative setting**, where the agents work together to solve a given task, there is the **competitive setting**, where the agents attempt to best each other in a game and there is the **mixed setting**, where the agents' relationship is unrestricted. As of now, MARL is mostly investigated in the academic community and examples of real-world applications are scarce. However, at dida we got together with the German railway company Deutsche Bahn to research applications of MARL for capacity and traffic management. We found that this particular problem, as well as a others of industrial interest, such as resource allocation and grid optimization, can be seen in the cooperative MARL setting. Hence I give a brief overview of interesting approaches that make MARL a viable approach to many problems.

## Formal aspects

The problem setting of MARL can be formalized with the notion of the **stochastic game**.

An N-player **stochastic game** is the tuple $$(S, A_1, A_2, ... , A_N, R_1, R_2, ..., R_N, P)$$, where $$S$$ is the set of states, $$A_i$$ is the set of actions of agent $$i$$, $$R_i: S \times A_1 \times ... \times A_N \rightarrow \mathbb{R}$$ the reward of agent $$i$$ and $$P: S \times S \times A1 \times ... \times A_N \rightarrow [0, 1]$$ is the conditional transition probability.

In state $$s \in S$$ each agent $$i$$ picks an action $$a_i \in A_i$$ and the game transitions into state $$s' \in S$$ with probability $$P(s'|s, a_1, ... , a_N)$$. As a consequence, each agent $$i$$ receives a reward $$R_i(s, s', a_1, ..., a_N)$$. Analogous to single agent RL, each agent $$i$$ is represented by its policy $$\pi_i: S \times A_i \rightarrow [0, 1]$$. The individual actions are often written as a joint action $$\pmb{a} = [a_1, ..., a_N]$$ and the policies are combined to policy $$\pmb{\pi}(\pmb{a} | s) = [\pi_1(a_1|s), ..., \pi_N(a_N|s)]$$.

Each agent $$i$$ maintains its own value function $$v_i^{\pmb{\pi}}$$ under the joint policy

$$ v_i^{\pmb{\pi}}(s) = \mathbb{E}_{P, \pmb{\pi}} \left[ \sum_{t=0}^T \gamma^t r_{t,i} | s_0 = s\right] $$,

where $$r_{t,i} = R(s_t, s_{t+1}, a_{t,1}, ..., a_{t,N})$$. The Q-function is defined very similarly:

$$ Q_i^{\pmb{\pi}}(s, \pmb{a}) = \mathbb{E}_{P, \pmb{\pi}} \left[ \sum_{t=0}^T \gamma^t r_{t,i} | s_0 = s, \pmb{a}_0 = \pmb{a}\right] $$.

With these notations MARL looks fairly similar to the single agent case. The setting, cooperative, competitive or mixed, is decided by the individual reward functions.

## Decentralized Actors

In Lee 2019 the authors attempt to solve a cooperative MARL scenario with partial observability via a **decentralized imitation learning** approach. The key contributions are an improved training method for decentralized actors by using actor-critic RL with a centralized critic and by improving the training data with examples sampled from a reference policy. As the critic is only relevant during training, at test time the agents are decentralized. The reference policy used for improving the training data is a traditional MARL-solution. This allows to harvest global information of the system to inform training of independent actors with limited visibility.

## Learning efficient communication

In Kim 2019 the authors also use actor-critic Deep Reinforcement Learning to train decentralized actors with a centralized critic. They named their algorithm **SchedNet**. The problem was again fully-cooperative and the actors had to learn efficient communication via a channel of limited bandwidth, i.e. only a limited number of agents were allowed to share their information at any given time. For this, the agents had to learn to assign weights to their own messages and a scheduler dispatched the messages based on these weights. An agent consisted of three Neural Networks, one for message generation, weight generation and action selection respectively. The actions were determined by the received messages. The algorithm was tested in a predator-prey-scenario and compared to algorithms without communication and with communication without intelligent scheduling for the messages. SchedNet outperformed its competitors and only came in second to a set of agents whose communication bandwidth was unlimited. This shows that efficient communication can aid decentralized actors to solve multi-agent tasks much more effectively and the relevant communication can be learned as well.

## Mean-field MARL

The more agents interact in an environment, the larger the space of possible interactions between agents becomes. This can hinder learning and limits the number of agents that can be deployed in the environment, before the training will not converge anymore. In Yang 2018 the authors tackle this issue by reducing the problem to MARL with only two agents by using a **mean field approximation**. Normally, the Q-function in MARL would require an exponentially large vector of actions. This complexity can be reduced by only considering a limited number of agent interactions, e.g. the interactions between an agent and its nearest neighbors. For a large number of agents, this might still be too complex. Hence the authors reduce the problem to a two-agent interaction by only considering the interaction between an agent and the mean of its nearest neighbors. With this approach they manage to solve much larger problem setting than it was possible with traditional MARL methods.

# Conclusion

All in all, MARL is an up-coming field of research. While industrial applications are still very rare in practice, there are many possible applications under investigation. I hope this article gave a broad overview over the potentials and limitations of the approach, as well as inspiration for considering MARL as a problem solution.