Lecture 21: Sequential decision making (part 2): The algorithms

An introduction for max-entropy RL algorithms

Recap: Control as Inference

In Control setting:

In Inference setting:

In the classical deterministic RL setup, we have:

\begin{aligned} & V_\pi(s) := \mathbb{E}_\pi \displaystyle \sum_{k=0}^T \left[\gamma^k r_{t+k+1} \mid s_t = s \right]\\ & Q_\pi (s, a) := \mathbb{E}_\pi \displaystyle \sum_{k=0}^T \left[\gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right]\\ & \pi_*(a \mid s) := \delta(a = \underset{a}{\text{argmax}} Q_*(s,a))\\ \end{aligned}

Let \(V_t(s_t) = \log \beta_t(s_t), Q_t(s_t, a_t) = \log \beta_t(s_t, a_t), V(s_t) = \log \int exp(Q(s_t, a_t) + \log p(a_t \mid s_t)) da_t​\). Denote \(\tau = (s_1,a_1, …, s_T,a_T)​\) as the full trajectory. Denote $p(\tau) = p(\tau \mid \mathcal{O}_{1:T})​$. Running inference in this GM allows us to compute:

\begin{aligned} & p(\tau \mid \mathcal{O}_{1:T}) \propto p(s_t) \displaystyle \prod_{t=2}^T p(s_{t+1} \mid s_t, a_t) \times exp(\displaystyle \sum_{t=1}^T r(s_t, a_t)) \\ & p(a_t \mid s_t, \mathcal{O}_{1:T}) \propto exp(Q_t(s_t, a_t) - V_t(s_t))​ \\ \end{aligned}

From the perspective of control as inference, we optimize for the following objective:

\[-D_{KL}(\hat{p}(\tau) \mid\mid p(\tau)) = \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat p(s_t, a_t)}\left[r(s_t, a_t)\right] + \mathbb{E}_{s_t \sim \hat{p}(s_t)}\left[\mathcal{H}(\pi(a_t \mid s_t))\right]​\]

Types of RL Algorithms

The objective of RL learning is to find the optimal parameter s.t. maximize the expected reward.

\[\theta^*=\arg\max_\theta \mathbb{E}_{\tau\sim p(\tau)}\left[\sum_{t=1}^T r(s_t,a_t)\right]\]

Policy gradients

In policy gradient, we directly optimize the target expected reward w.r.t the policy $\pi_\theta$ itself.

\[\begin{aligned} J(\theta)&=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right)\right] \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} r\left(s_{i, t}, a_{i, t}\right) \\ \nabla_{\theta} J(\theta) &=\nabla_{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[r(\tau)]=\int r(\tau) \nabla_{\theta} p_{\theta}(\tau) d \tau=\int r(\tau) p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau) d \tau \\ &=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[r(\tau) \nabla_{\theta} \log p_{\theta}(\tau)\right] \end{aligned}\]

The log-derivative trick is applied to the above equation. \(\begin{array}{l}{\nabla_{\theta} \log p_{\theta}(\tau)=\nabla_{\theta}\left[\log p\left(s_{1}\right)+\sum_{t=1}^{T} \log p\left(s_{t+1} | s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} | s_{t}\right)\right]} \\ {\nabla_{\theta} J(\theta)=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right)\right)\right]}\end{array}\)

The reinforce algorithm: \(\begin{array}{l}{\text { 1. sample }\left\{\tau_{i}\right\}_{i=1}^{N} \text { under } \pi_{\theta}\left(a_{t} | s_{t}\right)} \\ {\text { 2. } \hat{J}(\theta)=\sum_{i}\left(\sum_{t} \log \pi_{\theta}\left(a_{i, t} | s_{i, t}\right)\right)\left(\sum_{t} r\left(s_{i, t}, a_{i, t}\right)\right)} \\ {\text { 3. } \theta \leftarrow \theta+\alpha \nabla_{\theta} \hat{J}(\theta)}\end{array}\)


Q-learning does not explicitly optimize the policy \(\pi_\theta\); it optimize the estimation of the \(V,Q\) functions. The optimal policy can then be calculated by \(\pi^{\prime}\left(a_{t} | s_{t}\right)=\delta\left(a_{t}=\arg \max _{a}\left[Q_{\pi}\left(a, s_{t}\right)\right]\right)\)

Policy iteration via dynamic programming:

The approach still involves explicit optimization of \(\pi_\theta\). We can rewrite the iteration as:

\[\begin{array}{l}{\text { 1. set } Q(s, a) \leftarrow r(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim p\left(s^{\prime} | s, a\right)}\left[V\left(s^{\prime}\right)\right]} \\ {\text { 2. } \operatorname{set} V(s) \leftarrow \max _{a} Q(s, a)}\end{array}\]

Fitted Q-learning:

If the state space is high-dimensional or infinite, it is not feasible to represent \(Q, V\) in a tabular form. In this case, we use two parameterized functions \(Q_\phi, V_\phi\) to denote them. Then, we adopt fitted Q-iteration as stated in this paper:

\[\begin{array}{l}{\text { 1. set } y_{i} \leftarrow r\left(s_{i}, a_{i}\right)+\gamma \mathbb{E}_{s^{\prime} \sim p\left(s^{\prime} | s, a\right)}\left[V_{\phi}\left(s_{i}^{\prime}\right)\right]} \\ {\text { 2. } \operatorname{set} \phi \leftarrow \arg \min _{\phi} \sum_{i}\left\|Q_{\phi}\left(s_{i}, a_{i}\right)-y_{i}\right\|^{2}}\end{array}\]

Soft Policy Gradients

From the perspective of control as inference, we optimize for the following objective:

\begin{aligned} J(\theta) &= -D_{KL}(\hat{p}(\tau) || p(\tau)) \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p(s_t, a_t)}}[r(s_t, a_t)] + \mathbb{E}_{s_t \sim \hat{p}(s_t)}[\mathcal{H}(\pi(a_t|s_t))]\\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ \end{aligned}​

Now following the policy gradient method, such as REINFORCE, we just need to add a bonus entropy term to the rewards.

Soft Q-learning

Next, we connect the previous policy gradient to Q-learning. We can rewrite the policy gradient as follows:

\begin{aligned} J(\theta) &= -D_{KL}(\hat{p}(\tau) || p(\tau)) \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p(s_t, a_t)}}[r(s_t, a_t)] + \mathbb{E}_{s_t \sim \hat{p}(s_t)}[\mathcal{H}(\pi(a_t|s_t))]\\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ \end{aligned}

Note that

\begin{aligned} \nabla_\theta \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[\log \pi(a_t|s_t)] &= \int \nabla_\theta \big[ p(\tau) \sum_{t=1}^T \log \pi(a_t|s_t)\big] d\tau\\ &= \int \nabla_\theta p(\tau)\sum_{t=1}^T \log \pi(a_t|s_t) + p(\tau) \nabla_\theta\sum_{t=1}^T \log \pi(a_t|s_t) d\tau\\ &= \int p(\tau) \nabla_\theta \log p(\tau)\sum_{t=1}^T \log \pi(a_t|s_t) + p(\tau) \nabla_\theta \log p(\tau) d\tau\\ &= \int p(\tau) \nabla_\theta \log p(\tau)\Big [ \sum_{t=1}^T \log \pi(a_t|s_t) + 1 \Big] d\tau.\\ \end{aligned}

Recall from the previous lecture, \(\pi(a_t|s_t) = p(a_t | s_t, \mathcal{O}_{1:T}) = \exp(Q_\theta(s_t, a_t) - V_\theta(s_t))\)

\(V(s_t) = \log \int \exp(Q(s_t, a_t))da_t = \text{softmax}_{a_t} Q(s_t, a_t).\) Now combine these and rearrange the terms, we get:

\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log\pi_\theta(a_t|s_t) \Big[ r(s_t, a_t) + \big(\sum_{t'=t+1}^T r(s_{t'}, a_{t'}) - \log \pi_\theta(a_{t'}|s_{t'})\big) - \log \pi_\theta (a_t|s_t) - 1 \Big] \\ &= \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Big(\nabla_\theta Q_\theta(s_t, a_t) - \nabla_\theta V_\theta(s_t)\Big) \Big[ r(s_t, a_t) + Q_\theta(s_{t'}, a_{t'}) - Q_\theta(s_t, a_t) + V(s_t) \Big] \\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta Q_\theta(s_t, a_t) \Big[ r(s_t, a_t) + \text{soft} \max_{a_{t'}} Q_\theta(s_{t'}, a_{t'}) - Q_\theta(s_t, a_t) \Big] \\ \end{aligned}

Now the soft Q-learning update is very similar to Q-learning: \(\theta \gets \theta + \alpha\nabla_\theta Q_\theta(s,a)(r(s,a) + \gamma V(s') - Q_\theta(s,a)),\) where \(V(s') = \text{soft}\max_{a'} Q_\theta(s', a') = \log \int \exp (Q_\theta(s', a')) da'.\) Additionally, we can set the temperature in softmax to control the tradeoff between entropy and rewards.

To summaize, there are a few benefits of soft optimality: