Some Gaps Between Reinforcement Learning Practice and Theory

Image credit: Unsplash

I am recently encountering some theoretical problems during a reinforcement learning (RL) research project. I started to realize how much difference between how people use versus how people prove RL algorithms. Although rarely mentioned in the literature, some gaps indeed exist and may be useful to think a little deeper. It motivates me to write some of my thoughts down.

Note: Some of the practical heuristics might already have theoretical justifications that I do not know or bypassed by some reasonable assumptions. Thus, please correct me if I misunderstand or have mistakes.

Experience Replay Cause Bias

Off-policy learning is a large category of RL algorithms. it, instead of following the target policy, enables a target policy to be learned by using data from behavior policies. Experience replay is central to off-policy algorithms in deep reinforcement learning (RL). As the fundamental data-generating mechanism in off-policy deep reinforcement learning, it has been shown to improve sample efficiency and stability by storing and randomly replaying a fixed number of the most recently collected transitions for training.

One interesting problem of experience replay is that the stored transitions are inevitably dependent with the current/target policy or value function, because the policy/value function parameter is in fact constructed from those transitions. In other words, all transitions are correlated. It violates the following conditional unbiasedness assumption in stochastic policy gradient methods.

  • The estimated gradient is unbiased condition on the history (filtration) $\mathcal{F}_k=\{\xi_1,\ldots,\xi_{k};\theta_0,\theta_1,\ldots,\theta_{k}\}$, such that $\mathbb{E}_\xi[g(\theta_k,\xi)|\mathcal{F}_{k}]=\nabla f(\theta_k)$. ($\mathcal{F}_{k}$ denotes the outcomes up to and including time $k$).

In the classical stochastic gradient descent (SGD), the data are assumed to be sampled from a unknown population, which is also referred as to the behavior distribution in off-policy batch RL. However, when you have a growing experience replay buffer in which transitions were continuously collected by following past policies, you won’t have a fixed unknown behavior distribution but a changing known one. The main concerns behind this conditional dependence makes theoretical study of experience replay difficult.

Compatibility of Critic Model

Compatibility is an important concept in actor-critic methods, which are a popular class of on-policy RL algorithms for computing an optimal policy. The actor-critic consists of two eponymous components. An actor adjusts the parameters of the stochastic policy by stochastic gradient ascent. A critic estimates the action-value function to evaluate the action chosen by current policy. Instead of the unknown true action-value function, an action-value function is used, which introduce the concept of compatibility.

In general, substituting a function approximator for the true action-value (or value) function may introduce bias. However, the bias is zero if the function approximator is compatible, i.e.

  1. the function approximator is linear in the score function of policy
  2. parameters of the function approximator are chosen to minimize the mean-squared error of approximated and true action-value function.

In practice, condition 2 is usually relaxed in favor of policy evaluation algorithms that estimate the value function more efficiently by temporal-difference learning. However, condition 1 is rarely satisfied. This issues applies to almost all, if not all, modern deep policy optimization algorithms.

Good news is that people has released this issue and some literature are working on relaxing the compatibility constraint for neural network [1-3].

Discounted Objective

I had recently revisited the policy gradient theorem proof when working on my actor-critic related paper, which leads to this question.

After a quick literature review I found Russo Alessio mentioned the same issue in his blog [4] and Nota and Thomas [5] had raised concerns about the similar issue by constructing a counterexample. In addition, Naik et al [6] criticize discounted objective as well but in a different context. I personally used a lot of discounted objective, but still felt unclear on some concepts. Part of the reason comes from inaccurate or ambiguous description and the other part I think comes from the fact that RL is an interdisciplinary area including contribution from optimization, operations research, statistics and computer science. As a result, people used different terminologies and notations, which further exaggerates the ambiguity.

Now let’s briefly discuss the issue in infinite-horizon MDP setting. For detailed discussion, I recommend reading the blog and literature mentioned above.

In the classic policy gradient theorem, the discounted policy gradient, tells us how to modify the policy parameters, $\theta$, in order to increase $\nabla J_\gamma(\theta)$ , and is given by:

$$ \begin{equation} \begin{split} \nabla J_\gamma(\theta) &= \mathbb{E}_{s_t\sim d^\pi_\gamma,a_t\sim \pi(s_t|a_t)}\left[\left. \nabla \log\pi_\theta(a_t|s_t)Q^\pi_\gamma(s_t,a_t)\right|\theta\right] \end{split} \end{equation} $$

where $d^\pi_\gamma(s)=\int p(s_0)\sum^\infty_{t=0}\gamma^t p^\pi(s_t=s|s_0)d s_0$ is the discounted stationary state distribution (discounted state occupation).

Problem: the samples used to update the policy should be distributed according to the discounted state distribution $d^\pi_\gamma(s)$ however, the sample actually used in most algorithm are $s_{t+1}\sim p(s_{t+1}|s_t,a_t),a_t\sim \pi(s_t|a_t)$. In average return case, it is allowed because we can reach the limiting distribution by continuously running the simulation, i.e. $d^\pi(s)=\lim_{t\rightarrow \infty}p(s_{t}=s|s_0)$. But in the discounted case, we can’t naively take average of samples from reply buffer as the discounted state stationary distribution treat observations sampled at different time steps differently.

Reference

  1. Wang, D., & Hu, M. (2021). Deep Deterministic Policy Gradient With Compatible Critic Network. IEEE Transactions on Neural Networks and Learning Systems.
  2. Diddigi, R. B., Jain, P., & Bhatnagar, S. (2021). Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm. arXiv preprint arXiv:2110.10017.
  3. Balduzzi, D., & Ghifary, M. (2015). Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005.
  4. Russo Alessio. (2021) Why there is a problem with the Policy Gradient theorem in Deep Reinforcement Learning. https://towardsdatascience.com/why-there-is-a-problem-with-the-policy-gradient-theorem-in-deep-reinforcement-learning-958d845218f1
  5. Chris Nota and Philip S. Thomas. 2020. Is the Policy Gradient a Gradient?. AAMAS (2020).
  6. Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R. S. (2019). Discounted reinforcement learning is not an optimization problem. NeurIPS 2019.
Hua Zheng
Hua Zheng
PHD Candidate & ML Scientist

My research interests include reinforcement learning, machine learning and stochastic optimization.

comments powered by Disqus