32nd Annual POMS-Conference (Talk 1)

Name: 32nd Annual POMS-Conference (Talk 1)
Start: 2022-04-23T16:30:00Z
End: 2022-04-23T17:30:00Z
Location: Virtual

Hua Zheng, Wei Xie

Abstract

We extend the idea underlying the success of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for infinite-horizon Markov Decision Processes (MDP). The existing GS-PG method was designed to learn from complete episodes, which limits its applicability to low-data environment and online process control. In this paper, we present the mixture likelihood ratio (MLR) based policy gradient estimation which can leverage the information from historical state transitions generated under different behavioral policies. We propose a variance reduced experience replay (VRER) method that can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation accuracy, and accelerate the learning of optimal and robust policy. Then we create a algorithm by incorporating VRER with the state-of-the-art step-based policy optimization algorithms such as actor-critic method and proximal policy optimizations. The empirical study demonstrates that the proposed policy gradient methodology can significantly outperform the existing policy optimization approaches.

Date

Apr 23, 2022 4:30 PM — 5:30 PM

Event

POMS 32nd Online Conference

Location

Virtual

Variance Reduced Experience Replay Policy Optimization