Doesn't PPO, at least the vanilla variant, only work on-policy? That is, from recent data, not an experience replay?