-
Notifications
You must be signed in to change notification settings - Fork 223
Description
❓ Question
So I created a custom environment for chess. I want to train a MaskablePPO using imitations behavior cloning for initial weights, then run .learn() for continued training on the environment so that it starts to understand board states that behavior cloning doesnt see during training. It works with behavior cloning training the policy, but then when I got to .learn() it works until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions "but found invalid values:
tensor([[7.9758e-07, 7.9733e-07, 7.9821e-07, ..., 7.9767e-07, 7.9793e-07,
7.9746e-07]])" was wondering if there was a way to overcome this issue on my end? Potentially if it gets to something that it has never seen before, make all valid actions uniform in value since it doesn't know yet? Or is this a deeper issue that hasn't been solved/thought of yet?
Checklist
- I have checked that there is no similar issue in the repo
- I have read the documentation
- If code there is, it is minimal and working
- If code there is, it is formatted using the markdown code blocks for both code and stack traces.