[Question] Integrating Behavior Cloning With Maskable PPO

### ❓ Question

So I created a custom environment for chess. I want to train a MaskablePPO using imitations behavior cloning for initial weights, then run .learn() for continued training on the environment so that it starts to understand board states that behavior cloning doesnt see during training. It works with behavior cloning training the policy, but then when I got to .learn() it works until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions "but found invalid values:
tensor([[7.9758e-07, 7.9733e-07, 7.9821e-07,  ..., 7.9767e-07, 7.9793e-07,
         7.9746e-07]])" was wondering if there was a way to overcome this issue on my end? Potentially if it gets to something that it has never seen before, make all valid actions uniform in value since it doesn't know yet? Or is this a deeper issue that hasn't been solved/thought of yet? 

### Checklist

- [x] I have checked that there is no similar [issue](https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues) in the repo
- [x] I have read the [documentation](https://sb3-contrib.readthedocs.io/en/master/)
- [x] If code there is, it is [minimal and working](https://github.com/DLR-RM/stable-baselines3/issues/982#issuecomment-1197044014)
- [x] If code there is, it is formatted using the [markdown code blocks](https://help.github.com/en/articles/creating-and-highlighting-code-blocks) for both code and stack traces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Integrating Behavior Cloning With Maskable PPO #291

❓ Question

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Integrating Behavior Cloning With Maskable PPO #291

Description

❓ Question

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions