Skip to content

[Question] Integrating Behavior Cloning With Maskable PPO #291

@kaihansen8

Description

@kaihansen8

❓ Question

So I created a custom environment for chess. I want to train a MaskablePPO using imitations behavior cloning for initial weights, then run .learn() for continued training on the environment so that it starts to understand board states that behavior cloning doesnt see during training. It works with behavior cloning training the policy, but then when I got to .learn() it works until it see's a board state that it has never seen before, then action mask makes all valid actions and invalid actions "but found invalid values:
tensor([[7.9758e-07, 7.9733e-07, 7.9821e-07, ..., 7.9767e-07, 7.9793e-07,
7.9746e-07]])" was wondering if there was a way to overcome this issue on my end? Potentially if it gets to something that it has never seen before, make all valid actions uniform in value since it doesn't know yet? Or is this a deeper issue that hasn't been solved/thought of yet?

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    custom gym envIssue related to Custom Gym EnvquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions