Objects in the grid are represented by 0 - N, where N is the number of object types in the environment. The observation is normalised so the network gets as input values between [0, 1] by dividing each value in the grid by N. Every gridworld used in these experiments uses random goals, each new episode samples a goal position from the available set of goals.
By default the goal is shown on the observation, but the environment has a setting to hide the goal on the observation, in this case the agent has to search for it.
We implemented 3 different ways to supply the goal to the agent:
Example screenshot is shown on the right. Agent is represented by the blue rectangle while the goal is pink. Walls are grey and passages are black.
We used RLlib to run the experiments on a cluster with 16 cpu and 1 gpu per run. We have tried multiple setups for goal representation. We used PPO in these experiments.
In this setup we did not show the goal on the observation, the agent had to use the goal channel to find it or just explore if no goal channel was present. We observed that the agent with having a single goal only learned an optimal policy, but as soon as multiple goals were present the agent learned to go to the nearest goal and waited there until the end of the episode. We ran both DQN and PPO for 1-5 million interactions.
PPO trained with “invisible goals”, the goal is not visible to the agent, which makes the agent constantly explore the area to find the goal.
Sometimes the agent goes around, skipping some grids.
Agent given the final observation as the goal alongside the observation makes it learn a nice policy.
Finally goal is given as one-hot encoded. This setup converges slightly faster than giving the goal observation.
Starting in different locations -> does not solve the issue, goals not seen during training doesn’t work once the agent is deployed.
Decouples the agent’s value function from the MDP’s dynamics. Helps in generalisation to new tasks, where the underlying dynamics doesn’t change, only the reward function.