Experiments with Gridworld

Setup

Objects in the grid are represented by 0 - N, where N is the number of object types in the environment. The observation is normalised so the network gets as input values between [0, 1] by dividing each value in the grid by N. Every gridworld used in these experiments uses random goals, each new episode samples a goal position from the available set of goals. By default the goal is shown on the observation, but the environment has a setting to hide the goal on the observation, in this case the agent has to search for it.
We implemented 3 different ways to supply the goal to the agent:

profile Example screenshot is shown on the right. Agent is represented by the blue rectangle while the goal is pink. Walls are grey and passages are black.

Experimental setup

We used RLlib to run the experiments on a cluster with 16 cpu and 1 gpu per run. We have tried multiple setups for goal representation. We used PPO in these experiments.

Results

profile
Gridworld observation with invisible goal

Agent trained with random goals

Baseline results

In this setup we did not show the goal on the observation, the agent had to use the goal channel to find it or just explore if no goal channel was present. We observed that the agent with having a single goal only learned an optimal policy, but as soon as multiple goals were present the agent learned to go to the nearest goal and waited there until the end of the episode. We ran both DQN and PPO for 1-5 million interactions.

PPO trained with “invisible goals”, the goal is not visible to the agent, which makes the agent constantly explore the area to find the goal.

PPO trained on 4 room domain

Sometimes the agent goes around, skipping some grids.

PPO trained on 4 room domain

Agent given the final observation as the goal alongside the observation makes it learn a nice policy.

PPO trained on 4 room domain

Finally goal is given as one-hot encoded. This setup converges slightly faster than giving the goal observation.

PPO trained on 4 room domain

Starting in different locations -> does not solve the issue, goals not seen during training doesn’t work once the agent is deployed.

Successor Features (Baretto NIPS 2017)

Decouples the agent’s value function from the MDP’s dynamics. Helps in generalisation to new tasks, where the underlying dynamics doesn’t change, only the reward function.