Using a simplified version of the Unity Banana environment, the objective of the project ( Udacity Banana Navigation Project) is to design an agent to navigate and collect bananas in a large, square world. A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of the agent is to collect as many yellow bananas as possible while avoiding blue bananas. The agent’s observation space is 37 dimensional and the agent’s action space is 4 dimensional (forward, backward, turn left, and turn right). The task is episodic, and in order to solve the environment, the agent must get an average score of +13 over 100 consecutive episodes.
The submission for the project is available at link .
The learning algorithm used is the Dueling Deep Q Network with the feature having state size as
input and 128 layers. Both advantage and value have 128 layers each. Each sequential model
has RELU as an activation function.
” The advantage of the dueling architecture lies partly in its ability to learn the state-value function efficiently. With every update of the Q values in the dueling architecture, the value stream V is updated – this contrasts with the updates in a single-stream architecture where only the value for one of the actions is updated, the values for all other actions remain untouched. This more frequent updating of the value stream in our approach allocates more resources
to V , and thus allows for better approximation of the state values, which in turn need to be accurate for temporal difference-based methods like Q-learning to work Reference.
The hypermeters are as follows:
5000: maximum number of training episodes
1000: maximum number of timesteps per episode
1.0: starting value of epsilon for epsilon-greedy action selection
0.01: minimum value of epsilon
0.995: multiplicative factor per episode for decreasing epsilon
Plot of rewards per episode
Ideas for Future Work
Include researching and implementing different improvements to DQN such as prioritized
experience replay, noisy networks for exploration, rainbow, quantile regression and hierarchical