Reinforcement Learning: Continous Control

Algorithm and Architecture

The DDPG algorithm uses an two neural networks. The first network is called the Actor, and is used to map states to actions. The second network is referred to as the Critic, and maps state-action pairs to Q-values. The Actor produces an action given the current state of the environment. The critic then produces TD error signal, which drives learning in both the actor and the critic. This approach allows us to optimize a policy, with a continous action-space, in a deterministic fashion. Target networks for both the Actor and Critic networks are used in order to avoid correlation excessive when calculating the loss factor.

Similarily to the DQN algorithm, DDPG also utilizes a technique known as Replay Memory, which involves first storing the experiences, obtained from interacting with the environment, and later sampling them randomly and learning from them. This further minmizes correlation and stabilizies the performance of the model.
To achieve our results, we used the following hyper parameters: In our case, the structure of the neural network consists of three linear layers, with an input equal to the state space (37) and a final output corresponding to the number of available actions (4). In between, the hidden layer has a total of 64 neurons. Finally, we decided to use the relu activation function.

Code

Actor - Critic

Agent

DDPG

Environment solved in 267 episodes! Average Score: 31.02
Episode #Average Score
1003.26
20016.60
26731.02
DDPG graph

Obstacles & Future improvements

The next step would involve modifying the algorithm with the techniques used in the D4PG algorithm, as well as more experimentation with the model hyper parameters.

References