Algorithm and Architecture

The DDPG algorithm uses an two neural networks. The first network is called the Actor, and is used to map states to actions. The second network is referred to as the Critic, and maps state-action pairs to Q-values. The Actor produces an action given the current state of the environment. The critic then produces TD error signal, which drives learning in both the actor and the critic. This approach allows us to optimize a policy, with a continous action-space, in a deterministic fashion. Target networks for both the Actor and Critic networks are used in order to avoid correlation excessive when calculating the loss factor.

Similarily to the DQN algorithm, DDPG also utilizes a technique known as Replay Memory, which involves first storing the experiences, obtained from interacting with the environment, and later sampling them randomly and learning from them. This further minmizes correlation and stabilizies the performance of the model.
To achieve our results, we used the following hyper parameters:

BUFFER_SIZE = int(1e6)
BATCH_SIZE = 128
GAMMA = 0.99
TAU = 1e-2
LR_ACTOR = 1e-4
LR_CRITIC = 1e-4
WEIGHT_DECAY = 0

In our case, the structure of the neural network consists of three linear layers, with an input equal to the state space (37) and a final output corresponding to the number of available actions (4). In between, the hidden layer has a total of 64 neurons. Finally, we decided to use the relu activation function.

Code

Actor - Critic

Agent

DDPG

Environment solved in 267 episodes! Average Score: 31.02

Episode #	Average Score
100	3.26
200	16.60
267	31.02

Obstacles & Future improvements

The next step would involve modifying the algorithm with the techniques used in the D4PG algorithm, as well as more experimentation with the model hyper parameters.

References

A link to the original paper on DDPG can be found here.

Reinforcement Learning: Continous Control