Algorithm and Architecture
The DDPG algorithm uses an two neural networks. The first network is called the Actor, and is used to map states to actions. The second network is referred to as the Critic, and maps state-action pairs to Q-values. The Actor produces an action given the current state of the environment. The critic then produces TD error signal, which drives learning in both the actor and the critic. This approach allows us to optimize a policy, with a continous action-space, in a deterministic fashion. Target networks for both the Actor and Critic networks are used in order to avoid correlation excessive when calculating the loss factor. Similarily to the DQN algorithm, DDPG also utilizes a technique known as Replay Memory, which involves first storing the experiences, obtained from interacting with the environment, and later sampling them randomly and learning from them. This further minmizes correlation and stabilizies the performance of the model. To achieve our results, we used the following hyper parameters:- BUFFER_SIZE = int(1e6)
- BATCH_SIZE = 128
- GAMMA = 0.99
- TAU = 1e-2
- LR_ACTOR = 1e-4
- LR_CRITIC = 1e-4
- WEIGHT_DECAY = 0
Code
Actor - Critic
Agent
DDPG
Environment solved in 267 episodes! Average Score: 31.02Episode # | Average Score |
---|---|
100 | 3.26 |
200 | 16.60 |
267 | 31.02 |
Obstacles & Future improvements
The next step would involve modifying the algorithm with the techniques used in the D4PG algorithm, as well as more experimentation with the model hyper parameters.
References
- A link to the original paper on DDPG can be found here.