Deep Reinforcement Learning Motion in Observation

Issue

I am trying to implement a DRL (Deep Reinforcement Learning) Agent for self-driving vehicles. I am currently teaching my agent not to bump on other cars, using a simple camera. There are many ways to speed up the training, but currently, I am focusing on adding the sense of motion in my observation.

Everyone on the Internet (Including Google’s article about Atari games) mentions that in order to add motion in observations is to capture 3-4 frames, instead of 1 and feed them to the QNetwork as one observation. However, This isn’t much practical when using camera data, because It requires a lot of computational power to train an agent. For example:

Suppose You use a grayscale camera of resolution 256×256 and we use a simple uniform replay memory that holds up to 20000 observations. Then, the number of pixels stored in the memory are:

20000 (Samples) * 4 (Frames) * 256 (Width) * 256 (Height) = 5.2 GB of Physical RAM.

Also, suppose that You use a batch size of 64 observations to feed the agent, which contains a CNN of 32 filters in the 1st layer, then You need:

64 (Batch Size) * (4 Frames) * 256 (Width) * 256 (Height) * 32 (Filters) = 0.5 GB of GPU.

This is an insane amount of data that needs to be proccessed by the agent for 1 simple grayscale camera, just to add the sense of motion.

I was thinking of an alternative way of adding the sense of motion, however, I can’t find anything about it on the internet. Since we already know the speed of the vehicle, then we could feed the agent:

• 1 Frame that contains the camera data.
• 1 Frame that contains the normalized value of the vehicle’s speed in the center of the image (e.g. reserve a 32×32 window in the center of the image that contains the normalized speed of the vehicle (0.0-1.0) and the rest pixels have the value of 0.

In that way, we reduce the size of the data by half. Do You think this could be a good approach?

Solution

I understand that you are scared of those huge amount of RAM.
In the dqn papers they use usually huge amount of RAM aswell. In the nature paper about atari games they even use about 9 GB of RAM!
https://www.nature.com/articles/nature14236
You could try to resize your images, take 4 consecutive frames like you already explained and store them only as integers to abolutely minimize this.

Sometimes I guess there is no way around but in your case, you could try to add a second input layer with one node which is feeded with the (normalized) speed of your vehicle.
You should be able to implement this with the functional API of keras. There you aren’t limited to purely sequential layers.