Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

Issue

I am using Tensorflow Distributed learning using the following commands –

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = Basic_Model()
    model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])

The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi –

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   37C    P0    65W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   38C    P0    40W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   33C    P0    40W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   39C    P0    41W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But after running the script to create the model, I am getting the following error –

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]

A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?

Solution

MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that’s too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn’t vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.

strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
    model = Basic_Model()
    model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])

Example:

Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:

  • GPU0: [0, 1, 2, 3]
  • GPU1: [0, 1, 2, 3]
  • GPU2: [0, 1, 2, 3]
  • GPU3: [0, 1, 2, 3]

NOT

  • GPU0: [0]
  • GPU1: [1]
  • GPU2: [2]
  • GPU3: [3]

As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you’re limited to your smallest device when using this strategy.

Answered By – Djinn

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published