Issue
I am using Tensorflow Distributed learning using the following commands –
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi –
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error –
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
Solution
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that’s too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy
. MirroredStrategy, in terms of gpu memory, isn’t vram * num_of_gpu
, it practically is smallest_vram
, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
- GPU0: [0, 1, 2, 3]
- GPU1: [0, 1, 2, 3]
- GPU2: [0, 1, 2, 3]
- GPU3: [0, 1, 2, 3]
NOT
- GPU0: [0]
- GPU1: [1]
- GPU2: [2]
- GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you’re limited to your smallest device when using this strategy.
Answered By – Djinn
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0