Model training with tf.data.Dataset and NumPy arrays yields different results

Issue

I use the Keras model training API and observed differences when training the model with NumPy arrays (x_train and y_train) and with tf.data.Dataset.from_tensor_slices((x_train, y_train)). A minimal working example is shown below:

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = True
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

If I run this code with use_dataset = True, the final performance is:

Evaluation:
4/4 [==============================] - 0s 825us/step - loss: 0.4132
7/7 [==============================] - 0s 701us/step - loss: 0.4132

If I run it with use_dataset = False, I get:

Evaluation:
4/4 [==============================] - 0s 855us/step - loss: 0.4219
7/7 [==============================] - 0s 808us/step - loss: 0.4219

I expected that the two training loops would perform identically. Interestingly, the model performance is identical if I set batch_size = n_examples. The difference seems to be related with the way that batches are handled internally. Why is this happening? Is it a bug or a feature?

Solution

The behavior is due to the default parameter shuffle=True in model.fit(*) and not a bug. According to the docs regarding shuffle:

Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator or an object of tf.data.Dataset. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

So this parameter is ignored when a tf.data.Dataset is passed, and the data is not reshuffled after each epoch as in the other approach with arrays.
Here is the code to get the same results for both methods:

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = False
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        shuffle=False,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

Answered By – AloneTogether

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published