Why does shuffling sequences of data in tf.keras.dataset affect the order of sequences differently between tf.fit and tf.predict?

Issue

I am training an LSTM deep learning model with time series sequences and labels.

I generate a tensorflow dataset "train_data" and "test_data"

train_data = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=total_window_size,
sequence_stride=1,
batch_size=batch_size,
shuffle=is_shuffle).map(split_window).prefetch(tf.data.AUTOTUNE)

I then train the model with the above datasets

model.fit(train_data, epochs=epochs, validation_data = test_data, callbacks=callbacks)

And then run predictions to obtain the predicted values

train_labels = np.concatenate([y for x, y in train_data], axis=0)
train_predictions = model.predict(train_data)
test_labels = np.concatenate([y for x, y in test_data], axis=0)
test_predictions = model.predict(test_data)

Here is my question: When I plot the train/test label data against the predicted values I get the following plot when I do not shuffle the sequences in the dataset building step:

enter image description here

Here the output with shuffling:

enter image description here

Question Why is this the case? I use the exact same source dataset for training and prediction. The dataset should be shuffled. Is there a chance that TensorFlow shuffles the data twice randomly, once during training and another time for predictions? I tried to supply a shuffle seed but that did not change things either.

Solution

The dataset gets shuffled everytime you iterate through it. What you get after your list comprehension isn’t in the same order as when you write predict. If you don’t want that, pass:

shuffle(buffer_size=BUFFER_SIZE, reshuffle_each_iteration=False)

Answered By – Nicolas Gervais

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published