How to split the model.fit for continue training in multi days

Issue

The tensorflow model uses the following code for training:

model.fit(train_dataset,
          steps_per_epoch=10000,
          validation_data=test_dataset,
          epochs=20000
         )

The total steps_per_epoch is 10000 and epochs is 20000.

Is it possible to split the training time for multiple days:

day 1:

model.fit(..., steps_per_epoch=10000, ..., epochs=10, ....)
model.fit(..., steps_per_epoch=10000, ..., epochs=20, ....)
model.fit(..., steps_per_epoch=10000, ..., epochs=30, ....)

day 2:

model.fit(..., steps_per_epoch=10000, ..., epochs=100, ....)

day 3:

model.fit(..., steps_per_epoch=10000, ..., epochs=5, ....)

day (n):

model.fit(..., steps_per_epoch=10000, ..., epochs=n, ....)

The expected epochs is:

20000 = (day1 + day2 + day3 + ... + dayn)

Can I simply stop the model.fit and start the model.fit on another day?

Is it the same as running once with "epochs=20000"?

Solution

You can save your model after each day as a pickle file then tomorrow load your model and continue training:

training the model in day_1

import tensorflow_datasets as tfds
import tensorflow as tf
import joblib

train, test = tfds.load(
    'fashion_mnist',
    shuffle_files=True, 
    as_supervised=True, 
    split = ['train', 'test']
)

train = train.repeat(15).batch(64).prefetch(tf.data.AUTOTUNE)
test = test.batch(64).prefetch(tf.data.AUTOTUNE)

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(128, (3,3), activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))            
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))        
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_1.pkl')

Output after day_1:

Epoch 1/3
150/150 [==============================] - 7s 17ms/step - loss: 23.0504 - accuracy: 0.5786
Epoch 2/3
150/150 [==============================] - 2s 16ms/step - loss: 0.9366 - accuracy: 0.7208
Epoch 3/3
150/150 [==============================] - 3s 17ms/step - loss: 0.7321 - accuracy: 0.7682
157/157 [==============================] - 1s 8ms/step - loss: 0.4627 - accuracy: 0.8405
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets
['model_day_1.pkl']

Load model in day_2 and continue training:

model = joblib.load("/content/model_day_1.pkl")
model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_2.pkl')

Output after day_2:

Epoch 1/3
150/150 [==============================] - 3s 17ms/step - loss: 0.6288 - accuracy: 0.7981
Epoch 2/3
150/150 [==============================] - 2s 16ms/step - loss: 0.5290 - accuracy: 0.8222
Epoch 3/3
150/150 [==============================] - 2s 16ms/step - loss: 0.5124 - accuracy: 0.8272
157/157 [==============================] - 1s 5ms/step - loss: 0.4131 - accuracy: 0.8598
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets
['model_day_2.pkl']

Load model in day_3 and continue training:

model = joblib.load("/content/model_day_2.pkl")
model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_3.pkl')

Output after day_3:

Epoch 1/3
150/150 [==============================] - 3s 17ms/step - loss: 0.4579 - accuracy: 0.8498
Epoch 2/3
150/150 [==============================] - 2s 17ms/step - loss: 0.4078 - accuracy: 0.8589
Epoch 3/3
150/150 [==============================] - 2s 16ms/step - loss: 0.4073 - accuracy: 0.8560
157/157 [==============================] - 1s 5ms/step - loss: 0.3997 - accuracy: 0.8603
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets
['model_day_3.pkl']

Answered By – I'mahdi

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published