Restarting learning process from scratch for best results?

Issue

I am rather new to deep learning, and have just started tinkering with some small simple models, in this case a small unet, basically copy-pasted from: https://github.com/zhixuhao/unet/blob/master/trainUnet.ipynb

What I find is that in most of my runs, I get to a certain loss level, and from then on it does not want to converge further. However, every once in a while I restart the learning process from scratch, it suddenly converges to a loss around 1000 times lower than the above mentioned plateau… The final model is rather excellent – no complaints there, but does everyone have to restart learning that many times?

I understand that this is probably due to chance allocating initial weights of the model.. I have increased learning rate and decreased batch size to try escaping local minima, but it does not seem to help much.

Is restarting model over and over again common practice?

Solution

It is pretty normal to see a small amount of variance in different runs no matter how long a model is trained for, though not at the magnitude you are seeing.

Is the decrease in loss actually reflected in the test set accuracy? Loss can be a useful measurement but at least in my experience, Loss and Accuracy (or whatever metric you are interested in) are often only loosely correlated. I have observed that unusually high training accuracys/low training losses often result in a model which generalizes poorly.

Loss landscapes are not always smooth towards the global minima, it’s possible yours has two distinct valleys in it. This paper by H Li et al is a really interesting read on the topic:
Visualizing the Loss Landscape of Neural Nets

loss landscapes

To summarize, feature re-use and regularization can greatly assist in creating a smooth gradient towards the minima.

You might also want to look into a learning rate policy to try and get your model into regions where the loss landscape is smoother. I would recommend the One-Cycle Policy by Leslie Smith. The general idea is to ramp up the learning rate and decrease momentum to get your model into the region of the global minima (and skip over local minimas along the way), then decrease the learning rate to allow the model to fall into the base of the minima.

Answered By – Lukeyb

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published