I saw one example of the TensorFlow model where are both sigmoid and relu are used in the binary task (spam or not spam), and I don’t really understand for what reasons are both used and why there are 2 activation levels and also why there are 2 dropout levels. Maybe someone can explain me? Here also the schema:
tensorflow_model(): Input Embedding LSTM Dense Activation ("relu") Dropout Dense Activation('sigmoid')
ReLU is probably the most commonly used as the activation for hidden layers in deep neural networks for two reasons:
- It avoids the vanishing gradient problem: its gradient is always 0 or 1. When you multiply many 0 or 1 gradients together, it will still only be 0 or 1.
- It is cheap to compute! max(0, x) is much quicker than doing the multiple exponentials required for sigmoid or other more complex activations.
Now, the reason sigmoid is used at the end is because of the nature of the problem. When doing a binary classification, we want to predict the probability of the two binary classes (in your case spam vs. not spam). Sigmoid conveniently gives us values in the range of 0 to 1, which we can interpret as probabilities. If we used ReLU, it could output anything from 0 to infinity, so much harder to tell what those values mean.
Thus, for a classification problem like this, you will often see networks like so:
Input # hidden layers Dense, ReLU Dense, ReLU ... Dense, ReLU # output layer Dense, Sigmoid
The hidden layers can optionally include more features, such as dropout or batchnormalization which are used to do things like prevent overfitting and speed up training.
Answered By – jr15