Issue
My dataset( Network traffic dataset where we do binary classification)-
Number of features in data is 25
This is the Transformer model –
embed_dim = 25 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = layers.Input(shape=(25,1,))
transformer_block = TransformerBlock(25, num_heads, ff_dim)
x = transformer_block(inputs)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
outputs = layers.Dense(1, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
history = model.fit(
x_train, y_train, batch_size=32, epochs=50, validation_data=(x_test, y_test))
But the accuracy isn’t changing and is extremely poor and it isn’t changing with epochs-
Epoch 1/50
1421/1421 [==============================] - 9s 6ms/step - loss: 0.5215 - accuracy: 0.1192 - val_loss: 0.4167 - val_accuracy: 0.1173
Solution
Overall, one should be able to get to 100% (train) accuracy, as long as data is not contradictory. Arguably it is the best strategy to get there, before worrying about generalisation (test error), for the specific case:
- final activation should be sigmoid (otherwise we have f(x) = exp(x) / exp(x) = 1
- there is no need to dropout (it will only make training accuracy lower)
- global pooling can remove important information – replace it with a Dense layer for the time being
- normalise your data, your features are in quite wide ranges, this can cause training to struggle to converge
- consider lowering your learning rate, as it will make it easier to overfit to training data
If all the above fail, just increase size of the model, as "20-25" range of your features might just not be big enough. Neural networks need quite a lot of redundancy to learn properly.
Personally I would also replace the whole model with just an MLP and verify everything works, I am not sure why transformer would be the model of choice here, and it will allow you to verify if the issue is with the model chosen, or with the code.
Finally – make sure that indeed 100% accuracy is obtainable, take your training data and check if there are any 2 datapoints that have exactly the same features, and different labels. If there are none – you should be able to get to 100% accuracy, it is just a matter of getting hyperparamters right.
Answered By – lejlot
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0