Keras: Siamese Model with VGG16 is stuck at 50% accuarcy

Issue

I´m trying to create a Siamese model with Keras which learns to recognize differences in Mel-Spectrograms.
The dataset I´m using is the ESC-50 dataset.
I split it in training files (40 classes a 40 files) and test files (5 classes a 40 files).
I generate positive and negative pairs.
I generate Mel-Spectrograms with 64 Mel-Bands -> shape of mel-spectrogram (64,626).
For example the arrays feat_train_1 and feat_train_2 are of the shape (3200,64,626).

Mel-Spectrograms:

Mel-Spectrograms

I this picture the first two spectrograms are feat_train_1[i] and feat_train_2[i].

pair_labels_train[i]=1 (positive pair)

The 3rd and 4th spectrogram are feat_train_1[i+1] and feat_train_2[i+1] with pair_labels_train[i+1]=0.

I then expand the feature arrays with a channel dimension and broadcast them to 3 channels.

I´m using the VGG16 network to extract embeddings out of the features.
The euclidian distance of the two embeddings gets calculated.

The problem is that the accuracy (as well as val_accuarcy) is stuck at 50% while the loss slowly decreases.
You can see the whole script here:

BATCH_SIZE = 20
EPOCHS = 10

#Load the audio files and split them
audio_data, labels = utiltiy_functions.read_audio_files('esc-50-master/audio_conv', 'esc-50-master/meta')

idx_training, idx_test, idx_eval = utiltiy_functions.split_data(audio_data, labels)


pair_idx_train , pair_labels_train = utiltiy_functions.generate_pairs(labels, idx_training)
pair_idx_test , pair_labels_test = utiltiy_functions.generate_pairs(labels, idx_test)
pair_idx_eval , pair_labels_eval = utiltiy_functions.generate_pairs(labels, idx_eval)

audio_data_train_1 = audio_data[pair_idx_train[:,0]] 
audio_data_train_2 = audio_data[pair_idx_train[:,1]] 

audio_data_test_1 = audio_data[pair_idx_test[:,0]]
audio_data_test_2 = audio_data[pair_idx_test[:,1]]

audio_data_eval_1 = audio_data[pair_idx_eval[:,0]]
audio_data_eval_2 = audio_data[pair_idx_eval[:,1]] 



#Calculate Features and reshape

def get_librosa_melspecs(audio_array, name):

    melspecs = np.zeros((audio_array.shape[0],64,626))

    for i,audio in enumerate(audio_array):
        mel = librosa.feature.melspectrogram(y=audio, n_mels=64, n_fft = 1024, hop_length=128, sr=16000)
        mel[mel!=0] = np.log(mel[mel!=0])
        #melnormalized = librosa.util.normalize(mellog)

        melspecs[i]=mel

    np.save(name, melspecs)
    return melspecs


feat_test_1 = get_librosa_melspecs(audio_data_test_1, "features_vgg_test1.npy")
feat_test_2 = get_librosa_melspecs(audio_data_test_2, "features_vgg_test2.npy")
feat_train_1 = get_librosa_melspecs(audio_data_train_1, "features_vgg_train1.npy")
feat_train_2 = get_librosa_melspecs(audio_data_train_2, "features_vgg_train2.npy")

feat_test_1 = np.expand_dims(feat_test_1, 3)
feat_test_2 = np.expand_dims(feat_test_2, 3)
feat_train_1 = np.expand_dims(feat_train_1, 3)
feat_train_2 = np.expand_dims(feat_train_2, 3)

feat_test_1 = tf.reshape(tf.broadcast_to(feat_test_1, (400,64,626,3)), (400,64,626,3))
feat_test_2 = tf.reshape(tf.broadcast_to(feat_test_2, (400,64,626,3)), (400,64,626,3))
feat_train_1 = tf.reshape(tf.broadcast_to(feat_train_1, (3200,64,626,3)), (3200,64,626,3))
feat_train_2 = tf.reshape(tf.broadcast_to(feat_train_2, (3200,64,626,3)), (3200,64,626,3))


#Build siamese net

#inputs
feat_1 = Input(shape=(64,626,3))
feat_2 = Input(shape=(64,626,3))

#vgg16
model_vgg = VGG16(weights="imagenet", include_top=False, input_shape=(64,626,3))

for layer in model_vgg.layers:
    layer.trainable = True

pre_emb1 = model_vgg(feat_1)
pre_emb2 = model_vgg(feat_2)

#flatten and dense layers
flatten = Flatten()
dense_1 = Dense(4096, activation="relu")
dense_2 = Dense(4096, activation="relu")
dense_3 = Dense(512, activation="relu")

flatten1 = flatten(pre_emb1)
flatten2 = flatten(pre_emb2)

dense1_1 = dense_1(flatten1)
dense2_1 = dense_1(flatten2)

dense1_2 = dense_2(dense1_1)
dense2_2 = dense_2(dense2_1)

dense1_3 = dense_3(dense1_2)
dense2_3 = dense_3(dense2_2)

#Distance
distance = Lambda(utiltiy_functions.eucl_distance)([dense1_3, dense2_3])

#Output Layer
outputs = Dense(1, activation="sigmoid")(distance)

#model definition
model = Model(inputs=[feat_1, feat_2], outputs=outputs)

print(model.summary())

#compile
opt = Adam(learning_rate=0.001)
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])


early_stopping = EarlyStopping(monitor='val_loss', patience=3, mode='auto', restore_best_weights=True)

#Model trainieren
print("Siamesisches Model trainieren.\n")
model.fit(
        [feat_train_1[:], feat_train_2[:]], pair_labels_train[:],
        validation_data=([feat_test_1[:], feat_test_2[:]], pair_labels_test[:]),
        batch_size=BATCH_SIZE, 
        epochs=EPOCHS,
        shuffle=True,
        callbacks=[early_stopping]
        )


model.save_weights("siamese_weights.h5")

Output:

    2022-07-01 12:33:55.261913: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-07-01 12:33:55.262999: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WAV-Dateien einlesen...

WAV-Dateien splitten...

Trainings-Paare bilden...

Test-Paare bilden...

Evaluierungs-Paare bilden...

Berechnete Features aus Dateien laden...

2022-07-01 12:34:56.888225: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-07-01 12:34:56.895846: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found
2022-07-01 12:34:56.896817: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found
2022-07-01 12:34:56.897344: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2022-07-01 12:34:56.898131: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2022-07-01 12:34:56.898859: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found
2022-07-01 12:34:56.900598: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found
2022-07-01 12:34:56.903538: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2022-07-01 12:34:56.904142: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-07-01 12:34:56.940548: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-01 12:34:57.018051: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 384614400 exceeds 10% of free system memory.
2022-07-01 12:34:57.230756: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 384614400 exceeds 10% of free system memory.
2022-07-01 12:34:57.592132: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 3076915200 exceeds 10% of free system memory.
2022-07-01 12:35:06.909754: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 3076915200 exceeds 10% of free system memory.
Siamesisches Netzwerk erstellen...

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 64, 626, 3)]      0         
                                                                 
 block1_conv1 (Conv2D)       (None, 64, 626, 64)       1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 64, 626, 64)       36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 32, 313, 64)       0         
                                                                 
 block2_conv1 (Conv2D)       (None, 32, 313, 128)      73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 32, 313, 128)      147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 16, 156, 128)      0         
                                                                 
 block3_conv1 (Conv2D)       (None, 16, 156, 256)      295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 16, 156, 256)      590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 16, 156, 256)      590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 8, 78, 256)        0         
                                                                 
 block4_conv1 (Conv2D)       (None, 8, 78, 512)        1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 8, 78, 512)        2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 8, 78, 512)        2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 4, 39, 512)        0         
                                                                 
 block5_conv1 (Conv2D)       (None, 4, 39, 512)        2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 4, 39, 512)        2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 4, 39, 512)        2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 2, 19, 512)        0         
                                                                 
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
None
2022-07-01 12:35:22.586993: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 318767104 exceeds 10% of free system memory.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 64, 626, 3)  0           []                               
                                ]                                                                 
                                                                                                  
 input_2 (InputLayer)           [(None, 64, 626, 3)  0           []                               
                                ]                                                                 
                                                                                                  
 vgg16 (Functional)             (None, 2, 19, 512)   14714688    ['input_1[0][0]',                
                                                                  'input_2[0][0]']                
                                                                                                  
 flatten (Flatten)              (None, 19456)        0           ['vgg16[0][0]',                  
                                                                  'vgg16[1][0]']                  
                                                                                                  
 dense (Dense)                  (None, 4096)         79695872    ['flatten[0][0]',                
                                                                  'flatten[1][0]']                
                                                                                                  
 dense_1 (Dense)                (None, 4096)         16781312    ['dense[0][0]',                  
                                                                  'dense[1][0]']                  
                                                                                                  
 dense_2 (Dense)                (None, 512)          2097664     ['dense_1[0][0]',                
                                                                  'dense_1[1][0]']                
                                                                                                  
 lambda (Lambda)                (None, 1)            0           ['dense_2[0][0]',                
                                                                  'dense_2[1][0]']                
                                                                                                  
 dense_3 (Dense)                (None, 1)            2           ['lambda[0][0]']                 
                                                                                                  
==================================================================================================
Total params: 113,289,538
Trainable params: 113,289,538
Non-trainable params: 0
__________________________________________________________________________________________________
None
Siamesisches Netzwerk traineren...

Siamesisches Model trainieren.

Epoch 1/10

  1/160 [..............................] - ETA: 1:43:33 - loss: 1.6736 - accuracy: 0.5000
156/160 [============================>.] - ETA: 2:02 - loss: 0.7549 - accuracy: 0.4622
157/160 [============================>.] - ETA: 1:31 - loss: 0.7550 - accuracy: 0.4608
158/160 [============================>.] - ETA: 1:01 - loss: 0.7546 - accuracy: 0.4604
159/160 [============================>.] - ETA: 30s - loss: 0.7542 - accuracy: 0.4613 
160/160 [==============================] - ETA: 0s - loss: 0.7538 - accuracy: 0.4619 
160/160 [==============================] - 5059s 32s/step - loss: 0.7538 - accuracy: 0.4619 - val_loss: 0.7172 - val_accuracy: 0.4725
Epoch 2/10

  1/160 [..............................] - ETA: 1:20:48 - loss: 0.7224 - accuracy: 0.4500
  2/160 [..............................] - ETA: 1:19:53 - loss: 0.7171 - accuracy: 0.4500
  3/160 [..............................] - ETA: 1:19:26 - loss: 0.7145 - accuracy: 0.4500
  4/160 [..............................] - ETA: 1:19:04 - loss: 0.7090 - accuracy: 0.4875
  5/160 [..............................] - ETA: 1:18:32 - loss: 0.7086 - accuracy: 0.4600
  6/160 [>.............................] - ETA: 1:18:34 - loss: 0.7055 - accuracy: 0.4750
155/160 [============================>.] - ETA: 2:33 - loss: 0.7006 - accuracy: 0.4677
156/160 [============================>.] - ETA: 2:02 - loss: 0.7005 - accuracy: 0.4683
157/160 [============================>.] - ETA: 1:32 - loss: 0.7005 - accuracy: 0.4688
158/160 [============================>.] - ETA: 1:01 - loss: 0.7004 - accuracy: 0.4690
159/160 [============================>.] - ETA: 30s - loss: 0.7004 - accuracy: 0.4682 
160/160 [==============================] - ETA: 0s - loss: 0.7003 - accuracy: 0.4694 
160/160 [==============================] - 5075s 32s/step - loss: 0.7003 - accuracy: 0.4694 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10

  1/160 [..............................] - ETA: 1:21:04 - loss: 0.6919 - accuracy: 0.5500
  2/160 [..............................] - ETA: 1:20:37 - loss: 0.6933 - accuracy: 0.5000
  3/160 [..............................] - ETA: 1:19:54 - loss: 0.6932 - accuracy: 0.5000
  4/160 [..............................] - ETA: 1:19:19 - loss: 0.6940 - accuracy: 0.4750
  5/160 [..............................] - ETA: 1:18:47 - loss: 0.6935 - accuracy: 0.4900
  6/160 [>.............................] - ETA: 1:18:13 - loss: 0.6932 - accuracy: 0.5000852
 62/160 [==========>...................] - ETA: 49:52 - loss: 0.6936 - accuracy: 0.4839
 63/160 [==========>...................] - ETA: 49:21 - loss: 0.6937 - accuracy: 0.4825
 64/160 [===========>..................] - ETA: 48:50 - loss: 0.6936 - accuracy: 0.4844
 65/160 [===========>..................] - ETA: 48:20 - loss: 0.6936 - accuracy: 0.4854
 66/160 [===========>..................] - ETA: 47:49 - loss: 0.6936 - accuracy: 0.4848
 67/160 [===========>..................] - ETA: 47:18 - loss: 0.6936 - accuracy: 0.4836
 68/160 [===========>..................] - ETA: 46:48 - loss: 0.6936 - accuracy: 0.4838
 69/160 [===========>..................] - ETA: 46:17 - loss: 0.6936 - accuracy: 0.4855

The accuarcy wont change. The loss slowly decreases. I´ve tried training it for several hours.
I´ve already tried using different losses like constrastive loss and different networks like MobileNet or VGGish.
Its always stuck at 50%.
I hope you can help me. Since this is my first post here feel free to ask more questions.

Solution

I could change that by changing the last activation function from sigmoid to relu:

#Output Layer
outputs = Dense(1, activation="relu")(distance)

Answered By – logame

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published