How to create a dataset for image classification


I trained a model using images I gathered from the web. Then, when inferences were made using images newly collected from the web, performance was poor.

I am wondering how I can improve my dataset using misclassified images. Can I add all the misclassified images to the training dataset? And then do I have to collect new images?

I added some of the misclassified images to the training dataset, although the performance evaluation got better.


It might be worth if you could provide more info on how you trained your model, and your network architecture.

However this are some general guidelines:

  • You can try to diversify your images in your train set by, yes, adding new images. The more different examples you provide to your network, the higher the chance that they will be similar to images you want to obtain prediction from.
  • Do data augmentation, it is pretty straightforward and usually improves quite a bit the accuracy. You can have a look at this Tensorflow tutorial for Data Augmentation. If you don’t know what data augmentation is, basically is a technique to perform minor changes to your images, that is by rotating the image a bit, resizing etc. This way the model is trained to learn your images even with slight changes, which usually makes it more robust to new images.
  • You could consider doing Transfer Learning. The main idea here is to leverage a model that has learned on a huge dataset and use it to fine-tune your specific problem. In the tutorial I linked they show the typical workflow of transfer learning, by taking a model pretrained on the ImageNet dataset (the huge dataset), and retraining it on the Kaggle "cats vs dogs" classification dataset (a smaller dataset, like the one you could have).

Answered By – claudia

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published