Why is model.fit working without clear attribute and label separation and the same method is not working for model.evaluate?


I am working on building DistillBERT model for IMDB dataset where the text is classified either as positive or negative. In my code I have first tokenised the ‘text’ data –

from datasets import load_dataset
imdb = load_dataset("imdb")

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

After this I have added padding and converted the dataset to TF dataset.

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

import tensorflow as tf
tf_train_set = tokenized_imdb["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],

tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],

from transformers import create_optimizer

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, 

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base- 
uncased", num_labels=2)


Then, I am trying to find out the base case accuracy of the pre-trained model for the same dataset as with which fine-tuning will be done, i.e., the model’s accuracy before fine-tuning it for the downstream task.

base_model_result= model.evaluate(x=tf_validation_set) 


This is where I am getting the error –

AttributeError: 'NoneType' object has no attribute 'shape'

This clearly means that I need to provide attribute and label values separately.
However, this same strategy is working fine for model.fit

results = model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Now if I need to split tf_validation_set, how can I do that and make model.validation work. I have tried converting tf_validation_set to a list but it’s still not working.

For resolving this issue I tried converting the tf dataset to list and then separate the attributes and the label column like so –

X = list(map(lambda x: [x['input_ids'],x['attention_mask']], tf_validation_set))
y = list(map(lambda x: x['labels'], tf_validation_set))

base_loss, base_acc = model.evaluate(X,y,verbose=1) 

print('Base accuracy:', base_acc)

But here I am getting following error –

ValueError: Data cardinality is ambiguous:Make sure all arrays contain the same number of samples.

How can I fix this?


The input data could be:

  • A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).
  • A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
  • A dict mapping input names to the corresponding array/tensors, if the model has named inputs.
  • A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
  • A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights).

The "input data" or the parameter "x" in the fit()/evaluate() method passed to the model is of type tf.data dataset that returns a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
If "x" is a tf.data dataset instance, "y" should not be specified (since targets will be obtained from x). Kindly, refer this for more information.

Your code is working fine in Colab, please find the gist here. Thank you!

Answered By – Tfer3

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published