Tenserflow issue when tokenizing sentences

Issue

I followed a tutorial about tokenizing sentences using Tensorflow, here’s the code I’m trying:

from tensorflow.keras.preprocessing.text import Tokenizer #API for tokenization

t = Tokenizer(num_words=4) #meant to catch most imp _
listofsentences=['Apples are fruits', 'An orange is a tasty fruit', 'Fruits are tasty!']
t.fit_on_texts(listofsentences) #processes words

print(t.word_index)
print(t.texts_to_sequences(listofsentences)) #arranges tokens, returns nested list

The first print statement shows a dictionary as expected:

{'are': 1, 'fruits': 2, 'tasty': 3, 'apples': 4, 'an': 5, 'orange': 6, 'is': 7, 'a': 8, 'fruit': 9}

But the last line outputs a list that misses many words:

[[1, 2], [3], [2, 1, 3]]

Please let me know what I’m doing wrong and how to get the expected list:

[[4,1,2],[5,6,7,8,3,9],[2,1,3]]

Solution

To specify an unlimited amount of tokens use:

t = Tokenizer(num_words=None)

Output:

{'are': 1, 'fruits': 2, 'tasty': 3, 'apples': 4, 'an': 5, 'orange': 6, 'is': 7, 'a': 8, 'fruit': 9}
[[4, 1, 2], [5, 6, 7, 8, 3, 9], [2, 1, 3]]

Answered By – Markus

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published