Issue
I am trying to use TF Tokenizer
for a NLP model
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
The problem is that the tokenizer
creates tokens based on .
in this case. I am giving the split = " "
in the Tokenizer
so I expect the following op:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
As in I want the tokenizer to create words
based on space (" ")
and not on any special characters
How do I make the tokenizer
create tokens only on spaces
?
Solution
The Tokenizer
takes another argument called filter
which is currently defaults to all ascii punctuations (filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
). During tokenization, all of the characters contained in filter
are replaced by the specified split
string.
If you will look in the source code of the Tokenizer
and specifically on the method fit_on_texts, you will see it uses the function text_to_word_sequence which receive the filter
characters and consider them the same as the split
it also receives:
def text_to_word_sequence(... ):
...
translate_dict = {c: split for c in filters}
translate_map = maketrans(translate_dict)
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
So, in order to not split nothing but the specified split
, just pass empty string to the filter
argument
Answered By – ronpi
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0