using tf.keras.layers.Embedding for categorical variables in regression problem

Issue

Using the iris dataset as a hypothetical hello world example:

import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['iris_class'] = pd.Series(iris['target'], name = 'target_values')
df['iris_class_name'] = df['iris_class'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])
df.columns = df.columns.str.replace("[() ]", "")

print(df.head())

Let us say I want to use tf.keras.layers.Embedding instead of one-hot/dummy encoding as part of ANN for regression. e.g.:

iris_class_name + sepalwidthcm + petallengthcm -> sepallengthcm

where sepallengthcm is the dependent variable. I came across this:

city_lookup = tf.keras.layers.StringLookup(vocabulary = city_vocabulary, mask_token = None);
city_embedding= tf.keras.Sequential([
    city_lookup,
    tf.keras.layers.Embedding(len(city_vocabulary) + 1, embedding_dimension)
  ], "city_embedding")
  
city = features["city"]
city_embedding_output = city_embedding(city)

but am not sure how to exactly use it in my use case. Any pointers very much welcome. Thanks!

Solution

You can map iris_class_name to n-dimensional vector representations and then concatenate with the other continuous features:

import pandas as pd
from sklearn import datasets
import numpy as np
import tensorflow as tf

iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['iris_class'] = pd.Series(iris['target'], name = 'target_values')
df['iris_class_name'] = df['iris_class'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])
df.columns = df.columns.str.replace("[() ]", "")

vocab = df['iris_class_name'].unique()
embedding_dimension = 10
lookup = tf.keras.layers.StringLookup(vocabulary = vocab, mask_token = None)
embedding= tf.keras.Sequential([
    lookup,
    tf.keras.layers.Embedding(len(vocab) + 1, embedding_dimension)
  ])
  
names = df['iris_class_name'].to_numpy()
embedding_output = embedding(names)

features = np.concatenate((embedding_output, df[['sepalwidthcm', 'petallengthcm']].to_numpy()), axis=-1)

print(features.shape)
(150, 12)

Since you have 3 unique iris class names, you could also simply create an integer-to-vector dictionary manually, but it is up to you.

Answered By – AloneTogether

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published