Issue
I have a Pandas DataFrame read from csv that has some columns with string values but are actually object
types. Because they are categorical, I convert them into category
and then into integer representation, and then I am fitting a random forest regressor.
for col in df_raw.select_dtypes(include='object'):
df_raw[col] = df_raw[col].astype('category')
df_raw[col] = df_raw[col].cat.codes #not 'category' type anymore.
The problem is if I do this, then the dtype
is immediately converted to int
and I lose the cat
information, which I need later.
For example, after the first line in the loop, I can run df_raw[col].cat
, and I would get the indexed categories as expected. But once the second line is executed, the column dtype
changes to int8
, I will get the error:
Can only use .cat accessor with a ‘category’ dtype`
which, in a way makes perfect sense, since it’s dtype is int8
.
Is it possible to preserve the category encoding information in the same DataFrame and at the sametime have integer encodings in place to fit the regressor? How?
Solution
1. Simple idea
Why won’t you use a derived column in the regressor fitting, e.g.:
df_raw[col + '_calculated'] = df_raw[col].cat.codes
In this way you have both: a categorical column col
that does not change this feature and a “calculated” column with int
s as needed for further processing?
2. More clever approach
Another approach could be that you wrap the dataframe before passing it to the fit
method in such a way that regressor accesses .cat.codes
instead of the categorical value directly:
def access_wrapper(dframe, col):
yield from dframe[col].cat.codes
fit(..., access_wrapper(df, col))
In this way you do not affect the dataframe at all and do not copy the values from df[col]
at the expense of calling the dframe[col].cat.codes
per each access to the value (which should be fairly quick).
Answered By – sophros
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0