Issue
I have a movie dataset that contains a column that lists the movie’s genre as such:
title genres
t1 ['Drama', 'Science Fiction', 'War']
t2 ['Action', 'Crime']
I want to encode them to be:
title Drama Science Fiction War Action Crime
t1 1 1 1 0 0
t2 0 0 0 1 1
I have tried MultiLabelBinarizer, but the output came out to be:
, A D F S W a c d e i m n o r t u v
0 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1
1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0
How can I solve this problem? Is there another way for me to achieve this?
Any help would be greatly appreciated.
Solution
Considering this is your df:
title genres
0 t1 [Drama, Science Fiction, War]
1 t2 [Action, Crime]
You should do something like this:
# edit
# consider adding this line if your df.genre is a string of list
df.genres = df.genres.apply(lambda x: eval(x))
exploded_df = df.explode(column='genres')
pd.get_dummies(exploded_df, columns=['genres']).groupby('title', as_index=False).sum()
# output
title genres_Action genres_Crime genres_Drama genres_Science Fiction genres_War
0 t1 0 0 1 1 1
1 t2 1 1 0 0 0
Answered By – Ricardo
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0