Encoding for movie genres in dataset

Issue

I have a movie dataset that contains a column that lists the movie’s genre as such:

title    genres          
t1       ['Drama', 'Science Fiction', 'War']
t2       ['Action', 'Crime']

I want to encode them to be:

title  Drama  Science  Fiction  War  Action  Crime
t1     1      1                 1    0       0
t2     0      0                 0    1       1

I have tried MultiLabelBinarizer, but the output came out to be:

    ,   A   D   F   S   W   a   c   d   e   i   m   n   o   r   t   u   v
0   1   1   1   0   1   1   0   0   1   1   1   1   0   1   1   1   1   1   1
1   1   1   0   1   1   1   1   1   1   0   1   1   1   1   1   1   1   0   0

How can I solve this problem? Is there another way for me to achieve this?

Any help would be greatly appreciated.

Solution

Considering this is your df:

    title   genres
0   t1  [Drama, Science Fiction, War]
1   t2  [Action, Crime]

You should do something like this:

# edit
# consider adding this line if your df.genre is a string of list
df.genres = df.genres.apply(lambda x: eval(x))

exploded_df = df.explode(column='genres')
pd.get_dummies(exploded_df, columns=['genres']).groupby('title', as_index=False).sum()

# output
  title genres_Action   genres_Crime    genres_Drama    genres_Science Fiction  genres_War
0   t1  0               0               1               1                       1
1   t2  1               1               0               0                       0

Answered By – Ricardo

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published