Python: Extract Specific words/texts from a column

Issue

I am trying to pull some particular texts from another column in a dataset. The sample data is like this:

sample_data = pd.DataFrame({'Pet_Id': ['1', '2', '3', '4', '5'],
                            'Status': ['Sold', 'In store', 'Sold', 'In store', 'Sold'],
                            'Comment': ['6-12months - Dog -Labrador', 'Cat - BM-born 2019', 'Parrot - 12-24 months',
                                        'Over2Y - Dog', 'Cat']})

The output is as below:

  Pet_Id    Status                     Comment
0      1      Sold  6-12months - Dog -Labrador
1      2  In store          Cat - BM-born 2019
2      3      Sold       Parrot - 12-24 months
3      4  In store                Over2Y - Dog
4      5      Sold                         Cat

I wish to only get the animal type from the "Comment" column, in this case only "Dog", "Cat" or "Parrot" and put the values into a new column "Type", so the expected output is as below:

  Pet_Id    Status                     Comment    Type
0      1      Sold  6-12months - Dog -Labrador     Dog
1      2  In store          Cat - BM-born 2019     Cat
2      3      Sold       Parrot - 12-24 months  Parrot
3      4  In store                Over2Y - Dog     Dog
4      5      Sold                         Cat     Cat

I tried to use dict to map but not working well as there are quite a few records in the actual data, so would you please give some help?
Many thanks

Solution

Use str.extractall:

animals = ['Dog', 'Cat', 'Parrot']
pattern = fr"({'|'.join(animals)})"
sample_data['Type'] = sample_data['Comment'].str.extractall(pattern).values

Output:

>>> sample_data
  Pet_Id    Status                     Comment    Type
0      1      Sold  6-12months - Dog -Labrador     Dog
1      2  In store          Cat - BM-born 2019     Cat
2      3      Sold       Parrot - 12-24 months  Parrot
3      4  In store                Over2Y - Dog     Dog
4      5      Sold                         Cat     Cat

>>> print(pattern)
(Dog|Cat|Parrot)

Build the regex: fr"({'|'.join(animals)})

  • f for f-strings and r for raw-strings
  • (...) is for regex to capture group
  • ‘|’.join(animals) create a string where each item are separated by ‘|’

The regex captures one animal of the list.

Answered By – Corralien

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published