extract words from column according to range defined in second column


I have a data frame containing text in one column and specified windows of interest in a tuple in another column. Consider this example.

import pandas as pd

df = pd.DataFrame(columns=['date', 'name', 'text', 'tuple'],
                  data = [['2011-01-01', "Peter",    "Das ist nicht vielversprechend.",            (101, 0, 3)],
                          ['2012-01-01', "Michelle", "Du bist nicht misstrauisch.",                (101, 1, 3)],
                          ['2013-01-01', "Michelle", "Das ist eine vertrauenserweckende Aussage.", (101, 0, 1)],
                          ['2014-01-01', "Peter",    "Ich bin sehr nervös.",                       (101, 1, 3)]])

Ignoring the first entry of the tuple, I would now like to extract the word span defined in elements 1 & 2 (zero-indexed, excluding the second number) in the tuple from the column text and add this as a new column (words_of_interest).

For example, from line 1, this should yield words 0-2 (up to and excl. word number 3):
Expected output:

"Das ist nicht", 
"bist nicht",
"bin sehr"

I have tried various variations of .astype(str).str.split().str[i] for the strings and .str.get(1) for the span to no avail. Can someone help me?

Thanks in advance!


One approach:

df["result"] = [" ".join(text.split()[start:end]) for text, (_, start, end) in zip(df["text"], df["tuple"])]


         date      name  ...        tuple         result
0  2011-01-01     Peter  ...  (101, 0, 3)  Das ist nicht
1  2012-01-01  Michelle  ...  (101, 1, 3)     bist nicht
2  2013-01-01  Michelle  ...  (101, 0, 1)            Das
3  2014-01-01     Peter  ...  (101, 1, 3)       bin sehr

[4 rows x 5 columns]

If there are nan values in the column tuple, you could do:

tuples = [(None, None, None) if pd.isna(v) else v for v in df["tuple"]]
df["result"] = [" ".join(text.split()[start:end]) if start is not None else np.nan for text, (_, start, end) in
                zip(df["text"], tuples)]

Answered By – Dani Mesejo

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published