Issue
I have a data frame containing text in one column and specified windows of interest in a tuple in another column. Consider this example.
import pandas as pd
df = pd.DataFrame(columns=['date', 'name', 'text', 'tuple'],
data = [['2011-01-01', "Peter", "Das ist nicht vielversprechend.", (101, 0, 3)],
['2012-01-01', "Michelle", "Du bist nicht misstrauisch.", (101, 1, 3)],
['2013-01-01', "Michelle", "Das ist eine vertrauenserweckende Aussage.", (101, 0, 1)],
['2014-01-01', "Peter", "Ich bin sehr nervös.", (101, 1, 3)]])
Ignoring the first entry of the tuple, I would now like to extract the word span defined in elements 1 & 2 (zero-indexed, excluding the second number) in the tuple from the column text and add this as a new column (words_of_interest
).
For example, from line 1, this should yield words 0-2 (up to and excl. word number 3):
Expected output:
"Das ist nicht",
"bist nicht",
"Das"
"bin sehr"
I have tried various variations of .astype(str).str.split().str[i]
for the strings and .str.get(1)
for the span to no avail. Can someone help me?
Thanks in advance!
Solution
One approach:
df["result"] = [" ".join(text.split()[start:end]) for text, (_, start, end) in zip(df["text"], df["tuple"])]
print(df)
Output
date name ... tuple result
0 2011-01-01 Peter ... (101, 0, 3) Das ist nicht
1 2012-01-01 Michelle ... (101, 1, 3) bist nicht
2 2013-01-01 Michelle ... (101, 0, 1) Das
3 2014-01-01 Peter ... (101, 1, 3) bin sehr
[4 rows x 5 columns]
UPDATE
If there are nan values in the column tuple, you could do:
tuples = [(None, None, None) if pd.isna(v) else v for v in df["tuple"]]
df["result"] = [" ".join(text.split()[start:end]) if start is not None else np.nan for text, (_, start, end) in
zip(df["text"], tuples)]
print(df)
Answered By – Dani Mesejo
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0