Find previous row with the same ID and condition on column value to assign weight

Issue

I have the following dataframe:

    id      date        status
--------------------------------
0    2     2019-02-04   reserve    
1    2     2020-01-01   reserve    
2    2     2020-01-02   sold      
3    3     2020-01-03   reserve    
4    4     2020-01-03   booked     
5    3     2020-02-05   reserve    
6    4     2020-02-06   sold       
7    3     2020-02-07   sold

I need to add weight for each row with the following condtion:

If row has status == sold then weight is 1.
For the rest, I have to do the following: for the df[i] find previous row with the same id and status and based on day diffrence apply weight like so (1 / days difference).

The desired dataframe is:

    id      date        status    weight
----------------------------------------
0    2     2019-02-04   reserve     0 
1    2     2020-01-01   reserve   0.003 
2    2     2020-01-02   sold        1
3    3     2020-01-03   reserve     0 
4    4     2020-01-03   booked      0 
5    3     2020-02-05   reserve   0.030  
6    4     2020-02-06   sold        1 
7    3     2020-02-07   sold        1

Here’s my attempt, for now I can’t deal with finding previous row w/ condition:

df.date = pd.to_datetime(df.date)
df.sort_values(['date', 'id'], inplace=True)
df = df.reset_index(drop='index')

df['weight'] = np.where(df['status'] == 'sold', 1, 0)
df = df.reset_index(drop='index')
df1 = df[~(df.status == 'sold')]
retain_index = list(df1.index.values)
df1 = df1.reset_index(drop='index')

df1['diff_in_dates'] = df1.groupby('flat_id')['date'].diff().dt.days.fillna(0)

for index, row in df1.iterrows():
    if index == 0:
        pass
    else:
        if df1.loc[index, 'weight'] == 0:
            if (df1.loc[index, 'status'] == df1.loc[index - 1, 'status']) & \
                                    (df1.loc[index, 'flat_id'] == df1.loc[index - 1, 'flat_id']):
                df1.loc[index, 'weight'] = 1 / df1.loc[index, 'diff_in_dates']

df1.weight.replace([np.inf, -np.inf], np.nan, inplace=True)
df1.weight = df1.weight.fillna(0)

#Here, in df1 I have weight = 0 for df[5] instead of 0.030

df = pd.DataFrame(df.append(df1, ignore_index = False)

Solution

You can use a GroupBy.diff difference, extract the days and get the inverse, then use numpy.where to chose between the sold/non-sold status to know how to fill the default:

s = df.groupby(['id', 'status'])['date'].diff().dt.days.rdiv(1)

import numpy as np
df['weight'] = np.where(df['status'].eq('sold'), 1, s.fillna(0))

output:

   id       date   status    weight
0   2 2019-02-04  reserve  0.000000
1   2 2020-01-01  reserve  0.003021
2   2 2020-01-02     sold  1.000000
3   3 2020-01-03  reserve  0.000000
4   4 2020-01-03   booked  0.000000
5   3 2020-02-05  reserve  0.030303
6   4 2020-02-06     sold  1.000000
7   3 2020-02-07     sold  1.000000

Answered By – mozway

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published