Issue
So I have 2 dataframes that I’m joining by their redefined index which is the number we use to identify the study, when I’m joining them they look like this:
df1 (contains all study numbers):
Index | State | PS |
---|---|---|
1001 | CA | 0 |
1002 | NY | 0 |
1003 | NJ | 1 |
df2 (does not contain all study numbers and contains duplicates):
Index | Study |
---|---|
1001 | Active |
1002 | Active |
1002 | Closed |
I currently have df1 = df1.join(df2) which outputs:
Index | State | PS | Study |
---|---|---|---|
1001 | CA | 0 | Active |
1002 | NY | 0 | Active |
1002 | NY | 0 | Closed |
1003 | NJ | 1 |
In this example df, I’d like only the first instance of 1002 in df2 to be merged with df1. Assuming it has something to do with ‘how’ or ‘on’, but I don’t understand the documentation well enough as I am pretty new to Pandas. Thanks!
Desired output is:
Index | State | PS | Study |
---|---|---|---|
1001 | CA | 0 | Active |
1002 | NY | 0 | Active |
1003 | NJ | 1 |
Solution
Try using drop_duplicates
with keep="first"
since it is sorted from newest to oldest. Then you merge on the key Index
df2 = df2.drop_duplicates(subset="Index", keep="first")
df = pd.merge(df1, df2, on="Index", how="left")
Answered By – Raymond Toh
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0