Pandas Join creates unwanted duplicate, only want first instance

Issue

So I have 2 dataframes that I’m joining by their redefined index which is the number we use to identify the study, when I’m joining them they look like this:

df1 (contains all study numbers):

Index State PS
1001 CA 0
1002 NY 0
1003 NJ 1

df2 (does not contain all study numbers and contains duplicates):

Index Study
1001 Active
1002 Active
1002 Closed

I currently have df1 = df1.join(df2) which outputs:

Index State PS Study
1001 CA 0 Active
1002 NY 0 Active
1002 NY 0 Closed
1003 NJ 1

In this example df, I’d like only the first instance of 1002 in df2 to be merged with df1. Assuming it has something to do with ‘how’ or ‘on’, but I don’t understand the documentation well enough as I am pretty new to Pandas. Thanks!
Desired output is:

Index State PS Study
1001 CA 0 Active
1002 NY 0 Active
1003 NJ 1

Solution

Try using drop_duplicates with keep="first" since it is sorted from newest to oldest. Then you merge on the key Index

df2 = df2.drop_duplicates(subset="Index", keep="first")
df = pd.merge(df1, df2, on="Index", how="left")

Answered By – Raymond Toh

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published