Pandas Multi Index Dataframe Transform Raw Score into Quantile by Date and add Percentile Column

Issue

I have a pandas multi index that looks like this:

enter image description here

I would like to transform the Raw_Score into a percentile, where the comparison is against the day. So the top score of the day gets a 100, and the bottom score of the day gets at zero, with the other scores proportionately aligned. I would like to do this at the "all data", "Sector", or "Region" level. The dataframe would have a new column called "Adjusted_Score". I think I have to use groupby and quantile, but I am lost. Can someone point me in the right direction?

NOTE: Revised to show working code

import pandas as pd
import numpy as np

from datetime import datetime
from numpy import random

# ----------------------------------
# Set up a sample dataframe
# ----------------------------------

def create_df(num):
  # create empty data frame in pandas
  df = pd.DataFrame()

  # add a range of dates
  dates = pd.date_range(start="2022-04-01",end="2022-06-05").to_pydatetime().tolist()
  df['Dates'] = dates

  # generate a random 3 digit Value
  #for i in range(len(dates)):
  df['Raw_Score'] = np.random.randint(-999, 999, size=len(df))

  # generate a random 2 digit Sector between 10 and 15
  #for i in range(len(dates)):
  df['Sector'] = np.random.randint(10, 20, size=len(df))

  # give it a unique value
  df['Region'] = num

  return df

# make a big df
big_df = []

for num in range(10):
  df = create_df(num)
  df = pd.DataFrame(df)

  while num == 0:
    big_df = df.copy()
    num = num + 1
  else:
    big_df = pd.concat((big_df, df), axis=0)


df = big_df.copy()
df.reset_index(inplace=True)
df = df.drop_duplicates()

df.set_index(["Region", "Sector", "Dates"], inplace=True)

# ----------------------------------
# Actual Problem Below
# ----------------------------------

# Step 1: Getting rank with groupby:
df['rank'] = df.groupby('Dates')[['Raw_Score']].apply(lambda g: g.rank(method='first'))

# Step 2: Obtain the max value:
df['max'] = df.groupby('Dates')[['rank']].transform(lambda g: g.max())

# Step 3:Calculate percentile by n/N*100:
df['percentile'] = df['rank'] / df['max'] * 100

df

Thank you!

Solution

Sample code:

# Generate sample data:
df = pd.DataFrame({
    'var': ['a']*10 + ['b']*10,
    'val': np.random.permutation(20)})

# Step 1: Getting rank with groupby:
df['rank'] = df.groupby('var')[['val']].apply(lambda g: g.rank(method='first'))

# Step 2: Obtain the max value:
df['max'] = df.groupby('var')[['rank']].transform(lambda g: g.max())

# Step 3:Calculate percentile by n/N*100:
df['percentile'] = df['rank'] / df['max'] * 100

If your dataset is large and you want to speed up the process: Step 2 simply:

subsample = df.groupby('var')[['rank']].max()
df = pd.merge(df, subsample, on=['var'], how='left')

For some reason, merge in Pandas works much faster than transform method

3 level of multi-index follows the same code above

Answered By – PTQuoc

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published