Issue
I have a tensor ‘input_sentence_embed’ with shape torch.Size([1, 768])
There is a dataframe ‘matched_df’ which looks like
INCIDENT_NUMBER enc_rep
0 INC000030884498 [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1 INC000029956111 [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2 INC000029555353 [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3 INC000029555338 [[tensor(-0.2759), tensor(0.2604), tensor(0.21...
Shape of each tensor element in dataframe looks like
matched_df['enc_rep'].iloc[0].size()
torch.Size([1, 768])
I want to find euclidean / cosine similarity between ‘input_sentence_embed’ and each row of ‘matched_df’ efficently.
If they were scalar values, I could have easily broadcasted ‘input_sentence_embed’ as a new column in ‘matched_df’ and then find cosine similarity between two columns.
I am struggling with two problems
- How to broadcast ‘input_sentence_embed’ as a new column to the
‘matched_df’ - How to find cosine similarity between tensors stored
in two column
May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.
Solution
Input data:
import pandas as pd
import numpy as np
from torch import tensor
match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
'INC000029956111',
'INC000029555353',
'INC000029555338'],
'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
[[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
[[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
[[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})
input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]
- How to broadcast ‘input_sentence_embed’ as a new column to the ‘matched_df’
match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)
- How to find cosine similarity between tensors stored in two column
a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
Output result:
INCIDENT_NUMBER enc_rep input_sentence_embed cosine_similarity
0 INC000030884498 [[tensor(0.2971), tensor(0.4831), tensor(0.823... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.446067
1 INC000029956111 [[tensor(0.3481), tensor(0.8104), tensor(0.287... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.377775
2 INC000029555353 [[tensor(0.2210), tensor(0.3478), tensor(0.261... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.201116
3 INC000029555338 [[tensor(0.2951), tensor(0.6698), tensor(0.965... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.574257
Answered By – Corralien
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0