## Issue

I have the following dataframe:

```
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
```

I am trying to do a statistic that contains the number of questions answered by each group (it doesn’t matter how many times), the total number of questions, the number of questions answered by both groups and the ratio between mutual questions and total questions:

```
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
```

What I did was actually counting unique values "manually" with `groupby`

, `nunique`

etc, because it’s simple when the `group`

column has only two distinct values. However, **is it possible to compare pairs of two groups and build dataframes with one row like above if my dataframe has many groups**? Example:

```
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
f g3 True abc
f g3 True www
g g3 False xyz
h g4 True www
h g4 True qqq
i g4 False xyz
```

So I would want to compare `g1`

with `g2`

, `g1`

with `g3`

, `g1`

with `g4`

, `g2`

with `g3`

and `g3`

with `g4`

.

```
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
g1 g3 3 3 4 2 0.50
g1 g4 3 3 4 1 0.25
g2 g3 1 3 4 1 0.25
g2 g4 1 3 4 1 0.25
g3 g4 3 3 4 2 0.50
```

## Solution

Here’s an attempt. First, find the set of questions answered for each group:

```
>>> g = df.groupby('group')['question'].apply(set)
>>> g
group
g1 {xyz, abc, def}
g2 {xyz}
g3 {xyz, abc, www}
g4 {qqq, xyz, www}
Name: question, dtype: object
```

Then, use `itertools`

to get the combinations you want, computing the indices for each desired column:

```
import itertools
g = df.groupby('group')['question'].apply(set) # as above
rows = []
total = len(df.question.unique())
for a, b in itertools.combinations(g.index, 2):
rows.append({'Team1':a,
'Team2':b,
'NumberQuestionsT1':len(g[a]),
'NumberQuestionsT2':len(g[b]),
'AllQuestions': total,
'CommonQuestions': len(g[a] & g[b]),
'Ratio': len(g[a] & g[b]) / total})
output = pd.DataFrame(rows)
```

Giving:

```
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions \
0 g1 g2 3 1 5
1 g1 g3 3 3 5
2 g1 g4 3 3 5
3 g2 g3 1 3 5
4 g2 g4 1 3 5
5 g3 g4 3 3 5
CommonQuestions Ratio
0 1 0.2
1 2 0.4
2 1 0.2
3 1 0.2
4 1 0.2
5 2 0.4
```

I’ve replace the `AllQuestions`

calculation with the one you commented. Now, it seems like the difference is that your example input has 5 question types (`abc`

, `def`

, `qqq`

, `www`

, `xyz`

)?

I couldn’t think of a way to do this without `itertools`

, and I am also using a for loop for the initial determination of every `set`

of questions. So I reckon there could be a significantly faster method, but this may be sufficient given the size of your data.

Answered By – Tom

**This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 **