# Calculating average based on distinct ID while preserving all the data in a table?

## Issue

If I have data like so:

``````+------+----+-------+-------+
| year | id | value | group |
+------+----+-------+-------+
| 2019 |  1 |    10 | A     |
| 2019 |  1 |    10 | B     |
| 2019 |  2 |    20 | A     |
| 2019 |  3 |    30 | A     |
| 2019 |  2 |    20 | B     |
| 2020 |  1 |     5 | A     |
| 2020 |  1 |     5 | B     |
| 2020 |  2 |    10 | A     |
| 2020 |  3 |    15 | A     |
| 2020 |  2 |    10 | B     |
+------+----+-------+-------+
``````

Is there a way to calculate the average `value` based on the distinct `id` while preserving all the data?

I need to do this because I will also have `WHERE` clause(s) to filter other columns in the table, but I also need to get an overall view of the data in the case the `WHERE` clause(s) are not added (these WHERE filters will be added by an automated software in the OUTERMOST query which I can’t control).

The `group` column is an example.

For the above example, the results should be:

`Overall --> 20 for 2019 and 10 for 2020`

`WHERE group = 'A'` –> 20 for 2019 and 10 for 2020

`WHERE group = 'B'` –> 15 for 2019 and 7.5 for 2020

I tried to do the following:

``````SELECT
year,
AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
WHERE group = 'A' -- this clause may or may not exist
GROUP BY year
``````

Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be `NULL` and thus it would not be counted into the calculation, but unfortunately I can’t put analytical functions inside `aggregate` functions.

## Solution

While the data model is inappropriate and not normalized (you are storing values redundantly), the real problem is the late automated SQL injection (the optionally added where clause).

When a where clause gets added to your query, everything is fine, because the where clause properly restricts the rows to take into consideration (group A or B). When no where clause gets added, however, you would have to work on an aggregated data set (distinct year/id rows). The latter means an aggreation on an aggregation, which can be done with a subquery as was shown by DineshDB in an earlier answer. But here you have the problem that the where clause must work on the intermediate result (the subquery) and you say that your software adds the where clause to the main query instead.

The surprising solution to this is making this three aggregations. In below query I am mixing `MAX` (first aggregation), `AVG OVER` (second aggregation), and `DISTINCT` (third aggregation) and the three can happily co-exist in one query. No subquery is needed.

``````SELECT DISTINCT
year,
AVG(MAX(value)) OVER (PARTITION BY year)
FROM yourtable
WHERE `group` = ... -- optional where clause
GROUP BY year, id
ORDER BY year;
``````