Calculating average based on distinct ID while preserving all the data in a table?

Issue

If I have data like so:

+------+----+-------+-------+
| year | id | value | group |
+------+----+-------+-------+
| 2019 |  1 |    10 | A     |
| 2019 |  1 |    10 | B     |
| 2019 |  2 |    20 | A     |
| 2019 |  3 |    30 | A     |
| 2019 |  2 |    20 | B     |
| 2020 |  1 |     5 | A     |
| 2020 |  1 |     5 | B     |
| 2020 |  2 |    10 | A     |
| 2020 |  3 |    15 | A     |
| 2020 |  2 |    10 | B     |
+------+----+-------+-------+

Is there a way to calculate the average value based on the distinct id while preserving all the data?

I need to do this because I will also have WHERE clause(s) to filter other columns in the table, but I also need to get an overall view of the data in the case the WHERE clause(s) are not added (these WHERE filters will be added by an automated software in the OUTERMOST query which I can’t control).

The group column is an example.

For the above example, the results should be:

Overall --> 20 for 2019 and 10 for 2020

WHERE group = 'A' –> 20 for 2019 and 10 for 2020

WHERE group = 'B' –> 15 for 2019 and 7.5 for 2020

I tried to do the following:

SELECT 
  year,
  AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
WHERE group = 'A' -- this clause may or may not exist
GROUP BY year

Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can’t put analytical functions inside aggregate functions.

Solution

While the data model is inappropriate and not normalized (you are storing values redundantly), the real problem is the late automated SQL injection (the optionally added where clause).

When a where clause gets added to your query, everything is fine, because the where clause properly restricts the rows to take into consideration (group A or B). When no where clause gets added, however, you would have to work on an aggregated data set (distinct year/id rows). The latter means an aggreation on an aggregation, which can be done with a subquery as was shown by DineshDB in an earlier answer. But here you have the problem that the where clause must work on the intermediate result (the subquery) and you say that your software adds the where clause to the main query instead.

The surprising solution to this is making this three aggregations. In below query I am mixing MAX (first aggregation), AVG OVER (second aggregation), and DISTINCT (third aggregation) and the three can happily co-exist in one query. No subquery is needed.

SELECT DISTINCT
  year,
  AVG(MAX(value)) OVER (PARTITION BY year)
FROM yourtable
WHERE `group` = ... -- optional where clause
GROUP BY year, id
ORDER BY year;

Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=973ae4f260597392c55f260d3c260084

Answered By – Thorsten Kettner

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published