If I have data like so:
+------+----+-------+-------+ | year | id | value | group | +------+----+-------+-------+ | 2019 | 1 | 10 | A | | 2019 | 1 | 10 | B | | 2019 | 2 | 20 | A | | 2019 | 3 | 30 | A | | 2019 | 2 | 20 | B | | 2020 | 1 | 5 | A | | 2020 | 1 | 5 | B | | 2020 | 2 | 10 | A | | 2020 | 3 | 15 | A | | 2020 | 2 | 10 | B | +------+----+-------+-------+
Is there a way to calculate the average
value based on the distinct
id while preserving all the data?
I need to do this because I will also have
WHERE clause(s) to filter other columns in the table, but I also need to get an overall view of the data in the case the
WHERE clause(s) are not added (these WHERE filters will be added by an automated software in the OUTERMOST query which I can’t control).
group column is an example.
For the above example, the results should be:
Overall --> 20 for 2019 and 10 for 2020
WHERE group = 'A' –> 20 for 2019 and 10 for 2020
WHERE group = 'B' –> 15 for 2019 and 7.5 for 2020
I tried to do the following:
SELECT year, AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg FROM table WHERE group = 'A' -- this clause may or may not exist GROUP BY year
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be
NULL and thus it would not be counted into the calculation, but unfortunately I can’t put analytical functions inside
While the data model is inappropriate and not normalized (you are storing values redundantly), the real problem is the late automated SQL injection (the optionally added where clause).
When a where clause gets added to your query, everything is fine, because the where clause properly restricts the rows to take into consideration (group A or B). When no where clause gets added, however, you would have to work on an aggregated data set (distinct year/id rows). The latter means an aggreation on an aggregation, which can be done with a subquery as was shown by DineshDB in an earlier answer. But here you have the problem that the where clause must work on the intermediate result (the subquery) and you say that your software adds the where clause to the main query instead.
The surprising solution to this is making this three aggregations. In below query I am mixing
MAX (first aggregation),
AVG OVER (second aggregation), and
DISTINCT (third aggregation) and the three can happily co-exist in one query. No subquery is needed.
SELECT DISTINCT year, AVG(MAX(value)) OVER (PARTITION BY year) FROM yourtable WHERE `group` = ... -- optional where clause GROUP BY year, id ORDER BY year;
Answered By – Thorsten Kettner