Issue
If I have data like so:
+------+----+-------+-------+
| year | id | value | group |
+------+----+-------+-------+
| 2019 | 1 | 10 | A |
| 2019 | 1 | 10 | B |
| 2019 | 2 | 20 | A |
| 2019 | 3 | 30 | A |
| 2019 | 2 | 20 | B |
| 2020 | 1 | 5 | A |
| 2020 | 1 | 5 | B |
| 2020 | 2 | 10 | A |
| 2020 | 3 | 15 | A |
| 2020 | 2 | 10 | B |
+------+----+-------+-------+
Is there a way to calculate the average value
based on the distinct id
while preserving all the data?
I need to do this because I will also have WHERE
clause(s) to filter other columns in the table, but I also need to get an overall view of the data in the case the WHERE
clause(s) are not added (these WHERE filters will be added by an automated software in the OUTERMOST query which I can’t control).
The group
column is an example.
For the above example, the results should be:
Overall --> 20 for 2019 and 10 for 2020
WHERE group = 'A'
–> 20 for 2019 and 10 for 2020
WHERE group = 'B'
–> 15 for 2019 and 7.5 for 2020
I tried to do the following:
SELECT
year,
AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
WHERE group = 'A' -- this clause may or may not exist
GROUP BY year
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL
and thus it would not be counted into the calculation, but unfortunately I can’t put analytical functions inside aggregate
functions.
Solution
While the data model is inappropriate and not normalized (you are storing values redundantly), the real problem is the late automated SQL injection (the optionally added where clause).
When a where clause gets added to your query, everything is fine, because the where clause properly restricts the rows to take into consideration (group A or B). When no where clause gets added, however, you would have to work on an aggregated data set (distinct year/id rows). The latter means an aggreation on an aggregation, which can be done with a subquery as was shown by DineshDB in an earlier answer. But here you have the problem that the where clause must work on the intermediate result (the subquery) and you say that your software adds the where clause to the main query instead.
The surprising solution to this is making this three aggregations. In below query I am mixing MAX
(first aggregation), AVG OVER
(second aggregation), and DISTINCT
(third aggregation) and the three can happily co-exist in one query. No subquery is needed.
SELECT DISTINCT
year,
AVG(MAX(value)) OVER (PARTITION BY year)
FROM yourtable
WHERE `group` = ... -- optional where clause
GROUP BY year, id
ORDER BY year;
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=973ae4f260597392c55f260d3c260084
Answered By – Thorsten Kettner
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0