-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planner Fails to Optimize Continuous Aggregate Query That Ignores aggregated Columns #1867
Comments
I have also seen this in TimescaleDB 1.6.1. From the definition of the continuous aggregate view it seems like the continuous aggregate hypertable can contain partial aggregates that need to be finalized at query time. See the call to
I have however not seen this being the case in reality as The The addition of the |
I tried to understand the reason the continuous aggregates where designed the way they are and found the following explanations: From https://github.com/timescale/timescaledb/blob/master/tsl/src/continuous_aggs/Readme.md:
From https://github.com/timescale/timescaledb/blob/master/tsl/src/partialize_finalize.c:
and
So if I understand this correctly, the design is supposed to:
Regarding the possibility to reuse the data for different rollup resolutions, I can see how that is supposed to work. However, you are also adding support for defining multiple continuous aggregates on a single hypertable (#1869), so is this still a valid reason? Regarding parallelizing the calculation of aggregates over multiple nodes, you would still need to wait with storing the calculated aggregates until all nodes are done, or you would calculate incorrect final aggregates at query time. So you could as well simply store the final aggregate at that point. Regarding improving the performance, the whole point with using a continuous aggregate is to pay the computational cost up front, so that the cost at query-time becomes lower. To me it seems like the current design leaves too much computation to be done at query time, and that it would be better to instead pay this additional cost when materializing. Does my reasoning make sense, or have I missed some important aspect of the continuous aggregates? It's very likely since TimescaleDB is quite new to me :) |
I'm was also wondering why the aggregated data is not stored directly. Two disadvantages when storing partial aggregates are
|
@NomAnor You gave the reason: To calculate aggregates like "avg". You can't aggregate two "average" values together to get a new correct "average" like you can with "max" or "sum". You need the sum and the count. By keeping a compact partial value, you can merge multiple rows of the continuous aggregate table, or rows of the CA table and the real time table, and still get a correct value. It's not possible otherwise, and rather clever that they made this work, |
I understand that you can't generally combine complete partials. It would be nice if there was an option to disable partial aggregates whe you now you don't need them. |
But this is not how it works in reality. Since the CA view is grouping by the time column, the
An alternative solution is to document how to properly calculate an average, and let the user create his own sum and count columns, e.g. |
That was actually something I was experimenting with. Aggregate multiples of buckets together. And the only way to do that was to have my own count, sum and sumxx columns and do the calculations mysqlf. Thinking about it, what I would like the planer to do when I write a select query on a continious aggreate view (if it is actually possible to implement):
Does this make sense? I think I'm bad at explaining. |
The current implementation has deficiencies, one of which was why this bug was filed. (I also find the aggregate discussion amusing considering this bug makes no use of or reference to any aggregates!) @NomAnor, the existence of the group-by in the CA view could be because they are allowing for the possibility that multiple rows for the same key may exist in the underlying materialized hypertable. For example, forcing multiple update runs before a bucket is "closed" might write a new row covering just the new data for each run. Then you would have multiple rows for a single key and bucket. The group-by and partial aggregate would collapse these transparently and you'd get the expected result. I'm not sure if this is a feature or a bug. As for the rebucketing described by @netrounds-fredrik and @NomAnor, really the question is what sort of query re-writes can be implemented under the hood. I agree that using the current CA view does not get you the desired result, but the use of the partial aggregate column in the materialized hypertable means that it IS possible to write the query you want right now (again, roughly as @NomAnor proposed). That is, you can write you rebucketing query against the underlying materialized hypertable and call the finalize function yourself; you don't need additional columns. I've done it and it works fine, but of course you expose yourself to implementation details subject to change, use at your own risk, etc. etc. As to whether re-write support is on the Timescale roadmap, I don't know; I assumed it was. The purpose of this bug was to point out a different use case (no agg column at all) that also needs to be taken into consideration when such query re-writing is implemented. |
Rewriting the query like you describe would only work when using the aggregation function matching the function used to calculate the partial aggregates. You would for example not be able to calculate the Also, as I wrote before, it seems like the
Sure, the scope of the discussion in this issue has widened a bit to cover other problems with continuous aggregates. But I think that it's good to look at it in a wider perspective since it's all related. |
@louisth The purpose of continuous aggs is to speed up queries that would usually aggregate all the data if you were to run that against the hypertable. It is not intended to speed up the query you have. |
@gayyappan, any thoughts on problem with the actual use case for continuous aggregates that have also been discussed here? Should I create a separate issue instead? |
@gayyappan I disagree. I am using the continuous aggregate to speed up queries that would usually aggregate all the data. I'm speeding up the aggregate "distinct" or group-by. I am looking for what values exist during certain time ranges. (I didn't put a time range on this simple repro, but I have one in production.) The CA collapses lots of points into just one per time bucket, and it's much faster to query the bucketed data than the whole table; that's the point. And conceptually it works well, as was demonstrated in the second part of the repro above. The problem is that the view is not acting like the table it is pretending to be, and there is no supported work-around / toolkit to get the desired behavior. Also, the problem is not that I'm ignoring a column with an aggregation like avg or max. I'm ignoring columns that have the group-by aggregation, and yet the planner is still grouping by those columns first (and slowly too for some reason). Any regrouping on the CA would have this problem, whether there's an avg column or not. If you are arguing that CAs do not support any modifications to the original query except changing the time range, then the bug is that that limitation needs to be made very clear in the documentation because that limitation is not mentioned anywhere right now! |
That's also one of my use cases. I have an aggregate view that has no aggregate functions and is used to check which time buckets have data rows with a certain column value. That way I only need to check those time buckets against the data table (if the approximate result is not enough). |
@netrounds-fredrik As you point out, finalize_agg is not parallelized now. Parallelization support is on the continuous aggregate road map. |
Relevant system information:
Describe the bug
A simple query (select and group-by on one column) took 57x longer to run against the CA (45031.102 ms) than to run against the CA's underlying materialized hypertable and the real-time hypertable and merge the results together (788.596 ms)
To Reproduce
Takes 45s.
Takes 0.8s.
Expected behavior
Querying against the continuous aggregate should take about the same time as the manually unrolled query.
Actual behavior
Querying against the continuous aggregate takes over fifty times longer than the manually unrolled query.
The text was updated successfully, but these errors were encountered: