-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve initial loading of persons page #29331
base: master
Are you sure you want to change the base?
Conversation
It looks like the code of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Speed up person loading with optimized query execution
This PR introduces a significant performance optimization for loading the persons page by implementing a new query strategy that avoids unnecessary aggregation of all persons when only a limited number of results are needed.
- Added
LimitByExpr
class to HogQL AST and modified the grammar to properly supportLIMIT BY
clauses in ClickHouse SQL - Implemented
PersonsArgMaxVersion.V3
optimization that pushes limit, offset, and ordering into inner queries for more efficient execution - Fixed SQL clause ordering in printer.py to ensure
LIMIT BY
appears beforeLIMIT
to match ClickHouse syntax - Updated visitor pattern implementation to properly handle the new
LimitByExpr
nodes throughout the codebase - Added comprehensive tests to verify the optimization works correctly with various query patterns
28 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings | Greptile
inner_select.offset = clone_expr(node.offset, clear_locations=True, clear_types=True) | ||
# Note: we actually change the outer node's behaviour here. This only works because in the if statement above we check whether the query is a 'simple' query (ie no joins/not in a subquery). This would mess up if we tried this with ie events | ||
node.offset = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Setting node.offset to None could have unexpected side effects if this code is ever used in a context where the outer query's offset is still needed after this function returns
inner_select_where.append( | ||
ast.CompareOperation( | ||
left=ast.Field(chain=["where_optimization", "id"]), | ||
right=parse_select("SELECT id FROM raw_persons as limit_delete_optimization WHERE is_deleted = 1"), | ||
op=ast.CompareOperationOp.NotIn, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Using a subquery here could be inefficient if there are many deleted persons. Consider using a join or EXISTS clause instead
@snapshot_clickhouse_queries | ||
def test_limit_and_order_by(self): | ||
response = execute_hogql_query( | ||
parse_select("select id, properties.$some_prop, created_at from persons ORDER BY created_at limit 3"), | ||
self.team, | ||
) | ||
assert response.clickhouse | ||
self.assertIn("where_optimization", response.clickhouse) | ||
assert [x[0] for x in response.results] == [ | ||
self.first_person.uuid, | ||
self.second_person.uuid, | ||
self.third_person.uuid, | ||
] | ||
|
||
response = execute_hogql_query( | ||
parse_select("select id, properties.$some_prop from persons ORDER BY created_at limit 2, 1"), | ||
self.team, | ||
) | ||
assert [x[0] for x in response.results] == [self.second_person.uuid, self.third_person.uuid] | ||
|
||
_create_event(event="$pageview", distinct_id="1", team=self.team) | ||
_create_event(event="$pageview", distinct_id="2", team=self.team) | ||
_create_event(event="$pageview", distinct_id="3", team=self.team) | ||
response = execute_hogql_query( | ||
parse_select( | ||
"select id, persons.properties.$some_prop from events left join persons ON (events.person_id=persons.id) where persons.properties.$some_prop != 'something' limit 1" | ||
), | ||
self.team, | ||
) | ||
assert len(response.results) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The test verifies that the optimization works correctly with ORDER BY and LIMIT clauses, but there's a potential issue in the second assertion. The test is checking that the result contains both the second_person and third_person UUIDs, but the query uses 'limit 2, 1' which should return only 1 result (offset 2, limit 1). The assertion should probably check for just one UUID, not two.
_create_event(event="$pageview", distinct_id="1", team=self.team) | ||
_create_event(event="$pageview", distinct_id="2", team=self.team) | ||
_create_event(event="$pageview", distinct_id="3", team=self.team) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: These three lines create duplicate events with the same parameters as lines 70-72. This could lead to confusion about the test data state. Consider consolidating these event creations or adding a comment explaining why they're duplicated.
Size Change: 0 B Total Size: 9.73 MB ℹ️ View Unchanged
|
Problem
Don't merge until #29319 is merged
Loading the persons page is slow. This will speed it up by not aggregating all persons just to return 100 results
Changes
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
How did you test this code?