-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(row-counts-for-saved-queries): Temporal row counts for saved queries #29164
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR adds row counting functionality for materialized saved queries in the data warehouse, enabling proper billing and accurate reporting of data size.
- Added
count_delta_table_rows
function inrun_workflow.py
to efficiently count rows in Delta tables using batched processing - Implemented
update_table_row_count
to update therow_count
field in the correspondingDataWarehouseTable
model - Modified
materialize_model
to call these functions after materializing data - Added test assertions in
test_run_workflow.py
to verify row counts are correctly updated after materialization - Included error handling with proper logging to ensure robustness in production environments
2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. as a reminder the billing system works by pulling "rows_synced" from external data job which means we aggregate the total rows synced across all syncs. An analogy with materialization would be saving the row count/row size each sync vs just the rows in the table.
but, as mentioned, we should rethiink what we're counting in the first place
Yep, I just wanted to get this row updated before tackling the bigger problem... I wasn't even sure if we wanted to bill in the same way in terms of rows, but that's a larger discussion... |
Problem
Row counts are not updated and we don't bill for rows that are materialized in views... The first step is to have methods that count the records written via delta for our data-modeling-run.
Rows now updated:


Changes
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Works on scheduled queries, tested on 72k events in test env. Seems to work ok and update.