Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(row-counts-for-saved-queries): Temporal row counts for saved queries #29164

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

phixMe
Copy link
Contributor

@phixMe phixMe commented Feb 25, 2025

Problem

Row counts are not updated and we don't bill for rows that are materialized in views... The first step is to have methods that count the records written via delta for our data-modeling-run.

Rows now updated:
image
image

Changes

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Works on scheduled queries, tested on 72k events in test env. Seems to work ok and update.

@phixMe phixMe marked this pull request as ready for review February 25, 2025 18:40
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds row counting functionality for materialized saved queries in the data warehouse, enabling proper billing and accurate reporting of data size.

  • Added count_delta_table_rows function in run_workflow.py to efficiently count rows in Delta tables using batched processing
  • Implemented update_table_row_count to update the row_count field in the corresponding DataWarehouseTable model
  • Modified materialize_model to call these functions after materializing data
  • Added test assertions in test_run_workflow.py to verify row counts are correctly updated after materialization
  • Included error handling with proper logging to ensure robustness in production environments

2 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. as a reminder the billing system works by pulling "rows_synced" from external data job which means we aggregate the total rows synced across all syncs. An analogy with materialization would be saving the row count/row size each sync vs just the rows in the table.

but, as mentioned, we should rethiink what we're counting in the first place

@phixMe
Copy link
Contributor Author

phixMe commented Feb 25, 2025

looks good. as a reminder the billing system works by pulling "rows_synced" from external data job which means we aggregate the total rows synced across all syncs. An analogy with materialization would be saving the row count/row size each sync vs just the rows in the table.

but, as mentioned, we should rethiink what we're counting in the first place

Yep, I just wanted to get this row updated before tackling the bigger problem... I wasn't even sure if we wanted to bill in the same way in terms of rows, but that's a larger discussion...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants