-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove separate entity counting step at end of copy #5819
base: master
Are you sure you want to change the base?
Conversation
With recent changes, BatchCopy has become unnecessary
We used to count entities after all the data had been copied, but that is a very slow operation for large subgraphs; that in turn can lead to bad side-effects like connection timeouts and the copy ultimately failing. We now keep track of the number of current entity versions that we copy with each batch and calculate the entity count incrementally
We used to count entities, but we know the result is 0 because all the tables we are counting are empty
fn walk_ast<'b>(&'b self, mut out: AstPass<'_, 'b, Pg>) -> QueryResult<()> { | ||
// Generate a query | ||
// with copy_cte as ( {copy} ) | ||
// select count(*) from copy_cte where {block_range_current} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would better to have a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tests in store/test-store/tests/postgres/graft.rs
that run these queries by copying/grafting test subgraphs
|
||
let count = count.unwrap_or(0); | ||
|
||
deployment::update_entity_count(conn, &self.dst_site, count)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this correctly, I thought that count_current would run the count for the batch with the goal of adding it to previous result. Otherwise it would be counting all at the end as it was previously, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to count_current
wraps the CopyEntityBatchQuery
so that the overall query returns how many current versions were copied, i.e., we end up running something like
with copy_cte(current) as (
insert into dst(...) select * from src where ... returning block_range @> int32::MAX)
select count(*) from copy_cte where current
For the select
, copy_cte
is a table that has one boolean column, which is true
if the row we just inserted is a current version.
The query does two things at once: (1) it copies a batch (the inner insert
statement) and (2) returns how many rows that were copied are considered 'current', i.e. entity versions that should be counted for the entity count of the deployment. Line 488 then takes that count and adds it to the entity count for the deployment.
With this change, we never run a count query just by itself.
We used to count all entities in the destination after all data had been copied; for large subgraphs, that can be a very slow operation, and in some cases makes the copy operation fail because the underlying fdw connection times out.
With this PR, we now update the entity count with every batch we copy.