Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove separate entity counting step at end of copy #5819

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Feb 12, 2025

We used to count all entities in the destination after all data had been copied; for large subgraphs, that can be a very slow operation, and in some cases makes the copy operation fail because the underlying fdw connection times out.

With this PR, we now update the entity count with every batch we copy.

With recent changes, BatchCopy has become unnecessary
We used to count entities after all the data had been copied, but that is a
very slow operation for large subgraphs; that in turn can lead to bad
side-effects like connection timeouts and the copy ultimately failing.

We now keep track of the number of current entity versions that we copy
with each batch and calculate the entity count incrementally
We used to count entities, but we know the result is 0 because all the
tables we are counting are empty
@fordN fordN requested review from zorancv and mangas and removed request for zorancv February 13, 2025 17:01
fn walk_ast<'b>(&'b self, mut out: AstPass<'_, 'b, Pg>) -> QueryResult<()> {
// Generate a query
// with copy_cte as ( {copy} )
// select count(*) from copy_cte where {block_range_current}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would better to have a test for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests in store/test-store/tests/postgres/graft.rs that run these queries by copying/grafting test subgraphs


let count = count.unwrap_or(0);

deployment::update_entity_count(conn, &self.dst_site, count)?;
Copy link
Contributor

@mangas mangas Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, I thought that count_current would run the count for the batch with the goal of adding it to previous result. Otherwise it would be counting all at the end as it was previously, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to count_current wraps the CopyEntityBatchQuery so that the overall query returns how many current versions were copied, i.e., we end up running something like

with copy_cte(current) as (
  insert into dst(...) select * from src where ... returning block_range @> int32::MAX)
select count(*) from copy_cte where current

For the select, copy_cte is a table that has one boolean column, which is true if the row we just inserted is a current version.

The query does two things at once: (1) it copies a batch (the inner insert statement) and (2) returns how many rows that were copied are considered 'current', i.e. entity versions that should be counted for the entity count of the deployment. Line 488 then takes that count and adds it to the entity count for the deployment.

With this change, we never run a count query just by itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants