-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt to fix cff duplicates #497
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ✅ All tests successful. No failed tests found. @@ Coverage Diff @@
## main #497 +/- ##
==========================================
- Coverage 97.26% 97.25% -0.01%
==========================================
Files 412 413 +1
Lines 34424 34554 +130
==========================================
+ Hits 33481 33607 +126
- Misses 943 947 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. @@ Coverage Diff @@
## main #497 +/- ##
==========================================
- Coverage 97.26% 97.25% -0.01%
==========================================
Files 412 413 +1
Lines 34424 34554 +130
==========================================
+ Hits 33481 33607 +126
- Misses 943 947 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found @@ Coverage Diff @@
## main #497 +/- ##
==========================================
- Coverage 97.26% 97.25% -0.01%
==========================================
Files 412 413 +1
Lines 34424 34554 +130
==========================================
+ Hits 33481 33607 +126
- Misses 943 947 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Changes have been made to critical files, which contain lines commonly executed in production. Learn more ✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #497 +/- ##
==========================================
- Coverage 97.28% 97.28% -0.01%
==========================================
Files 443 444 +1
Lines 35153 35283 +130
==========================================
+ Hits 34200 34326 +126
- Misses 953 957 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation in the PR. Very helpful.
After reading through that and the proposed solution it seems that you are trying to fix the effects of the issue, not the cause (the cause being that 2 different tasks might be doing the same thing at the same time).
Wouldn't it be preferable to fix the cause of the problem?
It saves us from having to do extra work (saving sessions and then deleting them).
I would actually go as far as suggesting that we remove the pre_process_upload
task completely. I don't think we are getting any serious benefit from it at this time...
@giovanni-guidini I'm not as versed in the preprocess task to know the implications of deleting it altogether, for example what happens to people that use the preprocess cli command (create-report)? Do we know how many people use this + the origins of the task? While I think this is 99% the cause, I'd like to see if it's indeed the case in prod so that's why I thought of first trying out this. Other things we could do:
|
@adrian-codecov the main objective of the The pre-process upload task simply initializes the report that was created (carryforward is part of the initialization), but as we've seen that is also done in the upload task. So for people using the To which you might be wondering "what's the point of a redundant task?" Anyway, if we remove it the upload task would just take care of the process initialization. It might increase the time to notify by a little bit, but that would be the extend of the impact, I think. Using the same lock would indeed solve it, but having more tasks fighting for the same lock would increase coordination overhead (and we have a clear way to have less tasks fighting for locks). I don't see how increasing the timeout would help with the issue |
About this bit
That makes sense, but the code as is seems quite "final" and "permanent". Assuming this solution works to fix the carryforward issues we see (which it seems that it would work, the change itself seems fine*), what would be the next steps to solve the cause? (Of course I might be missing a very obvious "this is urgent, we need to patch this asap and worry about not making it happening again later", which would be very valid) *The sessions that belong to a report are also saved in the |
if not cffs_and_depths_in_db: | ||
return | ||
|
||
# Gets first 'parent_depth' it finds, as all cffs should be carried forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it get the minimum 'parent_depth' instead of the first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next four lines do that, it goes through the Report object which contains the uploads that will be carryforwarded and will determine the depth they have. I'm initializing carryforward_report_depth
with the arbitrary depth of 1 and in case line 1241 doesn't run, so the variable has a non-null value. I update it shortly after in line 1243 if a depth for the carryforward reports are found, does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well... no. The question is if the code should get the smallest 'parent_depth' from all the sessions in the report. Both the comment and the code are getting the 1st parent_depth that appears. Those values can be different depending on the order of sessions that were carried forward, and you made no claims about the order of the sessions
asked differently: Is the 1st session that appears always the smallest one?
@@ -60,6 +62,8 @@ | |||
from services.repository import get_repo_provider_service | |||
from services.yaml.reader import get_paths_from_flags, read_yaml_field | |||
|
|||
MIN_PARENT_COMMIT_DEPTH = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this ever not be 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value itself will always be 1, but the parent depth a commit chooses is determined by this
worker/services/report/__init__.py
Lines 648 to 649 in f12fe8f
def get_appropriate_commit_to_carryforward_from( | |
self, commit: Commit, max_parenthood_deepness: int = 10 |
commit=commit.commitid, | ||
), | ||
) | ||
# Potentially make task here - this would make the preprocess task likely timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upload task would be likely to timeout here too? Because that's a way bigger problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could depending on the amount of things to delete, hence the ask here to see if it would make sense to create a separate deletion task. I could do that if we think it makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good idea, yes.
Especially if it only changes the database and not the Report
object it would make a lot of sense to move this to a different task
[ps.: "I" can't speak for the team, obviously. You might want to gather more opinions]
for parent_depth, upload_ids in cffs_and_depths_in_db.items(): | ||
if parent_depth > carryforward_report_depth: | ||
log.info( | ||
"Deleting upload from DB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Big chance that this will spam our logs. Might be better to accumulate the number of uploads deleted and just make a single log afterwards...
db_session.query(Upload).filter(Upload.id_.in_(upload_ids)).delete( | ||
synchronize_session=False | ||
) | ||
db_session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use flush
instead of commit
? Like we do in the sync_repos and sync_teams tasks? I don't really know the difference, tho. Just going for the consistency
Change isn't too bad, length is mostly from tests, but it's important to understand the problem. More described here as well: https://github.com/codecov/internal-issues/issues/478. These are the changes in summary but the proper explanation is below:
session_extras
for uploadsWe've been seeing some very inconsistent and strange behaviour where customers using carryforward flags (cffs) see a disproportionate growth of cffs over time. It doesn't happen every commit, nor there is a specific pattern to it, but it is confirmed. A little on carryforward uploads.
What are cffs? It's a feature we offer where customers can persist upload coverage across commits without having to explicitly upload on every commit - customers specify this in their codecov yaml. Behind the scenes, we look for a commit's parent commit and "copy paste" the relevant upload information into the current commit.
The problem we're seeing is certain commits carry forward uploads from two different parent commits; this should never happen. (you'll see me here say upload and session interchangeably). How does this happen? See this example:
Example
Imagine you have a chain of commits, A -> B -> C, where C is the current commit, B is C's parent, and A is B's parent/C's grand-parent. In normal circumstances, if a customer has cffs on for say "unit" flag and they don't provide an upload with the "unit" flag in B and C, B would carry forward the upload from A, and C would carry forward the upload from B.
While we have a
parent_commit
field for theCommit
model, we have a dedicated function to know which parent a commit should carry forward to. This function will find the closest suitable parent commit (normally it's immediate parent) and carryforward reports from it. This function is needed because sometimes a commit's direct parent might not have a valid report or one at all, the commit might be in an error state, there might not be a direct parent at all, etc. I won't go too much in detail in the different states but you can read the function if you're curious, but I want to focus on the scenario where a parent_commit is processing, which is when it is in "pending" state. A parent commit that is processing isn't a suitable candidate to carryforward a report from, because for all we know it doesn't have a valid report or it can error or other causes, so we advance to its parent and use that as a candidate. In the example above, C would look at B, which is processing, so C would carryforward from A.This is so far expected and normal. The funky part occurs if you upload two or more times for commit C, one upload when B is processing, and one after B is successfully completed. In the the eyes of C, it would carryforward from A the first time and from B the second time, and if you consider A to have "N" uploads to carryforward and B to have "M" uploads to carry forward, then C would end up with "M + N" carryforward uploads - this is wrong, and C should only have "N" uploads. To take the example even further, if D eventually happens after C, and it carries the uploads from C, it would end up with "M + N" uploads, hence you see this "duplicate" effect every now and then. -- end of example
Now, if you're thinking, why doesn't this happen all the time as many people upload more than once per commit, you're thinking in the right vein. That's because, for this weird bug to happen, you need to upload multiple times
very fast
, and those uploads need to bevery big
. A commit doesn't stay processing for very long, so the window two different uploads to occur when a commit is processing and when it's not is slim. It is not impossible, but it's very unlikely. On top of that, two different uploads won't run this logic because it needs to go through this first, which is true when there isn't a report object and false after that is populated (so technically, 2 different uploads should never get here cause the second run will evaluate false and keep going). And thirdly, upload tasks have a lock for a task + repo + commit combo, so two uploads for the same commit couldn't be racing for the same function.So, how does this actually happen? It's because there are two tasks that could run into this functionality: the
upload
andpreprocessing
tasks. These two can theoretically call the logic that chooses the commit's parent for cffs, one that chooses A as the "parent" and the other one that chooses "B" as the parent.So if we piece everything together: if you run the preprocess + upload tasks
in quick succession
for a commit witha lot
of uploads, there is a chance that they choosedifferent parents
.So what am I doing? (finally), I'm writing some logic at "upload insertion time" to determine if there are already cffs from an older parent, and delete those if they exist. In the example above parting from C, I would be deleting the cffs from A when I see cffs coming from B.
Hopefully you enjoyed the read!
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.