-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide How to handle Transaction Submission Timeout
#558
Comments
Let's align on what happened here. The first submission timed out and the second submission for the same transaction (same sequence number) returned a Can you link to the relevant parts of the code that handle this case? Don't we already check if a transaction was submitted in the past by going through all historic transactions of an account? Or not in this case? I think the best-practice way to deal with these 504 submission errors returned by Stellar horizon servers is to try and resubmit until it gets accepted. |
Not when we're submitting it on-the-go; meaning it's NOT from the periodic resubmission. (It won't pass through any logic in
We are. There's a delay in-between though. Probably because of it, another (successful) transaction took the same sequence number. Once resubmission is triggered, the sequence number is already invalid. Also, this might have happen because the 2 CI tests ran on parallel:
Although it's hard to trace this. I searched for sequence 49065706389788; it only appeared in the job Run Tests for Vault. |
I see, that makes sense. What do you think would be the best approach to handle this @b-yap? |
This has only happened one time; I would like to observe if it happens again.
Default value of max_backoff_delay is:
which is huge. |
You are right, that's probably a bit too much and I'm also fine if we reduce it. |
The envelope will be removed (be it successful or not):
Retry is not possible. |
Right but when the vault restarts, it will probably see the redeem request still pending and then retry the whole redeem process again. In this case, this is also not problematic except that for the longer waiting time of the user for the redeem. However, let's say your assumption is correct and this occurred because 2 CIs were running in parallel, then we are very unlikely to encounter the same issue in production as long as only one vault client is running per vault (and the Stellar account of the vault is not re-used twice) which is something we explicitly mention in the vault operator documentation. So let's agree that we change the default max backoff and keep the rest as is, or what do you think @b-yap? |
@ebma can you please share a brief about what is the problem and should we prioritise it? |
I changed the description of the ticket to summarize the conclusion based on the comments. We don't need to prioritize it as it's an edge case that should not happen in practice. |
Context
It seems like resubmissions of Stellar transactions sometimes failed due to race conditions. This was probably because the same Stellar account was used by two vault clients (or in this case, two integration test jobs running at the same time) thus the sequence number increased during the resubmissions.
From the failed CI:
After returning the same transaction, there must have been race conditions, as the next error that appeared:
Discuss how to handle this:
/transactions
;Transaction Submission Timeout
s;TODO / Conclusion (based on comments)
DEFAULT_MAX_BACKOFF_DELAY_IN_SECS
from 60 to 10 seconds to make it less likely that this race condition is introduced.This is not a 100% failsafe mechanism. However, we already generally assume that every vault is only run once, so the vault operator can circumvent this edge case simply by following the guidelines we put in the documentation.
The text was updated successfully, but these errors were encountered: