Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide How to handle Transaction Submission Timeout #558

Open
b-yap opened this issue Oct 4, 2024 · 9 comments
Open

Decide How to handle Transaction Submission Timeout #558

b-yap opened this issue Oct 4, 2024 · 9 comments

Comments

@b-yap
Copy link
Contributor

b-yap commented Oct 4, 2024

Context

It seems like resubmissions of Stellar transactions sometimes failed due to race conditions. This was probably because the same Stellar account was used by two vault clients (or in this case, two integration test jobs running at the same time) thus the sequence number increased during the resubmissions.

From the failed CI:

ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb4e5d0da87b0b1ad3cf0ed0295721d23be3dce617cb6667c5c41afbf82df62fb}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Transaction Submission Timeout", status: 504, reason: "Your transaction submission request has timed out. This does not necessarily mean the submission has failed. Before resubmitting, please use the transaction hash provided in extras.hash to poll the GET /transactions endpoint for sometime and check if it was included in a ledger.", result_code_op: [], envelope_xdr: None }

After returning the same transaction, there must have been race conditions, as the next error that appeared:

ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb4e5d0da87b0b1ad3cf0ed0295721d23be3dce617cb6667c5c41afbf82df62fb}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Transaction Failed", status: 400, reason: "tx_bad_seq", result_code_op: [], envelope_xdr: Some("AAAAAgAAAABIAqJ3gBZp7ZPNNB6lqFRtrHwsVORcDOsAk9uqkBeoEwAAAGQAACygAAABHAAAAAAAAAABAAAAHERCOWR0ODFLZUxzVUhOVzRFU0dCTHlCNHl6akIAAAABAAAAAQAAAABIAqJ3gBZp7ZPNNB6lqFRtrHwsVORcDOsAk9uqkBeoEwAAAAEAAAAAvLWEHHR8B5HLUxEsvLtwpqTLhqOB93KKa2mitHztRHAAAAABVVNEQwAAAAAU0ZYxsDcX2auaNm4QMh7iZucux2yrYZDwoTNtSCKfiwAAAAAAABODAAAAAAAAAAGQF6gTAAAAQAgHmNMxHyjTF1s4R4vIieggkkHbHCYY69Ccgd+TwouhD0laCxZOhF+MJ6UiF5XdWWoaPk75PAXw7tBKOc6Skw0=") }

Discuss how to handle this:

  • Follow the instructions about polling /transactions;
  • Do not perform resubmission for Transaction Submission Timeouts;
  • Stay as is; no code changes

TODO / Conclusion (based on comments)

  • Reduce the DEFAULT_MAX_BACKOFF_DELAY_IN_SECS from 60 to 10 seconds to make it less likely that this race condition is introduced.

This is not a 100% failsafe mechanism. However, we already generally assume that every vault is only run once, so the vault operator can circumvent this edge case simply by following the guidelines we put in the documentation.

@ebma
Copy link
Member

ebma commented Oct 4, 2024

Let's align on what happened here. The first submission timed out and the second submission for the same transaction (same sequence number) returned a tx_bad_seq as the first one already got accepted.

Can you link to the relevant parts of the code that handle this case? Don't we already check if a transaction was submitted in the past by going through all historic transactions of an account? Or not in this case?

I think the best-practice way to deal with these 504 submission errors returned by Stellar horizon servers is to try and resubmit until it gets accepted.

@b-yap
Copy link
Contributor Author

b-yap commented Oct 7, 2024

Don't we already check if a transaction was submitted in the past by going through all historic transactions of an account

Not when we're submitting it on-the-go; meaning it's NOT from the periodic resubmission. (It won't pass through any logic in clients/wallet/src/resubmissions.rs).

I think the best-practice way to deal with these 504 submission errors returned by Stellar horizon servers is to try and resubmit until it gets accepted.

We are. There's a delay in-between though. Probably because of it, another (successful) transaction took the same sequence number. Once resubmission is triggered, the sequence number is already invalid.

Also, this might have happen because the 2 CI tests ran on parallel:

  • Run Tests for Pallets and Clients
  • Run Tests for Vault

Although it's hard to trace this. I searched for sequence 49065706389788; it only appeared in the job Run Tests for Vault.
I've added the links of the relevant code in the description.

@ebma
Copy link
Member

ebma commented Oct 7, 2024

Probably because of it, another (successful) transaction took the same sequence number. Once resubmission is triggered, the sequence number is already invalid.

I see, that makes sense. What do you think would be the best approach to handle this @b-yap?

@b-yap
Copy link
Contributor Author

b-yap commented Oct 8, 2024

This has only happened one time; I would like to observe if it happens again.
And if it does, I think one non-drastic solution is to reduce the delay. Right now it's set to 10 seconds MULTIPLIED by the delay counter:

let sleep_duration = backoff_delay_counter * BASE_BACKOFF_DELAY_IN_SECS;


Default value of max_backoff_delay is:

pub(crate) const DEFAULT_MAX_BACKOFF_DELAY_IN_SECS: u16 = 60;

max_backoff_delay: Self::DEFAULT_MAX_BACKOFF_DELAY_IN_SECS,

which is huge.

@ebma
Copy link
Member

ebma commented Oct 8, 2024

You are right, that's probably a bit too much and I'm also fine if we reduce it. 10 sounds good to me. This makes the error you describe in the ticket more unlikely to occur but it could still happen. Correct me if I'm wrong but when this happens, the vault will not attempt to retry the execution of the redeem request until it restarts, correct?

@b-yap
Copy link
Contributor Author

b-yap commented Oct 8, 2024

the vault will not attempt to retry the execution of the redeem request until it restarts, correct?

The envelope will be removed (be it successful or not):

let _ = self.remove_tx_envelope_from_cache(&envelope);

Retry is not possible.

@ebma
Copy link
Member

ebma commented Oct 8, 2024

Right but when the vault restarts, it will probably see the redeem request still pending and then retry the whole redeem process again. In this case, this is also not problematic except that for the longer waiting time of the user for the redeem.

However, let's say your assumption is correct and this occurred because 2 CIs were running in parallel, then we are very unlikely to encounter the same issue in production as long as only one vault client is running per vault (and the Stellar account of the vault is not re-used twice) which is something we explicitly mention in the vault operator documentation.

So let's agree that we change the default max backoff and keep the rest as is, or what do you think @b-yap?

@prayagd
Copy link
Collaborator

prayagd commented Nov 6, 2024

@ebma can you please share a brief about what is the problem and should we prioritise it?

@ebma
Copy link
Member

ebma commented Nov 6, 2024

I changed the description of the ticket to summarize the conclusion based on the comments. We don't need to prioritize it as it's an edge case that should not happen in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants