Decide How to handle `Transaction Submission Timeout` #558

b-yap · 2024-10-04T13:34:55Z

Context

It seems like resubmissions of Stellar transactions sometimes failed due to race conditions. This was probably because the same Stellar account was used by two vault clients (or in this case, two integration test jobs running at the same time) thus the sequence number increased during the resubmissions.

From the failed CI:

ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb4e5d0da87b0b1ad3cf0ed0295721d23be3dce617cb6667c5c41afbf82df62fb}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Transaction Submission Timeout", status: 504, reason: "Your transaction submission request has timed out. This does not necessarily mean the submission has failed. Before resubmitting, please use the transaction hash provided in extras.hash to poll the GET /transactions endpoint for sometime and check if it was included in a ledger.", result_code_op: [], envelope_xdr: None }

After returning the same transaction, there must have been race conditions, as the next error that appeared:

ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb4e5d0da87b0b1ad3cf0ed0295721d23be3dce617cb6667c5c41afbf82df62fb}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Transaction Failed", status: 400, reason: "tx_bad_seq", result_code_op: [], envelope_xdr: Some("AAAAAgAAAABIAqJ3gBZp7ZPNNB6lqFRtrHwsVORcDOsAk9uqkBeoEwAAAGQAACygAAABHAAAAAAAAAABAAAAHERCOWR0ODFLZUxzVUhOVzRFU0dCTHlCNHl6akIAAAABAAAAAQAAAABIAqJ3gBZp7ZPNNB6lqFRtrHwsVORcDOsAk9uqkBeoEwAAAAEAAAAAvLWEHHR8B5HLUxEsvLtwpqTLhqOB93KKa2mitHztRHAAAAABVVNEQwAAAAAU0ZYxsDcX2auaNm4QMh7iZucux2yrYZDwoTNtSCKfiwAAAAAAABODAAAAAAAAAAGQF6gTAAAAQAgHmNMxHyjTF1s4R4vIieggkkHbHCYY69Ccgd+TwouhD0laCxZOhF+MJ6UiF5XdWWoaPk75PAXw7tBKOc6Skw0=") }

Discuss how to handle this:

Follow the instructions about polling /transactions;
Do not perform resubmission for Transaction Submission Timeouts;
- Status 504 is a recoverable error; so
- Resubmission is performed after a delay (which should be logged next time).
Stay as is; no code changes

TODO / Conclusion (based on comments)

Reduce the DEFAULT_MAX_BACKOFF_DELAY_IN_SECS from 60 to 10 seconds to make it less likely that this race condition is introduced.

This is not a 100% failsafe mechanism. However, we already generally assume that every vault is only run once, so the vault operator can circumvent this edge case simply by following the guidelines we put in the documentation.

The text was updated successfully, but these errors were encountered:

ebma · 2024-10-04T13:40:25Z

Let's align on what happened here. The first submission timed out and the second submission for the same transaction (same sequence number) returned a tx_bad_seq as the first one already got accepted.

Can you link to the relevant parts of the code that handle this case? Don't we already check if a transaction was submitted in the past by going through all historic transactions of an account? Or not in this case?

I think the best-practice way to deal with these 504 submission errors returned by Stellar horizon servers is to try and resubmit until it gets accepted.

b-yap · 2024-10-07T07:26:20Z

Don't we already check if a transaction was submitted in the past by going through all historic transactions of an account

Not when we're submitting it on-the-go; meaning it's NOT from the periodic resubmission. (It won't pass through any logic in clients/wallet/src/resubmissions.rs).

I think the best-practice way to deal with these 504 submission errors returned by Stellar horizon servers is to try and resubmit until it gets accepted.

We are. There's a delay in-between though. Probably because of it, another (successful) transaction took the same sequence number. Once resubmission is triggered, the sequence number is already invalid.

Also, this might have happen because the 2 CI tests ran on parallel:

Run Tests for Pallets and Clients
Run Tests for Vault

Although it's hard to trace this. I searched for sequence 49065706389788; it only appeared in the job Run Tests for Vault.
I've added the links of the relevant code in the description.

ebma · 2024-10-07T10:15:53Z

Probably because of it, another (successful) transaction took the same sequence number. Once resubmission is triggered, the sequence number is already invalid.

I see, that makes sense. What do you think would be the best approach to handle this @b-yap?

b-yap · 2024-10-08T06:12:14Z

This has only happened one time; I would like to observe if it happens again.
And if it does, I think one non-drastic solution is to reduce the delay. Right now it's set to 10 seconds MULTIPLIED by the delay counter:

spacewalk/clients/wallet/src/horizon/horizon.rs

Line 176 in 7c79898

let sleep_duration = backoff_delay_counter * BASE_BACKOFF_DELAY_IN_SECS;

Default value of max_backoff_delay is:

spacewalk/clients/wallet/src/stellar_wallet.rs

Line 67 in 7c79898

pub(crate) const DEFAULT_MAX_BACKOFF_DELAY_IN_SECS: u16 = 60;

spacewalk/clients/wallet/src/stellar_wallet.rs

Line 125 in 7c79898

max_backoff_delay: Self::DEFAULT_MAX_BACKOFF_DELAY_IN_SECS,

which is huge.

ebma · 2024-10-08T08:22:30Z

You are right, that's probably a bit too much and I'm also fine if we reduce it. 10 sounds good to me. This makes the error you describe in the ticket more unlikely to occur but it could still happen. Correct me if I'm wrong but when this happens, the vault will not attempt to retry the execution of the redeem request until it restarts, correct?

b-yap · 2024-10-08T09:37:23Z

the vault will not attempt to retry the execution of the redeem request until it restarts, correct?

The envelope will be removed (be it successful or not):

spacewalk/clients/wallet/src/stellar_wallet.rs

Line 282 in 7c79898

let _ = self.remove_tx_envelope_from_cache(&envelope);

Retry is not possible.

ebma · 2024-10-08T09:56:17Z

Right but when the vault restarts, it will probably see the redeem request still pending and then retry the whole redeem process again. In this case, this is also not problematic except that for the longer waiting time of the user for the redeem.

However, let's say your assumption is correct and this occurred because 2 CIs were running in parallel, then we are very unlikely to encounter the same issue in production as long as only one vault client is running per vault (and the Stellar account of the vault is not re-used twice) which is something we explicitly mention in the vault operator documentation.

So let's agree that we change the default max backoff and keep the rest as is, or what do you think @b-yap?

prayagd · 2024-11-06T07:42:51Z

@ebma can you please share a brief about what is the problem and should we prioritise it?

ebma · 2024-11-06T12:28:40Z

I changed the description of the ticket to summarize the conclusion based on the comments. We don't need to prioritize it as it's an edge case that should not happen in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide How to handle `Transaction Submission Timeout` #558

Decide How to handle `Transaction Submission Timeout` #558

b-yap commented Oct 4, 2024 •

edited by ebma

Loading

ebma commented Oct 4, 2024 •

edited

Loading

b-yap commented Oct 7, 2024 •

edited

Loading

ebma commented Oct 7, 2024

b-yap commented Oct 8, 2024 •

edited

Loading

ebma commented Oct 8, 2024

b-yap commented Oct 8, 2024 •

edited

Loading

ebma commented Oct 8, 2024

prayagd commented Nov 6, 2024

ebma commented Nov 6, 2024

Decide How to handle Transaction Submission Timeout #558

Decide How to handle Transaction Submission Timeout #558

Comments

b-yap commented Oct 4, 2024 • edited by ebma Loading

Context

Discuss how to handle this:

TODO / Conclusion (based on comments)

ebma commented Oct 4, 2024 • edited Loading

b-yap commented Oct 7, 2024 • edited Loading

ebma commented Oct 7, 2024

b-yap commented Oct 8, 2024 • edited Loading

ebma commented Oct 8, 2024

b-yap commented Oct 8, 2024 • edited Loading

ebma commented Oct 8, 2024

prayagd commented Nov 6, 2024

ebma commented Nov 6, 2024

Decide How to handle `Transaction Submission Timeout` #558

Decide How to handle `Transaction Submission Timeout` #558

b-yap commented Oct 4, 2024 •

edited by ebma

Loading

ebma commented Oct 4, 2024 •

edited

Loading

b-yap commented Oct 7, 2024 •

edited

Loading

b-yap commented Oct 8, 2024 •

edited

Loading

b-yap commented Oct 8, 2024 •

edited

Loading