Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce artificial random storage service failures in storage-service for simulations. #3304

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MathieuDutSik
Copy link
Contributor

Motivation

We had some problem with ScyllaDb running on the storage. Being able to reproduce problem is useful for finding unexpected schemes of failure.

Proposal

A scheme is introduced for adding random read and/or write errors in the running of the storage service client. There is a different probability for read and for writes. This is a deterministic error introduction scheme in order to be as deterministic as possible.

Test Plan

The test did show some problems. Whether it is a true problem or some other errors remains to be determined.

The commands to get one problem is the following:

  • Running the storage-service: cargo run --release -p linera-storage-service -- memory --endpoint $LINERA_STORAGE_SERVICE

  • Running the validators:

     cargo build --features artificial_random_read_error,storage-service,remote-net
     rm -rf /tmp/WORK && mkdir /tmp/WORK
     cargo run \
        --features artificial_random_read_error \
        --bin linera \
        -- net up \
        --storage service:tcp:localhost:1235:table \
        --policy-config devnet \
        --path /tmp/WORK \
        --validators 4 --shards 4
  • Running the faucet:
     export LINERA_WALLET=/tmp/WORK/wallet_0.json
     export LINERA_STORAGE=rocksdb:/tmp/WORK/client_0.db
     cargo run --features artificial_random_read_error,storage-service,remote-net --bin linera -- faucet --amount 1000 --port 8079
  • Running the tests:
     export LINERA_FAUCET_URL=http://localhost:8079
     cargo test test_wasm_end_to_end_amm::remote_net_grpc --features artificial_random_read_error,remote-net

For which we get the error

2025-02-12T13:33:31.404232Z ERROR linera: Error is Failed to close chain

Caused by:
    chain client error: Failed to communicate with a quorum of validators: Worker error: Storage operation error in service: An artificial read error occurred

Release Plan

Follow the normal release plan.

Links

None

@MathieuDutSik MathieuDutSik marked this pull request as ready for review February 12, 2025 17:20
@MathieuDutSik MathieuDutSik requested a review from ma2bd February 12, 2025 17:20
@MathieuDutSik MathieuDutSik changed the title Introduce random storage service problem in storage-service. Introduce artificial random storage service failures in storage-service for simulations. Feb 12, 2025
@@ -85,7 +85,6 @@ pub mod metrics;
mod graphql;

/// Functions for random generation
#[cfg(with_testing)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you must have this, please add #[doc(hidden)]

Copy link
Contributor

@ma2bd ma2bd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this PR is a quick experiment but if we want to make this feature official, error injection should probably be controlled dynamically.

  • Features can be used to make certain options available but features alone should not modify runtime behaviors.
  • Server-side error injection could be easier to control than client-side. One could create a new GRPC command for instance. I can also see the benefit of client-side error injection, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants