Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTCache is prone to lookup storms on failure #474

Open
theduke opened this issue Nov 17, 2024 · 0 comments
Open

RTCache is prone to lookup storms on failure #474

theduke opened this issue Nov 17, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request Performance Performance of efficiency.

Comments

@theduke
Copy link

theduke commented Nov 17, 2024

Describe the bug

Note: If this should be considered a bug or just undocumented behaviour is debatable.

Currently RTCache::get will always retry on lock timeouts and error results.
The retry happens unconditionally without any lock-based coordination.

This means that if a lot of get() s are started while a lookup is running and that lookup fails a lot of fallback requests are started, leading to a lookup storm.

I find this quite unexpected and potentially very problematic in production scenarios, and IMO should be fixed.

At the very least it should be documented in doc comments.

If there is interest in fixing this, I'm happy to work on a PR.

A simple fix for the error return case would be to re-use the same logic to only cause one additional lookup for all requesters.

This could be considered a breaking change though, because:

  • The returned error on a shared failure would not be equivalent to the Lookup error message anymore. Although you probably shouldn't rely on error message strings for anything...
  • The lock_timeout would be respected twice, unless there is some custom logic to reduce the timeout by the time the first attempt took.

An alternative could be adding a new method with different behaviour.

ps: The behaviour for lock timeouts is tricky, and there the current implementation is more sensible to avoid getting stuck due to a long-running task, but could also be improved by making the lookup task abortable.

Reproduction

This simple tests shows the issue.
The below code leads to 100 separate lookups.

    #[tokio::test(flavor = "multi_thread")]
    async fn test_pingora_cache_storm2() {
        #[derive(Clone, Debug, Default)]
        struct MockLookup {
            counter: Arc<AtomicUsize>,
        }

        #[async_trait::async_trait]
        impl Lookup<usize, bool, Self> for MockLookup {
            async fn lookup(
                _key: &usize,
                extra: Option<&Self>,
            ) -> Result<(bool, Option<Duration>), Box<dyn std::error::Error + Send + Sync>>
            {
                let state = extra.unwrap();
                state
                    .counter
                    .fetch_add(1, std::sync::atomic::Ordering::SeqCst);
                tokio::time::sleep(Duration::from_secs(3)).await;
                Err("not implemented".into())
            }
        }

        let lookup = MockLookup::default();

        let cache = Arc::new(RTCache::<usize, bool, MockLookup, MockLookup>::new(
            10_000, None, None,
        ));

        let mut set = tokio::task::JoinSet::new();

        for _index in 0..100 {
            let cache = cache.clone();
            let lookup = lookup.clone();
            let waiter = tokio::spawn(async move {
                let _result = cache.get(&0, None, Some(&lookup)).await;
            });
            set.spawn(waiter);
        }

        set.join_all().await;

        let lookups = lookup.counter.load(std::sync::atomic::Ordering::SeqCst);
        assert_eq!(
            lookups, 2,
            "only 2 lookups should have been performed, one original and one fallback"
        );
    }
@eaufavor eaufavor added enhancement New feature or request Performance Performance of efficiency. labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Performance Performance of efficiency.
Projects
None yet
Development

No branches or pull requests

4 participants