RTCache is prone to lookup storms on failure #474

theduke · 2024-11-17T13:37:29Z

Describe the bug

Note: If this should be considered a bug or just undocumented behaviour is debatable.

Currently RTCache::get will always retry on lock timeouts and error results.
The retry happens unconditionally without any lock-based coordination.

This means that if a lot of get() s are started while a lookup is running and that lookup fails a lot of fallback requests are started, leading to a lookup storm.

I find this quite unexpected and potentially very problematic in production scenarios, and IMO should be fixed.

At the very least it should be documented in doc comments.

If there is interest in fixing this, I'm happy to work on a PR.

A simple fix for the error return case would be to re-use the same logic to only cause one additional lookup for all requesters.

This could be considered a breaking change though, because:

The returned error on a shared failure would not be equivalent to the Lookup error message anymore. Although you probably shouldn't rely on error message strings for anything...
The lock_timeout would be respected twice, unless there is some custom logic to reduce the timeout by the time the first attempt took.

An alternative could be adding a new method with different behaviour.

ps: The behaviour for lock timeouts is tricky, and there the current implementation is more sensible to avoid getting stuck due to a long-running task, but could also be improved by making the lookup task abortable.

Reproduction

This simple tests shows the issue.
The below code leads to 100 separate lookups.

    #[tokio::test(flavor = "multi_thread")]
    async fn test_pingora_cache_storm2() {
        #[derive(Clone, Debug, Default)]
        struct MockLookup {
            counter: Arc<AtomicUsize>,
        }

        #[async_trait::async_trait]
        impl Lookup<usize, bool, Self> for MockLookup {
            async fn lookup(
                _key: &usize,
                extra: Option<&Self>,
            ) -> Result<(bool, Option<Duration>), Box<dyn std::error::Error + Send + Sync>>
            {
                let state = extra.unwrap();
                state
                    .counter
                    .fetch_add(1, std::sync::atomic::Ordering::SeqCst);
                tokio::time::sleep(Duration::from_secs(3)).await;
                Err("not implemented".into())
            }
        }

        let lookup = MockLookup::default();

        let cache = Arc::new(RTCache::<usize, bool, MockLookup, MockLookup>::new(
            10_000, None, None,
        ));

        let mut set = tokio::task::JoinSet::new();

        for _index in 0..100 {
            let cache = cache.clone();
            let lookup = lookup.clone();
            let waiter = tokio::spawn(async move {
                let _result = cache.get(&0, None, Some(&lookup)).await;
            });
            set.spawn(waiter);
        }

        set.join_all().await;

        let lookups = lookup.counter.load(std::sync::atomic::Ordering::SeqCst);
        assert_eq!(
            lookups, 2,
            "only 2 lookups should have been performed, one original and one fallback"
        );
    }

The text was updated successfully, but these errors were encountered:

eaufavor added enhancement New feature or request Performance Performance of efficiency. labels Nov 18, 2024

johnhurt assigned johnhurt and Noah-Kennedy Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTCache is prone to lookup storms on failure #474

RTCache is prone to lookup storms on failure #474

theduke commented Nov 17, 2024 •

edited

Loading

RTCache is prone to lookup storms on failure #474

RTCache is prone to lookup storms on failure #474

Comments

theduke commented Nov 17, 2024 • edited Loading

Describe the bug

Reproduction

theduke commented Nov 17, 2024 •

edited

Loading