You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: If this should be considered a bug or just undocumented behaviour is debatable.
Currently RTCache::get will always retry on lock timeouts and error results.
The retry happens unconditionally without any lock-based coordination.
This means that if a lot of get() s are started while a lookup is running and that lookup fails a lot of fallback requests are started, leading to a lookup storm.
I find this quite unexpected and potentially very problematic in production scenarios, and IMO should be fixed.
At the very least it should be documented in doc comments.
If there is interest in fixing this, I'm happy to work on a PR.
A simple fix for the error return case would be to re-use the same logic to only cause one additional lookup for all requesters.
This could be considered a breaking change though, because:
The returned error on a shared failure would not be equivalent to the Lookup error message anymore. Although you probably shouldn't rely on error message strings for anything...
The lock_timeout would be respected twice, unless there is some custom logic to reduce the timeout by the time the first attempt took.
An alternative could be adding a new method with different behaviour.
ps: The behaviour for lock timeouts is tricky, and there the current implementation is more sensible to avoid getting stuck due to a long-running task, but could also be improved by making the lookup task abortable.
Reproduction
This simple tests shows the issue.
The below code leads to 100 separate lookups.
#[tokio::test(flavor = "multi_thread")]
async fn test_pingora_cache_storm2() {
#[derive(Clone, Debug, Default)]
struct MockLookup {
counter: Arc<AtomicUsize>,
}
#[async_trait::async_trait]
impl Lookup<usize, bool, Self> for MockLookup {
async fn lookup(
_key: &usize,
extra: Option<&Self>,
) -> Result<(bool, Option<Duration>), Box<dyn std::error::Error + Send + Sync>>
{
let state = extra.unwrap();
state
.counter
.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
tokio::time::sleep(Duration::from_secs(3)).await;
Err("not implemented".into())
}
}
let lookup = MockLookup::default();
let cache = Arc::new(RTCache::<usize, bool, MockLookup, MockLookup>::new(
10_000, None, None,
));
let mut set = tokio::task::JoinSet::new();
for _index in 0..100 {
let cache = cache.clone();
let lookup = lookup.clone();
let waiter = tokio::spawn(async move {
let _result = cache.get(&0, None, Some(&lookup)).await;
});
set.spawn(waiter);
}
set.join_all().await;
let lookups = lookup.counter.load(std::sync::atomic::Ordering::SeqCst);
assert_eq!(
lookups, 2,
"only 2 lookups should have been performed, one original and one fallback"
);
}
The text was updated successfully, but these errors were encountered:
Describe the bug
Note: If this should be considered a bug or just undocumented behaviour is debatable.
Currently RTCache::get will always retry on lock timeouts and error results.
The retry happens unconditionally without any lock-based coordination.
This means that if a lot of
get()
s are started while a lookup is running and that lookup fails a lot of fallback requests are started, leading to a lookup storm.I find this quite unexpected and potentially very problematic in production scenarios, and IMO should be fixed.
At the very least it should be documented in doc comments.
If there is interest in fixing this, I'm happy to work on a PR.
A simple fix for the error return case would be to re-use the same logic to only cause one additional lookup for all requesters.
This could be considered a breaking change though, because:
Lookup
error message anymore. Although you probably shouldn't rely on error message strings for anything...lock_timeout
would be respected twice, unless there is some custom logic to reduce the timeout by the time the first attempt took.An alternative could be adding a new method with different behaviour.
ps: The behaviour for lock timeouts is tricky, and there the current implementation is more sensible to avoid getting stuck due to a long-running task, but could also be improved by making the lookup task abortable.
Reproduction
This simple tests shows the issue.
The below code leads to 100 separate lookups.
The text was updated successfully, but these errors were encountered: