Implementation of CPU throttling #1689

lschmidtcavalcante-sc · 2025-02-07T20:42:28Z

Proposed implementation of CPU throttling (still a work in progress).

Implementation adds a new config cpu_overload_protection_threshold: if set to 0, then no CPU throttling; otherwise throttle a percentage of the requests (cpu_overload_protection_throttling_percentage) if CPU usage is above said threshold.

The percentage of requests to be throttled (responded with THROTTLED) is adjusted every 10 seconds:

If CPU usage is above threshold, then double the amount of requests to be throttled (starting at 1);
Otherwise, halve the amount of requests to be throttled (eventually reaching zero).

Note that some commands are never throttled (see is_client_excluded_from_cpu_throttling and is_cmd_excluded_from_cpu_throttling): they are mostly commands related to intra-cluster communication; afterall, we want to be able to scale up the cluster to move away from this state.

Issue #1688

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>

PingXie · 2025-02-08T19:23:55Z

Thanks for the contribution, @lschmidtcavalcante-sc!

This is a very valid problem to solve for clusters but I don't think we should back-pressure on CPU usage, which doesn't go linearly with the workload level. The main challenges that I can think of, ranked by their impact and likelihood, are

Long running Lua scripts block the main thread.

Today we can only safely timeout a read-only LUA script but as soon as the script modifies a key, it has to run to completion. Aborting a write script safely would require applying changes in an all-or-none way.

Along this line, long running non-script commands (with O(N) complexity) also risk starving the cluster bus, leading to premature node failures. We could consider a generic solution to either timeout them or time-slicing their execution with the cluster bus processing. However, the latter could break consistency, since the cluster bus can churn the topology mid-execution.

An excessive amount of active clients delays cluster connection processing.

We could explore limiting the epoll processing batch size and additionally prioritize the processing of the cluster connections. So rather than competing against 10,000 active clients, the cluster connections share the CPU time with say, 1,000 clients in every batch.

On a related note, I do think back-pressuring on memory stress is more valuable to reduce the OOM risk.

lschmidtcavalcante-sc · 2025-02-10T21:03:34Z

Hey @PingXie , let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

It is still early to tell how effective CPU throttling has been for Snap (also, we are still using KeyDB), but our internal testing indicated that clusters under heavy load with CPU throttling enabled did not lose nodes, giving extra time for the oncall to take action -- to address the comment at #1688 (comment) by @xbasel , our internal testing did not assume any handling from the clients.

artikell · 2025-02-14T01:52:37Z

I have the following questions:

What should be done if the throttled requests have fully occupied the CPU?
Is adding a new error message unfriendly to users? Why not consider blocking the requests? like isPausedActions?
In the avalanche scenario, is it very likely that the CPU will be fully occupied continuously until the service becomes unavailable?

lschmidtcavalcante-sc · 2025-02-14T19:48:02Z

@artikell 好 what do you mean by throttled requests have fully occupied the CPU? Regarding the introduction of a new error code, we intend to re-use this one #1672 and I don't think we had any issues at Snap introducing a new error code (@ovaiskhansc can correct me) -- we also don't control which client people use to access our clusters, nor does this solution expects any handling from the client side.

Keep in mind that this is a feature behind a flag that I expect to be disabled by default.

Implementation of CPU throttling

46fee7c

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CPU throttling #1689

Implementation of CPU throttling #1689

lschmidtcavalcante-sc commented Feb 7, 2025 •

edited

Loading

PingXie commented Feb 8, 2025

lschmidtcavalcante-sc commented Feb 10, 2025 •

edited

Loading

artikell commented Feb 14, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 14, 2025 •

edited

Loading

Implementation of CPU throttling #1689

Are you sure you want to change the base?

Implementation of CPU throttling #1689

Conversation

lschmidtcavalcante-sc commented Feb 7, 2025 • edited Loading

PingXie commented Feb 8, 2025

lschmidtcavalcante-sc commented Feb 10, 2025 • edited Loading

artikell commented Feb 14, 2025 • edited Loading

lschmidtcavalcante-sc commented Feb 14, 2025 • edited Loading

lschmidtcavalcante-sc commented Feb 7, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 10, 2025 •

edited

Loading

artikell commented Feb 14, 2025 •

edited

Loading

lschmidtcavalcante-sc commented Feb 14, 2025 •

edited

Loading