Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CPU throttling #1689

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

lschmidtcavalcante-sc
Copy link

@lschmidtcavalcante-sc lschmidtcavalcante-sc commented Feb 7, 2025

Proposed implementation of CPU throttling (still a work in progress).

Implementation adds a new config cpu_overload_protection_threshold: if set to 0, then no CPU throttling; otherwise throttle a percentage of the requests (cpu_overload_protection_throttling_percentage) if CPU usage is above said threshold.

The percentage of requests to be throttled (responded with THROTTLED) is adjusted every 10 seconds:

  • If CPU usage is above threshold, then double the amount of requests to be throttled (starting at 1);
  • Otherwise, halve the amount of requests to be throttled (eventually reaching zero).

Note that some commands are never throttled (see is_client_excluded_from_cpu_throttling and is_cmd_excluded_from_cpu_throttling): they are mostly commands related to intra-cluster communication; afterall, we want to be able to scale up the cluster to move away from this state.

Issue #1688

Signed-off-by: Lucas Schmidt Cavalcante <[email protected]>
@PingXie
Copy link
Member

PingXie commented Feb 8, 2025

Thanks for the contribution, @lschmidtcavalcante-sc!

This is a very valid problem to solve for clusters but I don't think we should back-pressure on CPU usage, which doesn't go linearly with the workload level. The main challenges that I can think of, ranked by their impact and likelihood, are

  1. Long running Lua scripts block the main thread.

Today we can only safely timeout a read-only LUA script but as soon as the script modifies a key, it has to run to completion. Aborting a write script safely would require applying changes in an all-or-none way.

Along this line, long running non-script commands (with O(N) complexity) also risk starving the cluster bus, leading to premature node failures. We could consider a generic solution to either timeout them or time-slicing their execution with the cluster bus processing. However, the latter could break consistency, since the cluster bus can churn the topology mid-execution.

  1. An excessive amount of active clients delays cluster connection processing.

We could explore limiting the epoll processing batch size and additionally prioritize the processing of the cluster connections. So rather than competing against 10,000 active clients, the cluster connections share the CPU time with say, 1,000 clients in every batch.

On a related note, I do think back-pressuring on memory stress is more valuable to reduce the OOM risk.

@lschmidtcavalcante-sc
Copy link
Author

lschmidtcavalcante-sc commented Feb 10, 2025

Hey @PingXie , let me give you more context so we are addressing the same issue. At Snap we observed that from time to time a team launches a new feature that dramatically increases the amount of traffic to a given cluster but they also forget to scale it up, then if CPU usage remains high for a sustained period some nodes are lost.

It is still early to tell how effective CPU throttling has been for Snap (also, we are still using KeyDB), but our internal testing indicated that clusters under heavy load with CPU throttling enabled did not lose nodes, giving extra time for the oncall to take action -- to address the comment at #1688 (comment) by @xbasel , our internal testing did not assume any handling from the clients.

@artikell
Copy link
Contributor

artikell commented Feb 14, 2025

I have the following questions:

  • What should be done if the throttled requests have fully occupied the CPU?
  • Is adding a new error message unfriendly to users? Why not consider blocking the requests? like isPausedActions?
  • In the avalanche scenario, is it very likely that the CPU will be fully occupied continuously until the service becomes unavailable?

@lschmidtcavalcante-sc
Copy link
Author

lschmidtcavalcante-sc commented Feb 14, 2025

@artikell 好 what do you mean by throttled requests have fully occupied the CPU? Regarding the introduction of a new error code, we intend to re-use this one #1672 and I don't think we had any issues at Snap introducing a new error code (@ovaiskhansc can correct me) -- we also don't control which client people use to access our clusters, nor does this solution expects any handling from the client side.

Keep in mind that this is a feature behind a flag that I expect to be disabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants