Circuit breaker #266

Kamil-Lontkowski · 2025-01-09T14:34:02Z

This draft implements CricuitBreaker with features based on that are provided in breaker from resilience4j.

Those features, for count based window (last n operations):

defining threshold for failures
defining threshold for slow operations, and threshold for what is considered slow
minimum number of operations until thresholds are calculated
HalfOpen state that allows number of operations to pass, after that it is calculated if those operations where below threshold so we can close breaker or go back to open.
time after which breaker goes from Open to HalfOpen

Things that are not implemented in here but are in resilience4j.

Timeout for HalfOpen state, after which breaker goes back to Open
Ability to turn of automatic transition from Open to HalfOpen state

Kamil-Lontkowski · 2025-01-09T14:48:57Z

For now this is incomplete since I have some questions.

Right now I am not keeping results of calls made in HalfOpen state separate. Maybe it should be, because different number of operations can push rate more or less depending on the difference.
Should I add those 2 missing features from resilience4j? Flag seems necessary, since we should not worry about background thread. But HalfOpen timeout might be useful, but it will complicate a bit process of registering operation started in HalfOpen state, so we don't end up with wrong metrics.
Right now there is only runOrDrop operation defined. Since resilience4j always throw exception I am not sure what other interfaces would make sense.
Would it make sense to have ability to completely wipe out current state of circuit breaker?

Right now only now only count based sliding window is implemented but I wanted to discuss those questions right away.

With time based sliding windows I think that background loop evicting older entries from queue similar to SlidingWindow RateLimiterAlgorithm should be sufficient.

core/src/main/scala/ox/resilience/CircuitBreaker.scala

core/src/main/scala/ox/resilience/AtomicCircularBuffer.scala

core/src/main/scala/ox/resilience/CircuitBreaker.scala

adamw · 2025-01-10T11:36:48Z

As for the questions:

hm I don't know - how do circuit breakers work normally? maybe you could find some articles describing typical designs, and link to them here or in the comments; I think we should aim for whatever is "industry standard"
again, that's a question of how circuit breakers work in general. Intuitively, the half open state should transition to open/closed after certain conditions are met, but I'm sure there's plenty of edge-cases to consider
yeah, runOrDrop is fine
I think any use-case for wiping can be served by simply creating a new CB, so let's omit wiping for now

Kamil-Lontkowski · 2025-01-13T17:33:18Z

Answer based on what is available in pekko/akka, monix, rezilience (zio). circuit (cats-effect) and polly(C#)

resilience4j provide some more configuration in this regard than other libs. I think it would be best to calculate metrics for different states separately. It is more intuitive how rates are calculated. Other libs allow just for one call and if it succeeds it closes.
This also seems like more flexibility on resilience4j side.

Monix, circuit and Pekko/Akka works exactly the same. They count failures (or slow calls) in a row not a rate based on window. Then wait before going to halfOpen and then deciding based on one operation result. Plus the wait duration before transitioning to halfOpen is configured as backoff.

rezilience provides maxFailures in a row just like monix and also count based sliding window. It also supports different schedules for waiting before going to halfOpen state. It also allows only one call to decide if it goes back to open or close.

Polly is little different but only in few cases. It provides threshold rates for sampling window of some duration. As I understand it in effect means sliding window(But maybe it is simpler and works just like fixed window). It also support minimum number of calls to be able trip in a sample. It also allows for dynamically determining break duration before switching to halfOpen.It also provide ability to set state manually and reading current state through CircuitBreakerStateProvider.

Zio, resilience4j and rezilience also provides ability to consume different events like state changes or metrics.

core/src/main/scala/ox/resilience/CircuitBreaker.scala

core/src/test/scala/ox/resilience/CircuitBreakerTest.scala

adamw · 2025-01-14T16:25:37Z

Answer based on what is available in pekko/akka, monix, rezilience (zio). circuit (cats-effect) and polly(C#)

Good analysis, thanks :) So basing on that, what design would you propose? What would be the configuration options, and the algorithm of transferring between closed/ho/open states?

Not sure if we need both count-based and windowed variants - isn't the count-based variant a windowed variant, but with window duration = Inf?

Kamil-Lontkowski · 2025-01-14T17:21:03Z

There is difference that if we would just treat count based as sliding window with Inf we would always have to count all results. Window size defines how many n last operation we want to include in metrics. I wanted to move all state machine logic to base trait and only difference between implementations would be how we calculate metrics.

If we leave both variants we can have all functionalities (maybe apart from ability to consume events). Giving proper arguments we can mimic pekko and monix behavior exactly. I am only debating if we would want to support any Infinite schedule when it comes to those durations, but I would want to have proper implementation of all other functionalities before that, then see if it fits.

adamw · 2025-01-15T08:36:31Z

But in a count-based approach, you're counting all results anyway?

Kamil-Lontkowski · 2025-01-15T11:40:58Z

Yeah, but callResults is a very basic implementation of CircularBuffer so we count only on max n call results and don't hold in memory more results than we need. The writeIndex is increment during registration of result.

Kamil-Lontkowski · 2025-01-16T15:33:08Z

I changed calculating state change to be pure function. Added docs in code and some tests for the state machine. I will add more tests especially testing if schedules and timeout for state changes work properly. I also need to add docs, I think in this case some kind of diagram would be helpful to understand how this work.

Kamil-Lontkowski · 2025-01-16T15:34:37Z

Also seems like CI is broken because of deprecation of actions/upload-artifact: v3

adamw · 2025-01-18T08:59:41Z

Also seems like CI is broken because of deprecation of actions/upload-artifact: v3

Let's fix it then, in a separate PR then :)

core/src/main/scala/ox/resilience/CircuitBreaker.scala

core/src/main/scala/ox/resilience/CircuitBreakerStateMachine.scala

core/src/main/scala/ox/resilience/AtomicCircularBuffer.scala

core/src/test/scala/ox/resilience/CircuitBreakerTest.scala

core/src/main/scala/ox/resilience/CircuitBreakerStateMachine.scala

adamw · 2025-01-22T08:01:42Z

I added some comments with some questions to consider what the behaviour should be in case of concurrently running operations. I think it would be good to describe the algorithm (in a comment), detailing exactly when state transitions occur, depending on configuration, and what happens if there are other operations who report their results later. We should should be ready for hundreds/thousands of such concurrently running operations reporting failures. So before fixing this (and other) tests, maybe let's consider some possibilities as to how to best fix the problem.

Kamil-Lontkowski · 2025-01-22T09:18:49Z

Regarding the failing test, the assumption is wrong. It should not go back to closed after the halfOpenTimeoutDuration passes. This timeout is guarding slow operations, so if we have halfOpen state and operations are slow we open again if it exceeds timeout. Only way to go from HalfOpen to Closed is to complete numberOfCallsInHalfOpenState within timeout duration and without exceeding any threshold.

But I think in case of many concurrent operations we have cases where some operation could break statistics. For example operation started in Closed state that finishes in HalfOpen state will not count towards finished operations in HalfOpen state but will be included in rates. I think we can guard against it by checking if AcquireResult has the same state that we are in currently. It will not work for cases where operation last for example whole cycle of changing states. But it can also be done - just add since timestamp in Open state and make distinction based on that. So it will just ignore results that were started in different state. All statistics are cleared on change state anyway so we just need to not count the ones started in different state.

adamw · 2025-01-22T09:21:55Z

@Kamil-Lontkowski ok, can you then first fix my test case (fixing the assertions/ checking that it eventually does transit to HO), and add a test case for the scenario you describe? So that we first have a failing test case, and then we can proceed to fixing the problem

adamw · 2025-01-22T09:22:29Z

And even more so, a description of when state transitions occur is crucial :) (for understanding & using the CB)

Kamil-Lontkowski · 2025-01-23T08:40:59Z

For now the part in docs looks like that

Maybe these colors are a bit much, but since I had mistaken Closed and Open state many times I wanted to make the Closed pop as the one allowing operations.
This was made with mermaidjs

stateDiagram
  direction LR
  classDef red fill:red
  classDef green fill:green
  classDef yellow fill:yellow

  Closed --> Open: 1
  Closed --> HalfOpen: 2
  Open --> HalfOpen: 3
  HalfOpen --> Open: 4
  HalfOpen --> Closed: 5

  class Closed green
  class Open red
  class HalfOpen yellow

Kamil-Lontkowski commented Jan 9, 2025

View reviewed changes