Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probing Dapr's /healthz endpoint throws sidecar error #8472

Open
rochabr opened this issue Feb 7, 2025 · 3 comments
Open

Probing Dapr's /healthz endpoint throws sidecar error #8472

rochabr opened this issue Feb 7, 2025 · 3 comments
Labels
kind/bug Something isn't working

Comments

@rochabr
Copy link

rochabr commented Feb 7, 2025

In what area(s)?

/area runtime

What version of Dapr?

CLI version: 1.15.0-rc.5
Runtime version: 1.15.0-rc.10

Expected Behavior

In a dotnet application we are using Dapr's /healthz endpoint as a startup probe to check if the sidecar was ready before the app became functional.

Sample code:

[HttpGet("/healthz")]
public async Task<IActionResult> IsHealthy()
{
    // Ensure dapr sidecar is running and healthy. If not, fail the health check and have the pod restarted.
    // This should prevent the case where the application container is running before dapr is installed in
    // the case of a gitops deploy.

    var response = await _httpClient.GetAsync($"http://localhost:{DaprHttpPort}/v1.0/healthz");

    return new StatusCodeResult((int)response.StatusCode);
}
startupProbe:
    httpGet:
      path: /healthz
      port: 80
    failureThreshold: 6
    periodSeconds: 10

Prior to 1.15, this worked well, but testing it on 1.15.0-rc.10, a particular workflow sidecar logs show:

level=error msg="Failed to receive scheduler hosts: rpc error: code = Canceled desc = context canceled" app_id=order-processor-workflow instance=order-processor-workflow-7f898f88d6-l48qc scope=dapr.runtime.scheduler type=log ver=1.15.0-rc.10

and

time="2025-02-06T00:00:02.68682165Z" level=info msg="dapr initialized. Status: Running. Init Elapsed 255ms" app_id=order-processor-workflow instance=order-processor-workflow-86b49d957b-qs7nk scope=dapr.runtime type=log ver=1.15.0-rc.10
time="2025-02-06T00:00:02.697086734Z" level=debug msg="api error: code = Internal desc = dapr is not ready" app_id=order-processor-workflow instance=order-processor-workflow-86b49d957b-qs7nk scope=dapr.runtime.http type=log ver=1.15.0-rc.10
time="2025-02-06T00:00:02.699513345Z" level=debug msg="App health probe successful: false" app_id=order-processor-workflow instance=order-processor-workflow-86b49d957b-qs7nk scope=dapr.apphealth type=log ver=1.15.0-rc.10

Removing that startup probe fixes the issue and the app works as expected.

It's important to note that other applications that are not workflows have the same startup probe and work properly.

Steps to Reproduce the Problem

A simple workflow application with a startup probe that points to the dapr healthz endpoint should throw this error.

@rochabr rochabr added the kind/bug Something isn't working label Feb 7, 2025
@alicejgibbons
Copy link
Contributor

After discussion, we should recommend that folks no longer add a health probe on application start-up that hits the Dapr healthz endpoint given that the SDK should check for this already. In 1.15 for actors and workflows, this check is already being made and this circular dependency is causing the failing app (with the code above).

As part of this we need to:

@WhitWaldo
Copy link
Contributor

After discussion, we should recommend that folks no longer add a health probe on application start-up that hits the Dapr healthz endpoint given that the SDK should check for this already. In 1.15 for actors and workflows, this check is already being made and this circular dependency is causing the failing app (with the code above).

As part of this we need to:

* [ ]  update the docs to account for this best practice guidance, specifically: https://docs.dapr.io/operations/resiliency/health-checks/sidecar-health/#outbound-health-endpoint.* [ ]  Update the SDKs to ensure that the `waitForSidecar/wait_until_ready` methods are listed as "deprecated" in the SDKs so that it prompts people to stop using them.
    
    * [ ]  Dotnet sdk: https://docs.dapr.io/developing-applications/sdks/dotnet/dotnet-client/#wait-for-sidecar* [ ]   Java sdk: https://docs.dapr.io/developing-applications/sdks/java/java-client/#wait-for-sidecar* [ ]  Python sdk: https://docs.dapr.io/developing-applications/sdks/python/python-client/#health-timeout* [ ]  JS SDK: https://github.com/dapr/js-sdk/blob/4189a3d2ad6897406abd766f4ccbf2300c8f8852/src/interfaces/Client/IClientHealth.ts#L14

My takeaway was slightly different, so this might be worth hashing out once more next week. What I took from the discussion is that the health endpoint isn't really intended for SDKs to use and instead intended for load balancer and K8s health probes instead. So the SDK itself shouldn't expose this method to developers to use (marking them as obsolete) and that even the SDK itself shouldn't bother doing a check because if the port is open to send the request to Daprd, the runtime is listening to requests and that it'll block requests until it's ready, then start responding to each. Thus, no health check needed.

@olitomlinson
Copy link

olitomlinson commented Feb 7, 2025

After discussion, we should recommend that folks no longer add a health probe on application start-up that hits the Dapr healthz endpoint given that the SDK should check for this already. In 1.15 for actors and workflows, this check is already being made and this circular dependency is causing the failing app (with the code above).

"we should recommend that folks no longer add a health probe on application start-up that hits the Dapr healthz endpoint"

Yes, even prior to 1.15 it seems pretty crazy that people were doing this, and demonstrates a lack of understanding.

However, I'm not clear how OPs original issue raised has anything to do with the WaitForSidecar SDK method which uses v1.0/healthz/outbound and NOT v1.0/healthz -- The two endpoints have very different purposes, and I'm not fully sure why WaitForSidecar is being implicated and deprecated as part of this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants