Include healthcheck logic for helper scripts running as sidecars #1842

TrevorBenson · 2025-01-07T18:13:23Z

Description

Enhances the healthcheck.sh script to work for checking permissions on sidecar containers (helper scripts) via the ENTRYPOINT_PROCESS.

Where should the reviewer start?

Copy the updated healthcheck.sh script into the sidecar container in /home/guild/.scripts/.
Wait for the 5 minute check interval to occur and confirm if the sidecar is now shown as healthy.

Testing different CPU Usage values

Define a CPU_THRESHOLD environment variable (defaults to 80 %) at a value you want to mark a container unhealthy when it is exceeded.

Testing different amount of retries (internal to healthcheck.sh script).

Define a RETRIES environment variable (defaults to 20) at a number of retries you want to perform if the CPU usage is above the CPU_THRESHOLD value before exiting non zero

Currently it is a 3 second delay between checks, so 20 retries results in up to 60 seconds before the healthcheck will exit as unhealthy due to CPU load.

Testing different healthcheck values (external to healthcheck.sh script).

The current HEALTHCHECK of the container image is:

5 minutes start period
5 minutes interval
100 seconds timeout
3 retries (default value from being undefined)

Reducing the start period and intervals to something more appropriate for the sidecar script will result in a much shorter period to determine the sidecar containers health.

Make sure to keep the environment variable RETRIES * 3 < container healthcheck timeout to avoid marking the container unhealthy before the script will return during periods of high cpu load.

Motivation and context

Issue #1841

Which issue it fixes?

Closes #1841

How has this been tested?

docker cp the script into preview network cncli sync, validate and leaderlog containers and waiting until the interval runs the script
Execute the script with docker exec to confirm it reports healthy
Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

Additional Details

There is a SLEEPING_SCRIPTS array which is used for validate and leaderlog to still check for the cncli binary, but not consider a sleep period for validate and leaderlog to be unhealthy. Not 100% sure this is the best approach, but with sleep periods being variable I felt it was likely an acceptable middle ground.

Please do not hesitate to suggest an alternative approach to handling sleeping sidecars healthchecks if you think you have an improvement.

@adamsthws if you could please copy this into your sidecar containers (and your pool) and report back any results. I am marking this as a draft PR for the time being until testing is completed, after which if things look good I will mark it for review and get feedback from others.

Thanks

TrevorBenson · 2025-01-07T18:29:44Z

FWIW Here is my preview network pool, and cncli containers showing healthy once the script was copied in and healthcheck interval was reached:

# podman ps --filter 'name=preview-cncli-[slv]' --filter 'name=preview-ccio-pool --format '{{ .Names }}\t{{ .Status }}''
preview-ccio-pool	Up 4 weeks (healthy)
preview-cncli-sync	Up 3 days (healthy)
preview-cncli-validate	Up 3 days (healthy)
preview-cncli-leaderlog	Up 3 days (healthy)

adamsthws · 2025-01-07T21:01:27Z

Looks good!
How I have tested...

cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers

Execute the script with docker exec.

Result: exit 0. 'We're healthy - cncli'

Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

Result: Docker ps shows containers are healthy.

Adjusted RETRIES

Note
- Setting RETRIES=0 results in exit 127, 'Max retries reached for cncli'.
- Setting RETRIES=1 results in exit 0. 'We're healthy - cncli'

Adjusted CPU_THRESHOLD

Note
- The cncli process uses very little cpu so even setting threshold as low as 1% i was unable to intentionally get healthcheck to fail.

# docker ps --filter 'name=cncli' --format '{{ .Names }}\t{{ .Status }}' | column -t
cncli-pt-send-tip    Up  33  minutes  (healthy)
cncli-pt-send-slots  Up  33  minutes  (healthy)
cncli-sync           Up  33  minutes  (healthy)
cncli-validate       Up  33  minutes  (healthy)
cncli-leaderlog      Up  33  minutes  (healthy)

adamsthws · 2025-01-07T21:36:01Z

Further testing...

I was able to test with higher cpu load after deleting the cncli db and re-syncing.

Result

# ./healthcheck.sh
Checking health for process: cncli 
./healthcheck.sh: line 44: ((: 67.9: syntax error: invalid arithmetic operator (error token is ".9")
We're healthy - cncli
# exit 0

Line 44 of healthcheck.sh:
The (( CPU_USAGE > cpu_threshold )) construct in Bash is used for arithmetic evaluation, but it only supports integer arithmetic. It doesn't handle floating-point numbers.

This seems to fix it...
Line 41 (round float to nearest integer):

CPU_USAGE=$(ps -C "$process_name" -o %cpu= | awk '{s+=$1} END {print int(s + 0.5)}')

With the above change, when cpu load is higher than CPU_THRESHOLD, this is the result:

 # ./healthcheck.sh
Checking health for process: cncli
Warning: High CPU usage detected for 'cncli' (68%)
Max retries reached for cncli
# exit 1

TrevorBenson · 2025-01-08T00:06:40Z

Looks good! How I have tested...

cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers

Execute the script with docker exec.

* Result: exit 0. 'We're healthy - cncli'

Monitor the containers until the healthcheck interval occurs and that they are marked healthy.

* Result: Docker ps shows containers are healthy.

Adjusted RETRIES

* Note
  
  * Setting RETRIES=0 results in exit 127, 'Max retries reached for cncli'.
  * Setting RETRIES=1 results in exit 0. 'We're healthy - cncli'

Adjusted CPU_THRESHOLD

* Note
  
  * The cncli process uses very little cpu so even setting threshold as low as 1% i was unable to intentionally get healthcheck to fail.

# docker ps --filter 'name=cncli' --format '{{ .Names }}\t{{ .Status }}' | column -t
cncli-pt-send-tip    Up  33  minutes  (healthy)
cncli-pt-send-slots  Up  33  minutes  (healthy)
cncli-sync           Up  33  minutes  (healthy)
cncli-validate       Up  33  minutes  (healthy)
cncli-leaderlog      Up  33  minutes  (healthy)

Yeah, there are rare instances where cncli percentage can be high, but this tends to be when resyncing the entire db and/or a cncli init is running. Occasionally if there is an issue with node process itself, like if it gets stuck chainsync/blockfetch and never completes, I have also seen cncli get a high percentage, but otherwise its quite rare to see it increase.

I figured with mithril-signer or db-sync, it might be more useful.

files/docker/node/addons/healthcheck.sh

…t to round up.

TrevorBenson · 2025-01-08T00:25:15Z

@adamsthws Feel free to submit suggestions to adjust the SCRIPT_TO_BINARY_MAP entries. Otherwise this week when I look at some other issues I'll go through each helper script and update the map and set this PR to ready to review.

Thanks for the testing.

adamsthws · 2025-01-08T16:36:28Z

Testing revealed that setting RETRIES=0 results in script exit 1 without running the loop... it would be preferable to run the loop once when RETRIES=0.

Suggestion - Modify the loop condition to handle RETRIES=0 by changing line 39 to the following:

    for (( CHECK=0; CHECK<=RETRIES; CHECK++ )); do
    
    # 'RETRIES=3' results in the loop running a total of 4 times
    # 'RETRIES=0' results in the loop running a total of 1 times

Or...

    for (( CHECK=1; CHECK<=RETRIES || (RETRIES==0 && CHECK==1); CHECK++ )); do
    
    # 'RETRIES=3' results in the loop running a total of 3 times
    # 'RETRIES=0' results in the loop running a total of 1 times

adamsthws · 2025-01-09T01:05:29Z

I started thinkinng about a cncli specific check. The following function is an idea to check cncli status...

# Function to check cncli status
check_cncli_status() {
    CNCLI=$(which cncli)

    for (( CHECK=1; CHECK<=RETRIES || (RETRIES==0 && CHECK==1); CHECK++ )); do
        CNCLI_OUTPUT=$($CNCLI status \
            --byron-genesis "/opt/cardano/cnode/files/byron-genesis.json" \
            --shelley-genesis "/opt/cardano/cnode/files/shelley-genesis.json" \
            --db "/opt/cardano/cnode/guild-db/cncli/cncli.db")

        CNCLI_STATUS=$(echo "$CNCLI_OUTPUT" | jq -r '.status')
        ERROR_MESSAGE=$(echo "$CNCLI_OUTPUT" | jq -r '.errorMessage')

        if [ "$CNCLI_STATUS" == "ok" ]; then
            echo "We're healthy - cncli status is ok and synced."
            return 0  # Return 0 if the status is ok
        elif [ "$CNCLI_STATUS" == "error" ]; then
            if [ "$ERROR_MESSAGE" == "db not fully synced!" ]; then
                echo "cncli's sqlite database is not fully synced. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to sync
            elif [ "$ERROR_MESSAGE" == "database not found!" ]; then
                echo "cncli's sqlite database not found. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to be created
            else
                echo "Error - cncli status: $ERROR_MESSAGE. Attempt $CHECK. Retrying in 3 seconds."
                sleep 3  # Wait 3 seconds then retry for other errors
            fi
        else # If status is not "ok" or "error"
            echo "cncli status: $CNCLI_STATUS. Attempt $CHECK. Retrying in 3 seconds."
            sleep 3  # Wait 3 seconds then retry
        fi
    done

    echo "cncli status check failed after $RETRIES attempts."
    return 1  # Return 1 if retries are exhausted
}

Perhaps would be improved further by also checking if sync is incrementing, so the healthcheck doesn't fail during initial sync.

How would you feel about adding me as a commit co-author if you decide to use this?

TrevorBenson · 2025-01-11T06:25:07Z

@adamsthws I'm happy to make you a co-author even for something simple, for example if you know how to submit a suggestion go ahead an apply one for for (( CHECK=0; CHECK<=RETRIES; CHECK++ )); do to modify the PR and I'll merge it as a commit.

TrevorBenson · 2025-01-11T07:08:47Z

In regards to the larger block for cncli checks, first it is clear lots of thought went into it.

This portion:

            if [ "$ERROR_MESSAGE" == "db not fully synced!" ]; then
                echo "cncli's sqlite database is not fully synced. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to sync
            elif [ "$ERROR_MESSAGE" == "database not found!" ]; then
                echo "cncli's sqlite database not found. Attempt $CHECK. Retrying in 3 minutes."
                sleep 180  # Wait 3 minutes then retry to allow time for the database to be created

Sleeps of 180 exceed the current timeout period of 100. Options:

Additional documentation. My gut feeling it would lead to additional support requests for operators who don't read the docs, it also makes the monolithic container slightly more complex.
Increasing the containers timeout. Potential to reduce observability for node, or other processes.

With container settings of 3 retries and 5 minute interval w/ 100 second timeout it is 15 minutes from the last healthy response, or 10 minutes from the first unhealthy response, before the container exhausts retries and is marked unhealthy. I think this covers the two 180 second sleeps, even if the operator reduces the interval and timeouts when not running the node.

Separately, conversations outside of this PR and thread have pointed to some of the logic used in KOIOS for db-sync, also that it could also be used for checking the sqlite DB for cncli.

#!/bin/bash

export PGPASSWORD=${POSTGRES_PASSWORD}
[[ $(( $(date +%s) - $(date --date="$(psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} -qt -c 'select time from block order by id desc limit 1;')" +%s) )) -lt 3600 ]] || exit 1

I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour.

These are just my thoughts. If you think that I overlooked some aspect please don't hesitate to continue the discussion.

adamsthws · 2025-01-12T13:24:12Z

I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour.

Thankyou for the feedback.

It seems the limitations of using 'cncli status' in the healthcheck are twofold...

1 -
'cncli status' can't be used to determine if sync is progressing... Rather, it only indicates that it is either synced or not-synced. The long timeout allows time for sync to complete but I see that has introduced observability shortcomings.

Instead, checking for cncli sync progress via sqlite db would seem to make more sense.

2 -
Not all cncli containers need sqlite. (E.g. send-tip).
In the cncli send-tip container (with no sqlite db), 'cncli status' always returns error, as it will never reach sync.
(The tip is sucessfully being sent to Pooltool without the need of sqlite).

{
  "status": "error",
  "errorMessage": "database not found!"
}

I think it's reasonable to say 'cncli status' is not a suitable way to check cncli container health.

files/docker/node/addons/healthcheck.sh

adamsthws · 2025-01-12T21:35:44Z

Here's a suggestion to check the health of cncli ptsendtip, I'd love your feedback...

# Function to check if the tip is successfully being sent to Pooltool
check_cncli_send_tip() {
    for (( CHECK=0; CHECK<=RETRIES; CHECK++ )); do

        # Define the retry wait time in seconds
        retry_wait=3
        error_suffix="Attempt $((CHECK + 1)). Retrying in $retry_wait seconds."

        # Get the process ID of cncli
        process_id=$(pgrep -o cncli) || {
            echo "Error: cncli process not found. $error_suffix"
            sleep $retry_wait  # Wait n seconds then retry
            continue # Retry if the process is not found
            }

        # Capture the next output from cncli that is related to Pooltool
        pt_log_entry=$(timeout 30 cat /proc/$process_id/fd/1 | grep -i --line-buffered "pooltool" | head -n 1) || {
            echo "Failed to capture output from cncli stdout. $error_suffix"
            sleep $retry_wait  # Wait n seconds then retry
            continue # Retry if the output capture fails
            }

        # Define the success message to check for
        success_message='.*"success":true.*'

        # Check if the success message exists in the captured log
        if echo "$pt_log_entry" | grep -q $success_message; then
            echo "Healthy: Tip is being sent to Pooltool."
            return 0  # Return 0 if the success message is found
        else # If the success message is not found
            echo "Success message not found in log output. $error_suffix"
            sleep $retry_wait  # Wait n seconds then retry
        fi
    done

    echo "Error: Max retries reached - Tip is not being sent to Pooltool."
    return 1  # Return 1 if retries are exhausted
}

I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour.

Thankyou for the feedback.

It seems the limitations of using 'cncli status' in the healthcheck are twofold...

1 - 'cncli status' can't be used to determine if sync is progressing... Rather, it only indicates that it is either synced or not-synced. The long timeout allows time for sync to complete but I see that has introduced observability shortcomings.

Instead, checking for cncli sync progress via sqlite db would seem to make more sense.

2 - Not all cncli containers need sqlite. (E.g. send-tip). In the cncli send-tip container (with no sqlite db), 'cncli status' always returns error, as it will never reach sync. (The tip is sucessfully being sent to Pooltool without the need of sqlite).
{
  "status": "error",
  "errorMessage": "database not found!"
}
I think it's reasonable to say 'cncli status' is not a suitable way to check cncli container health.

Co-authored-by: Adam Matthews <[email protected]>

TrevorBenson · 2025-01-13T05:38:07Z

I think it's reasonable to say 'cncli status' is not a suitable way to check cncli container health.

Yep, this is where I was originally taking "the easy way out" on the healthcheck of checking if the binary was running or not, and/or sleep.

TrevorBenson · 2025-01-13T07:34:10Z

Here's a suggestion to check the health of cncli ptsendtip, I'd love your feedback...

#General Thoughts
First, this part is more about the existing healthcheck than your code contributions. History was that node ping test was removed and the query tip check was implemented. Unfortunately to compare takes at the 20 seconds for the next slot to see it progress. In some cases much longer:

I added KOIOS later to try and reduce the 60 second sleep in the hopes to improve observability, if were matching KOIOS we don't even look for progression, were in sync. But in case we weren't, then retry and check for being in sync, w/ 3 second retries it should be very few retries usually to match koios.

As a single unhealthy response requires exhausting the containers --health-retries before marking the container unhealthy my preference would be to aim for using "fail early" methodology on any other checks.

Now, this is just my viewpoint on providing the fastest feedback loop and observability. Its always open for discussion. My viewpoint isn't "the law" when it comes to how it will work, especially as we get more contributors that actually use the container and its specific tools. Feel free to push back on my views at any time.

If retry_wait is a variable (which I like) we should probably make it a global/constant so it can be adjusted by the user for any check function that does do retries.
- While variable name collisions are probably unlikely I'm thinking of prefixing healthcheck variables just in case. Maybe HEALTHCHECK_, so like:
  - HEALTHCHECK_CPU_THRESHOLD
  - HEALTHCHECK_RETRIES
  - HEALTHCHECK_RETRY_WAIT
- This provides the operator full control. For example using HEALTHCHECK_RETRIES=0 for a single test, and tune everything else via the containers --health-interval and --health-retries, or vice versa.
For process_id=$(pgrep -o cncli) do you find that cncli sendtip is not running?
- Did you set --health-start-period 0 and observe a race where the healthcheck happened first and failed? I'd prefer removing the retry in the function and just return 1 for unhealthy with a message to fail fast for observability and let the operator choose retries at the container level as often as possible.
Given the above example of 80 seconds or more between blocks I suspect that the timeout 30 may in some attempts result in not catching the entry at all. So this might force us to include some retries, otherwise the --health-retries could miss 3 passes and mark it unhealthy even when its working.
Have you tested and confirmed its working as intended?
- I tried cat on the cncli sendtip file descriptor 1 while watching log entries. While I saw logs from the container showing a send operation, I never saw from the file descriptor. Are you seeing something different in your tests?

I can add some logic that determines which subcommand cncli.sh was provided. Something like: cncli_subcmd=$(ps -p 1 -o cmd= | awk '{print $NF}') should allow us to separate sendtip from things where we would use sqlite.

I've got a commit I'll push tomorrow which includes db-sync and the cncli checks based on the db. Then figure out how to move forward on sendtip.

Thanks for the work on this!

TrevorBenson · 2025-01-14T06:57:06Z

@adamsthws OK commit pushed, sorry for the delay. Spent a little time considering if healthcheck.sh could remain in /home/guild/.scripts/ and properly source env. There is a placeholder for your check to go in once the function is added. Feel free to add a suggestion for it, or open a PR against the source branch of this PR and I'll merge that, whatever is your preference.

Thanks.

adamsthws · 2025-01-14T12:55:49Z

For process_id=$(pgrep -o cncli) do you find that cncli sendtip is not running?

Did you set --health-start-period 0 and observe a race where the healthcheck happened first and failed? I'd prefer removing the retry in the function and just return 1 for unhealthy with a message to fail fast for observability and let the operator choose retries at the container level as often as possible.

In my limited testing have never found cncli sendtip to not be running, or a race when --health-start-period 0... I shall remove the retry.

Given the above example of 80 seconds or more between blocks I suspect that the timeout 30 may in some attempts result in not catching the entry at all. So this might force us to include some retries, otherwise the --health-retries could miss 3 passes and mark it unhealthy even when its working.

If nothing is ever written to the file descriptor...

With a long timeout, it will fail slowly.
With a short timeout plus retries, it will also fail slowly

Do you reccommend...

Lower timeout, with retries within the function that can echo information to the user on each retry.
Higher timeout to allow time for the next tip, and remove retries within the function?
Something else?

Have you tested and confirmed its working as intended?

I tried cat on the cncli sendtip file descriptor 1 while watching log entries. While I saw logs from the container showing a send operation, I never saw from the file descriptor. Are you seeing something different in your tests?

Yes, I have tested... Might the discrepancy between your results and mine be that:

You're looking at the PID of the entrypoint script: pid=$(pgrep -f "${ENTRYPOINT_PROCESS}")
Whereas I'm looking at PID of cncli: pid=$(pgrep -o cncli)

I've got a commit I'll push tomorrow which includes db-sync and the cncli checks based on the db. Then figure out how to move forward on sendtip.

Great! I'll try to review quickly but may be a day or two. Thanks!

files/docker/node/addons/healthcheck.sh

adamsthws

I haven't had chance to do any testing but here is a quick initial look over. Thanks!

files/docker/node/addons/healthcheck.sh

Co-authored-by: Adam Matthews <[email protected]>

TrevorBenson · 2025-01-14T16:56:17Z

If nothing is ever written to the file descriptor...

* With a long timeout, it will fail slowly.

* With a short timeout plus retries, it will also fail slowly

Do you reccommend...

* Lower timeout, with retries within the function that can echo information to the user on each retry.

* Higher timeout to allow time for the next tip, and remove retries within the function?

* Something else?

I suppose we could align with the original query tip (non KOIOS) and go for 60 seconds. If it fails on the first pass, then there are 3 retries before the container is marked unhealthy anyway. Thoughts?

Also I realize in my prior explanation I clearly had the timing off. I should have said 20 total and 15 from the first failure, since it would be defaulting to 3 retries with the 5 minute intervals, not 3 total attempts. At least when operators do not tune the interval, retries, etc. at container creation.

TrevorBenson · 2025-01-14T17:13:45Z

Yes, I have tested... Might the discrepancy between your results and mine be that:
* You're looking at the PID of the entrypoint script: `pid=$(pgrep -f "${ENTRYPOINT_PROCESS}")`

Nope. This was a manual test inside my cncli sync container. So I manually checked the PID and used tmux, not the script at all. Because I wasn't using a valid API key I checked both fd 1 and 2 (and 0).

* Whereas I'm looking at PID of cncli: `pid=$(pgrep -o cncli)`

I'll chalk it up to some oddity with the manual test and using an XXXX value for the API ID. As long as you are seeing the output and its working in your test that is good enough for me.

I need to create a new account at PT to replace my old one based on email address anyway. I'll try to retest once I have a valid setup and start using ptsendtip as a container again.

files/docker/node/addons/healthcheck.sh

TrevorBenson · 2025-01-15T17:40:08Z

Separately, I've been considering for awhile retries (now HEALTHCHECK_RETRIES) being dropped from 20, to 0-3. I originally set it to 20 retries and (now HEALTHCHECK_RETRY_WAIT) to 3 second sleeps so that the KOIOS method would at max run as long as the 60 seconds of a query tip without KOIOS. This was never really required, but its been there ever since.

I tested last night manually in a loop of 10 with a 20 second sleep 10/10 returned in under 1 second, never needing a retry. 10/10 also incremented at least 1 block.

I didn't bother to do multiple passes and calculate the ratio. However given the retries at the container level, I am pretty confident we could drop this from 20 and improve the observability (slightly) by failing earlier.

I'm interested in your feedback. What do you think about changing the default values of HEALTHCHECK_RETRIES so even KOIOS will fail earlier?

adamsthws · 2025-01-15T23:24:16Z

I tested last night manually in a loop of 10 with a 20 second sleep 10/10 returned in under 1 second, never needing a retry. 10/10 also incremented at least 1 block.

Given the retries at the container level, I am pretty confident we could drop this from 20 and improve the observability (slightly) by failing earlier.

I'm interested in your feedback. What do you think about changing the default values of HEALTHCHECK_RETRIES so even KOIOS will fail earlier?

If it rarely (or never) fails and needs a retry, I would be in favor of removing the Koios retry loop from check_node() entirely. Simplifying the code, making it easier to maintain and understand. The only healthcheck function then that would have a retry would be check_process()

In which case, there might also be an argument to remove variables: $HEALTHCHECK_RETRY_WAIT, $HEALTHCHECK_RETRIES in favor of:

Hardcoded Sensible Retry Defaults (Within check_process())
- Preventing Misconfiguration
- Simplicity and Ease of Use
- Reducing User Cognitive Overload
- Reduced Support Overhead

TrevorBenson · 2025-01-16T08:21:51Z

If it rarely (or never) fails and needs a retry, I would be in favor of removing the Koios retry loop from check_node() entirely. Simplifying the code, making it easier to maintain and understand. The only healthcheck function then that would have a retry would be check_process()

Well, not "never". There are some occasions where the node is ahead, like when a block is produced locally and healcheck is almost at the exact moment. Same in reverse, KOIOS being ahead, but rarely for more than 3 seconds. Since your open to even removing the retries altogether I'll create a POC and give it a full 100 runs using a current interval of 5 minutes.

I'll track the total times:

That the node is ahead
That koios is ahead
That they are in sync
The number of consecutive out of sync responses occur.

I'll also try 4, 3 and 2 minute intervals. We could potentially improve the observability, reducing the time something is wrong before the status changes to unhealthy. As long as multiple systems are not sharing a single outbound IP for KOIOS queries I don't believe this should push the limits of the free tier.

adamsthws · 2025-01-16T13:28:02Z

There are some occasions where the node is ahead, like when a block is produced locally and healcheck is almost at the exact moment. Same in reverse, KOIOS being ahead, but rarely for more than 3 seconds.

I wonder if including an allowable drift for the koios check, (similar to check_db_sync()) would make sense in place of retries? ... But perhaps we're just swapping one type of complexity for another with that line of thought.

TrevorBenson · 2025-01-16T17:05:02Z

I wonder if including an allowable drift for the koios check, (similar to check_db_sync()) would make sense in place of retries? ... But perhaps we're just swapping one type of complexity for another with that line of thought.

I like the thought, but you might be correct that we just swap complexity.

A single line for "healthy" we get something like this for a drift of 1:

if [[ "$FIRST" -eq "$SECOND" || "$((FIRST - SECOND))" -le 1 && "$((FIRST - SECOND))" -ge -1 ]]; then

It's not terribly complex, but not the simplest either. It could be split up w/ elif though. The drift could also a variable, though I think it brings us back to the discussion of cognitive overload from earlier.

TrevorBenson · 2025-01-17T19:33:00Z

@adamsthws So the results were pretty interesting.

FWIW, I set the healthcheck_poc.sh so that it leverages the KOIOS_API_TOKEN to be sure I didn't get any 429's as this testbed shares the WAN IP with multiple systems querying koios.

Tests

Intervals of 300, 240, 180, 120
- 300 and 240 used 100 runs each
- 180 and 120 used 200 runs each
  - So that 180 and 120 had more samples with more overlap to the 300 interval test.
Tests performed in parallel
- Over the same time periods
- On the same chain (Preview)

Results

300 second intervals

Run time 2025-01-17-06:10:51 AM UTC - 2025-01-17-02:26:57 PM UTC
Results
- 5/100 node less than koios: 5%
- 95/100 node synched to koios: 95%
- 0 node greater than koios: 0%
Consecutive unsynchronized checks
- 0

240 second intervals

Run time 2025-01-17-06:10:53 UTC - 2025-01-17-12:47:58 PM UTC
Results
- 1/100 node less than koios: 1%
- 99/100 node synched to koios: 99%
- 0/100 node greater than koios: 0%
Consecutive unsynchronized checks
- 0

180 second intervals

Run time 2025-01-17-06:10:55 AM UTC - 2025-01-17-04:10:08 PM UTC
Results
- 5/200 node less than koios: 2.5%
- 195/200 node synched to koios: 97.5%
- 0 node greater than koios: 0%
Consecutive unsynchronized checks
- 0

120 second intervals

Run time 2025-01-17-06:10:58 AM UTC - 2025-01-17-12:51:10 PM UTC
Results
- 5/200 node less than koios: 2.5%
- 195/200 node synched to koios: 97.5%
- 0 node greater than koios: 0%
Consecutive unsynchronized checks
- 0

Summary

No tests experienced a single consecutive unsynchronized event. I think this confirms 4 consecutive unsynchonized events are fairly unlikely in the nodes KOIOS test.
The unsynchronized results among each interval were not from the same issue/time period.
- The node/koios tips in the DB shows the chain had progressed, even on the shortest distance between two results for a different test (2 minutes)
```
15|2025-01-17-06:20:52 AM|300|lt|1|2880268|2880269
18|2025-01-17-06:22:58 AM|180|lt|1|2880273|2880274
```
The interval 300 being double the ratio of the other tests I believe to be just luck of the draw, I doubt it would come out the same if run again.

If reducing this interval does not negatively impact any other healthchecks being adding, and the healthchecks start only after the ledger has finished loading (in case of unclean shutdowns, ledger replays during major version changes, etc.), I see no drawbacks to improving the observability of the container health with shorter intervals.

I can post the script to a gist if you want to replicate the tests yourself or test different networks.

adamsthws · 2025-01-18T15:01:54Z

I can post the script to a gist if you want to replicate the tests yourself or test different networks.

100% - running your test on my side now. I'll let you know the results.

adamsthws · 2025-01-20T10:49:11Z

Running your test on my side now. I'll let you know the results.

Test Results
Results after running both of our two tests at my end...

Test: check_node()
(healthcheck_pos.sh - Gist)
(Interval=120) (Tries=300)

Where "Status" != "synchronized": 43/300 (14.33%)
Where "Status" == "lt": 4/300 (1.33%)
Where "Status" == "gt": 39/300 (13%)
Failed consecutivly: 1/300 (0.33%)
- (It failed twice in a row one time)

Test: check_cncli_send_tip()
(test_check_cncli_send_tip.sh - Gist)
(Interval=120) (Tries=300) (Timeout=60)

Failed checks: 30/300 (10%)
Failed consecutivly (twice in a row): 4/300 (1.33%)
Failed consecutivly (three times in a row): 1/300 (0.33%)
- (It failed three times in a row one time)

Probabilities
I started getting interested in the probabilities...

If a check has a probability of failing 10% of the time, the probability of a that check failing three times in a row is 0.1% (or 1 in 1,000) - (assuming the events are independent, meaning each failure is not influenced by previous failures).

Probability of one failure:
P(failure)=10%=0.1P(failure)=10%=0.1.

Probability of two consecutive failures:
P(2 failures)=P(failure)×P(failure)=0.1×0.1=0.01P(2 failures)=P(failure)×P(failure)=0.1×0.1=0.01 (1%).

Probability of three consecutive failures:
P(3 failures)=P(failure)×P(failure)×P(failure)=0.1×0.1×0.1=0.001P(3 failures)=P(failure)×P(failure)×P(failure)=0.1×0.1×0.1=0.001 (0.1%).

So with timeout=60s and --healthcheck-interval 5m, check_cncli_send_tip() would (probabilistically) become unhealthy every 3.47 days

- Increased accuracy of "Pooltool" log scrape pattern. - Included ($pt_log_entry) in output for improved logging/debugging. - Improved variable names for clarification.

Update healthcheck.sh

TrevorBenson · 2025-01-22T00:40:35Z

So with timeout=60s and --healthcheck-interval 5m, check_cncli_send_tip() would (probabilistically) become unhealthy every 3.47 days

It's clear there is a need to account for your findings.

I suspect the "middle ground" here may be adjusting current HEALTHCHECK values, as well as providing documentation for optimizing based on the use case (ENTRYPOINT_PROCESS).

Thoughts

Probabilities

A quick glance at your formula for probabilities I realized I mispoke in earlier statements 🤦🏼

I described one of our monitoring solution's concept of "retries" before a service status changes to offline/unhealthy, instead of dockers concept. Instead of 1 initial failure + X retries to change a status, for Docker it's just X retries, or "FailingStreak: X" == --health-retries X. Failingstreak is available in docker container inspect under State.Health.

I'm a bit tired ATM, but unless my mental math is off, doesn't that bring us to unhealthy status once per 25 hours / 1.04 days (at 300 interval) or once per 10 hours if we wanted to reduce to an interval of 120?

Node tests

Your node results are interesting.

My tests were all in the US from various systems, but still pretty low latency to a very high number of overall stake pools, and total delegation. In regions with even higher overall latency to the majority of nodes/delegation (possibly AU?), I suspect we'd see even more failures.

Additional Data

It might be interesting to see the following output providing details about the consecutive event observed:

SELECT * FROM checks WHERE status != 'synchronized';

This is not required, just "interesting". If you you decide to grab/provide this for review, or any other queries you might find interesting among the data, a gist link would be fine.

Send Tip Tests

Given your current results from timeout=60 I doubt if timeout=100 (or even timeout=120) would be sufficient to combat an event with such high probability to occur. every few days (more than once an epoch, ~105 times annually)

Options

We could retest w/ 100 and 120 to get more data points.
- Possibly do 60 too, and all in parallel starting the script more than once.
- Parallel is just to confirm failures at 60 don't drop while we evaluate the 100 & 120 results. If 60 is lower consecutive failures our findings from 100/120 might be skewed as well.
Consider some form of check enhancement.
- Possibly reduce false positives by borrowing from check_node logic? Using the tip when needed.

CCLI=$(which cardano-cli)
FIRST=$($CCLI query tip --testnet-magic ${NWMAGIC} | jq .block)
pt_log_entry=$(timeout $log_entry_timeout cat /proc/$process_id/fd/1 | grep --line-buffered "Pooltool" | head -n 1)
if [ -z "$pt_log_entry" ]; then
    SECOND=$($CCLI query tip --testnet-magic ${NWMAGIC} | jq .block)
    if [[ "$FIRST" -eq "$SECOND" ]]; then
        echo "Unable to capture cncli output within $log_entry_timeout seconds, but node has not moved tip."
        return 0
    else
        echo "Unable to capture cncli output within $log_entry_timeout seconds."
        return 1  # Return 1 if the output capture fails
fi

EDIT: I mentioned 105 times annually before I looked at the formulas and backtracked to write the intro. It would be off, if my mental math wasn't, but it is is quite possible since I didn't double check.

I'll look at the numbers again today. In any case I still think adjustments are needed, even if its only once every 3.47 days, or even if it was simply once a month.

adamsthws · 2025-01-22T21:11:07Z

I described one of our monitoring solution's concept of "retries" before a service status changes to offline/unhealthy, instead of dockers concept. Instead of 1 initial failure + X retries to change a status, for Docker it's just X retries, or "FailingStreak: X" == --health-retries X. Failingstreak is available in docker container inspect under State.Health.

Thanks for pointing that out... I was thinking about it in terms of docker's behavour - where 3 consecutive failures = unhealthy.

I'm a bit tired ATM, but unless my mental math is off, doesn't that bring us to unhealthy status once per 25 hours / 1.04 days (at 300 interval) or once per 10 hours if we wanted to reduce to an interval of 120?

I re-calculated to double check and I believe my previous working was correct.
By my calculations, with a 10% failure rate (and "FailingStreak: 3" == --health-retries 3):

120 interval would become unhealthy every 1.38 days
300 interval would become unhealthy every 3.47 days

My tests were all in the US from various systems, but still pretty low latency to a very high number of overall stake pools, and total delegation. In regions with even higher overall latency to the majority of nodes/delegation (possibly AU?), I suspect we'd see even more failures.

That's a good point. FWIW I'm testing from the UK.

Additional Data
It might be interesting to see the following output providing details about the consecutive event observed:
SELECT * FROM checks WHERE status != 'synchronized';
This is not required, just "interesting". If you you decide to grab/provide this for review, or any other queries you might find interesting among the data, a gist link would be fine.

Unfortunately the container has been re-created since running your test and the results db has not persisted. Happy to re-run the test if you think it would be valuable?

Given your current results from timeout=60 I doubt if timeout=100 (or even timeout=120) would be sufficient to combat an event with such high probability to occur. every few days (more than once an epoch, ~105 times annually)

I've tested with timout=90 and now awaiting timout=120 test to complete... Once we have these results it might give us a better idea of the values to test for in the next round (while testing all simultaneously). Here's the results of Timeout=90...

Test: check_cncli_send_tip() - Gist
(Interval=120) (Tries=300) (Timeout=90)
Failed checks: 11/300 (3.67%)
Probablilty of 3 consecutive failed checks: 0.00495%
Expected frequency of unhealthy container status: 70.24 days

TrevorBenson · 2025-01-23T07:47:13Z

Unfortunately the container has been re-created since running your test and the results db has not persisted. Happy to re-run the test if you think it would be valuable?

No need. The parallel run with 60/90/120 would be the most useful, as we can see if timeout=60 results in the same level of failures as your first pass.

If you can persist the DB on the parallel run it would be useful to investigate each consecutive failure. The comparison of each overlapping check for the different timeout values should be invaluable to determine if the adjustment really addresses the probabilities calculated previously, or if you just had a lucky run, which seeing the parallel timeout=60 will determine.

TrevorBenson · 2025-01-23T07:51:11Z

Thanks for pointing that out... I was thinking about it in terms of docker's behavour - where 3 consecutive failures = unhealthy.

Great. I must have multiplied by 100 to think about the % on the 2 failure scenario, then just continued with that number when I added the third failure.

Sleep is good, I should try it more often 😅.

adamsthws · 2025-01-23T09:21:15Z

I've tested with timout=90 and now awaiting timout=120 test to complete... Once we have these results it might give us a better idea of the values to test for in the next round (while testing all simultaneously). Here's the results of Timeout=90...
* Test: check_cncli_send_tip() - [Gist](https://gist.github.com/adamsthws/dce0263bea2302047660b9c8ea458cbd)
  (Interval=120) (Tries=300) (Timeout=90)
  Failed checks: 11/300 (3.67%)
  Probablilty of 3 consecutive failed checks: 0.00495%
  Expected frequency of unhealthy container status: 70.24 days

Results of timout=120:

Test: check_cncli_send_tip() - Gist
(Interval=120) (Tries=300) (Timeout=120)
Failed checks: 3/300 (1%)
Probablilty of 3 consecutive failed checks: 0.0001%
Expected frequency of unhealthy container status: 3472 days (9.5 years)

Quick observation:
As the length of timout increases linearly, the failure rate decreases exponentially .
As the failure rate decreases linearly, the probability of three consecutive failures decreases exponentially .

I will now go on to test simultaneously to verify results, using the following timout values: 60, 90, 100, 110, 120.

TrevorBenson · 2025-01-23T20:30:19Z

Results of timout=120:
* Test: check_cncli_send_tip() - [Gist](https://gist.github.com/adamsthws/dce0263bea2302047660b9c8ea458cbd)
  (Interval=120) (Tries=300) (Timeout=120)
  Failed checks: 3/300 (1%)
  Probablilty of 3 consecutive failed checks: 0.0001%
  Expected frequency of unhealthy container status: 3472 days (9.5 years)
Quick observation: As the length of timout increases linearly, the failure rate decreases exponentially . As the failure rate decreases linearly, the probability of three consecutive failures decreases exponentially .

I will now go on to test simultaneously to verify results, using the following timout values: 60, 90, 100, 110, 120.

Send Tip

Fabulous. With confirmation from the parallel data showing 60 was close to or the same as your first while we can compare in SQL for every other timeout during the same period should make the choice quite simple.

Node

The only concerning thing remaining was your results from check_node POC. The 13% result being much higher than I observed (worst at 5% w/ 0 consecutive). The probability increase is dramatic, 5% being 0.000125/0.012%, 10% (send tip) being 0.001/0.1% and 13% being 0.002197/0.22%. Essentially 800% more likely than in any test I performed, and ~120% more likely than the send tip healthcheck failure ratio observed w/ timeout=60.

Due to this I'm thinking about an adjustment to the logic for check_node using KOIOS.

Allow up to 1 block drift in either direction or fully in sync to return 0.
Use 1-2 retries (instead of current 20) with a 3-4 second sleep.

I suspect either will significantly reduce the POC failure rates. Do you have a preference or other options to suggest?

After your response I'll create an adjustment to the node healthcheck POC and run them in parallel for easy comparison.

FWIW, I think moving this out of draft is quite close. I do need to come up with something for mithril signer. I don't want that to delay this work getting reviewed, so if I don't find time to implement/test something before the next POC test finishes, I'll just add that to a list of other PR's I need to start work on.

Thanks for all the time an energy you've spent collaborating on this!

adamsthws · 2025-01-24T15:56:36Z

The 13% result being much higher than I observed

I wonder if this large discrepancey was due to my not using KOIOS_API_TOKEN and hitting api rate limiting. Though, I don't have a great deal of nodes on this WAN IP. I'd be happy to re-run it with an api key to verify however, even if using your 5% failure rate, that would be high enough that it should be accounted for with one of your two proposed adjsutments (drift/retries). Plus, the healthcheck should probably be able to account for operators running without a Koios api key.

Due to this I'm thinking about an adjustment to the logic for check_node using KOIOS.

Allow up to 1 block drift in either direction or fully in sync to return 0.

Use 1-2 retries (instead of current 20) with a 3-4 second sleep.

Great!
The time it takes to test if it has drifted by >1 blocks would be greater than the time it takes test with (short) retries.
The drift feels like a nicer solution for some reason but retries would fail faster.
I imagine either would work well... Which is your preference?

FWIW, I think moving this out of draft is quite close. I do need to come up with something for mithril signer. I don't want that to delay this work getting reviewed, so if I don't find time to implement/test something before the next POC test finishes, I'll just add that to a list of other PR's I need to start work on.

Thanks for all the time an energy you've spent collaborating on this!

Thankyou for saying so...It's been a pleasure working through it with you :)
And thanks for being so open to the suggestions!

TrevorBenson · 2025-01-24T17:23:41Z

I imagine either would work well... Which is your preference?

Currently I think allowing a minor block drift is my preference. We reduce the queries to KOIOS, which shoudn't really be an issue unless multiple machines share an IP. But it also puts less pressure on KOIOS when aggregating all SPO's who use the Guild Operators container. It also aligns with how we (and others already) do the DB checks where they expect some level of drift, and even allow it to be pretty large.

I'll create a POC test and run in parallel for a 1 block and a 2 block drift, as well as 1 retry and 2 retry. Short of a major difference I think a 1 (or 2) block drift would be the direction I'd lean. I'll let you know later today when the updated POC scripts are ready, in case you want to test on your end as well.

TrevorBenson · 2025-01-26T20:43:28Z

Apologies on the delay, came down with a cold and was decided to rest for the remainder of Friday before a long trip this week.

While putting together the new POC scripts I did some extremely short interval testing (1 second between checks) to validate the script worked as intended. I ran into a very small amout of koios and node being in sync, but 1 second later I was on the same block, but the koios endpoint was behind.

id   timestamp               interval  status        count  node      koios     drift  allowed_drift  allowed_retries  retries_used
---  ----------------------  --------  ------------  -----  --------  --------  -----  -------------  ---------------  ------------
96   2025-01-26-08:05:53 PM  1         synchronized  6      11403349  11403349  0      0              1                0
97   2025-01-26-08:05:56 PM  1         gt            1      11403350  11403346  4      0              1                1
98   2025-01-26-08:05:57 PM  1         synchronized  1      11403350  11403350  0      0              1                0
99   2025-01-26-08:06:00 PM  1         gt            1      11403350  11403348  2      0              1                1
100  2025-01-26-08:06:02 PM  1         synchronized  1      11403350  11403350  0      0              1                0

id   timestamp               interval  status        count  node      koios     drift  allowed_drift  allowed_retries  retries_used
---  ----------------------  --------  ------------  -----  --------  --------  -----  -------------  ---------------  ------------
272  2025-01-26-08:13:12 PM  1         synchronized  72     11403370  11403370  0      2              0                0
273  2025-01-26-08:13:13 PM  1         gt            1      11403370  11403367  3      2              0                0
274  2025-01-26-08:13:15 PM  1         synchronized  1      11403370  11403370  0      2              0                0

The healthcheck_poc2.sh is available in this gist

adamsthws · 2025-01-30T10:34:15Z

Hey, I wanted to give you a quick update as I've not made comment for a while - I've been quietly testing, I just didn't want you thinking I'd lost interest.

When testing the function check_cncli_send_tip() with 300 retrires the results are slightly varying so I've increased retries to 1000 for a larger sample size (which is taking a long time to complete).
Here the test method

I also have results from running your: healthcheck_poc2.sh. I've not had chance to look at the results yet but will get those to you soon.

Thanks for your patience and I hope you're feeling better now!
Cheers, Adam

TrevorBenson · 2025-01-30T16:39:48Z

Hey, I wanted to give you a quick update as I've not made comment for a while - I've been quietly testing, I just didn't want you thinking I'd lost interest.

Great, I also got a bit busy with travel for work so haven't been as active as I initially planned on the poc2 phase.

When testing the function check_cncli_send_tip() with 300 retrires the results are slightly varying so I've increased retries to 1000 for a larger sample size (which is taking a long time to complete). Here the test method

👍🏼

I also have results from running your: healthcheck_poc2.sh. I've not had chance to look at the results yet but will get those to you soon.

Same, I ran what I described, but hadn't gotten back around to creating the ratios and examining probabilities. I hope to get a few minutes this evening to jump into it and update this thread.

Thanks for your patience and I hope you're feeling better now! Cheers, Adam

Just started to after the flights. At least my two weeks in the EU won't be suffering 😆 . Thanks!

TrevorBenson · 2025-02-12T20:52:28Z

@adamsthws Apologies for the delays. I've updated the gist with results from my past tests. Given the potential for drift between the node and koios, at least based on the results I had, an allowed drift appears to be a more resilient option than including additional retries.

I'm interested in the findings from your tests and if you observed similar results or had any differences that should be considered.

Thanks

adamsthws · 2025-02-15T12:50:15Z

Apologies for the delays.

No need to apologise, I understand you've been travelling... I've been travelling this week too on a snowboarding trip to France :) Hope you're having fun in Europe!

I've updated the gist with results from my past tests.

Great, I've done the same.

I'm interested in the findings from your tests and if you observed similar results or had any differences that should be considered.

It looks like our results are quite similar. I'd be happy to re-run the test with a larger amount of checks if you think a larger sample size would provide any benefit.

Thanks

adamsthws · 2025-02-15T14:09:14Z

I've added my results for my sendtip tests to the gist here

We could retest w/ 100 and 120 to get more data points. Possibly do 60 too, and all in parallel.

Parallel is just to confirm failures at 60 don't drop while we evaluate the 100 & 120 results. If 60 is lower consecutive failures our findings from 100/120 might be skewed as well.

In the initial test with timout=60 test I measured a 9% failure rate. In the latest results (while testing differeing timout values simultaneously) there was a 8.4% failure rate with timout=60, so the failure rate seems to be fairly consistent.

TrevorBenson · 2025-02-16T22:15:43Z

Apologies for the delays.

No need to apologise, I understand you've been travelling... I've been travelling this week too on a snowboarding trip to France :) Hope you're having fun in Europe!

I've updated the gist with results from my past tests.

Great, I've done the same.

I'm interested in the findings from your tests and if you observed similar results or had any differences that should be considered.

It looks like our results are quite similar. I'd be happy to re-run the test with a larger amount of checks if you think a larger sample size would provide any benefit.

Thanks

Given a decent overlap in results I'm inclined to push for an allowed drift. To account for the potential of the AU to have a larger drift I'm thinking of 6 blocks for the moment, which should equate to around 2 minutes, and also cover when one koios endpoint is behind on sync more than others.

TrevorBenson · 2025-02-16T22:26:48Z

I've added my results for my sendtip tests to the gist here

We could retest w/ 100 and 120 to get more data points. Possibly do 60 too, and all in parallel.

Parallel is just to confirm failures at 60 don't drop while we evaluate the 100 & 120 results. If 60 is lower consecutive failures our findings from 100/120 might be skewed as well.

In the initial test with timout=60 test I measured a 9% failure rate. In the latest results (while testing differeing timout values simultaneously) there was a 8.4% failure rate with timout=60, so the failure rate seems to be fairly consistent.

One thing in the results is confusing me:

Quick note: 'unhealthy status' means 3 consecutive failures
[ ... ]

docker exec cardano-mainnet-bp_cncli-pt-send-tip4 tail -n 6 /home/guild/.scripts/result.log
-----
Tests completed: 1000 | Timeout: 140 | Interval: 180
Healthy: 999 | Unhealthy: 1 | Consecutive Failures: 0
Measured failure rate: .10000000%
Probability of 3 consecutive failures: 0%
Expected frequency of unhealthy status: Never.

If a single unhealthy status means 3 consecutive failures occurred in the test, and his test shows Unhealthy: 1, how is the probability 0%?

Separately.

From your results, what are you inclined to do for the pt send tip healthcheck?
For other health checks where I was considering Drift (currently 6), do you have any preference or suggestions to use less or more drift?

I think we have enough details to move forward with a final version and move this out of draft. However you want to include your work is fine with me (PR on upstream source, or anything else you prefer to include your commits).

I'll confer with you before moving the PR out of draft so we're sure we didn't miss/overlook anything along the way

Thanks again, and I hope you're enjoying the snowboarding! I should be going again in a couple of weeks 🏂

Include healthcheck logic for helper scripts running as sidecars

0b27c4e

TrevorBenson commented Jan 8, 2025

View reviewed changes

files/docker/node/addons/healthcheck.sh Outdated Show resolved Hide resolved

Round CPU_USAGE to an int and if less than 0.5 from threshold cause i…

257647c

…t to round up.

Merge branch 'alpha' into sidecar-healthchecks

0644d5f

adamsthws reviewed Jan 12, 2025

View reviewed changes

files/docker/node/addons/healthcheck.sh Outdated Show resolved Hide resolved

Improved/Corrected for loop logic

859b0e5

Co-authored-by: Adam Matthews <[email protected]>

TrevorBenson requested a review from adamsthws January 14, 2025 06:44

Added DB checks for cncli and dbsync.

f752b1c

TrevorBenson force-pushed the sidecar-healthchecks branch from 13d8a50 to f752b1c Compare January 14, 2025 06:52

Remove reference to old SLEEPING_SCRIPTS array

74aa459

adamsthws reviewed Jan 14, 2025

View reviewed changes

files/docker/node/addons/healthcheck.sh Outdated Show resolved Hide resolved

adamsthws reviewed Jan 14, 2025

View reviewed changes

Restore the correct loop for check/retries.

1eda07b

Co-authored-by: Adam Matthews <[email protected]>

TrevorBenson commented Jan 15, 2025

View reviewed changes

files/docker/node/addons/healthcheck.sh Outdated Show resolved Hide resolved

Use URL not KOIOS_URL

505487a

adamsthws mentioned this pull request Jan 16, 2025

Update healthcheck.sh chaincrucial/guild-operators#1

Merged

adamsthws and others added 3 commits January 21, 2025 10:37

Update healthcheck.sh

338639c

- Increased accuracy of "Pooltool" log scrape pattern. - Included ($pt_log_entry) in output for improved logging/debugging. - Improved variable names for clarification.

Merge branch 'alpha' into sidecar-healthchecks

cbdd386

Merge pull request #1 from adamsthws/sidecar-healthchecks

36d6525

Update healthcheck.sh

Include healthcheck logic for helper scripts running as sidecars #1842

Are you sure you want to change the base?

Include healthcheck logic for helper scripts running as sidecars #1842

Conversation

TrevorBenson commented Jan 7, 2025

Description

Where should the reviewer start?

Testing different CPU Usage values

Testing different amount of retries (internal to healthcheck.sh script).

Testing different healthcheck values (external to healthcheck.sh script).

Motivation and context

Which issue it fixes?

How has this been tested?

Additional Details

TrevorBenson commented Jan 7, 2025

adamsthws commented Jan 7, 2025 • edited Loading

adamsthws commented Jan 7, 2025 • edited Loading

TrevorBenson commented Jan 8, 2025 • edited Loading

TrevorBenson commented Jan 8, 2025

adamsthws commented Jan 8, 2025 • edited Loading

adamsthws commented Jan 9, 2025

TrevorBenson commented Jan 11, 2025 • edited Loading

TrevorBenson commented Jan 11, 2025 • edited Loading

adamsthws commented Jan 12, 2025 • edited Loading

adamsthws commented Jan 12, 2025 • edited Loading

TrevorBenson commented Jan 13, 2025

TrevorBenson commented Jan 13, 2025 • edited Loading

TrevorBenson commented Jan 14, 2025

adamsthws commented Jan 14, 2025

adamsthws left a comment

Choose a reason for hiding this comment

TrevorBenson commented Jan 14, 2025

TrevorBenson commented Jan 14, 2025

TrevorBenson commented Jan 15, 2025 • edited Loading

adamsthws commented Jan 15, 2025 • edited Loading

TrevorBenson commented Jan 16, 2025

adamsthws commented Jan 16, 2025 • edited Loading

TrevorBenson commented Jan 16, 2025

TrevorBenson commented Jan 17, 2025 • edited Loading

Tests

Results

300 second intervals

240 second intervals

180 second intervals

120 second intervals

Summary

adamsthws commented Jan 18, 2025

adamsthws commented Jan 20, 2025 • edited Loading

TrevorBenson commented Jan 22, 2025 • edited Loading

Thoughts

Probabilities

Node tests

Additional Data

Send Tip Tests

Options

adamsthws commented Jan 22, 2025 • edited Loading

TrevorBenson commented Jan 23, 2025

TrevorBenson commented Jan 23, 2025

adamsthws commented Jan 23, 2025 • edited Loading

TrevorBenson commented Jan 23, 2025

Send Tip

Node

adamsthws commented Jan 24, 2025

TrevorBenson commented Jan 24, 2025 • edited Loading

TrevorBenson commented Jan 26, 2025

adamsthws commented Jan 30, 2025

TrevorBenson commented Jan 30, 2025

TrevorBenson commented Feb 12, 2025

adamsthws commented Feb 15, 2025

adamsthws commented Feb 15, 2025

TrevorBenson commented Feb 16, 2025

TrevorBenson commented Feb 16, 2025

adamsthws commented Jan 7, 2025 •

edited

Loading

adamsthws commented Jan 7, 2025 •

edited

Loading

TrevorBenson commented Jan 8, 2025 •

edited

Loading

adamsthws commented Jan 8, 2025 •

edited

Loading

TrevorBenson commented Jan 11, 2025 •

edited

Loading

TrevorBenson commented Jan 11, 2025 •

edited

Loading

adamsthws commented Jan 12, 2025 •

edited

Loading

adamsthws commented Jan 12, 2025 •

edited

Loading

TrevorBenson commented Jan 13, 2025 •

edited

Loading

TrevorBenson commented Jan 15, 2025 •

edited

Loading

adamsthws commented Jan 15, 2025 •

edited

Loading

adamsthws commented Jan 16, 2025 •

edited

Loading

TrevorBenson commented Jan 17, 2025 •

edited

Loading

adamsthws commented Jan 20, 2025 •

edited

Loading

TrevorBenson commented Jan 22, 2025 •

edited

Loading

adamsthws commented Jan 22, 2025 •

edited

Loading

adamsthws commented Jan 23, 2025 •

edited

Loading

TrevorBenson commented Jan 24, 2025 •

edited

Loading