-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include healthcheck logic for helper scripts running as sidecars #1842
base: alpha
Are you sure you want to change the base?
Include healthcheck logic for helper scripts running as sidecars #1842
Conversation
FWIW Here is my preview network pool, and cncli containers showing healthy once the script was copied in and healthcheck interval was reached:
|
Looks good! cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers Execute the script with docker exec.
Monitor the containers until the healthcheck interval occurs and that they are marked healthy.
Adjusted RETRIES
Adjusted CPU_THRESHOLD
|
Further testing... I was able to test with higher cpu load after deleting the cncli db and re-syncing. Result
Line 44 of healthcheck.sh: This seems to fix it...
With the above change, when cpu load is higher than CPU_THRESHOLD, this is the result:
|
Yeah, there are rare instances where cncli percentage can be high, but this tends to be when resyncing the entire db and/or a cncli init is running. Occasionally if there is an issue with node process itself, like if it gets stuck chainsync/blockfetch and never completes, I have also seen cncli get a high percentage, but otherwise its quite rare to see it increase. I figured with mithril-signer or db-sync, it might be more useful. |
@adamsthws Feel free to submit suggestions to adjust the Thanks for the testing. |
Testing revealed that setting RETRIES=0 results in script exit 1 without running the loop... it would be preferable to run the loop once when RETRIES=0. Suggestion - Modify the loop condition to handle RETRIES=0 by changing line 39 to the following:
Or...
|
I started thinkinng about a cncli specific check. The following function is an idea to check cncli status...
Perhaps would be improved further by also checking if sync is incrementing, so the healthcheck doesn't fail during initial sync. How would you feel about adding me as a commit co-author if you decide to use this? |
@adamsthws I'm happy to make you a co-author even for something simple, for example if you know how to submit a suggestion go ahead an apply one for |
In regards to the larger block for cncli checks, first it is clear lots of thought went into it. This portion:
Sleeps of 180 exceed the current timeout period of 100. Options:
With container settings of 3 retries and 5 minute interval w/ 100 second timeout it is 15 minutes from the last healthy response, or 10 minutes from the first unhealthy response, before the container exhausts retries and is marked unhealthy. I think this covers the two 180 second sleeps, even if the operator reduces the interval and timeouts when not running the node. Separately, conversations outside of this PR and thread have pointed to some of the logic used in KOIOS for db-sync, also that it could also be used for checking the sqlite DB for cncli.
I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour. These are just my thoughts. If you think that I overlooked some aspect please don't hesitate to continue the discussion. |
Thankyou for the feedback. It seems the limitations of using 'cncli status' in the healthcheck are twofold... 1 - Instead, checking for cncli sync progress via sqlite db would seem to make more sense. 2 -
I think it's reasonable to say 'cncli status' is not a suitable way to check cncli container health. |
Here's a suggestion to check the health of cncli ptsendtip, I'd love your feedback...
|
Co-authored-by: Adam Matthews <[email protected]>
Yep, this is where I was originally taking "the easy way out" on the healthcheck of checking if the binary was running or not, and/or sleep. |
13d8a50
to
f752b1c
Compare
@adamsthws OK commit pushed, sorry for the delay. Spent a little time considering if healthcheck.sh could remain in Thanks. |
In my limited testing have never found
If nothing is ever written to the file descriptor...
Do you reccommend...
Yes, I have tested... Might the discrepancy between your results and mine be that:
Great! I'll try to review quickly but may be a day or two. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had chance to do any testing but here is a quick initial look over. Thanks!
Co-authored-by: Adam Matthews <[email protected]>
I suppose we could align with the original query tip (non KOIOS) and go for 60 seconds. If it fails on the first pass, then there are 3 retries before the container is marked unhealthy anyway. Thoughts? Also I realize in my prior explanation I clearly had the timing off. I should have said 20 total and 15 from the first failure, since it would be defaulting to 3 retries with the 5 minute intervals, not 3 total attempts. At least when operators do not tune the interval, retries, etc. at container creation. |
Nope. This was a manual test inside my cncli sync container. So I manually checked the PID and used tmux, not the script at all. Because I wasn't using a valid API key I checked both fd 1 and 2 (and 0).
I'll chalk it up to some oddity with the manual test and using an XXXX value for the API ID. As long as you are seeing the output and its working in your test that is good enough for me. I need to create a new account at PT to replace my old one based on email address anyway. I'll try to retest once I have a valid setup and start using ptsendtip as a container again. |
Separately, I've been considering for awhile retries (now HEALTHCHECK_RETRIES) being dropped from 20, to 0-3. I originally set it to 20 retries and (now HEALTHCHECK_RETRY_WAIT) to 3 second sleeps so that the KOIOS method would at max run as long as the 60 seconds of a query tip without KOIOS. This was never really required, but its been there ever since. I tested last night manually in a loop of 10 with a 20 second sleep 10/10 returned in under 1 second, never needing a retry. 10/10 also incremented at least 1 block. I didn't bother to do multiple passes and calculate the ratio. However given the retries at the container level, I am pretty confident we could drop this from 20 and improve the observability (slightly) by failing earlier. I'm interested in your feedback. What do you think about changing the default values of HEALTHCHECK_RETRIES so even KOIOS will fail earlier? |
If it rarely (or never) fails and needs a retry, I would be in favor of removing the Koios retry loop from In which case, there might also be an argument to remove variables:
|
Well, not "never". There are some occasions where the node is ahead, like when a block is produced locally and healcheck is almost at the exact moment. Same in reverse, KOIOS being ahead, but rarely for more than 3 seconds. Since your open to even removing the retries altogether I'll create a POC and give it a full 100 runs using a current interval of 5 minutes. I'll track the total times:
I'll also try 4, 3 and 2 minute intervals. We could potentially improve the observability, reducing the time something is wrong before the status changes to unhealthy. As long as multiple systems are not sharing a single outbound IP for KOIOS queries I don't believe this should push the limits of the free tier. |
I wonder if including an allowable drift for the koios check, (similar to |
I like the thought, but you might be correct that we just swap complexity. A single line for "healthy" we get something like this for a drift of 1:
It's not terribly complex, but not the simplest either. It could be split up w/ |
@adamsthws So the results were pretty interesting. FWIW, I set the healthcheck_poc.sh so that it leverages the KOIOS_API_TOKEN to be sure I didn't get any 429's as this testbed shares the WAN IP with multiple systems querying koios. Tests
Results300 second intervals
240 second intervals
180 second intervals
120 second intervals
Summary
If reducing this interval does not negatively impact any other healthchecks being adding, and the healthchecks start only after the ledger has finished loading (in case of unclean shutdowns, ledger replays during major version changes, etc.), I see no drawbacks to improving the observability of the container health with shorter intervals. I can post the script to a gist if you want to replicate the tests yourself or test different networks. |
100% - running your test on my side now. I'll let you know the results. |
Test Results Test: check_node()
Test: check_cncli_send_tip()
Probabilities If a check has a probability of failing 10% of the time, the probability of a that check failing three times in a row is 0.1% (or 1 in 1,000) - (assuming the events are independent, meaning each failure is not influenced by previous failures).
So with |
- Increased accuracy of "Pooltool" log scrape pattern. - Included ($pt_log_entry) in output for improved logging/debugging. - Improved variable names for clarification.
Update healthcheck.sh
It's clear there is a need to account for your findings. I suspect the "middle ground" here may be adjusting current HEALTHCHECK values, as well as providing documentation for optimizing based on the use case (ENTRYPOINT_PROCESS). ThoughtsProbabilitiesA quick glance at your formula for probabilities I realized I mispoke in earlier statements 🤦🏼 I described one of our monitoring solution's concept of "retries" before a service status changes to offline/unhealthy, instead of dockers concept. Instead of 1 initial failure + X retries to change a status, for Docker it's just X retries, or "FailingStreak: X" == I'm a bit tired ATM, but unless my mental math is off, doesn't that bring us to unhealthy status once per 25 hours / 1.04 days (at 300 interval) or once per 10 hours if we wanted to reduce to an interval of 120? Node testsYour node results are interesting. My tests were all in the US from various systems, but still pretty low latency to a very high number of overall stake pools, and total delegation. In regions with even higher overall latency to the majority of nodes/delegation (possibly AU?), I suspect we'd see even more failures. Additional DataIt might be interesting to see the following output providing details about the consecutive event observed:
This is not required, just "interesting". If you you decide to grab/provide this for review, or any other queries you might find interesting among the data, a gist link would be fine. Send Tip TestsGiven your current results from timeout=60 I doubt if timeout=100 (or even timeout=120) would be sufficient to combat an event with such high probability to occur. every few days (more than once an epoch, ~105 times annually) Options
EDIT: I mentioned 105 times annually before I looked at the formulas and backtracked to write the intro. It would be off, if my mental math wasn't, but it is is quite possible since I didn't double check. I'll look at the numbers again today. In any case I still think adjustments are needed, even if its only once every 3.47 days, or even if it was simply once a month. |
Thanks for pointing that out... I was thinking about it in terms of docker's behavour - where 3 consecutive failures = unhealthy.
I re-calculated to double check and I believe my previous working was correct.
That's a good point. FWIW I'm testing from the UK.
Unfortunately the container has been re-created since running your test and the results db has not persisted. Happy to re-run the test if you think it would be valuable?
I've tested with timout=90 and now awaiting timout=120 test to complete... Once we have these results it might give us a better idea of the values to test for in the next round (while testing all simultaneously). Here's the results of Timeout=90...
|
No need. The parallel run with 60/90/120 would be the most useful, as we can see if timeout=60 results in the same level of failures as your first pass. If you can persist the DB on the parallel run it would be useful to investigate each consecutive failure. The comparison of each overlapping check for the different timeout values should be invaluable to determine if the adjustment really addresses the probabilities calculated previously, or if you just had a lucky run, which seeing the parallel timeout=60 will determine. |
Great. I must have multiplied by 100 to think about the % on the 2 failure scenario, then just continued with that number when I added the third failure. Sleep is good, I should try it more often 😅. |
Results of timout=120:
Quick observation: I will now go on to test simultaneously to verify results, using the following timout values: 60, 90, 100, 110, 120. |
Send TipFabulous. With confirmation from the parallel data showing 60 was close to or the same as your first while we can compare in SQL for every other timeout during the same period should make the choice quite simple. NodeThe only concerning thing remaining was your results from check_node POC. The 13% result being much higher than I observed (worst at 5% w/ 0 consecutive). The probability increase is dramatic, 5% being 0.000125/0.012%, 10% (send tip) being 0.001/0.1% and 13% being 0.002197/0.22%. Essentially 800% more likely than in any test I performed, and ~120% more likely than the send tip healthcheck failure ratio observed w/ timeout=60. Due to this I'm thinking about an adjustment to the logic for check_node using KOIOS.
I suspect either will significantly reduce the POC failure rates. Do you have a preference or other options to suggest? After your response I'll create an adjustment to the node healthcheck POC and run them in parallel for easy comparison. FWIW, I think moving this out of draft is quite close. I do need to come up with something for mithril signer. I don't want that to delay this work getting reviewed, so if I don't find time to implement/test something before the next POC test finishes, I'll just add that to a list of other PR's I need to start work on. Thanks for all the time an energy you've spent collaborating on this! |
I wonder if this large discrepancey was due to my not using
Great!
Thankyou for saying so...It's been a pleasure working through it with you :) |
Currently I think allowing a minor block drift is my preference. We reduce the queries to KOIOS, which shoudn't really be an issue unless multiple machines share an IP. But it also puts less pressure on KOIOS when aggregating all SPO's who use the Guild Operators container. It also aligns with how we (and others already) do the DB checks where they expect some level of drift, and even allow it to be pretty large. I'll create a POC test and run in parallel for a 1 block and a 2 block drift, as well as 1 retry and 2 retry. Short of a major difference I think a 1 (or 2) block drift would be the direction I'd lean. I'll let you know later today when the updated POC scripts are ready, in case you want to test on your end as well. |
Apologies on the delay, came down with a cold and was decided to rest for the remainder of Friday before a long trip this week. While putting together the new POC scripts I did some extremely short interval testing (1 second between checks) to validate the script worked as intended. I ran into a very small amout of koios and node being in sync, but 1 second later I was on the same block, but the koios endpoint was behind.
The healthcheck_poc2.sh is available in this gist |
Hey, I wanted to give you a quick update as I've not made comment for a while - I've been quietly testing, I just didn't want you thinking I'd lost interest. When testing the function I also have results from running your: healthcheck_poc2.sh. I've not had chance to look at the results yet but will get those to you soon. Thanks for your patience and I hope you're feeling better now! |
Great, I also got a bit busy with travel for work so haven't been as active as I initially planned on the poc2 phase.
👍🏼
Same, I ran what I described, but hadn't gotten back around to creating the ratios and examining probabilities. I hope to get a few minutes this evening to jump into it and update this thread.
Just started to after the flights. At least my two weeks in the EU won't be suffering 😆 . Thanks! |
@adamsthws Apologies for the delays. I've updated the gist with results from my past tests. Given the potential for drift between the node and koios, at least based on the results I had, an allowed drift appears to be a more resilient option than including additional retries. I'm interested in the findings from your tests and if you observed similar results or had any differences that should be considered. Thanks |
No need to apologise, I understand you've been travelling... I've been travelling this week too on a snowboarding trip to France :) Hope you're having fun in Europe!
Great, I've done the same.
It looks like our results are quite similar. I'd be happy to re-run the test with a larger amount of checks if you think a larger sample size would provide any benefit. Thanks |
I've added my results for my sendtip tests to the gist here
In the initial test with timout=60 test I measured a 9% failure rate. In the latest results (while testing differeing timout values simultaneously) there was a 8.4% failure rate with timout=60, so the failure rate seems to be fairly consistent. |
Given a decent overlap in results I'm inclined to push for an allowed drift. To account for the potential of the AU to have a larger drift I'm thinking of 6 blocks for the moment, which should equate to around 2 minutes, and also cover when one koios endpoint is behind on sync more than others. |
One thing in the results is confusing me:
If a single unhealthy status means 3 consecutive failures occurred in the test, and his test shows Unhealthy: 1, how is the probability 0%? Separately.
I think we have enough details to move forward with a final version and move this out of draft. However you want to include your work is fine with me (PR on upstream source, or anything else you prefer to include your commits). I'll confer with you before moving the PR out of draft so we're sure we didn't miss/overlook anything along the way Thanks again, and I hope you're enjoying the snowboarding! I should be going again in a couple of weeks 🏂 |
Description
Enhances the healthcheck.sh script to work for checking permissions on sidecar containers (helper scripts) via the ENTRYPOINT_PROCESS.
Where should the reviewer start?
/home/guild/.scripts/
.Testing different CPU Usage values
80
%) at a value you want to mark a container unhealthy when it is exceeded.Testing different amount of retries (internal to healthcheck.sh script).
20
) at a number of retries you want to perform if the CPU usage is above the CPU_THRESHOLD value before exiting non zeroCurrently it is a 3 second delay between checks, so 20 retries results in up to 60 seconds before the healthcheck will exit as unhealthy due to CPU load.
Testing different healthcheck values (external to healthcheck.sh script).
The current HEALTHCHECK of the container image is:
Reducing the start period and intervals to something more appropriate for the sidecar script will result in a much shorter period to determine the sidecar containers health.
Make sure to keep the environment variable RETRIES * 3 < container healthcheck timeout to avoid marking the container unhealthy before the script will return during periods of high cpu load.
Motivation and context
Issue #1841
Which issue it fixes?
Closes #1841
How has this been tested?
docker cp
the script into preview network cncli sync, validate and leaderlog containers and waiting until the interval runs the scriptdocker exec
to confirm it reports healthyAdditional Details
There is a SLEEPING_SCRIPTS array which is used for validate and leaderlog to still check for the cncli binary, but not consider a sleep period for validate and leaderlog to be unhealthy. Not 100% sure this is the best approach, but with sleep periods being variable I felt it was likely an acceptable middle ground.
Please do not hesitate to suggest an alternative approach to handling sleeping sidecars healthchecks if you think you have an improvement.
@adamsthws if you could please copy this into your sidecar containers (and your pool) and report back any results. I am marking this as a draft PR for the time being until testing is completed, after which if things look good I will mark it for review and get feedback from others.
Thanks