From 976fd2649df9206c61cd86b4cc97ddd95e9c59fd Mon Sep 17 00:00:00 2001 From: Gernot Feichter Date: Fri, 12 Jul 2024 10:29:54 +0200 Subject: [PATCH 1/5] feat: autorecover from stuck situations Signed-off-by: Gernot Feichter --- hips/hip-9999.md | 102 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 hips/hip-9999.md diff --git a/hips/hip-9999.md b/hips/hip-9999.md new file mode 100644 index 00000000..9603d7bf --- /dev/null +++ b/hips/hip-9999.md @@ -0,0 +1,102 @@ +--- +hip: 9999 +title: "Autorecover from stuck situations" +authors: [ "Gernot Feichter " ] +created: "2024-07-12" +type: "feature" +status: "draft" +helm-version: 3 +--- + +## Abstract + +The idea is to simplify the handling for both manual users and CI/CD pipelines, +to auto-recover from a state of stuck deployments, which is currently not possible unless users implement +boilerplate code around their helm invocations. + +## Motivation + +If a helm deployment fails, I want to be able to retry it, +ideally by running the same command again to keep things simple. + +There are two known situations how the user can run into such a situation where a retry will NOT work: +1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`. +2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`. + +Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations: +1. `kubectl delete secret ''.` (Not possible if you don't want to lose all history) +2. `helm delete` your release. (Not possible if you don't want to lose all history) +3. `helm rollback` your release. (Not possibly if it is the first installation) + +## Rationale + +The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm +release is locked by themselves or not and for how long the lock is valid. + +It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release +may be stuck in a pending state. + +## Specification + +The --timout parameter gets a deeper meaning. +Previously the --timout parameter only had an effect on the helm process running on the respective client. +After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and +have an indirect impact on possible parallel processes. + +`helm ls -a` shows two new columns, regular `helm ls` does NOT show those: +- LOCKED TILL + calculated by the helm client: k8s server time + timeout parameter value +- SESSION ID + Unique, random session id generated by the client + +Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value, +SESSION ID and sets the release into a failed state before terminating in order to free the lock. + +## Backwards compatibility + +It is assumed that the helm release object as stored in k8s will not break +older clients if new fields are added while existing fields are untouched. + +Backwards compatibility will be tested during implementation! + +## Security implications + +The proposed solution should not have an impact on security. + +## How to teach this + +Since the way that helm is invoked is not altered, there will not be much to teach here. +The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that +needs to be encouraged. + +It should just reduce the amount of frustration when dealing with pending and failed helm releases. + +A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock). + +## Reference implementation + +helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations + +acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations + +## Rejected ideas + +None + +## Open issues + +[] HIP status `accepted' + +[x] Reference implementation + +[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) + +[] Test for kill scenario (forever stuck in pending) + +[] Backwards compatibility check (looking good already) + +## References + +https://github.com/helm/helm/issues/7476 +https://github.com/rancher/rancher/issues/44530 +https://github.com/helm/helm/issues/11863 From ada8829267cddfe0f1657777594e40d46ef2bfad Mon Sep 17 00:00:00 2001 From: gerrnot Date: Wed, 24 Jul 2024 20:12:38 +0200 Subject: [PATCH 2/5] Update hips/hip-9999.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot --- hips/hip-9999.md | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/hips/hip-9999.md b/hips/hip-9999.md index 9603d7bf..4b19944f 100644 --- a/hips/hip-9999.md +++ b/hips/hip-9999.md @@ -85,15 +85,11 @@ None ## Open issues -[] HIP status `accepted' - -[x] Reference implementation - -[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) - -[] Test for kill scenario (forever stuck in pending) - -[] Backwards compatibility check (looking good already) +- [ ] HIP status `accepted' +- [x] Reference implementation +- [x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) +- [ ] Test for kill scenario (forever stuck in pending) +- [ ] Backwards compatibility check (looking good already) ## References From 03624ebb69644fdcacc7acd24298c41225cbc246 Mon Sep 17 00:00:00 2001 From: gerrnot Date: Wed, 24 Jul 2024 20:13:50 +0200 Subject: [PATCH 3/5] Update hips/hip-9999.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot --- hips/hip-9999.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/hips/hip-9999.md b/hips/hip-9999.md index 4b19944f..b3754339 100644 --- a/hips/hip-9999.md +++ b/hips/hip-9999.md @@ -90,9 +90,5 @@ None - [x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) - [ ] Test for kill scenario (forever stuck in pending) - [ ] Backwards compatibility check (looking good already) - -## References - -https://github.com/helm/helm/issues/7476 https://github.com/rancher/rancher/issues/44530 https://github.com/helm/helm/issues/11863 From 74c3e944e0bb5719de8caa76c68f2ad6dc04d23c Mon Sep 17 00:00:00 2001 From: gerrnot Date: Wed, 24 Jul 2024 20:15:17 +0200 Subject: [PATCH 4/5] Update hips/hip-9999.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com> Signed-off-by: gerrnot --- hips/hip-9999.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hips/hip-9999.md b/hips/hip-9999.md index b3754339..47e8bcf2 100644 --- a/hips/hip-9999.md +++ b/hips/hip-9999.md @@ -26,7 +26,7 @@ There are two known situations how the user can run into such a situation where Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations: 1. `kubectl delete secret ''.` (Not possible if you don't want to lose all history) 2. `helm delete` your release. (Not possible if you don't want to lose all history) -3. `helm rollback` your release. (Not possibly if it is the first installation) +3. `helm rollback` your release. (Not possible if it is the first installation) ## Rationale From 7b94ce834471d0445ff53cee4339c6169901e237 Mon Sep 17 00:00:00 2001 From: Gernot Feichter Date: Tue, 6 Aug 2024 11:21:40 +0200 Subject: [PATCH 5/5] feat: autorecover from stuck situations Signed-off-by: Gernot Feichter --- hips/hip-9999.md | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/hips/hip-9999.md b/hips/hip-9999.md index 47e8bcf2..fe34b3f9 100644 --- a/hips/hip-9999.md +++ b/hips/hip-9999.md @@ -44,10 +44,14 @@ After implementation, the --timout parameter will be stored in the helm release have an indirect impact on possible parallel processes. `helm ls -a` shows two new columns, regular `helm ls` does NOT show those: -- LOCKED TILL - calculated by the helm client: k8s server time + timeout parameter value -- SESSION ID - Unique, random session id generated by the client +- `LOCKED TILL` + datetime calculated by the helm client: current time + timeout parameter value + Originally k8s server time was intended as "current time", but since helm exclusively uses the + client time everywhere else, we do not change that via this HIP, such a refactoring would need + to be performed via a separate HIP against the entire codebase. +- `SESSION ID` + + Unique, random session id generated by the client. Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value, SESSION ID and sets the release into a failed state before terminating in order to free the lock. @@ -88,7 +92,16 @@ None - [ ] HIP status `accepted' - [x] Reference implementation - [x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) -- [ ] Test for kill scenario (forever stuck in pending) -- [ ] Backwards compatibility check (looking good already) +- [x] Test for upgrading from pending state +- [x] Test for upgrading from failed state +- [ ] Decision: Helm ls -> which flag should show the new fields `LOCKED TILL` and `SESSION ID`? +- [ ] Decision: k8s Lease object vs helm relesae secret for storing the `LOCKED TILL` and `SESSION ID` +- [x] Backwards compatibility check (part of acceptance tests repo, looking good already, even when storing the state in the release object) + +## References + +https://github.com/helm/helm/issues/7476 + https://github.com/rancher/rancher/issues/44530 + https://github.com/helm/helm/issues/11863