Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator request conversion-webhook error: x509: certificate signed by unknown authority #399

Open
qiaozhi199 opened this issue Feb 10, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@qiaozhi199
Copy link

Describe the bug

I used to use theia-cloud version 0.10.0, then I upgraded to version 1.0.0.
After upgrading to 1.0.0 for a while, I couldn't log into theia-cloud in my browser.
I see in the background that the operator Pod failed to start.

$ kubectl get pods -n theia-cloud
operator-deployment-f476bcfb4-gtmwb 0/1 CrashLoopBackOff 2586 48d
operator-deployment-f476bcfb4-qtlqg 0/1 CrashLoopBackOff 2783 48d

operator Pod log:

$ kubectl logs -f -n theia-cloud operator-deployment-f476bcfb4-qtlqg
......
2025-02-05T07:47:15.006000578Z -706665172-pool-2-thread-1 WARN Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
07:47:15.004 [-706665172-pool-2-thread-1] ERROR org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [init] Error while initializing workspace watch
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.222.0.1:443/apis/theia.cloud/v1beta5/namespaces/theia-cloud/workspaces. Message: conversion webhook for theia.cloud/v1beta4, Kind=Workspace failed: Post "https://conversion-webhook-service.theia-cloud.svc:443/convert/workspace?timeout=30s": x509: certificate signed by unknown authority. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=conversion webhook for theia.cloud/v1beta4, Kind=Workspace failed: Post "https://conversion-webhook-service.theia-cloud.svc:443/convert/workspace?timeout=30s": x509: certificate signed by unknown authority, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:451) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:419) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:98) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at org.eclipse.theia.cloud.common.k8s.client.ResourceClient.list(ResourceClient.java:87) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator.initWorkspacesAndWatchForChanges(BasicTheiaCloudOperator.java:111) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator.start(BasicTheiaCloudOperator.java:87) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at org.eclipse.theia.cloud.operator.LeaderElectionTheiaCloudOperatorLauncher.startOperatorAsLeader(LeaderElectionTheiaCloudOperatorLauncher.java:116) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at org.eclipse.theia.cloud.operator.LeaderElectionTheiaCloudOperatorLauncher.onStartLeading(LeaderElectionTheiaCloudOperatorLauncher.java:96) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderCallbacks.onStartLeading(LeaderCallbacks.java:34) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.updateObserved(LeaderElector.java:269) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:250) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$4(LeaderElector.java:179) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:295) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.222.0.1:443/apis/theia.cloud/v1beta5/namespaces/theia-cloud/workspaces. Message: conversion webhook for theia.cloud/v1beta4, Kind=Workspace failed: Post "https://conversion-webhook-service.theia-cloud.svc:443/convert/workspace?timeout=30s": x509: certificate signed by unknown authority. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=conversion webhook for theia.cloud/v1beta4, Kind=Workspace failed: Post "https://conversion-webhook-service.theia-cloud.svc:443/convert/workspace?timeout=30s": x509: certificate signed by unknown authority, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(Unknown Source) ~[?:?]
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(Unknown Source) ~[?:?]
at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(Unknown Source) ~[?:?]
at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136) ~[defaultoperator-1.0.0-jar-with-dependencies.jar:?]
... 3 more

Certificate validation failed when the operator accessed the conversions-webhook service.
In theia-cloud 0.10.0, the workspace CRD is v1beta4; In theia-cloud 1.0.0, the workspace CRD version is v1beta5. so the CRD version needs to be converted.

I see that the conversion-webhook Pod will no longer log output after January 28, 2025 08:37.

$ kubectl logs -f -n theia-cloud conversion-webhook-7bc8c68585-dgctl
......
2025-01-28 08:37:19,921 INFO [org.ecl.the.clo.con.ConversionEndpoint] (executor-thread-34) [convert/session] [d2bdca26-1a2c-442b-a58b-febca655fd66] Converting Session (version: 'theia.cloud/v1beta7') to version 'theia.cloud/v1beta8'
2025-01-28 08:37:19,923 INFO [org.ecl.the.clo.con.ConversionEndpoint] (executor-thread-34) [convert/session] [c3ade1e2-c8a7-477e-b783-b6a6dd00a197] Converting Session (version: 'theia.cloud/v1beta7') to version 'theia.cloud/v1beta8'
2025-01-28 08:37:19,925 INFO [org.ecl.the.clo.con.ConversionEndpoint] (executor-thread-34) [convert/session] [8bfb93dc-2716-4c6f-bf7b-c2884adcf369] Converting Session (version: 'theia.cloud/v1beta7') to version 'theia.cloud/v1beta8'

Also I see that the HTTPS certificate used by the convertition-webhook was updated on January 28, 2025 at 08:37 through cert-manager.

$ kubectl exec -n theia-cloud conversion-webhook-7bc8c68585-dgctl -it -- /bin/sh
# cd /etc/webhook/certs/
# ls -l
total 0
lrwxrwxrwx 1 root root 13 Dec 18 11:59 ca.crt -> ..data/ca.crt
lrwxrwxrwx 1 root root 14 Dec 18 11:59 tls.crt -> ..data/tls.crt
lrwxrwxrwx 1 root root 14 Dec 18 11:59 tls.key -> ..data/tls.key
# openssl x509 -in tls.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
da:2c:bc:68:6b:0a:f0:e5:c4:2b:b9:18:d9:c8:dc:53
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=Theia Cloud CA
Validity
Not Before: Jan 28 08:37:40 2025 GMT
Not After : Apr 28 08:37:40 2025 GMT
Subject: CN=crd.conversion.cert
Subject Public Key Info:
......

I assume that the conversion-webhook service does not automatically update the certificate internally, causing an error.

Expected behavior

Conversions-webhook can convert CRD version normally,
The operator Pod can be started successfully.

Cluster provider

No response

Version

No response

Additional information

No response

@qiaozhi199 qiaozhi199 added the bug Something isn't working label Feb 10, 2025
@jfaltermeier
Copy link
Contributor

Hi, thanks for reporting. I believe this should work, we haven’t seen this issue on the Try Now production cluster.

The certificate for the webhook is self-signed since it’s only used internally within the cluster, so there shouldn’t be any issues during renewal.

Could you check the READY state of the certificate and the status/age of the conversion webhook?

kubectl get certificates -n your-theia-cloud-namespace
kubectl get pod -n your-theia-cloud-namespace

Does restarting/killing the conversion-webhook pod resolve the issue?

@qiaozhi199
Copy link
Author

qiaozhi199 commented Feb 10, 2025

Hi, thanks for reporting. I believe this should work, we haven’t seen this issue on the Try Now production cluster.

The certificate for the webhook is self-signed since it’s only used internally within the cluster, so there shouldn’t be any issues during renewal.

Could you check the READY state of the certificate and the status/age of the conversion webhook?

kubectl get certificates -n your-theia-cloud-namespace
kubectl get pod -n your-theia-cloud-namespace

Does restarting/killing the conversion-webhook pod resolve the issue?

Yes, restarting the conversion-webhook pod can resolve this issue.
Therefore, I suspect that after the certificate expiration, the new certificate generated by cert-manager is not reloaded by the convertition-webhook service.

The validity period of the certificate generated by cert-manager is 3 months. I am worried that this problem may occur again if the certificate expires after 3 months.
Do I need to restart conversion-webhook pod every 3 months?

Or why not use HTTP instead of HTTPS inside the cluster? This avoids the problem of SSL certificate expiration.

@jfaltermeier
Copy link
Contributor

Thanks.

To summarize what I found out, the certificate is stored as a Kubernetes secret and loaded into the conversion-webhook pod as a volume mount. When the certificate changes, the mounted files update automatically:
https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod
https://github.com/eclipse-theia/theia-cloud-helm/blob/main/charts/theia-cloud-crds/templates/conversion-webhook-deployment.yaml

The conversion webhook consumes the certificate via these options:
https://github.com/eclipse-theia/theia-cloud/blob/main/java/conversion/org.eclipse.theia.cloud.conversion/src/main/resources/application.properties

It looks like we are missing this option to instruct Quarkus to reload the certificate periodically:
https://quarkus.io/guides/all-config#quarkus-vertx-http_quarkus-http-ssl-certificate-reload-period

In our production environments, the webhook likely never ran longer than three months without Kubernetes restarting it (due to node/cluster updates or scaling), so we never encountered this issue before.

We will fix it for the next release

@jfaltermeier jfaltermeier self-assigned this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants