You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similar to kubermatic/machine-controller#1587 we were hit by 2 complete outages of Hetzner cloud worker nodes recently again. We are at kubeone 1.9.0 with OSM 1.6.0.
I found out what exactly happens and it begins already with the deployment of the worker:
Hetzner deploys a machine with /etc/netplan/50-cloud-init.yaml present
On first boot, cloud-init invokes netplan generate and creates /run/systemd/network/10-netplan-eth0.network and 10-netplan-eth0.link files and systemd configures eth0 accordingly
OSM runs bootstrap script, which deletes /etc/netplan/50-cloud-init.yaml and disables cloud-init. Remember the file is gone now!
The machine is rebooted and Ubuntu 24.04 runs /etc/systemd/system-generators/netplan early in the boot process. This essentially invokes netplan generate again. Since there is no /etc/netplan/50-cloud-init.yaml any more, it also wipes /run/systemd/network/. I'm not sure if this is a thing in Ubuntu <=22.04 as well
The node proceeds to join cluster and everything is fine
As long as networking is not restarted, systemd will still maintain to manage eth0, i.e. handling dhcp, link and stuff.
But since Ubuntu does unattended upgrades by default, over time there will be packages upgraded which invoke a systemctl restart systemd-networkd. At that time, systemd will not manage eth0 any more as in 4. the files were wiped. This is still not an issue, as long as eth0 stays up.
But also recently, Hetzner has some weird network quirks. Links go down from time to time:
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Link DOWN
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Lost carrier
That's when everything goes south. The link is not managed any more and will not be taken up again, the node is gone. Control planes and any other instance sustain this by recovering the link, just not the worker nodes.
Possible solutions are:
a) Do not delete /etc/netplan/50-cloud-init.yaml. Since cloud init is disabled afterwards anyways, I see no problem in leaving it there. That being said without knowing your reason to remove it in the first place. Having the file still there would prevent any netplan generate runs from wiping out network config.
b) Disable the systemd generator, so it does not run netplan generate on boot. I'm not sure if this invokes any other issues later on.
c) Disable unattended upgrades to prevent networking restarts. But I'd rather have them with a stable network config
Please check your worker nodes for existence of files in /run/systemd/network. If empty, you're most likely prone to outages.
@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?
The text was updated successfully, but these errors were encountered:
@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?
I saw the ticket when looking through some board and its typo irked me for a long time. It's pure coincidence I randomly edited it recently :) Even if it wasn't, I would not publicly admit to my superpowers of remotely removing IPs from other people's servers.
@xrstf I couldn't find the ticket because I was searching for "loses" during the outage at that time. Just noticed you changed it shortly after and thought you experienced similar and changed it therefore. Didn't mean to make you responsible for our outage :)
Similar to kubermatic/machine-controller#1587 we were hit by 2 complete outages of Hetzner cloud worker nodes recently again. We are at kubeone 1.9.0 with OSM 1.6.0.
I found out what exactly happens and it begins already with the deployment of the worker:
/etc/netplan/50-cloud-init.yaml
presentnetplan generate
and creates/run/systemd/network/10-netplan-eth0.network
and10-netplan-eth0.link
files and systemd configures eth0 accordingly/etc/netplan/50-cloud-init.yaml
and disables cloud-init. Remember the file is gone now!/etc/systemd/system-generators/netplan
early in the boot process. This essentially invokesnetplan generate
again. Since there is no/etc/netplan/50-cloud-init.yaml
any more, it also wipes/run/systemd/network/
. I'm not sure if this is a thing in Ubuntu <=22.04 as wellAs long as networking is not restarted, systemd will still maintain to manage eth0, i.e. handling dhcp, link and stuff.
But since Ubuntu does unattended upgrades by default, over time there will be packages upgraded which invoke a
systemctl restart systemd-networkd
. At that time, systemd will not manage eth0 any more as in 4. the files were wiped. This is still not an issue, as long as eth0 stays up.But also recently, Hetzner has some weird network quirks. Links go down from time to time:
That's when everything goes south. The link is not managed any more and will not be taken up again, the node is gone. Control planes and any other instance sustain this by recovering the link, just not the worker nodes.
Possible solutions are:
a) Do not delete
/etc/netplan/50-cloud-init.yaml
. Since cloud init is disabled afterwards anyways, I see no problem in leaving it there. That being said without knowing your reason to remove it in the first place. Having the file still there would prevent anynetplan generate
runs from wiping out network config.b) Disable the systemd generator, so it does not run netplan generate on boot. I'm not sure if this invokes any other issues later on.
c) Disable unattended upgrades to prevent networking restarts. But I'd rather have them with a stable network config
Please check your worker nodes for existence of files in
/run/systemd/network
. If empty, you're most likely prone to outages.@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?
The text was updated successfully, but these errors were encountered: