Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner worker nodes losing IPv4 with Ubuntu 24.04 #437

Open
7oku opened this issue Feb 25, 2025 · 2 comments
Open

Hetzner worker nodes losing IPv4 with Ubuntu 24.04 #437

7oku opened this issue Feb 25, 2025 · 2 comments

Comments

@7oku
Copy link

7oku commented Feb 25, 2025

Similar to kubermatic/machine-controller#1587 we were hit by 2 complete outages of Hetzner cloud worker nodes recently again. We are at kubeone 1.9.0 with OSM 1.6.0.

I found out what exactly happens and it begins already with the deployment of the worker:

  1. Hetzner deploys a machine with /etc/netplan/50-cloud-init.yaml present
  2. On first boot, cloud-init invokes netplan generate and creates /run/systemd/network/10-netplan-eth0.network and 10-netplan-eth0.link files and systemd configures eth0 accordingly
  3. OSM runs bootstrap script, which deletes /etc/netplan/50-cloud-init.yaml and disables cloud-init. Remember the file is gone now!
  4. The machine is rebooted and Ubuntu 24.04 runs /etc/systemd/system-generators/netplan early in the boot process. This essentially invokes netplan generate again. Since there is no /etc/netplan/50-cloud-init.yaml any more, it also wipes /run/systemd/network/. I'm not sure if this is a thing in Ubuntu <=22.04 as well
  5. The node proceeds to join cluster and everything is fine

As long as networking is not restarted, systemd will still maintain to manage eth0, i.e. handling dhcp, link and stuff.
But since Ubuntu does unattended upgrades by default, over time there will be packages upgraded which invoke a systemctl restart systemd-networkd. At that time, systemd will not manage eth0 any more as in 4. the files were wiped. This is still not an issue, as long as eth0 stays up.

But also recently, Hetzner has some weird network quirks. Links go down from time to time:

Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Link DOWN
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Lost carrier

That's when everything goes south. The link is not managed any more and will not be taken up again, the node is gone. Control planes and any other instance sustain this by recovering the link, just not the worker nodes.

Possible solutions are:

a) Do not delete /etc/netplan/50-cloud-init.yaml. Since cloud init is disabled afterwards anyways, I see no problem in leaving it there. That being said without knowing your reason to remove it in the first place. Having the file still there would prevent any netplan generate runs from wiping out network config.

b) Disable the systemd generator, so it does not run netplan generate on boot. I'm not sure if this invokes any other issues later on.

ln -s /dev/null /etc/systemd/system-generators/netplan
systemctl daemon-reload

c) Disable unattended upgrades to prevent networking restarts. But I'd rather have them with a stable network config

Please check your worker nodes for existence of files in /run/systemd/network. If empty, you're most likely prone to outages.

@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?

@xrstf
Copy link
Contributor

xrstf commented Feb 25, 2025

@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?

I saw the ticket when looking through some board and its typo irked me for a long time. It's pure coincidence I randomly edited it recently :) Even if it wasn't, I would not publicly admit to my superpowers of remotely removing IPs from other people's servers.

@7oku
Copy link
Author

7oku commented Feb 25, 2025

@xrstf I couldn't find the ticket because I was searching for "loses" during the outage at that time. Just noticed you changed it shortly after and thought you experienced similar and changed it therefore. Didn't mean to make you responsible for our outage :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants