Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s cluster and nat gateway #454

Open
mertcangokgoz opened this issue Sep 23, 2024 · 21 comments
Open

k3s cluster and nat gateway #454

mertcangokgoz opened this issue Sep 23, 2024 · 21 comments

Comments

@mertcangokgoz
Copy link

mertcangokgoz commented Sep 23, 2024

I am currently using nat gateway in my project, I need k3s and I want to communicate my cluster only with private ip without any public ip address. I am using debian-12 image in the cluster.

As a result of this configuration, I expect the machines to go to the internet and at the same time the pods to stand up. However, during the installation, it makes an output like the following, I think the installation is not completed in a healthy way.

image
@vitobotta
Copy link
Owner

Hi, do you see the server(s) attached to the main-vpc-network network in the Hetzner Console? If yes do they get an IP in that network?

@vitobotta
Copy link
Owner

Please SSH into one of the servers attached to the network and run

SUBNET="10.13.0.0/16"
SUBNET_PREFIX=$(echo $SUBNET | cut -d'/' -f1 | sed 's/\./\\./g' | sed 's/0$//')

echo $SUBNET_PREFIX 

Does it return the correct prefix?

Then run

ip -4 addr show | grep -q "inet $SUBNET_PREFIX" 

What does it return?

@vitobotta
Copy link
Owner

My gut feeling is that there is something wrong with your post_create_commands.

Attach a temp server to the same network, then SSH into it and with /bin/sh, not bash (since Cloud Init script must work in regular sh shell) try running your post create commands and see if all of them work just fine.

@vitobotta
Copy link
Owner

What do you get with ip -4 addr show?

@vitobotta
Copy link
Owner

Can you try ip -4 addr show | grep "inet $SUBNET_PREFIX" without -q? Trying to replicate what happens during the installation.

@mertcangokgoz
Copy link
Author

@vitobotta

I changed the subnet and the problem disappeared(I don't know if it has something to do with the subnets I've split.), of course I haven't included post_create_commands yet, but I get a situation like the following, is this coming from ssh?

[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] Waiting for successful ssh connectivity with instance blackhole-k3s-cluster-pool-small-static-pool-worker3...
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] Waiting for successful ssh connectivity with instance blackhole-k3s-cluster-pool-small-static-pool-worker2...
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker1] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker1 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker1] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker1 created
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker3 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker3 created
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker2 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker2 created
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'

@vitobotta
Copy link
Owner

Yeah that may be a problem with SSH, perhaps with the key. Can you try enabling the agent?

@mertcangokgoz
Copy link
Author

Yeah that may be a problem with SSH, perhaps with the key. Can you try enabling the agent?

Are you talking about the use_agent setting? But there is no password in the key I created.

@vitobotta
Copy link
Owner

Another possibility may be some issue with Debian due to some recent changes made to address the new way of handling custom ssh ports in newer versions of Ubuntu. Can you try with Ubuntu but with the same configuration to see if that's the problem?

@mertcangokgoz
Copy link
Author

image

Thank you very much for your help, I have one last question. all of the machines have internet access, my configurations are correct, but the following warning comes, is this normal for autoscale

image

apart from this I get the following warning yes it is set so that pods cannot be opened on the master node but hetzner did not create pods on other nodes for csi-controller etc. but opened 3 machines

Is this a normal process

@vitobotta
Copy link
Owner

It's not a warning :) It's just telling you that some ponds were probably pending due to lack of resources so the cluster had to scale up. Did it add a new node?

@mertcangokgoz
Copy link
Author

image

yes it added 3 nodes, I can not see the added ones with the kubectl get nodes command, it seems that I have 3 master 3 workers now.

@vitobotta
Copy link
Owner

What do you see in the autoscaler's logs?

@mertcangokgoz
Copy link
Author

I managed to run it properly, I think I will write a small article on the subject to my blog address.

Thank you very much for your help.

I just want to ask a very small question

  private_network:
    enabled: true
    subnet: 10.14.3.0/24
    existing_network_name: ‘main-vpc-network’

Even if I configure it as such, why would it be receiving ip over 10.14.1.0/24.

@vitobotta
Copy link
Owner

Can you share the solution for posterity?

Can you also clarify the question? :p

@mertcangokgoz
Copy link
Author

autoscaler stopped working even though I made no changes,

1- turning on the machine I see from the hetzner cloud panel
2- I see it getting the private ip address via dhcp.
3- It seems to be starting to make installations

There's nothing after that, I'm tied up because I don't have ssh access. I can't see the logs. It's like it's not doing calm installations. clustera doesn't even include the node.

The machine has only private ip address behind NAT gateway. Routing is full there is no problem there either. I organised it according to the documentation.

How can I debug this situation?

@mertcangokgoz
Copy link
Author

I finally managed to solve the problem, due to the lack of public ip, the installations started to be incomplete due to both route and dns problems.

I don't know how this happened, but I solved the situation by manually intervening in the cloud-init config.

On machines with NAT gateway, the route and dns configuration needs to be run before all processes. Even if we add post_create_commands to the top, it runs at the bottom.

{{ post_create_commands_str }}
I noticed that the configuration we added here was not added to the top.

@vitobotta
Copy link
Owner

I am sorry, but I am not following. Can you clarify what exactly fixed your problem and what changes you needed to make to hetzner-k3s to to solve it? I could make a new release with your fixes or you could make a PR if you are up to it. :)

@mertcangokgoz
Copy link
Author

In a k8s structure where there is no public network, the following should be implemented.

1-network settings should be made and nat gateway should be configured.

  # Add network interface to route nat gateway
  - |
    cat <<'EOF' >> /etc/systemd/network/10-enp7s0.network
    [Match]
    Name=enp7s0
    
    [Network]
    DHCP=yes
    Gateway=10.144.0.1
    EOF
  # reload networkd
  - systemctl restart systemd-networkd
  # Configure systemd-resolved
  - systemctl enable systemd-resolved
  - systemctl start systemd-resolved
  # Set DNS
  - |
    cat <<'EOF' >> /etc/systemd/resolved.conf
    [Resolve]
    Cache=yes
    DNS=185.12.64.1 185.12.64.2
    FallbackDNS=1.1.1.1
    EOF
  - systemctl daemon-reload
  - systemctl restart systemd-resolved

2- packages should not be installed with packages: command (Packages should be included in the system immediately after cloud-init network settings.)

so the cloud-init file has to be like this. If ipv4 and ipv6 are completely off

#cloud-config
preserve_hostname: true

write_files:

- path: /etc/systemd/system/ssh.socket.d/listen.conf
  content: |
    [Socket]
    ListenStream=
    ListenStream=22

- path: /etc/configure-ssh.sh
  permissions: '0755'
  content: |
    if systemctl is-active ssh.socket > /dev/null 2>&1
    then
      # OpenSSH is using socket activation
      systemctl disable ssh
      systemctl daemon-reload
      systemctl restart ssh.socket
      systemctl stop ssh
    else
      # OpenSSH is not using socket activation
      sed -i 's/^#*Port .*/Port 22/' /etc/ssh/sshd_config
    fi
    systemctl restart ssh

runcmd:
- hostnamectl set-hostname $(curl http://169.254.169.254/hetzner/v1/metadata/hostname)
- update-crypto-policies --set DEFAULT:SHA1 || true
- /etc/configure-ssh.sh
- |
  cat <<'EOF' >> /etc/systemd/network/10-enp7s0.network
  [Match]
  Name=enp7s0
  
  [Network]
  DHCP=yes
  Gateway=10.144.0.1
  EOF
# reload networkd
- systemctl restart systemd-networkd
# Configure systemd-resolved
- systemctl enable systemd-resolved
- systemctl start systemd-resolved
# Set DNS
- |
  cat <<'EOF' >> /etc/systemd/resolved.conf
  [Resolve]
  Cache=yes
  DNS=185.12.64.1 185.12.64.2
  FallbackDNS=1.1.1.1
  EOF
- systemctl daemon-reload
- systemctl restart systemd-resolved
- apt update & apt-get install -y ifupdown net-tools
- echo "nameserver 8.8.8.8" > /etc/k8s-resolv.conf
- |
    touch /etc/initialized
    
    HOSTNAME=$(hostname -f)
    PUBLIC_IP=$(hostname -I | awk '{print $1}')
    
    if [ "true" = "true" ]; then
      echo "Using private network " > /var/log/hetzner-k3s.log
      SUBNET="10.144.1.0/24"
      SUBNET_PREFIX=$(echo $SUBNET | cut -d'/' -f1 | sed 's/\./\\./g' | sed 's/0$//')
      MAX_ATTEMPTS=30
      DELAY=10
      UP="false"
    
      for i in $(seq 1 $MAX_ATTEMPTS); do
        if ip -4 addr show | grep -q "inet $SUBNET_PREFIX"; then
          echo "Private network IP in subnet $SUBNET is up" 2>&1 | tee -a /var/log/hetzner-k3s.log
          UP="true"
          break
        fi
        echo "Waiting for private network IP in subnet $SUBNET to be available... (Attempt $i/$MAX_ATTEMPTS)" 2>&1 | tee -a /var/log/hetzner-k3s.log
        sleep $DELAY
      done
    
      if [ "$UP" = "false" ]; then
        echo "Timeout waiting for private network IP in subnet $SUBNET" 2>&1 | tee -a /var/log/hetzner-k3s.log
      fi
    
      PRIVATE_IP=$(ip route get 10.144.1.0 | awk -F"src " 'NR==1{split($2,a," ");print a[1]}')
      NETWORK_INTERFACE=" --flannel-iface=$(ip route get 10.144.1.0 | awk -F"dev " 'NR==1{split($2,a," ");print a[1]}') "
    else
      echo "Using public network " > /var/log/hetzner-k3s.log
      PRIVATE_IP="${PUBLIC_IP}"
      NETWORK_INTERFACE=" "
    fi
    
    mkdir -p /etc/rancher/k3s
    
    cat > /etc/rancher/k3s/registries.yaml <<EOF
    mirrors:
      "*":
    EOF
    
    curl -sfL https://get.k3s.io | K3S_TOKEN="REDACTED" INSTALL_K3S_VERSION="v1.31.1+k3s1" K3S_URL=https://10.144.1.16:6443 INSTALL_K3S_EXEC="agent \
    --node-name=$HOSTNAME  --kubelet-arg "cloud-provider=external"  --kubelet-arg "resolv-conf=/etc/k8s-resolv.conf"  \
    --node-ip=$PRIVATE_IP \
    --node-external-ip=$PUBLIC_IP \
    $NETWORK_INTERFACE " sh -
    
    echo true > /etc/initialized

Unfortunately, I cannot support the project because I do not know the software language in which the project is developed :)

@vitobotta
Copy link
Owner

Thanks for clarifying! I see what you mean now. I will do some testing and see if I can release some changes that might help with this kind of setup in the next release.

@dyipon
Copy link

dyipon commented Oct 9, 2024

I would like to confirm whether the solution functions correctly when public IP addresses are completely disabled. While the process is slow, taking around 6-7 minutes to create a small cluster, it still works as expected. I tested this without modifying the cloud-init configuration, using only the post-commands.
Thanks you @mertcangokgoz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants