Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk exhausted after upgrade 1.5.6-1.9.5 #24914

Open
monwolf opened this issue Jan 22, 2025 · 0 comments
Open

Disk exhausted after upgrade 1.5.6-1.9.5 #24914

monwolf opened this issue Jan 22, 2025 · 0 comments
Labels

Comments

@monwolf
Copy link

monwolf commented Jan 22, 2025

Hi,
After upgrading our nodes from 1.5.6 -> 1.9.5 we observed differences in storage allocated resources.
We have 2 partitions on our hosts

  • / for the OS
  • /var/ for the tasks
    this is the storage free:

Image

our data_dir is pointing to /var/nomad . Our config looks like:

region = "gine"
name = "ec2devusfarm02"
log_level = "DEBUG"
leave_on_interrupt = true
leave_on_terminate = true
data_dir = "/var/nomad/data"
bind_addr = "0.0.0.0"
disable_update_check = true
limits {
        https_handshake_timeout   = "10s"
        http_max_conns_per_client = 400
        rpc_handshake_timeout     = "10s"
        rpc_max_conns_per_client  = 400
}
advertise {
    http = "10.121.200.13:4646"
    rpc = "10.121.200.13:4647"
    serf = "10.121.200.13:4648"
}
tls {
  http = true
  rpc  = true
  cert_file = "/opt/nomad/ssl/server.pem"
  key_file = "/opt/nomad/ssl/server-key.pem"
  ca_file = "/opt/nomad/ssl/nomad-ca.pem"
  verify_server_hostname = true
  verify_https_client    = true

}
log_file = "/var/log/nomad/"
log_json = true
log_rotate_max_files = 7
consul {
    address = "127.0.0.1:8500"
    server_service_name = "nomad-server"
    client_service_name = "nomad-client"
    auto_advertise = true
    server_auto_join = true
    client_auto_join = true

    ssl = true
    ca_file = "/opt/consul/ssl/consul-ca.pem"
    cert_file = "/opt/consul/ssl/server.pem"
    key_file = "/opt/consul/ssl/server-key.pem"
    token = "xxxxx"


}
acl {
  enabled = true
}

vault {
    enabled = true
    address = "https://vault.legacy-dev.com:8200/"
    ca_file = "/opt/vault/ssl/vault-ca.pem"
        cert_file = "/opt/vault/ssl/client-vault.pem"
    key_file = "/opt/vault/ssl/client-vault-key.pem"
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address = "localhost:8125"
  disable_hostname = true
  collection_interval = "10s"
}
datacenter = "farm"

client {
    enabled = true
    network_interface = "ens5"
    cni_path = "/opt/cni/bin"
    cni_config_dir = "/etc/cni/net.d/"
}

plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
    allow_privileged = true
    volumes {
      enabled = true
    }
  }
}

After the upgrade we started to see exhausted disk errors when we tried to schedule a job:

Image

But the node has enough free storage. If we observe the nomad node status:

Image

As you can see, nomad uses / instead of /var to calculate allocable space. /var/ was used in 1.5.6. But in unique attributes is fingerprinting the right FS

Image

How can I solve that? I didn't see in the release notes something related with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant