Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Serial worker reloads can potentially block dynamic endpoint changes #12797

Open
mrparkers opened this issue Feb 4, 2025 · 2 comments
Open
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@mrparkers
Copy link

mrparkers commented Feb 4, 2025

What happened:

We have an ingress-nginx deployment that uses enable-serial-reloads to prevent too many nginx workers from spawning when many ingress changes are made within close proximity to one another. This configuration does not seem to have detailed documentation in the website docs, but you can see the docs in the code here:

// Defines whether multiple concurrent reloads of worker processes should occur.
// Set this to false to prevent more than n x 2 workers to exist at any time, to avoid potential OOM situations and high CPU load
// With this setting on false, configuration changes in the queue will be re-queued with an exponential backoff, until the number of worker process is the expected value.
// By default new worker processes are spawned every time there's a change that cannot be applied dynamically with no upper limit to the number of running workers
// http://nginx.org/en/docs/ngx_core_module.html#worker_processes
WorkerSerialReloads bool `json:"enable-serial-reloads,omitempty"`

I think the intention behind enable-serial-reloads is to prevent too many configuration reloads that require new nginx worker processes to be launched. However, this configuration option seems to also unintentionally prevent dynamic configuration reloads (reloads that do not require a new nginx worker process). This seems to be a bug, because it can lead to a situation where a particularly long nginx reload blocks all dynamic endpoint changes for the controller. If a deployment rollout happens during a reload that requires a new nginx worker process, the updated list of endpoints does not seem to be picked up by the controller until the reload is finished.

What you expected to happen:

If enable-serial-reloads is set to true, I expect reloads that require new nginx worker processes to be re-queued if another such reload is already in progress, but I expect for dynamic reloads to remain intact. Endpoint updates should still happen even during a particularly long reload.

Code path:

Given a pending change that requires a configuration reload, which also includes endpoint changes:

NGINX Ingress controller version (exec into the pod and run /nginx-ingress-controller --version): v1.12.0

Kubernetes version (use kubectl version):

Client Version: v1.31.3
Kustomize Version: v5.4.2
Server Version: v1.29.12-eks-2d5f260

Environment:

  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): Linux ip-10-2-52-194.us-west-2.compute.internal 5.10.223-212.873.amzn2.x86_64 #1 SMP Wed Aug 7 16:53:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    • Helm
@mrparkers mrparkers added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Feb 4, 2025
@strongjz strongjz self-assigned this Feb 13, 2025
@strongjz
Copy link
Member

So the issue is here when the on update would fail and return an err.

What you're asking is that it continues and does the dynamic load.

I see skip dynamic configuration in the first template rendering in here, which makes sense

So it the order of operations would be

  1. check for first-time render
  2. check dynamic
  3. check nondynamic

Then, return whatever error happens.

Also, I'm unsure why the controller can retry a nondynamic reload like we do with dynamic since the nginx.go OnUpdate will stop the update during a reload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

3 participants