Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eureka服务数量达到万级时,controller 与 gateway CPU狂飙达100%+,Prometheus监控gateway无法获取数据 #1615

Open
1 task done
lcfang opened this issue Dec 20, 2024 · 4 comments

Comments

@lcfang
Copy link
Contributor

lcfang commented Dec 20, 2024

If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via ASRC(Alibaba Security Response Center) where the issue will be triaged appropriately.

  • I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

当eureka中服务注册数量达到1w+时,gateway的CPU使用率达到100+

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace:
从监控看到controller和gateway的cpu使用率达到了100%,controller有发生重启(报错:https://github.com/alibaba/higress/issues/1536),所以在gateway中有报Prom抓取数据失败,也有报xDS连接断开,尝试调整controller和gatewayCPU (,pilot原来是2c,调整为8c,gateway原来是4c,调整到了12c),问题依然存在,后来尝试停了Prometheus的数据抓取,CPU依然很高。
fdfab5f053810dd5555a58de7f5009f

Ⅲ. Describe what you expected to happen

CPU使用率正常(不超过50%?)

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. helm 部署higress v2.0.2版本
  2. gateway开启Prometheus监控
  3. eureka注册中心模拟注册1w个服务
  4. 在higress前台添加eureka注册中心
  5. Prometheus抓取数据失败,gateway报错,controller会重启

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

  • Higress version: 2.0.2
  • OS : kylin V10 SP3
  • Others: k8s v1.28,helm部署方式
@johnlanni
Copy link
Collaborator

CPU 到 100%,是配置线程跑满了,服务数太多的时候,CDS 的计算会比较消耗 CPU,不过计算成功后就不会重复计算了;配置线程跑满会导致用于处理 prometheus 的 admin 接口 pending,健康检查会依赖这个接口,建议将readinessProbe的超时时间调高。

@johnlanni
Copy link
Collaborator

johnlanni commented Dec 23, 2024

另外看日志,应该是配置推送期间跟controller之间的连接一直在断,可以看下controller是不是oom了,可以先加高一些内存。

controller侧的profiler可以通过下面方式采集一下:

## 端口映射到本地
k port-forward pod/higress-controller-xxxxxx-xxxxx -n higress-system 15014:15014

## cpu profile:
	go tool pprof http://localhost:15014/debug/pprof/profile?seconds=20

## mem profile:
	go tool pprof http://localhost:15014/debug/pprof/heap?seconds=20

image

cpu和mem会分别生成这样一个文件,可以把这个文件发一下

@lcfang
Copy link
Contributor Author

lcfang commented Jan 9, 2025

基本定位到:1、由于服务数量多,eureka服务订阅会定期全量更新,一旦有服务或者配置更新,也会触发配置全量下发;2、Prometheus调用gateway state接口获取监控数据,会导致CPU飙升;
controller对接eureka配置按需更新的特性下个版本能加吗? @johnlanni

@johnlanni
Copy link
Collaborator

@lcfang 可以的 下个版本加个变量可以控制

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants