eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据 #1615

lcfang · 2024-12-20T13:18:01Z

If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via ASRC(Alibaba Security Response Center) where the issue will be triaged appropriately.

I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

当eureka中服务注册数量达到1w+时，gateway的CPU使用率达到100+

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace:
从监控看到controller和gateway的cpu使用率达到了100%，controller有发生重启（报错：https://github.com/alibaba/higress/issues/1536），所以在gateway中有报Prom抓取数据失败，也有报xDS连接断开，尝试调整controller和gatewayCPU （，pilot原来是2c，调整为8c，gateway原来是4c，调整到了12c），问题依然存在，后来尝试停了Prometheus的数据抓取，CPU依然很高。

Ⅲ. Describe what you expected to happen

CPU使用率正常（不超过50%？）

Ⅳ. How to reproduce it (as minimally and precisely as possible)

helm 部署higress v2.0.2版本
gateway开启Prometheus监控
eureka注册中心模拟注册1w个服务
在higress前台添加eureka注册中心
Prometheus抓取数据失败，gateway报错，controller会重启

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

Higress version: 2.0.2
OS : kylin V10 SP3
Others: k8s v1.28，helm部署方式

johnlanni · 2024-12-23T02:02:38Z

CPU 到 100%，是配置线程跑满了，服务数太多的时候，CDS 的计算会比较消耗 CPU，不过计算成功后就不会重复计算了；配置线程跑满会导致用于处理 prometheus 的 admin 接口 pending，健康检查会依赖这个接口，建议将readinessProbe的超时时间调高。

johnlanni · 2024-12-23T03:23:25Z

另外看日志，应该是配置推送期间跟controller之间的连接一直在断，可以看下controller是不是oom了，可以先加高一些内存。

controller侧的profiler可以通过下面方式采集一下：

## 端口映射到本地
k port-forward pod/higress-controller-xxxxxx-xxxxx -n higress-system 15014:15014

## cpu profile:
	go tool pprof http://localhost:15014/debug/pprof/profile?seconds=20

## mem profile:
	go tool pprof http://localhost:15014/debug/pprof/heap?seconds=20

cpu和mem会分别生成这样一个文件，可以把这个文件发一下

lcfang · 2025-01-09T02:01:57Z

基本定位到：1、由于服务数量多，eureka服务订阅会定期全量更新，一旦有服务或者配置更新，也会触发配置全量下发；2、Prometheus调用gateway state接口获取监控数据，会导致CPU飙升；
controller对接eureka配置按需更新的特性下个版本能加吗？ @johnlanni

johnlanni · 2025-01-14T01:55:19Z

@lcfang 可以的下个版本加个变量可以控制

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据 #1615

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据 #1615

lcfang commented Dec 20, 2024

johnlanni commented Dec 23, 2024

johnlanni commented Dec 23, 2024 •

edited

Loading

lcfang commented Jan 9, 2025

johnlanni commented Jan 14, 2025

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据 #1615

eureka服务数量达到万级时，controller 与 gateway CPU狂飙达100%+，Prometheus监控gateway无法获取数据 #1615

Comments

lcfang commented Dec 20, 2024

Ⅰ. Issue Description

Ⅱ. Describe what happened

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

johnlanni commented Dec 23, 2024

johnlanni commented Dec 23, 2024 • edited Loading

lcfang commented Jan 9, 2025

johnlanni commented Jan 14, 2025

johnlanni commented Dec 23, 2024 •

edited

Loading