-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support p5 instances with custom networking and second trunk interface #3189
Comments
Regarding this -
CNI is currently aware of only single network card i.e, network card 0 so CNI won't be aware of ENI availability on non-zero network cards.
Can you please clarify this? you mean ENI's are attached to the node and then you are enabling custom networking? |
Between looking at the logs and the code, it looks like the CNI enumerated all interfaces and is looking for an interface that is I think I understand what you mean in that it is only ever going to auto-create a trunk on network card 0? And subsequently, it would only ever add secondary IPs to a trunk on network card 0? When custom networking is enabled and we create the trunk interface by hand before we start the instance, according to logs it seemed to enumerate the interfaces using the metadata API, and find the trunk interface. However, it doesn't seem to be able to use this and fails with the error provided, that it has exceeded the ENI limit. This didn't make sense to me, because at this point it should only need to create secondary IPs to go with the trunk it found. |
Just as a follow up, I was able to also able to manually attach a trunk interface to a test instance, on second subnet, to network card 0. This sort of seems like it could work but I still had some manual work to do. Configuration is as follows: network card 0: ipamd.log reports:
Then I see:
I then go into the console and add some secondary IPs to I have not confirmed if these pods egress/nat from the primary ENI like we would want when going outside the pod network "subnet b". If they do, then I think the only blockers are 1)figuring out why it isn't auto-assigning IPs from the subnet defined in our ENIConfig. |
@mlsorensen - Thanks for the details. I will sync up internally with the account team to setup a call with you. We can go over this in detail.
Trunk ENIs are not managed by AWS VPC CNI. Hence you won't see secondary IP's being assigned by AWS VPC CNI. Trunk ENI's are created by EKS Control plane component - VPC Resource Controller for the feature SGPP (Security group per pod) - https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html. When the feature is enabled, VPC resource controller allocates single trunk ENI and multiple branch ENIs behind the single trunk ENI (each instance has an upper limit on branch ENIs). Pods using this feature will consume one branch ENI per pod - p5.48xlarge can have 120 branch ENIs across all network cards - ![]() Ref - https://www.eksworkshop.com/docs/networking/vpc-cni/security-groups-for-pods/ AWS VPC CNI is only single card aware. So with ![]() Now since you have custom networking enabled, primary ENI is not used for pod IPs. So the number of ENIs will reduce by 1. To improve pod launch latency, CNI allocates secondary ENI at start up when this feature is enabled - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L576-L581 If you are using SGPP feature, I assume you have enabled this knob on aws-node DS - https://github.com/aws/amazon-vpc-cni-k8s?tab=readme-ov-file#enable_pod_eni-v170 With this knob enabled and custom networking enabled, CNI indicates VPC resource controller to attach trunk ENI to the node. (VPC resource controller is also single network card aware) - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L542-L562 So on network card 0, primary ENI is not usable and secondary ENI is allocated by CNI at startup. So there won't be any room for additional ENIs including trunk ENI. Hence you would see max ENI limit reached. Since you manually tried - Just to keep in mind and might not be relevant here, since you attached trunk ENI, CNI didn't delete the ENI but if you are attaching any other secondary ENI (non-trunk) out of band you will have to tag those ENIs - https://github.com/aws/amazon-vpc-cni-k8s/tree/master?tab=readme-ov-file#no-manage-tag so that CNI ignores ENIs for pod IPs. Regarding your other questions - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L2154-L2161 |
Yes, thank you. I will review this in further detail to ensure I understand. The confusing part for me is here
However, I can set up this trunk manually on card 0 for custom networking, and the node now has 33 ENIs. To me it would seem that CNI just needs to start using this trunk and allocating secondary IPs for pods from this trunk's network. I don't know why it needs to attach any more ENIs, at all - or why I can do this manually (which I am OK with) but the system can't. |
Can you please share what is your CNI version? CNI ignores trunk ENI even if discovered since management is only by VPC Resource controller. Is there any specific reason you want to attach trunk eni rather than allowing CNI to allocate additional eni or VPC RC to allocate trunk eni if feature is enabled.. In CNI, these limits are fetched from this file - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/vpc/vpc_ip_resource_limit.go#L8005-L8006 internally built from EC2 APIs - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/gen_vpc_ip_limits.go#L62 when a new instance is GA'ed.. |
Only because the CNI reports that it can't due to the hard coded ENI limit in the CNI code. I see on other instance types that we get a trunk ENI with custom networking enabled, so I was attempting to reproduce the same configuration on p5 by hand. Ideally the custom networking feature would function as it does for other instances, but I'm just trying to work around this for now and seeing how far we get and what is missing. I am currently using |
Thanks for the clarification. The instance type supports only 2 ENI per network card, so with custom networking - primary ENI is not used and secondary ENI will be allocated on start up for pods. So there won't be a slot for trunk ENI hence you hitting the ENI limit error. Actually I missed this, you can have 33 ENIs since the node supports total of 64 ENIs out of 32 network cards. In your scenario, you had one primary ENI and trunk ENI on network card 0... |
What I see on other instance types with custom networking is that it creates a trunk and uses that for the second subnet and places pods on this. Now on p5, I am creating a trunk with second subnet, because it won't create automatically due to the artificial limit of 32 ENIs. At a minimum, we just need the CNI to pick that trunk up and do IPAM, then have pods to use it. I guess I am not understanding, because I'm almost there with a manual configuration, making the p5 behave like another instance that supports custom network. |
What would you like to be added:
Fix support for custom networking on p5 instances that contain 32 ENIs and some are EFA-only.
Currently, with p5 instances (e.g. p5.48xlarge) and 32 EFA ENIs, the primary network interface is used as a trunk, and pods get secondary IPs rather than individual ENIs. Configuration is:
This works fine, despite already being at our ENI limit of 32. The VPC CNI understands it only needs secondary IPs for Pods.
When custom networking is enabled, this fails because we have hit our instance limit on network interfaces.
{"level":"debug","ts":"2025-01-30T12:42:50.832Z","caller":"ipamd/ipamd.go:567","msg":"Skipping ENI allocation as the max ENI limit is already reached"}
However, we are be able to do the following by setting up the trunk interface by calling
aws etc create-network-interface --interface-type=trunk
and then attaching it to an existing EFA-only card:So the first issue is that the VPC-CNI seems to artificially limit itself to creating trunk interfaces based on the max ENIs for the interface, when in fact it could add a trunk to one of the EFA-only cards as we have done manually.
When we set this node configuration up by hand and enable custom networking, we run into a second issue: it doesn't seem to be able to use "secondary IP mode", and it is possibly also not using this trunk even though it is on the subnet defined in ENIConfig, instead it fails by detecting we have reached max ENIs.
{"level":"warn","ts":"2025-01-31T20:22:48.548Z","caller":"ipamd/ipamd.go:825","msg":"Failed to allocate 49 IP addresses on an ENI: AllocENI: error attaching ENI: attachENI: failed to attach ENI: AttachmentLimitExceeded: Interface count 35 exceeds allowed limit for p5.48xlarge. ENI limits exceeded on following network cards: Network Card 0 (requested: 3, limit: 2)
In general, VPC CNI seems to be confused about how it is handling IP allocation on p5 instances, whether it needs an ENI, or when it can and can't add a trunk interface. We don't hit this ENI limit in the initial configuration where it seems to understand that it only needs secondary IPs.
This could also be related to #3094 - as
ENABLE_POD_ENI
doesn't seem to do anything on these p5 instances, when I had hoped maybe it would circumvent the check for max ENIs when we don't need new ENIs.Why is this needed:
With complex peering requirements we need to be able to separate node subnets from pod subnets. We have considered using an overlay for pods, which would introduce a major configuration change, and will continue investigating that. The VPC CNI custom networking feature seems to do what we want on other instance types, aside from this limitation on p5 instances where it seems to artificially block on the instances max ENIs.
The text was updated successfully, but these errors were encountered: