Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support p5 instances with custom networking and second trunk interface #3189

Open
mlsorensen opened this issue Feb 4, 2025 · 9 comments
Open

Comments

@mlsorensen
Copy link

mlsorensen commented Feb 4, 2025

What would you like to be added:

Fix support for custom networking on p5 instances that contain 32 ENIs and some are EFA-only.

Currently, with p5 instances (e.g. p5.48xlarge) and 32 EFA ENIs, the primary network interface is used as a trunk, and pods get secondary IPs rather than individual ENIs. Configuration is:

Network card 0: EFA + ENA on device index 0, subnet "1", pods are secondary IPs on subnet "1" (Primary interface)
Network card 1 through 31: EFA-only on device index 1

This works fine, despite already being at our ENI limit of 32. The VPC CNI understands it only needs secondary IPs for Pods.

When custom networking is enabled, this fails because we have hit our instance limit on network interfaces.

{"level":"debug","ts":"2025-01-30T12:42:50.832Z","caller":"ipamd/ipamd.go:567","msg":"Skipping ENI allocation as the max ENI limit is already reached"}

However, we are be able to do the following by setting up the trunk interface by calling aws etc create-network-interface --interface-type=trunk and then attaching it to an existing EFA-only card:

Network card 0: EFA + ENA on device index 0, subnet "1" (Primary interface)
Network card 1-3: EFA-only on device index 1
Network card 4: EFA + ENA on device index 0, subnet "2", pods are secondary IPs on subnet "2" (and would NAT out of primary interface per doc as needed)
Network card 5-31: EFA-only on device index 1

So the first issue is that the VPC-CNI seems to artificially limit itself to creating trunk interfaces based on the max ENIs for the interface, when in fact it could add a trunk to one of the EFA-only cards as we have done manually.

When we set this node configuration up by hand and enable custom networking, we run into a second issue: it doesn't seem to be able to use "secondary IP mode", and it is possibly also not using this trunk even though it is on the subnet defined in ENIConfig, instead it fails by detecting we have reached max ENIs.

{"level":"warn","ts":"2025-01-31T20:22:48.548Z","caller":"ipamd/ipamd.go:825","msg":"Failed to allocate 49 IP addresses on an ENI: AllocENI: error attaching ENI: attachENI: failed to attach ENI: AttachmentLimitExceeded: Interface count 35 exceeds allowed limit for p5.48xlarge. ENI limits exceeded on following network cards: Network Card 0 (requested: 3, limit: 2)

In general, VPC CNI seems to be confused about how it is handling IP allocation on p5 instances, whether it needs an ENI, or when it can and can't add a trunk interface. We don't hit this ENI limit in the initial configuration where it seems to understand that it only needs secondary IPs.

This could also be related to #3094 - as ENABLE_POD_ENI doesn't seem to do anything on these p5 instances, when I had hoped maybe it would circumvent the check for max ENIs when we don't need new ENIs.

Why is this needed:

With complex peering requirements we need to be able to separate node subnets from pod subnets. We have considered using an overlay for pods, which would introduce a major configuration change, and will continue investigating that. The VPC CNI custom networking feature seems to do what we want on other instance types, aside from this limitation on p5 instances where it seems to artificially block on the instances max ENIs.

@jayanthvn
Copy link
Contributor

@mlsorensen

Regarding this -

So the first issue is that the VPC-CNI seems to artificially limit itself to creating trunk interfaces based on the max ENIs for the interface, when in fact it could add a trunk to one of the EFA-only cards as we have done manually.

CNI is currently aware of only single network card i.e, network card 0 so CNI won't be aware of ENI availability on non-zero network cards.

When we set this node configuration up by hand and enable custom networking, we run into a second issue: it doesn't seem to be able to use "secondary IP mode", and it is possibly also not using this trunk even though it is on the subnet defined in ENIConfig, instead it fails by detecting we have reached max ENIs.

Can you please clarify this? you mean ENI's are attached to the node and then you are enabling custom networking?

@mlsorensen
Copy link
Author

Between looking at the logs and the code, it looks like the CNI enumerated all interfaces and is looking for an interface that is type=trunk.

I think I understand what you mean in that it is only ever going to auto-create a trunk on network card 0? And subsequently, it would only ever add secondary IPs to a trunk on network card 0?

When custom networking is enabled and we create the trunk interface by hand before we start the instance, according to logs it seemed to enumerate the interfaces using the metadata API, and find the trunk interface. However, it doesn't seem to be able to use this and fails with the error provided, that it has exceeded the ENI limit. This didn't make sense to me, because at this point it should only need to create secondary IPs to go with the trunk it found.

@mlsorensen
Copy link
Author

Just as a follow up, I was able to also able to manually attach a trunk interface to a test instance, on second subnet, to network card 0. This sort of seems like it could work but I still had some manual work to do.

Configuration is as follows:

network card 0:
primary IP/ENI on "subnet a 10.0.128.0/17": eni-01bc0fc5ded807e44 10.0.131.156/17 06:39:ad:c0:55:fd
trunk ENI on "subnet b 10.0.16.0/20": eni-096c0243efe7df4ee 10.0.27.49/20 06:8a:f4:99:f1:d1

ipamd.log reports:

{"level":"debug","ts":"2025-02-08T00:15:56.312Z","caller":"awsutils/awsutils.go:567","msg":"Found ENI: eni-01bc0fc5ded807e44, MAC 06:39:ad:c0:55:fd, device 0"}
{"level":"debug","ts":"2025-02-08T00:15:56.314Z","caller":"awsutils/awsutils.go:567","msg":"Found IPv4 addresses associated with interface. This is not efa-only interface"}
...

"level":"debug","ts":"2025-02-08T00:15:56.351Z","caller":"awsutils/awsutils.go:567","msg":"Found ENI: eni-096c0243efe7df4ee, MAC 06:8a:f4:99:f1:d1, device 1"}
{"level":"debug","ts":"2025-02-08T00:15:56.353Z","caller":"awsutils/awsutils.go:567","msg":"Found IPv4 addresses associated with interface. This is not efa-only interface"}

Then I see:

"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.27.21/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.27.21/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.21.88/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.21.88/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.16.90/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.16.90/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.30.75/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.30.75/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.21.251/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.21.251/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.30.175/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.30.175/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.21.128/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.21.128/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 10.0.27.161/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 10.0.27.161/ffffffff"}
{"level":"debug","ts":"2025-02-08T00:38:06.263Z","caller":"datastore/data_store.go:607","msg":"AssignPodIPv4Address: ENI eni-096c0243efe7df4ee does not have available addresses"}

I then go into the console and add some secondary IPs to eni-096c0243efe7df4ee from 10.0.16.0/20. Now I do see some of these getting assigned to pods, and these pods come up on "subnet b", and can ping the gateway 10.0.16.1. Things are a bit messed up though because I have some secondary IPs assigned to pods on the "subnet a" and some on "subnet b", and I had to do some manual work to get here.

I have not confirmed if these pods egress/nat from the primary ENI like we would want when going outside the pod network "subnet b". If they do, then I think the only blockers are

1)figuring out why it isn't auto-assigning IPs from the subnet defined in our ENIConfig.
2)figuring out how to get it to create the trunk on NIC 0 automatically rather than blocking on max ENIs.

@jayanthvn
Copy link
Contributor

@mlsorensen - Thanks for the details. I will sync up internally with the account team to setup a call with you. We can go over this in detail.

because at this point it should only need to create secondary IPs to go with the trunk it found.

Trunk ENIs are not managed by AWS VPC CNI. Hence you won't see secondary IP's being assigned by AWS VPC CNI. Trunk ENI's are created by EKS Control plane component - VPC Resource Controller for the feature SGPP (Security group per pod) -

https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html.

When the feature is enabled, VPC resource controller allocates single trunk ENI and multiple branch ENIs behind the single trunk ENI (each instance has an upper limit on branch ENIs). Pods using this feature will consume one branch ENI per pod - p5.48xlarge can have 120 branch ENIs across all network cards -

https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go#L8288-L8298

Image

Ref - https://www.eksworkshop.com/docs/networking/vpc-cni/security-groups-for-pods/

AWS VPC CNI is only single card aware. So with p5.48xlarge you get 1 network card and 2 ENIs even though capacity is 32 network cards and 2 ENIs per network card (64 ENIs).

Image

Now since you have custom networking enabled, primary ENI is not used for pod IPs. So the number of ENIs will reduce by 1. To improve pod launch latency, CNI allocates secondary ENI at start up when this feature is enabled -

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L576-L581

If you are using SGPP feature, I assume you have enabled this knob on aws-node DS - https://github.com/aws/amazon-vpc-cni-k8s?tab=readme-ov-file#enable_pod_eni-v170

With this knob enabled and custom networking enabled, CNI indicates VPC resource controller to attach trunk ENI to the node. (VPC resource controller is also single network card aware) -

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L542-L562

So on network card 0, primary ENI is not usable and secondary ENI is allocated by CNI at startup. So there won't be any room for additional ENIs including trunk ENI. Hence you would see max ENI limit reached.

Since you manually tried - I was able to also able to manually attach a trunk interface to a test instance, on second subnet, to network card 0 on node startup, this worked but you would see CNI logs failing to allocate secondary ENI because of ENI limit breach.

Just to keep in mind and might not be relevant here, since you attached trunk ENI, CNI didn't delete the ENI but if you are attaching any other secondary ENI (non-trunk) out of band you will have to tag those ENIs - https://github.com/aws/amazon-vpc-cni-k8s/tree/master?tab=readme-ov-file#no-manage-tag so that CNI ignores ENIs for pod IPs.

Regarding your other questions - figuring out how to get it to create the trunk on NIC 0 automatically rather than blocking on max ENIs. -> Yes we do reserve one ENI for trunk when SGPP feature is enabled. But because of the number of ENI limit per network card, CNI couldn't reserve the trunk ENI slot for you -

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L2154-L2161

@mlsorensen
Copy link
Author

mlsorensen commented Feb 8, 2025

Yes, thank you. I will review this in further detail to ensure I understand. The confusing part for me is here because of the number of ENI limit per network card, CNI couldn't reserve the trunk ENI slot for you - this matches what I see in the log when custom networking is enabled:

{"level":"debug","ts":"2025-02-08T17:54:58.219Z","caller":"ipamd/ipamd.go:567","msg":"Skipping ENI allocation as the max ENI limit is already reached"}
{"level":"error","ts":"2025-02-08T17:54:58.219Z","caller":"aws-k8s-agent/main.go:42","msg":"Initialization failure: Failed to attach any ENIs for custom networking"}

However, I can set up this trunk manually on card 0 for custom networking, and the node now has 33 ENIs. To me it would seem that CNI just needs to start using this trunk and allocating secondary IPs for pods from this trunk's network. I don't know why it needs to attach any more ENIs, at all - or why I can do this manually (which I am OK with) but the system can't.

@jayanthvn
Copy link
Contributor

jayanthvn commented Feb 8, 2025

Can you please share what is your CNI version? CNI ignores trunk ENI even if discovered since management is only by VPC Resource controller. Is there any specific reason you want to attach trunk eni rather than allowing CNI to allocate additional eni or VPC RC to allocate trunk eni if feature is enabled..

In CNI, these limits are fetched from this file - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/vpc/vpc_ip_resource_limit.go#L8005-L8006 internally built from EC2 APIs - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/gen_vpc_ip_limits.go#L62 when a new instance is GA'ed..

@mlsorensen
Copy link
Author

mlsorensen commented Feb 10, 2025

 Is there any specific reason you want to attach trunk eni rather than allowing CNI to allocate additional eni or VPC RC to allocate trunk eni if feature is enabled..

Only because the CNI reports that it can't due to the hard coded ENI limit in the CNI code. I see on other instance types that we get a trunk ENI with custom networking enabled, so I was attempting to reproduce the same configuration on p5 by hand. Ideally the custom networking feature would function as it does for other instances, but I'm just trying to work around this for now and seeing how far we get and what is missing.

I am currently using v1.19.2-eksbuild.1.

@jayanthvn jayanthvn assigned jaydeokar and unassigned dshehbaj Feb 10, 2025
@jayanthvn
Copy link
Contributor

jayanthvn commented Feb 10, 2025

Thanks for the clarification. The instance type supports only 2 ENI per network card, so with custom networking - primary ENI is not used and secondary ENI will be allocated on start up for pods. So there won't be a slot for trunk ENI hence you hitting the ENI limit error.

Actually I missed this, you can have 33 ENIs since the node supports total of 64 ENIs out of 32 network cards. In your scenario, you had one primary ENI and trunk ENI on network card 0...

@mlsorensen
Copy link
Author

What I see on other instance types with custom networking is that it creates a trunk and uses that for the second subnet and places pods on this.

Now on p5, I am creating a trunk with second subnet, because it won't create automatically due to the artificial limit of 32 ENIs. At a minimum, we just need the CNI to pick that trunk up and do IPAM, then have pods to use it.

I guess I am not understanding, because I'm almost there with a manual configuration, making the p5 behave like another instance that supports custom network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants