Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Efa device level metrics #277

Open
wants to merge 3 commits into
base: aws-cwa-dev
Choose a base branch
from

Conversation

Aakash-Dantre
Copy link

Description:

This PR adds the Elastic Network Interface (ENI) ID as a dimension to EFA (Elastic Fabric Adapter) metrics. This enhancement improves the observability of EFA devices by linking them to their corresponding ENI IDs in AWS, making it easier to correlate EFA metrics with network interface resources.

We calculate ElasticNetworkInterfaceId by the following steps.

Implementation Details

The ENI ID is derived through the following process:

  1. For each EFA device, we extract its MAC address from the IPv6 GID (Global Identifier) using the following steps:

    • Read the GID from /sys/class/infiniband/<device>/ports/<port>/gids/0
    • Verify it's a link-local IPv6 address (fe80::/10)
    • Extract the interface identifier (last 64 bits)
    • Convert from modified EUI-64 format back to MAC address by:
      • Inverting the Universal/Local bit in the first octet
      • Removing the ff:fe middle bytes
      • Reconstructing the 6-byte MAC address
  2. Using the AWS EC2 metadata service, we map the MAC address to its corresponding ENI ID

  3. The ENI ID is then added as a dimension to all EFA metrics:

    • Node-level EFA metrics
    • Pod-level EFA metrics (when available)
    • Container-level EFA metrics (when available)

Example Metric Output

ClusterName, ContainerName, ElasticNetworkInterfaceId, FullPodName, Namespace, PodName: for container level metrics
ClusterName, ElasticNetworkInterfaceId, FullPodName, Namespace, PodName: for pod level metrics
ClusterName, ElasticNetworkInterfaceId, InstanceId, NodeName: for node level metrics

image

Testing:

EMF Output

{
    "AutoScalingGroupName": "eks-workers2-eeca6dae-a846-c51d-c535-db5e19d562e9",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "ElasticNetworkInterfaceId",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_efa_rx_dropped",
                    "Unit": "Count/Second"
                },
                {
                    "Name": "container_efa_tx_bytes",
                    "Unit": "Bytes/Second"
                },
                {
                    "Name": "container_efa_rdma_read_bytes",
                    "Unit": "Bytes/Second"
                },
                {
                    "Name": "container_efa_rdma_write_bytes",
                    "Unit": "Bytes/Second"
                },
                {
                    "Name": "container_efa_rdma_write_recv_bytes",
                    "Unit": "Bytes/Second"
                },
                {
                    "Name": "container_efa_rx_bytes",
                    "Unit": "Bytes/Second"
                }
            ]
        }
    ],
    "ClusterName": "efa-device-level-metrics-demo",
    "ContainerName": "efaburn",
    "EfaDevice": "efa_0",
    "ElasticNetworkInterfaceId": "eni-08dd34f290cc4331c",
    "FullPodName": "my-training-job-1-t9dgz",
    "InstanceId": "i-0d9b1eaefd29cc223",
    "InstanceType": "c6in.32xlarge",
    "Namespace": "default",
    "NodeName": "ip-192-168-121-229.us-east-2.compute.internal",
    "PodName": "my-training-job-1",
    "Timestamp": "1738852539339",
    "Type": "ContainerEFA",
    "Version": "0",
    "kubernetes": {
        "container_name": "efaburn",
        "containerd": {
            "container_id": "48de01f68be2afcceae15983968164c47cb724eda9633e9c910e8a205d1e2c11"
        },
        "host": "ip-192-168-121-229.us-east-2.compute.internal",
        "labels": {
            "controller-revision-hash": "7446db4d9c",
            "name": "efaburn",
            "pod-template-generation": "1"
        },
        "namespace_name": "default",
        "pod_name": "my-training-job-1-t9dgz",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "my-training-job-1"
            }
        ]
    },
    "container_efa_rdma_read_bytes": 762499301376,
    "container_efa_rdma_write_bytes": 0,
    "container_efa_rdma_write_recv_bytes": 0,
    "container_efa_rx_bytes": 762530803280,
    "container_efa_rx_dropped": 0,
    "container_efa_tx_bytes": 48323760
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant