Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vHostUser interface Design Proposal #218

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions design-proposals/vhostuser/vhostuser-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@

# Overview
- User Space networking is a feature that involves packet path from the VM to the data plane running in userspace like OVS-DPDK etc, by-passing the kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any idea to make vhost-user interface available for block devices and qemu-storage-daemon as well?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr and @albertofaria have investigated vhost-user-blk devices in the past separately from this PR. Perhaps they can share a proposal or the current state of the discussion with you.

- With VhostUser interface, the kubevirt VM created can utilize the dpdk virtio functionality and avoid using kernel path for the traffic when DPDK is involved, allowing usage of fast datapath and improve performance.

## Motivation
- Any kubevirt VM uses linux kernel interfaces to connect to its host and to reach out externally or to communicate within the cluster. The current kubevirt VMs can only support the kernel data path for which the traffic needs to be through the kernel of the node/compute hosting the VM.

- As a result, when we have a host with DPDK dataplane running in Userspace, the traffic from kubevirt VM will go through the kernel and reach the dataplane running in userspace. This longer path slows down the traffic between the application and the dataplane when both are running in userspace.

- In order to improve the traffic performance and make use of the dpdk, we need to use a different path that can by-pass the kernel and reach the dataplane from the user space directly. This so called fast path will run within the user space and will allow fast packet processing.

## Goals
- The kubevirt VM when created with DPDK features, should be able to use fast datpath if the host supports the DPDK dataplane allowing fast packet processing to improve performance of the traffic.

## Non Goals
- The Userspace CNI with the dataplane running in userspace like dpdk/vpp, should make sure proper handling of mounting and unmounting the volumes. These changes are CNI specific and might vary from CNI to CNI.
- The kubevirtci creates kernel mode dataplane hosts for testing the VMs. It needs to be updated to support DPDK.

## Definition of Users
- Any user who has a (k8s) cluster with a dpdk dataplane can use the interface/feature.
- The feature will be supported only if a multus userspace CNI is being used, for instance userspace cni from intel.

## User Stories
- As a user/admin, I want to use user space networking feature if I have a VM with dpdk virtio interfaces and the host has a DPDK dataplane and userspace cni with multus support, by-passing the traffic from kernel of the host.

## Repos

- kubevirt/kubevirt
- Introduction of vhostuser type interface.
- kubevirt/kubevirtci
- Enable usage of dpdk on the setups created by kubevirtci.

# Design
- The design involves introducing vHostUser interface type and required parameters to support it. An EmptyDIR volume (shared-dir) and DownwardAPI (podinfo) volume are mounted on the virtlauncher pod to have the support for new interface.
- These mounts will allow the virt-launcher pod to create a VM with an additional interface which can be reached using the vhostSocket mentioned in the DownwardAPI.

- Creating the VMs will follow the same process with a few changes highlighted as below:
1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\
a. **EmptyDir volume named (shared-dir) "/var/lib/cni/usrcni/" from virt-launcher pod is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane. This socket file acts as an UNIX socket between the VM interface and the dataplane running in the UserSpace.**\
b. **DownwardAPI volume named (podinfo) is "/etc/podnetinfo/annotations" used to share vhostuser socket file name with the virt-launcher via pod annotations. The downwardAPI is used here to have the pod know the socket details like name and path, which is created by CNI/kubemanager while bringingup the virt-launcher pod and this info is only availbale after the pod is created**
2. **The CNI should mount the shared-dir to "/var/lib/cni/usrcni". The CNI can link the EmptyDir volume's host path /var/lib/kubelet/pods/<podID>/volumes/kubernetes.io~empty-dir/shared-dir to a custom path CNI prefers to have. The CNI should create the sockets in the host in the custom path. For each interface, CNI should create a socket.**
3. **The CNI should update the virt-launcher pod annotations with vhostsocket-file name and details using the downwarAPI. The /etc/podnetinfo/annotations from the virt-launcher pod will hold the information of the socket(s) details.**
4. **The virt-launcher reads the DownwardAPI volume and retrieves the vhostsocket-file name specified in pod annotations and uses it while launching the VM using libvirtd.**
4. **The virt-launcher is modified to skip establishing the networking between VM and the virt-launcher pod using bridge(Refer in Kubevirt Networking section in https://kubevirt.io/2018/KubeVirt-Network-Deep-Dive.html).**
5. **Instead of using the bridge through the launcher pod, the vHostUser interface of the VM will be directly connected to the DPDK datplane using the vhost socket**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like <source type='unix' path='/tmp/vhost1.sock' mode='server'/> will be used so QEMU creates the UNIX domain socket and then something must trigger DPDK to connect to this socket?

Which component sends a command to DPDK to establish the UNIX domain socket connection that QEMU is waiting for so the VM can start?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanha , I didn't get your question.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vhost-user protocol allows the frontend and backend to act as the client or server. This is configurable and QEMU supports both acting as a socket client and a server.

In your proposal the socket seems to be created by QEMU (the frontend) and DPDK (the backend) connects to the UNIX domain socket. (I say this because you wrote "The virt-launcher reads the DownwardAPI volume and retrieves the vhostsocket-file name specified in pod annotations and uses it to create a unix socket in the VM".)

My question is about the order in which QEMU creates the UNIX domain socket and DPDK connects to it:

How does DPDK know when it is time to connect to the UNIX domain socket since the socket is created after virt-launcher Pod launches?

Does DPDK need to attempt to reconnect in a loop or is there a step where DPDK is notified that it's time to connect?

Copy link
Author

@nvinnakota10 nvinnakota10 May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanha , The UNIX socket is expected to be created in the host by the CNI in the custom path linked to the shared directory and the qemu then connects to it. The socket(s) details will be passed to the qemu by the virtualization-launcher while creating the VM, which gets this information from the downwardAPI passed by the Kubemanager.

I have rephrased to avoid confusion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the NAD below need to be updated too?

        "host": {
                "engine": "ovs-dpdk",
                "iftype": "vhostuser",
                "netType": "bridge",
                "vhost": {
                        "mode": "client"
                },
                "bridge": {
                        "bridgeName": "br-dpdk0"
                }
        },
...
        "container": {
                "engine": "ovs-dpdk",
                "iftype": "vhostuser",
                "netType": "interface",
                "vhost": {
                        "mode": "server"
                }
        },

Maybe "mode" needs to be reversed for "host" and "container".

Do you intend to support both configurations, or just DPDK acting as the server and QEMU acting as the client?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for passt with vhost-user (see also #218 (comment)) we would have, say, "iftype": "vhostuser" in the "container" object, but with different "engine" or "netType", and I guess something completely different in the "host" object...?


- Removing the VMs will follow the below chnages:
1. **The CNI should delete the socket files of each interface. If 'n' interfaces are present, 'n' number of socket files will be deleted allowing k8s to proceed with a clean pod deletion.**

No additional changes are required while deleting the VM, it will be deleted as a regular VM, the CNI should handle the deletion of interfaces. In the case of vHostUser, CNI should delete the socket files before k8s initiates pod deletion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if I insist on this point, but does kubelet request again to CNI the deletion of the network interface if there is a failure? If so, then the cleanup shouldn't be a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU no; depending on the implementation, kubelet may not even be aware something failed.

The CNI spec in fact indicates a CNI DEL should not return an error:

Plugins should generally complete a DEL action without error even if some resources are missing.

Some entity should reconcile the resources afterwards ensuring those are gone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we might still have a problem here :/ @nvinnakota10 if you have a working prototype, I'd be really interested to know what happens if you try to kill the component doing the unmount and at the same time deleting the VMI/pod.

Copy link
Author

@nvinnakota10 nvinnakota10 May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr , I tried something similar you asked. I made changes to my CNI where it won't execute the deletion of sockets. I just made sure CNI del won't do anything.

I tried deleting the VM using the kubectl delete command. The kubelet logs from journalctl showed the kubelet did unmount the volumes both shared-dir(emptyDir) and podinfo (downwardAPI). The virt-launcher deletion was not having any issues.

Please find the logs related to the same:

May 23 22:07:12 vihari-worker-1 kubelet[5747]: I0523 22:07:12.908533 5747 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume "shared-dir" (UniqueName: "kubernetes.io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") pod "virt-launcher-vm-single-virtio-sktwl" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") " pod="kubevirttest/virt-launcher-vm-single-virtio-sktwl"
May 23 22:07:12 vihari-worker-1 kubelet[5747]: I0523 22:07:12.908556 5747 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume "podinfo" (UniqueName: "kubernetes.io/downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") pod "virt-launcher-vm-single-virtio-sktwl" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") " pod="kubevirttest/virt-launcher-vm-single-virtio-sktwl"

————————————
May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.729817 5747 reconciler.go:211] "operationExecutor.UnmountVolume started for volume "podinfo" (UniqueName: "kubernetes.io /downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") "

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.729871 5747 reconciler.go:211] "operationExecutor.UnmountVolume started for volume "shared-dir" (UniqueName: "kubernetes .io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") "


May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.737107 5747 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/downward-api/9f57a12d-5c cf-47b2-afe0-a010241b6c77-podinfo" (OuterVolumeSpecName: "podinfo") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77"). InnerVolumeSpecName "podinfo". PluginName "kubernetes.io/downward-ap i", VolumeGidValue ""

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.772221 5747 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/9f57a12d-5ccf- 47b2-afe0-a010241b6c77-shared-dir" (OuterVolumeSpecName: "shared-dir") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77"). InnerVolumeSpecName "shared-dir". PluginName "kubernetes.io/empty -dir", VolumeGidValue ""


May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.831658 5747 reconciler.go:399] "Volume detached for volume "podinfo" (UniqueName: "kubernetes.io/downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") on node "vihari-worker-1" DevicePath """

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.831680 5747 reconciler.go:399] "Volume detached for volume "shared-dir" (UniqueName: "kubernetes.io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") on node "vihari-worker-1" DevicePath """


- With the above process a kubevirt VM with a vHostUser interface can be acheived. All the steps highlighted will be the changes made as part of the design.

## API Examples

Since, KubeVirt will always explicitly define the pod interface name for multus-cni. It will be computed from the VMI spec interface name, to allow multiple connections to the same multus provided network.

The vHostUser Interface will be defined in the VM spec as shown below:

```yaml
interfaces:
- name: vhost-user-vn-blue
vhostuser: {}
useVirtioTransitional: true
networks:
- name: vhost-user-vn-blue
multus:
networkName: vn-blue
```
No additional definition of vloumes or annotations are required. The DownwardAPI will provide all the necessary annotations required for vhostUser Interface.

## Scalability
- There should not be any scalability issues, as the feature is to create an interface when useful.
- Regarding VM live migration, we need to check if the socket files created will be able to support the VM migration.

## Update/Rollback Compatibility
- Should have no impact.

## Functional Testing Approach
Functional test can:

- Create VM with vhostUser interface
- Create VM with multiple interface types. (Bridge + vhostuser type)

# Implementation Phases
- The design will be implemented in two phases:
- Add DPDK support on kubevirt/kubevirtci
- Enable OVS with DPDK
- Enable an option in gocli for DPDK support
- Add UserSpace CNI in the setup instead of or along with calico.
- Add vhostUser Interface type in kubevirt/kubevirt
- Add vHostUser Interface type
- Create Pod template to support the Interface
- Create appropriate virsh xml elements
- Add E2E test to send traffic between the 2 VMs created.

## Annex


A sample DownwardAPi mght look this way:

```yaml
k8s.v1.cni.cncf.io/network-status="[{\n \"name\": \"k8s-kubemanager-kubernetes-CNI/default-podnetwork\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.244.2.4\"\n ],\n \"mac\": \"02:d7:b3:27:b3:b2\",\n \"default\": true,\n \"dns\": {}\n},{\n \"name\": \"kubevirttest/vn-blue\",\n \"interface\": \"net1\",\n \"ips\": [\n \"19.1.1.2\"\n ],\n \"mac\": \"02:c6:da:d3:ab:07\",\n \"dns\": {},\n \"device-info\": {\n \"type\": \"vhost-user\",\n \"vhost-user\": {\n \"mode\": \"server\",\n \"path\": \"2ea1931c-2935-net1\"\n }\n }\n}]"
k8s.v1.cni.cncf.io/networks="[{\"interface\":\"net1\",\"name\":\"vn-blue\",\"namespace\":\"kubevirttest\",\"cni-args\":{\"interface-type\":\"vhost-user-net\"}}]"
kubectl.kubernetes.io/default-container="compute"
kubernetes.io/config.seen="2023-05-16T16:50:04.224937236Z"
kubernetes.io/config.source="api"
kubevirt.io/domain="vm-single-virtio"
kubevirt.io/migrationTransportUnix="true"
post.hook.backup.velero.io/command="[\"/usr/bin/virt-freezer\", \"--unfreeze\", \"--name\", \"vm-single-virtio\", \"--namespace\", \"kubevirttest\"]"
post.hook.backup.velero.io/container="compute"
pre.hook.backup.velero.io/command="[\"/usr/bin/virt-freezer\", \"--freeze\", \"--name\", \"vm-single-virtio\", \"--namespace\", \"kubevirttest\"]"
pre.hook.backup.velero.io/container="compute"
```
The CNI should update the /etc/podnetinfo/annotaions with something similar, to update the annotations of the virt-launcher pod.

The NAD can be generic which can just be used for defining the networks. The below is a NAD definition from userspace CNI based on ovs-dpdk by intel.

```yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: userspace-ovs-net-1
spec:
config: '{
"cniVersion": "0.3.1",
"type": "userspace",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called "userspace" and not "vhost-user-net"? Is the idea to support other userspace network interfaces in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must have the name of the CNI binary which will run on the host.

AFAIU, the "reference" CNI to use is userspace CNI, whose binary is named userspace: https://github.com/intel/userspace-cni-network-plugin#work-standalone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanha , yes. the type here is pointing to the CNI binary name. Since, userspace CNI by Intel is one of the Opensource userspace CNI, I mentioned that in the example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maiqueb @nvinnakota10
Thanks for sharing the link to the Userspace CNI plugin! It seems the Userspace CNI plugin doesn't define how Pods need to be configured - it's left up to the user to create volumes, choose paths, and I'm not sure it uses annotations and the downward API.

I think a standard interface should be defined by the Userspace CNI plugin, not in KubeVirt. Then other container workloads can benefit from a standard interface that doesn't require coming up with configuration (volumes, paths, etc) from scratch.

The KubeVirt proposal would just be about implementing the Userspace CNI interface, not defining annotations, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step towards this, I suggest splitting this proposal into two parts:

  1. The standard Userspace CNI plugin interface that allows any Pod to detect vhost-user-net devices and consume them. This will eventually go into the Userspace CNI plugin project (and other CNI plugins can implement a compatible interface).
  2. The KubeVirt-specific part that describes how to implement the interface in virt-launcher, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised this topic in the video call we had last week, but no one had a suggestion for how the Userspace CNI plugin (and third-party plugins that work along the same lines) could provide a standard interface. I am not familiar with CNI myself so I also don't have a way forward.

In light of this, I think it's okay to keep things as they are for now and consider this review comment resolved.

"name": "userspace-ovs-net-1",
"kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig",
"logFile": "/var/log/userspace-ovs-net-1-cni.log",
"logLevel": "debug",
"host": {
"engine": "ovs-dpdk",
"iftype": "vhostuser",
"netType": "bridge",
"vhost": {
"mode": "client"
},
"bridge": {
"bridgeName": "br-dpdk0"
}
},
"container": {
"engine": "ovs-dpdk",
"iftype": "vhostuser",
"netType": "interface",
"vhost": {
"mode": "server"
}
},
"ipam": {
"type": "host-local",
"subnet": "10.56.217.0/24",
"rangeStart": "10.56.217.131",
"rangeEnd": "10.56.217.190",
"routes": [
{
"dst": "0.0.0.0/0"
}
],
"gateway": "10.56.217.1"
}
}'

```
Binary file added design-proposals/vhostuser/vhostuser.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.