-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vHostUser interface Design Proposal #218
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @nvinnakota10. Thanks for your PR. I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
bee44b2
to
af328f6
Compare
@nvinnakota10 I'm still missing the scheduling part or is it part of CNI? My question is will be a VMI requesting this network interface automatically scheduled on nodes where the CNI plugin is installed or does the virt-launcher pod requires additional information? |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nvinnakota10 ,
Would it be possible to refer to cni providing this functionality and include how it works for Pods?
|
||
- Creating the VMs will follow the same process with a few changes highlighted as below: | ||
1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\ | ||
a. **EmptyDir volume named (shared-dir) is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane.**\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow up here and explain why this empty dir is required? Can it be mount on arbitrary path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the empty dir is to act as a socket directory in the Compute container and allow qemu to create a vhostuser socket in the directory /var/lib/kubelet/pods//volumes/kubernetes.io~empty-dir/shared-dir, which will be mounted by the CNI in the respective kubelet. The host path's mount can vary based on where the CNI wants to mount.
Moreover, to have a better transparency with k8s, emptyDir is more advantageous, as the clean up will be done by k8s, i.e., if we delete the pod, the empty directory will also be deleted.
ref: https://kubernetes.io/docs/concepts/storage/volumes/#emptydir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't always true, see the comments in #218 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort in driving / helping move forward this by providing a design doc.
I didn't review in depth; I hope I can find the time to provide a more comprehensive review.
devices: | ||
interfaces: | ||
- name: vhost-user-vn-blue | ||
vhostuser: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implies adding yet another binding mechanism.
We've tried not to do that in the past (just a heads up ...) .
We are actually trying to decouple this from KubeVirt, thus allowing end users to provide (and use ...) their bindings without moving the maintenance burden to KubeVirt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maiqueb , Can you please give a few pointers on how this works. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
- update the KubeVirt API with this new interface type.
- AFAIU, nothing needs to be done for the phase#1 (virt-handler performs some privileged operations on behalf of the launcher). Everything you need should be done by userspace-cni, right ?
- the launcher domain manager must generate the interface domXML snippet for your particular interface. More stuff done here (example for macvtap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as I've said in this comment, we're currently investigating how to allow users to extend the network binding options without requiring the core KubeVirt maintainers to maintain said bindings... We've recently discouraged users from posting new binding types ...
FYI .
@EdDev can provide more information; I just shipped around a few half baked ideas on how this could be done, but there's nothing concrete yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maiqueb, I believe I am not using a binding mechanism for the vhostuser interface, as I am not doing what exactly has been done for interfaces like bridge, masquerade, PASST.
I am using this PR kubevirt/kubevirt#3208 as a reference and most of the changes are from it.
I have built kubevirt with the changes from the mentioned PR, and made necessary changes to make it work with the latest kubevirt, for the releases 0.58, 0.59 and 0.48. I was able to bring up the interface without using the binding mechanism.
Please let me know if you think this is still not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, you may be right, seems we also don't need to generate the interface domXML snippet (on the launcher pod):
https://github.com/kubevirt/kubevirt/pull/3208/files#diff-030aa8b7ff9bf4a511d1d5641c1dec04205a944137672be3a691f0b1b4068753R93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While pod networking configuration is not needed by virt-handler, the domain configuration is surely needed [1].
I am thinking that we can generalize these types of bindings. I.e noop bindings which do not touch the pod networking.
But we are still left with the logic of how to define the domain spec (libvirt domain config), including NUMA, hugepages, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev, @maiqueb, sorry for the confusion. I am actually using the binding interface method. Like I mentioned I made a few changes, and forgot I was using the interface binding method for vhostuser interface.
kubevirt/kubevirt@release-0.59...Juniper:kubevirt:release-0.59
The above comparison shows the exact changes we are currently using. I have rechecked my changes as I remember this binding method for the interface was added sometime ago.
What would be your suggestion in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please confirm if we can proceed with the interface binding method, or do we have any work in progress for the same as mentioned in the comment:
#218 (comment)
Thanks
a2d9889
to
dc88895
Compare
The scheduling part is done by the k8s. In the VM spec, the user can specify custom labels and use them as selectors for the scheduling part. Moreover, user can specify the huge pages required and memory requests. This will allow user to label the nodes that can support and k8s to schedule accordingly based on requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late feedback on this.
There is a lot to digest here and it will probably take me some time to learn all the details, sorry in advance.
As @maiqueb mentioned earlier, we are trying to find a elegant way to decouple as much as possible the different network interface types to ease and distribute the maintenance effort.
This type seems not to require the first step which configures the pod networking, only the second step where there is a need to define the domain configuration to consume the endpoint (it this case a socket).
In the meantime, I think the discussion on this proposals can continue as usual.
- Creating the VMs will follow the same process with a few changes highlighted as below: | ||
1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\ | ||
a. **EmptyDir volume named (shared-dir) "/var/lib/cni/usrcni/" is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane. This socket file acts as an UNIX socket between the VM interface and the dataplane running in the UserSpace.**\ | ||
b. **DownwardAPI volume named (podinfo) is used to share vhostuser socket file name with the virt-launcher via pod annotations.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the EmptyDir
mounts the actual socket to a specific path, why do we need the DownwardAPI ?
What information needs to arrive to the virt-launcher filesystem and why it cannot be passed through a field in the VMI (e.g. the in the interface details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev, Since, both the kubemanager and CNI need to use the same name for the socket file, I am using the downwarAPI to achieve this. This is just to make sure the pod creates a VM with a socket with right name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev , Also the CNI/KubeManager will be creating the socket files while creating the virtio interfaces. So, to have the consistency, we expect that the CNI uses the socket name reperesenting the poduid+intfname+id. This info will not be available from the VM YAML spec.
DownwardAPI will help us maintain consistency as the CNI creates the sockets in a custom path, which will be shared to the launcher pod. This socket name will be passed to qemu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev , Also the CNI/KubeManager will be creating the socket files while creating the virtio interfaces. So, to have the consistency, we expect that the CNI uses the socket name reperesenting the poduid+intfname+id. This info will not be available from the VM YAML spec.
DownwardAPI will help us maintain consistency as the CNI creates the sockets in a custom path, which will be shared to the launcher pod. This socket name will be passed to qemu.
@nvinnakota10, if socket name is built after poduid+intfname+id
, live migration of VM will fail because a new pod will be created where the running VM will be transferred to, and qemu will be started with the same parameters including the path/socketname.
But obviously poduid
changed and the CNI created a new socket named after newpodui+intfname+id
, and uses also this name to configure the vswitch port. qemu will then hang waiting the socket poduid+intfname+id
to be opened by the client (vswitch).
We did experience this issue with userspace CNI, and had to explicitly specify socket name in the userspace CNI config in order to have a successful VM live migration:
"cniVersion": "0.3.1",
"type": "userspace",
"name": "userspace-vpp-250-1",
"KubeConfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
"SharedDir": "/var/lib/cni/usrcni/",
"logFile": "/var/log/userspace-vpp-250-1.log",
"logLevel": "debug",
"host": {
"engine": "vpp",
"iftype": "vhostuser",
"netType": "bridge",
"vhost": {
"mode": "client",
"socketfile": "net1"
},
"bridge": {
"bridgeName": "250"
}
}
The socket name is poduid+intfname+id
is the default one in userspace CNI. Maybe it worth changing this default naming in the CNI (standard interface?) or change the qemu parameters regarding the vhost-user-net socketname when live migrate a VM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TReZ42 , Yes, we will be have to look at the Migration part, as the new pod created might still have the same socket name, but the downwardAPI might create a new socket config based on the podUID. This might cause the VM to fail waiting for the socket.
I will check the case with multi interface as well. I am not sure if the interface supports live migration. Though it is a secondary interface, it can fail live migration to fail if not handled appropriately, as the socket names might change, as PodUId changes and the VM should have the appropriate socket name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TReZ42 , One possible way is to hashing the combination of name+namespace of the VM created. But for this, can you please confirm if we can Live Migrate the VMs using VirtualMachineInstanceReplicaSet kind. How do we handle the names of the VM when VirtualMachineInstanceReplicaSet kind is used to create the VMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvinnakota10, VMI names are created after the template name
metadata if it exists, or VirtualMachineInstanceReplicaSet name if not, appended with an id: testvmi5wfxp
, testvmibjkpz
.
We can also think about VirtualMachinePool: the VMs and VMIs names are created after VirtualMachinePool name, appended with -0, -1, etc. : test-vmp-0
, test-vmp-1
.
We can Live Migrate each of the VMI created by VirtualMachineInstanceReplicaSet or VirtualMachinePool, but not directly the VirtualMachineInstanceReplicaSet or the VirtualMachinePool.
VMI name does not change after live migration, so use a hash of vmi+namespace+intfname
as socket name could be OK. But userspace CNI must also be adapted to use socket names according to this naming convention...
Don't you think it is possible to let the CNI defines the socket name, and for live migration patch the domain of the target VMI with the new socket name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TReZ42 , Yes. This part needs to be handled by the CNI. For the changes I am planning to upstream for vHostUser interface, Kubevirt need no information on how the socket name is derived. But the CNI needs to make sure that the socket name needs to be consistent, if they need Live Migration.
Also, if we create VMs using replicasets, the hash appended at the end of the VMI name is immutable. So, even if we Live Migrate these VMs they still have the same name in the VMI object, which shouldn't affect the Socket name consistency.
I was asking the question to make sure if the approach make sense, or do I need to look at other possibilities for having CNI to maintain consistent names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvinnakota10, if I understand correctly, you want the CNI to derive the socket name from the VM name (VMI, VMIRS, VMPool) ? But is this info available to the CNI in Downward API? Perhaps in annotations?
It's not possible to change the target domain with updated socket name, generated by the CNI as usual, the way you are doing in createDomainInterfaces
with the vhostuser case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am using the createDomainInterfaces. As soon as the CNI creates the socket name and interface name, it will store the annotations in the path mounted for DownwardAPI for the launcher pod.
The createDomainInterfaces will retrieve this information to create the socket using qemu args.
Please use the term vhost-user-net to refer to the network device since there are other vhost-user devices too (blk, fs, gpio, etc). KubeVirt may support other vhost-user devices in the future, so it helps to be clear what is specific vhost-user-net and what is generic vhost-user infrastructure that other devices will also use. Edit: To clarify what I mean, please change the title of this proposal to vhost-user-net so it's clear this vhost-user integration is specific to network interfaces. |
@@ -0,0 +1,166 @@ | |||
|
|||
# Overview | |||
- User Space networking is a feature that involves packet path from the VM to the data plane running in userspace like OVS-DPDK etc, by-passing the kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, any idea to make vhost-user interface available for block devices and qemu-storage-daemon as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicefr and @albertofaria have investigated vhost-user-blk devices in the past separately from this PR. Perhaps they can share a proposal or the current state of the discussion with you.
Hey, @nvinnakota10 and all. Raised this also in the mailing list, but just decided to duplicate here. I am curious about the status of this PR. Any chance we can move this forward? What are the current blockers? Ping @maiqueb, @EdDev, @xpivarc, @alicefr. |
@vasiliy-ul I might be wrong, but I think this is blocked by the fact that we would like to separate the network plugin externally and we are missing the infrastructure for this UPDATE: this might be already available from this PR: kubevirt/kubevirt#10284 |
@alicefr, thanks for the link 👍 I was not aware of that PR. Will take a closer look. I've seen this concern raised here in this PR, but from the conversations it is not clear if it's a blocker or not: #218 (comment) and #218 (review) So, does that mean that from now on this new API is the way to go if we need to introduce network bindings? |
This was at least my understanding, @EdDev could you please confirm this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
The networking bindings were already refactored, and now it is possible for community provided bindings to be provided out of (KubeVirt) tree. Should this proposal be revisited under these new terms ? |
Which prompts an update from my side: we're about to merge vhost-user support in passt. I guess it's going to be transparent for KubeVirt, that is, libvirt will eventually pass different options to both passt ( So, that support doesn't aim in any way at replacing this proposal, and probably it doesn't integrate with it at all, but please let me know if opinions differ. |
Well, while it doesn't intend to replace this proposal, it surely brings forward a "cheaper" alternative. It would be interesting to see an "alternatives" section in this proposal focusing on that @sbrivio-rh . Thanks for raising it. |
@fabiand I would, at least, wait until we see the other proposal. Otherwise, @nvinnakota10 are you still working on it? |
Fair point.
When are we expecting the new proposal?
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Hi All, We are still working on a proper implementation of the vhostuser network binding plugin and all stuff around:
Benoit. /remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
This is a design proposal for the vhostuser interface to support UserSpace networking feature by the kubevirt VM.
The document provides information on the Feature/Interface proposed, why it is needed, how it can enable fast packet processing when dpdk is available.