Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design-proposal: improved handling of the VM FW ID #347

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dasionov
Copy link

@dasionov dasionov commented Nov 6, 2024

What this PR does / why we need it:

This proposal introduces a mechanism to persist the FW ID of a Virtual Machine (VM) in KubeVirt. By storing the FW ID, we ensure that it remains consistent across VMI restarts, which is
crucial for applications and services that rely on the UUID for identification or licensing purposes.

relates-to: kubevirt/kubevirt#13156, kubevirt/kubevirt#13158

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

None

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Nov 6, 2024
@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch 2 times, most recently from 4257d15 to b7735b0 Compare November 6, 2024 14:20
Copy link
Member

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dasionov for stepping up to create a design proposal. It would be wonderful if you can get this old issue sorted.

However, I think that you should discuss possible alternatives for a solution before deciding one implementation (persisting UUID in status).

design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @dasionov for picking this issue up! It's indeed important to solve and keeps coming back!

I very much agree with most of what @dankenigsberg commented above, especially regarding the need to provide a few alternative solutions. This way we can discuss which would be the best way moving forward.

Let me try to help you with that. Here are a few alternatives that I have in mind:

Keep the UUID in the VM's spec even if the user did not request it
Partially implemented here: kubevirt/kubevirt#13158.

Main idea: The UUID would be generated, as of now, for every VM once it boots. After generation, the VM's spec would be updated with the generated UUID.
Pros: Easy to implement, avoid adding new API fields, fully backward compatible.
Cons: Abusing the spec. The spec should only contain the user's desired state. In most situations, the VM owner does not care what the UUID is, hence it is not a part of the desired state from the user's POV. This will complicate and make the VM's definition longer, hard to read, and hard to reason about.

Generate a default random UUID via a webhook and set it to the spec
Partially implemented here: kubevirt/kubevirt#12085.

Main idea: If the user did not provide a desired UUID, a webhook will generate a random UUID and will set it to the VM's spec.
Pros: Same as the previous suggestion.
Cons: Same argument from above regarding abusing the spec. In addition, using webhooks for supplying defaults is a bad practice for many reasons, for example it might hurt performance and scalability if a lot of VMs are being started at the same time, since then virt-api might become a bottle neck.

Create a new firmware UUID field under status

Main idea: Just as today, UUID would be generated during the VM's first boot. After it is generated, it would be saved to a dedicated firmware UUID status field.
Pros: No need to abuse spec and we can keep it clean.
Cons: Adding a new API field which is generally not valuable outside the scope of this issue. New API fields need to be added with extreme caution, as they are very hard to remove, might increase load and hurt performance, and make the API less readable and compact. (We can consider minimizing this con by setting the UUID in a VM condition as suggested here).

Introduce a breaking change

Main idea: Just as today, UUID would be generated during the VM's first boot, but it would generate the UUID based on both name and namespace, which will be a breaking change. Before doing so, we would warn the users for a few versions (via logs/alerts/etc) and possibly providing tooling to add the current firmware UUID to the VM's spec in order to protect the workload.
Pros: No need to abuse spec and we can keep it clean, very easy to implement.
Cons: Demands user intervention in order to avoid breaking current workloads.

Obviously, feel free to add more alternatives that you can think of, or express your opinion on the above alternatives.

design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Acedus Acedus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks like a solid foundation. One concern I have is around VM snapshots—I think it's worth mentioning that even if the firmware UUID of an existing VM is set to persist, it won’t be retained during a restore. This could potentially cause functionality issues.

design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
- Keeps the spec clean.
- Focuses on VM status for generated information.

**Cons:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drawback here in my view is that an object's status is often dynamic and therefore somewhat fragile. This makes sense, as controllers manipulate the status field to represent the object's current state, while they rarely—if ever—modify the object's spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drawback here in my view is that an object's status is often dynamic and therefore somewhat fragile. This makes sense, as controllers manipulate the status field to represent the object's current state

It is a matter of how we implement the field and the controllers that would interact with it.
I mean, as I see it, this status field will be only ever assigned once then never modified again by any controller. In this scenario I don't think it would be highly dynamic and fragile.

while they rarely—if ever—modify the object's spec.

And for a good reason :)
Ideally, only a human should ever modify a VM's spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple k8s objects rely on their status fields to function. It's not that fragile and is not editable for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this option, IMO, the flow simplifies: on VM start, the VM controller stores either the user-specified spec.template.spec.firmware.uuid or vm.metadata.uid.
No webhooks are required.
We can also trigger events for running VMs without a firmware UUID in the status to restart to get a stable UUID across shutdowns or just raise the restart required condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this option, IMO, the flow simplifies: on VM start, the VM controller stores either the user-specified spec.template.spec.firmware.uuid or vm.metadata.uid.
No webhooks are required.
We can also trigger events for running VMs without a firmware UUID in the status to restart to get a stable UUID across shutdowns or just raise the restart required condition.

Just to ensure that I understood correctly. In this MO, existing VMs would not persist the firmware UUID specified in their VMI counterparts but rather set the status firmware UUID to (for example) vm.metadata.uid and raise the restart required condition? This acknowledges that disruption (i.e. firmware UUID changes post-restart to affect anything that relies on it) is imminent, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This acknowledges that disruption (i.e. firmware UUID changes post-restart to affect anything that relies on it) is imminent, correct?

This disruption has to be prevented. Otherwise it will be a breaking change.
In order to prevent it I believe we'd have to apply a multi-phased approach:

  1. Continue with the current firmware UUID calculation, but store the UUID in status.
  2. As @vladikr suggested, use alerts / VM conditions in order to notify the users to restart VMs that do not have firmware UUID in their status.
  3. After a while, let the community know (i.e. via a mail to mailing list) that in the next version firmware UUID must reside in status in order to prevent breaking workloads.
  4. Switch to using vm.metadata.uid as the new firmware UUID (unless a UUID is already provided in the VM's status, therefore keeping backward compatibility).

Copy link
Contributor

@Acedus Acedus Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @vladikr suggested, use alerts / VM conditions in order to notify the users to restart VMs that do not have firmware UUID in their status.

So at least a restart is required to derive the firmware UUID from the VMI (including shutdown VMs) so that it would persist in status, correct?

Switch to using vm.metadata.uid as the new firmware UUID (unless a UUID is already provided in the VM's status, therefore keeping backward compatibility).

Assuming a scenario where n existing VMs in cluster(s) suffer from the issue of overlapping firmware UUIDs, what is the recommended approach for users to rectify the issue post-upgrade? Recreate the VMs? Change the spec.firmware.uuid to what appears in vm.metadata.uid and restart? Is user action mandatory to resolve the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at least a restart is required to derive the firmware UUID from the VMI (including shutdown VMs) so that it would persist in status, correct?

Now that I'm re-thinking about it, a restart might not be necessary, as virt-controller could simply copy spec.firmware.uuid to the status field if exists, and if not, re-calc the hash and assign it to status.

Assuming a scenario where n existing VMs in cluster(s) suffer from the issue of overlapping firmware UUIDs, what is the recommended approach for users to rectify the issue post-upgrade? Recreate the VMs? Change the spec.firmware.uuid to what appears in vm.metadata.uid and restart? Is user action mandatory to resolve the issue?

TBH I'm not sure that we can redeem this situation. If two VMs are already running with the same firmware UUID, I think one has to be re-created. In any case, I think it shouldn't be in the scope of this proposal which only aims to prevent this situation from happening in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'm not sure that we can redeem this situation. If two VMs are already running with the same firmware UUID, I think one has to be re-created. In any case, I think it shouldn't be in the scope of this proposal which only aims to prevent this situation from happening in the future.

SGTM, thanks.

@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 15c34f8 to c35b4a4 Compare November 7, 2024 13:45
@iholder101
Copy link
Contributor

One concern I have is around VM snapshots—I think it's worth mentioning that even if the firmware UUID of an existing VM is set to persist, it won’t be retained during a restore.

@Acedus would you mind elaborating on this one? why won't it be retained during a restore?

@Acedus
Copy link
Contributor

Acedus commented Nov 7, 2024

One concern I have is around VM snapshots—I think it's worth mentioning that even if the firmware UUID of an existing VM is set to persist, it won’t be retained during a restore.

@Acedus would you mind elaborating on this one? why won't it be retained during a restore?

I'm referring to VM backups taken before the introduction of the changes described in this DP. Correct me if I'm wrong, but the VM snapshot content won't be updated as part of this change (e.g., whether through spec or status), and if it isn't, once the new UUID generation mechanism is introduced and a restore is attempted the UUID will change.

@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch 2 times, most recently from 3751565 to 7201c75 Compare November 7, 2024 14:36
firmwareUUID: "123e4567-e89b-12d3-a456-426614174000"
```

or persist via status condition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue against a condition. Conditions are generally needed to manage the lifecycle or to be consumed by other controllers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditions are not for persistence.

design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 7201c75 to 1279cbf Compare November 10, 2024 12:25
**Pros:**
- The upgrade-specific code can be safely removed after two or three releases.
- Simple logic post-upgrade, with a straightforward UUID assignment.
- Limited disturbance to GitOps workflows, affecting only pre-existing VMs with the current method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if gitops is setting metadata.uid, correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand @iholder101 What if the VM controller would dynamically add this annotation during VM creation if vm.spec.template.spec.Firmware.UUID is not defined?

For example, when a VM is created (via GitOps or using oc/kubectl), the controller would set the following annotation:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    kubevirt.io/host-uuid: "<host-uuid>"

Does that align with your understanding? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangel97 what is the benefit of storing this in a new annotation, instead of an already-existing API?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe leveraging existing solutions would be the most effective approach. However, we should also consider scenarios where a system has a host UUID, and this breaking change could potentially alter it. In that case, if I understand correctly, wouldn't it be necessary to retain the "old" host UUID somewhere?

- May potentially bottleneck `virt-api`.

3. **Add a Firmware UUID Field to VM Status**
**Description:** Store the generated UUID within the VM’s status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status is only live, it must not be used to store information.

aka any informatoin in the status in only relevant and "persisted" on the runtime platform.
Object smust not be created with a populated status (only empty status is what we want: status: {})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object smust not be created with a populated status (only empty status is what we want: status: {})

A VM object won't be created with a populated status, but a VM that was created, booted, then shut off would have a non-empty status.

Am I missing anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @fabiand :)
just want to understand if I get you correctly.

type: Ready
created: true
runStrategy: Once
firmwareUUID: "123e4567-e89b-12d3-a456-426614174000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status ins only for erporting, not for persitence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?
How is spec more persistent than status? Aren't they both persisted in etcd?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status ins only for erporting, not for persitence.

@fabiand I don't know where this is coming from. Numerous objects across k8s use the status exactly for that purpose, including objects in KubeVirt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VM object, until it is deleted would have a status - it simply stores the current state of the VM, while the spec holds the desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, it is not what status is meant to show.

Status is meant to represent the current state. However, the FW UUID cannot be "read" from the current state of the VM, because it is stopped.
There is a name to persisting such information on the status: caching.
Status is not about caching, it is about representing the actual. But here the suggestion is to store the actual so it will be the input for a future actual (which comes in the opposite direction).

We actually had this implemented in the network interfaces status and it caused a mess: To calculate the new status on a reconcile cycle, the previous data was considered, potentially dragging incorrect state over cycles indefinitely. We have since dropped this behavior, calculating the interfaces status without depending on the previous state.
But this is still tricky business even today, because multiple components attempt to touch the same list.

Bottom line, while this could be a good solution, caching the data on the status can potentially bite back. There is logic that uses this status information to set the UUID on the next VM instantiation and an opposite logic that reads the VM instantiation spec (or status?) to reflect that actual back to the same status field. Sounds like a loop to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the VM is stopped the status shows the current state at this UUID is stored on the VM boot disk
As long as the VM exists this is its current state, regardless of its phase.

firmwareUUID: "123e4567-e89b-12d3-a456-426614174000"
```

or persist via status condition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditions are not for persistence.

**Pros:**
- The upgrade-specific code can be safely removed after two or three releases.
- Simple logic post-upgrade, with a straightforward UUID assignment.
- Limited disturbance to GitOps workflows, affecting only pre-existing VMs with the current method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangel97 what is the benefit of storing this in a new annotation, instead of an already-existing API?

design-proposals/persist-vmi-firmware-uuid.md Outdated Show resolved Hide resolved
@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 1279cbf to b79c2d0 Compare November 14, 2024 10:34

### 4. Introduce a Breaking Change to Ensure Universal Uniqueness

**Description:** Modify UUID generation to use a combination of VM name and namespace, ensuring unique UUIDs within the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about universal uniqueness? How would it be ensured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about universal uniqueness? How would it be ensured?

Using both name + namespace as a hash will ensure uniqueness AFAICT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not universal uniqueness. a different vm on a different cluster would obtain the same uuid if it has the same name and namespace.

Copy link
Contributor

@iholder101 iholder101 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got you.
We can always add something "random" to the hash, like creation timestamp. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to clarify the impact of VMs in different clusters having the same UUID if they share the same name and namespace?
Are there specific cases where cross-cluster uniqueness is essential for the UUID to function correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that @dankenigsberg is referring migrating a VM from a different cluster, e.g. via importing/exporting disks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there specific cases where cross-cluster uniqueness is essential for the UUID to function correctly?

By definition, UUID has to be universally-unique. A lot of services would break if two different VMs present matching identifiers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@dasionov dasionov Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the vm.metadata.uid?

2. **Existing VMs**:
- For VMs created before the upgrade, the VM controller will patch the `vm.spec.template.spec.firmware.uuid` field,
with a value calculated using the legacy logic (based on the VM name).
This ensures backward compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have an annotation that stores the legacy uuid and then patch the old VMs with the new uuid on upgrade? What are the blockers in terms of backward-compatibility?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dasionov After the upgrade, how would you know if a certain VM is from older versions or from current version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acardace Good point, I think it will work only for running VMs. Those that are off, we probably won't know whether to apply legacy calculations on them.

Copy link
Author

@dasionov dasionov Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the upgrade, how would you know if a certain VM is from older versions or from current version?

Every VM that does not have a persisted UUID in the spec should be considered old, as new VMs will receive a random UUID from the mutator webhook. Therefore, it does not matter whether the VMs were running or stopped before the upgrade.

What if we have an annotation that stores the legacy uuid and then patch the old VMs with the new uuid on upgrade

I Have a vague memory that @EdDev was little bit against annotation solution for this specific bug.

@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 392bf06 to e21304d Compare February 3, 2025 09:30
Copy link
Member

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current version of the proposal, with its simple logic and cover for restoring of old snapshot which do not have uuid stored in spec or annotation.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2025
@vladikr
Copy link
Member

vladikr commented Feb 4, 2025

I like the current version of the proposal, with its simple logic and cover for restoring of old snapshot which do not have uuid stored in spec or annotation.

I am uncomfortable with the presented approach for updating the existing VMs, for several reasons.

  1. Patching the VM spec post-admission is a bad practice that can lead to races if other controllers are involved. It is counterintuitive and non-gitops-friendly. It also creates a president.
  2. We will have to keep the bad code in the controller.
  3. It doesn't solve the problem for currently shutdown VMs.

Instead of this approach, I would rather provide a one-time job for admins to update the existing VM specs with UUID generated based on the existing logic (if the UUID is missing from the spec)
I would also make sure that a RestartRequired condition is not set if the VMI has an identical UUID.

@enp0s3
Copy link
Contributor

enp0s3 commented Feb 4, 2025

@vladikr Hi, IIUC the one-time job solution answers to the goal of making the UUID persistent, is this correct?
We have also the other goals:

## Goals
* Improve the uniqueness of the VMI firmware UUID to become independent of the VM name and namespace.
* Maintain UUID persistence with VM backup/restore.
* Maintain backward compatibility

Which goals we would like to achieve in this proposal?

@dasionov
Copy link
Author

dasionov commented Feb 4, 2025

I am uncomfortable with the presented approach for updating the existing VMs, for several reasons.

  1. Patching the VM spec post-admission is a bad practice that can lead to races if other controllers are involved. It is counterintuitive and non-gitops-friendly. It also creates a president.
  2. We will have to keep the bad code in the controller.
  3. It doesn't solve the problem for currently shutdown VMs.

By keeping the existing logic in the controller and ensuring that the UUID is persisted in the VM spec for 2-3 release cycles, we provide VM owners with sufficient time to start any shutdown machines and have their UUID recorded.

So unless we consider clients who have shut down their VMs and then upgraded beyond three versions (where we eventually remove the legacy logic), your concern is valid. However, I believe we can reasonably define three years as the maximum support window for a VM in this context.

Instead of this approach, I would rather provide a one-time job for admins to update the existing VM specs with UUID generated based on the existing logic (if the UUID is missing from the spec) I would also make sure that a RestartRequired condition is not set if the VMI has an identical UUID.

I’m not opposed to your approach of using a one-time job to update existing VM specs with a UUID if it's missing. However, I am concerned that older backups, which lack a UUID, may fail when someone attempts to restore them. How do you propose handling that scenario?

@vladikr
Copy link
Member

vladikr commented Feb 4, 2025

I am uncomfortable with the presented approach for updating the existing VMs, for several reasons.

  1. Patching the VM spec post-admission is a bad practice that can lead to races if other controllers are involved. It is counterintuitive and non-gitops-friendly. It also creates a president.
  2. We will have to keep the bad code in the controller.
  3. It doesn't solve the problem for currently shutdown VMs.

By keeping the existing logic in the controller and ensuring that the UUID is persisted in the VM spec for 2-3 release cycles, we provide VM owners with sufficient time to start any shutdown machines and have their UUID recorded.

So unless we consider clients who have shut down their VMs and then upgraded beyond three versions (where we eventually remove the legacy logic), your concern is valid. However, I believe we can reasonably define three years as the maximum support window for a VM in this context.

Different vendors can have different support windows. I think a job can allow users the flexibility they may need to support older versions.

Instead of this approach, I would rather provide a one-time job for admins to update the existing VM specs with UUID generated based on the existing logic (if the UUID is missing from the spec) I would also make sure that a RestartRequired condition is not set if the VMI has an identical UUID.

I’m not opposed to your approach of using a one-time job to update existing VM specs with a UUID if it's missing. However, I am concerned that older backups, which lack a UUID, may fail when someone attempts to restore them. How do you propose handling that scenario?

I think some manual intervention would be required, but also in the case when we remove the code from the controller.
There would need to be documentation describing this case. I think the flow could be as simple as restoring, removing the generated UUID, and running the job.

@vladikr
Copy link
Member

vladikr commented Feb 4, 2025

@vladikr Hi, IIUC the one-time job solution answers to the goal of making the UUID persistent, is this correct? We have also the other goals:

## Goals
* Improve the uniqueness of the VMI firmware UUID to become independent of the VM name and namespace.

That would still be handled by the mutating webhook.

  • Maintain UUID persistence with VM backup/restore.

The webhook would set the UUID on the VM spec.

  • Maintain backward compatibility

The change is still compatible + the job.


Which goals we would like to achieve in this proposal?

I think we should be able to target all of them.

@dankenigsberg
Copy link
Member

I am uncomfortable with the presented approach for updating the existing VMs, for several reasons.

  1. Patching the VM spec post-admission is a bad practice that can lead to races if other controllers are involved. It is counterintuitive and non-gitops-friendly. It also creates a president.
  2. We will have to keep the bad code in the controller.
  3. It doesn't solve the problem for currently shutdown VMs.

By keeping the existing logic in the controller and ensuring that the UUID is persisted in the VM spec for 2-3 release cycles, we provide VM owners with sufficient time to start any shutdown machines and have their UUID recorded.

So unless we consider clients who have shut down their VMs and then upgraded beyond three versions (where we eventually remove the legacy logic), your concern is valid. However, I believe we can reasonably define three years as the maximum support window for a VM in this context.

I don't understand the discussion about shut-down VMs. The proposed solution makes no differentiation between running and non-running VMs. All get their buggy UUID persisted.

Instead of this approach, I would rather provide a one-time job for admins to update the existing VM specs with UUID generated based on the existing logic (if the UUID is missing from the spec) I would also make sure that a RestartRequired condition is not set if the VMI has an identical UUID.

I’m not opposed to your approach of using a one-time job to update existing VM specs with a UUID if it's missing. However, I am concerned that older backups, which lack a UUID, may fail when someone attempts to restore them. How do you propose handling that scenario?

@dasionov are we discussing here 3. Upgrade-Specific Persistence of Firmware UUID or a yet unlisted proposal?

@dasionov
Copy link
Author

dasionov commented Feb 5, 2025

I don't understand the discussion about shut-down VMs. The proposed solution makes no differentiation between running and non-running VMs. All get their buggy UUID persisted.

If I understand correctly what @vladikr is trying to say is that shutdown vms wont be processed by the controller and wont get their uuid persisted, only when they are started, so technically a client can shut is vm for more then 3 releases and by that time we will remove the bad legacy logic, and then the shut vm will be started with no uuid at all.

@dasionov are we discussing here 3. Upgrade-Specific Persistence of Firmware UUID or a yet unlisted proposal?

it is an unlisted approach which is similar, but more flexible i guess, by creating an image and a job for admins to use once, when ever they want to fill all the missing uuids from all the vms in the cluster.

@dasionov
Copy link
Author

dasionov commented Feb 5, 2025

  • Maintain UUID persistence with VM backup/restore.

The webhook would set the UUID on the VM spec.

can you explain how? I thought you want to remove the legacy calculation completely. if the mutator will assign new uuid the restore and backups will break won't they?

@dankenigsberg
Copy link
Member

shutdown vms wont be processed by the controller

@dasionov that is not clear from the text of your proposal.

For VMs created before the upgrade, the VM controller will patch the vm.spec.template.spec.firmware.uuid field,
with a value calculated using the legacy logic (based on the VM name).

I expect this to apply to all VMs, running or stopped.

@dasionov
Copy link
Author

dasionov commented Feb 5, 2025

I expect this to apply to all VMs, running or stopped.

@dankenigsberg @vladikr
After a deeper analysis—and with input from @xpivarc—I verified that our resync mechanism indeed re-enqueues every VM, regardless of whether it's stopped or running. This means that even VMs in a stopped state will eventually be processed by the controller, allowing their firmware UUID to be persisted if it's missing. - ref

@vladikr
Copy link
Member

vladikr commented Feb 5, 2025

I expect this to apply to all VMs, running or stopped.

@dankenigsberg @vladikr After a deeper analysis—and with input from @xpivarc—I verified that our resync mechanism indeed re-enqueues every VM, regardless of whether it's stopped or running. This means that even VMs in a stopped state will eventually be processed by the controller, allowing their firmware UUID to be persisted if it's missing. - ref

Yes, this means you will relocate the current code, which is only triggered at VMI start, into the general sync loop.
This would also mean that if a user, for whatever reason, removes the UUID from the spec, the controller will add it back—something controversial, to say the least.
And then, after three releases, this behavior will change again...

From my point of view, having an external job that is controlled by an admin, executed at the desired time is better, cleaner, and less surprising.

@vladikr
Copy link
Member

vladikr commented Feb 5, 2025

We spoke offline, and @dasionov generously helped me realize something about the restore that I was missing before.
I'm still not happy about the proposed solution, but I'm not going to block this PR.

- Communicate changes through release notes and detailed documentation.

3. **GA Phase**
- in two versions we can assume that all clusters are already upgraded and the old code can be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will empty UIDs be handled when old code is dropped?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see a scenario where a VM resource would be processed after the old code has been removed while still having the empty UUID in its spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still an editable field, is it not? I see no mention of making the field immutable after it's set.

It may not be supported, but does the kubevirt operator keep people from updating from back 3+ versions? Even if it does, I know of users that don't use the operator.

Something is going to happen we encounter an empty id and I am curious if we know what. And if not are we planning for this case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still an editable field, is it not? I see no mention of making the field immutable after it's set.

It may not be supported, but does the kubevirt operator keep people from updating from back 3+ versions? Even if it does, I know of users that don't use the operator.

Something is going to happen we encounter an empty id and I am curious if we know what. And if not are we planning for this case

Thank you for the detailed explanation—you raise a valid point once again.
We should definitely discuss this further before finalizing the proposal.

IMO, if a user removes the field for any reason, leaving it empty, it makes sense for the controller to generate a new random UUID, similar to how the mutator behaves on creation.

@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from e21304d to 3d75a3d Compare February 11, 2025 17:28
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jean-edouard for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 3d75a3d to 5dc595c Compare February 13, 2025 12:15
@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2025
@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 5dc595c to 75a9308 Compare February 18, 2025 21:41
@kubevirt-bot
Copy link

New changes are detected. LGTM label has been removed.

@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 18, 2025
This proposal introduces a mechanism to make the firmware UUID of a
Virtual Machine in KubeVirt universally unique.
It ensures the UUID remains consistent across VM restarts, preserving
stability and reliability.

Signed-off-by: Daniel Sionov <[email protected]>
@dasionov dasionov force-pushed the add_persist_vmi_firmware_uuid_in_vm_status_design_proposal branch from 75a9308 to 29c2c5f Compare February 19, 2025 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. sig/compute size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.