Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bentito
Copy link
Contributor

@bentito bentito commented Jan 22, 2025

Description of the change:

Please checkout this doc on scoping out this change: https://docs.google.com/document/d/1P4cSYEP05vDyuhBfilyuWgL5d5OOD5z7JlAyOxpPqps

In this PR we are resurrecting #3411 with the intent to fix what that PR was originally going to fix. Follow on work will address the then revealed problem with metadata.generation|resourceVersion as per the doc comment by @tmshort

Motivation for the change:
[from the linked doc, "How Did We Get Here:]

  • Original Change for Memory Optimization: Sixteen months ago, we merged a PR in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects. This change was crucial for memory utilization performance gains, enabling OLM to run efficiently on MicroShift, a lightweight Kubernetes distribution designed for edge computing.
  • Limited Access to Spec/Status: By using PartialObjectMetadata, we only have access to the metadata of copied CSVs, not their spec or status fields. This means the operator lacks the information needed to compare the full content of the copied CSVs with the originals.
  • Removal of “Hash and Compare” Logic: The change inadvertently removed a core piece of the “hash and compare” logic. Previously, the operator used annotations containing hashes of the non-status and status fields of the original CSV to determine if a copied CSV needed updating. These annotations were not set on the copied CSVs after the change.
  • Resulting in Excessive Updates: Without the ability to compare hashes, the operator began issuing updates for copied CSVs 100% of the time, regardless of whether they were in sync with the originals. This behavior introduced a high load on the Kubernetes API server, especially in environments with many namespaces and CSVs installed in AllNamespace mode. The increased load also led to higher audit log volumes, impacting users with increased logging costs.

Architectural changes:

  • Reintroducing Annotations: A proposed fix PR adds back the annotations olm.operatorframework.io/nonStatusCopyHash and olm.operatorframework.io/statusCopyHash to the copied CSVs. These annotations store hashes of the non-status and status fields of the original CSV, respectively.
  • Reducing Unnecessary Updates: By comparing these hashes, the operator can determine if the copied CSVs are out of sync with the originals and only issue updates when necessary. This reduces the frequency of update requests to the API server and lowers audit log volumes.
  • Uncovering a New Bug: However, reintroducing the hash comparison logic will uncover a bug due to the use of PartialObjectMetadata for caching copied CSVs. Since we only have access to metadata, if a user manually modifies the spec or status of a copied CSV without changing the hash annotations, the operator cannot detect the change. The operator would incorrectly assume the copied CSV is in sync with the original, leading to potential inconsistencies.

Testing remarks:

Except from the expected changes around the inability to track copied CSV changes made by a user, we should be careful to test the following:

  • Cannot Revert Memory Optimization: Reverting the changes that introduced PartialObjectMetadata caching is not feasible. The memory optimization is critical for running OLM on MicroShift and supporting edge computing scenarios where resources are constrained.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

everettraven and others added 4 commits January 22, 2025 11:09
by adding annotations to copied CSVs that are populated with
hashes of the non-status fields and the status fields.

This seems to be how this was intended to work, but was not actually
working this way because the annotations never actually existed on the
copied CSV. This resulted in a hot loop of update requests being made
on all copied CSVs.

Signed-off-by: everettraven <[email protected]>
Signed-off-by: everettraven <[email protected]>
Signed-off-by: everettraven <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants