Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many: deal with incorrect status from osbuild better #827

Open
wants to merge 3 commits into from

Conversation

mvo5
Copy link
Collaborator

@mvo5 mvo5 commented Feb 7, 2025

go.mod: upadate to latest images to get PR#1200

This commit updates the images library to pull in the fix [0] for
the overly long messages from osbuild. This should test the
(previously) failing test that runs the centos-9 ISO with an
attached terminal.

[0] osbuild/images#1200


test: run the centos-9 test with an attached terminal

This commit changes the centos-9 test to run with podman -t so
that we have a test-case that uses the terminal progress. This
is prompted by:
a) we have no integration test currently that uses the terminal
progress for the full build
b) a konflux failure/memory leak that showed because there the
test is run with -t


This commit ensures that we do restrict the memory of the bib
test conatiner to catch excessive memory usage. This is prompted
by a memory leak when dealing with unrecoverable status messages
that lead to failures in konflux.


This commit changes the progress parser (again) to deal with
errors from the osbuild json progress scanner. On errors we
will now exit right away and potentially kill osbuild but
provide an error message that hints how to workaround the
issue.

The original code assumed we get transient errors like json
decode issues. However it turns out that this is not always
the case, some errors from a bufio.Scanner are unrecoverable
(like ErrTooBig) and trying again just leads to an endless
loop. We can also not "break" wait for the build to finish
because that would appear like the progress is broken
forever and we would still have to report an error (just
much later).

defer func() {
// ensure osbuild is stopped even if we exit early
if cmd.Process != nil {
cmd.Process.Kill()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need a bit more care, maybe only do it in hte error case (as the happy case does a cmd.Wait already so no killing needed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine as is but I would strongly recommend using PR_SET_PDEATHSIG if it's not already in use. That's a kernel-enforced way to lifecycle bind the child to the parent, such that if e.g. the parent dies due to SIGSEGV or whatever, the child gets killed too.

@mvo5 mvo5 marked this pull request as ready for review February 7, 2025 21:05
@mvo5 mvo5 force-pushed the fix-konflux-oom branch 3 times, most recently from 0f80abb to 837d891 Compare February 10, 2025 08:59
Copy link
Member

@ondrejbudai ondrejbudai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

PR_SET_PDEATHSIG feels like a nice follow-up, right, @mvo5 ?

@mvo5
Copy link
Collaborator Author

mvo5 commented Feb 11, 2025

Looks good, thanks!

PR_SET_PDEATHSIG feels like a nice follow-up, right, @mvo5 ?

Yeah, I think we can look into this in a followup, while writing this I realized its actually more complicated because killing osbuild hard will probably leave a bunch of (dandling) bind mounts around. So the current approach is to send a SIGINT and hope for the best, osbuild should cleanup after itself. I'm not sure we have good options here but fortunately the error condition (bad progress reported) is rare. The other thing we could try is to make sure that the osbuild status scanner always advances, then we could also keep trying until we get valid json again (replacing the current bufio scanner essentially, however in the general case when we never get valid output again we would still appear hung for an extended period of time, but maybe a worthwhle tradeoff over erroring?)

@mvo5 mvo5 requested a review from ondrejbudai February 13, 2025 09:23
Copy link
Member

@ondrejbudai ondrejbudai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

mvo5 added 3 commits February 13, 2025 11:19
This commit changes the progress parser (again) to deal with
errors from the osbuild json progress scanner. On errors we
will now exit right away and potentially kill osbuild but
provide an error message that hints how to workaround the
issue.

The original code assumed we get transient errors like json
decode issues. However it turns out that this is not always
the case, some errors from a bufio.Scanner are unrecoverable
(like ErrTooBig) and trying again just leads to an endless
loop. We can also not "break" and wait for the build to finish
because that would appear like the progress is broken
forever and we would still have to report an error (just
much later).
This commit ensures that we do restrict the memory of the bib
test conatiner to catch excessive memory usage. This is prompted
by a memory leak when dealing with unrecoverable status messages
that lead to failures in konflux.
This commit changes the centos-9 test to run with `podman -t` so
that we have a test-case that uses the `terminal` progress. This
is prompted by:
a) we have no integration test currently that uses the terminal
   progress for the full build
b) a konflux failure/memory leak that showed because there the
   test is run with `-t`
@ondrejbudai
Copy link
Member

fat-fingered the "update branch" button and it created a merge commit. Fixed that by a force-push.

@mvo5 mvo5 enabled auto-merge February 13, 2025 10:35
@mvo5 mvo5 disabled auto-merge February 13, 2025 10:37
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Small error message comment.

// can also not (in the general case) recover as
// the underlying osbuildStatus.scanner maybe in
// an unrecoverable state (like ErrTooBig).
return fmt.Errorf(`error parsing osbuild status, please report a bug and try with "--progress=verbose": %w`, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we telling people to report a bug with the verbose output log? If so, the phrasing here doesn't really say that, I think.
Should we rephrase this to:

error parsing osbuild status. Consider reporting a bug (URL) with the output of the build using --progress=verbose

or something like that.

@mvo5 mvo5 enabled auto-merge February 13, 2025 11:51
@mvo5 mvo5 added this pull request to the merge queue Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants