Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiarch/qemu-user-static not working on 1.23 #4215

Open
gbucknel opened this issue Sep 26, 2024 · 14 comments · May be fixed by bottlerocket-os/bottlerocket-core-kit#206
Open

multiarch/qemu-user-static not working on 1.23 #4215

gbucknel opened this issue Sep 26, 2024 · 14 comments · May be fixed by bottlerocket-os/bottlerocket-core-kit#206
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@gbucknel
Copy link

This is a pretty strange one, but I wanted to raise it in case someone else was hitting it.
In a CICD context, we build multiarch images on x64 by executing the multiarch/qemu-user-static image before trying to build arm64 and amd64 docker images.

We use this command to do this :

podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged multiarch/qemu-user-static --reset -p yes

As far as I know this registers alternative executable types in the kernel, so QEMU will know when to step in to emulate the commands. This has been working for quite a while, but when 1.23 was pushed out, this process broke.

Expected results :
Image to build correctly

Actual results:
Image build fails with exec format error

[2024-09-26T06:40:10.212Z] + buildah build --authfile /run/containers/0/auth-ecr.json --layers --format oci --pull --network=host --ulimit nofile=24000:24000 --squash --jobs 2 --platform=linux/amd64,linux/arm64 --manifest jobs-job-service:0.0.550-test-builds-gbucknel-b4-g2b0d41d .
[202[2024-09-26T06:40:23.636Z] process failed to start with error: fork/exec /bin/sh: exec format errorprocess exited with error: exec: not startedsubprocess exited with status 1

[2024-09-26T06:40:30.904Z] Error: [linux/arm64]: building at STEP "RUN mkdir -p /usr/apps/service-config": exit status 1

script returned exit code 1

We were able to get everything working again by going back to 1.22.

Also , running the register script by hand in a sheltie session on the host seems to make arm64 builds work in 1.23 again, so trying to do the same thing in a bootstrap container is something I'd like to try.

Looking at the changelog, I'm a bit puzzled as to where the change could've been. Will continue to try to figure it out. If anyone else has seen this, it would be really interesting.

@gbucknel gbucknel added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Sep 26, 2024
@gbucknel
Copy link
Author

I tried using a bootstrap container - it wasn't any different to executing it from a normal container.
It looks like this is the issue ?

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.107

commit 53477032977930f459293b7c244c348d5667c574
Author: Christian Brauner <[email protected]>
Date:   Thu Oct 28 12:31:13 2021 +0200

    binfmt_misc: cleanup on filesystem umount

I guess because this is executed in a container typically - with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.

I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.

@bcressey
Copy link
Contributor

Nice work tracking this down!

with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.

This seems like a pretty clear regression, and one that breaks a couple of the popular "fire and forget" solutions for deploying binfmt support:

I'll see about reporting this to LKML.

I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.

I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.

@bcressey
Copy link
Contributor

I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.

This doesn't work, since it runs afoul of the checkProcMount safety check:

[   74.666444] host-containers@admin[1381]: time="2024-09-27T17:33:30Z" level=fatal msg="failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting \"/proc/sys/fs/binfmt_misc\" to rootfs at \"/proc/sys/fs/binfmt_misc\": create mount destination for /proc/sys/fs/binfmt_misc mount: check proc-safety of /proc/sys/fs/binfmt_misc mount: \"/run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/proc/sys/fs/binfmt_misc\" cannot be mounted because it is inside /proc: unknown"

@gbucknel
Copy link
Author

hi @bcressey , thanks for looking at this. I don't suppose there is a way to add a "non containerized" init script to a bottlerocket node without building your own image? I had a look but couldn't find anything. It makes sense there isn't another way to do it.

@bcressey
Copy link
Contributor

@gbucknel that's correct - containers are the only way to run custom code on Bottlerocket.

While poking at this, I noticed that the binfmt_misc filesystem isn't mounted on the host by default, because the binfmt feature is disabled. That's something that'd be good to fix, though I'm not sure if it would work around the new kernel's behavior on its own.

@bcressey
Copy link
Contributor

@gbucknel I was able to get the following bootstrap container working. The two challenges involved were that the host didn't have its own binfmt_misc mount already, and the SELinux labels for the qemu-*-static binaries were overly restrictive because they originated in a bootstrap container.

Dockerfile:

FROM multiarch/qemu-user-static
ADD binfmt-install ./
RUN chmod +x ./binfmt-install
ENTRYPOINT ["sh", "binfmt-install"]

binfmt-install:

#!/bin/bash
set -euxo pipefail
exec 1>&2

# Create the mount point on the host.
mkdir -p /.bottlerocket/rootfs/mnt/binfmt_misc

# Mount the binfmt_misc filesystem. It will propagate back to the host
# because this location is set up as an "rshared" mount.
mount binfmt_misc -t binfmt_misc /.bottlerocket/rootfs/mnt/binfmt_misc

# Bind mount the binfmt_misc filesystem to the expected location under
# /proc/sys/fs/binfmt_misc. Otherwise the QEMU script will mount a second
# copy.
mount --bind /.bottlerocket/rootfs/mnt/binfmt_misc /proc/sys/fs/binfmt_misc

# Because we're running as a bootstrap container, the QEMU binaries all
# have the "secret_t" SELinux label, which prevents unprivileged containers
# from mapping them into memory. Copy the binaries to host path where they
# will have the "local_t" label instead, after removing any previous copies.
export QEMU_BIN_DIR=/.bottlerocket/rootfs/local/qemu/bin
mkdir -p "${QEMU_BIN_DIR}"
rm -f "${QEMU_BIN_DIR}"/qemu-*-static
cp /usr/bin/qemu-*-static "${QEMU_BIN_DIR}"

# Now run the registration script!
./register --reset -p yes

@gbucknel
Copy link
Author

@bcressey !!! That's so cool, I'll try it out, thank you!

@bcressey
Copy link
Contributor

Happy to help!

If it's easier to integrate, I expect it'd be possible to make this work in a k8s pod also, with a spec like this:

apiVersion: v1
kind: Pod
metadata:
  name: qemu-static
spec:
  volumes:
  - name: mnt-dir
    hostPath:
      path: /mnt
  - name: local-dir
    hostPath:
      path: /local
  containers:
  - name: qemu-static
    image: multiarch/qemu-user-static:latest
    command: [ "..." ]
    volumeMounts:
    # this provides an "rshared" mount to send mounts back to the host
    - mountPath: /.bottlerocket/rootfs/mnt
      name: mnt-dir
      mountPropagation: Bidirectional
    # this provides a mount with the "local_t" label for correct labeling
    - mountPath: /.bottlerocket/rootfs/local
      name: local-dir
    securityContext:
      privileged: true

@gbucknel
Copy link
Author

gbucknel commented Oct 1, 2024

@bcressey , thanks again, I've tested your container and it works well. Appreciate the pod yaml as well since I'm not sure if a bootstrap container is the right approach - I had quite a few nodes silently fail while playing with this today.

Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?

I was testing with AL2023 yesterday just to try to repro the issue and was surprised to find that it worked fine on the newer kernel and the multiarch container. Maybe that's because systemd is handling the state of the proc filesystem ?
I'll close this issue if you think using the multiarch container is the right way forward here.

@bcressey
Copy link
Contributor

bcressey commented Oct 1, 2024

Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?

@gbucknel I'm planning to enable that on the Bottlerocket side. Not sure whether this would make it work in your environment or not, but it's worth a try. It sounds like it might help if AL2023 is working.

@gbucknel
Copy link
Author

gbucknel commented Oct 1, 2024

@bcressey oh that's awesome, let me know if I can test anything.

@gbucknel
Copy link
Author

hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?

@bcressey
Copy link
Contributor

bcressey commented Oct 16, 2024

hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?

@gbucknel It won't make the release train for 1.25; I need to finish writing a test for the SELinux policy changes, and also verify they are actually required.

Previously the host did not mount binfmt_misc at all, so writing to it required CAP_SYS_ADMIN in order to mount it. With that barrier gone, there needs to be some other check (beyond UID 0) or else the security posture will change.

For your binfmt misc installer - can you confirm you are running that inside a privileged pod? The SELinux rule I've tested would effectively require that.

@gbucknel
Copy link
Author

No problem, just thought I'd check.

At the moment, I run

podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged 307943323221.dkr.ecr.us-east-1.amazonaws.com/multiarch/qemu-user-static --reset -p yes

So yes, I am in a privileged pod . Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants