Suggestion / Question: Docker inside Benchexec #993

incaseoftrouble · 2024-02-18T10:16:41Z

Hi,

First: I didn't find anything on this on this in issues / doc, but maybe I just didn't look deep enough :) Please just point me in the right direction if that is the case.

I was running experiments for QComp 2023 and other submissions, and the general approach there (and which seems to be usual for artifacts nowadays) is to package tools as docker container. This is particularly useful for big, compiled tools with lots of dependencies, such as Storm or PRISM. Now, since usually authors provide their docker container and this docker container usually does not contain benchexec, one cannot run benchexec inside the docker container. Instead (as I did it), it seems reasonable to run a docker container for each execution (and apply cpu pinning etc.). But this doesn't make use of benchexec, which I would like to do.

So, the question is: Is there a reasonable way to use benchexec on the host to run tools provided as arbitrary docker container (i.e. without control over what is contained in the docker container)? (Edit: It is reasonable to require that the images are x86 / x64 linux-based)

I am guessing that docker run <image> <foo> inside benchexec would be overkill / something smarter could be done there. A silly but probably doable solution would be to have a statically compiled benchexec binary that is copied into the docker container and takes care of the sandboxing etc. But this does assume a bit about the architecture of the image I guess. What would be really cool is to extract the volumes of the image as basis for the overlayfs, but no idea if this is possible.

I am willing to provide implementation if pointed in the right direction, this is more about whether this is possible / feasible / sensible and which direction to take.

The text was updated successfully, but these errors were encountered:

PhilippWendler · 2024-02-19T07:25:20Z

There is a technical and a non-technical answer for that.

The technical part:

Using Docker inside BenchExec is simply not possible because of Docker. docker run does not start a container, it only connects to a running Docker daemon (that runs as root) and asks that to start the container, so the container would be running outside of BenchExec.
There are other container runtimes that are superior to Docker, such as Podman, and which AFAIK can start containers directly from the executed command, just like BenchExec. As it is possible to nest BenchExec's runexec inside a BenchExec container, it might also be possible with Podman. But of course one might not want to do that because the measurements would involve all the container-creation overhead.
As long as the container has Python installed in a standard location, injecting BenchExec into the container would only require a single file to be made accessible, which would be easy with -v. It might even be possible to do this with a Python installation, maybe with the layout of a Python venv.
It should be possible to avoid Docker completely and simply take the image, extract it into a directory, and start the tool from that directory using BenchExec, if BenchExec gets a flag to use a specific directory as the root directory of the container. This is technically trivial, the most difficult part is the UX part, such as whether other paths in the BenchExec input should refer to the root of the container or the host. But if thought through well I guess this would be a nice additional feature for BenchExec.

The non-technical part:

I think there are good arguments for tools (especially in academia) to not be distributed (solely) as container or VM images, because it makes it impossible (in practice) to use the tool without the container or VM, because the distributed image mixes the tool together with the base system and it is not clear what parts would need to be extracted in order to run the tool on a different system. But not everyone might have the possibility to use a container. For example, using Docker requires full root access, so students working on their university machines cannot use it, researchers wanting to run the tool on a computing cluster might have a job-distribution system that does not allow containers (StarExec!), etc. And nobody knows what problems this could cause in 10 years, when somebody might want to reproduce results from current papers.

A better choice is to separate base system and actual tool, and document the tool requirements. Container/VM images can still be used for the base system, and should be archived for reproducibility, but if the instructions for tool X are "Either use this provided base image or use Ubuntu 22.04 plus packages A,B,C, and then extract this ZIP archive with the tool in it.", then it becomes much easier for people who cannot or do not want to use a container to create an environment in which they can run the tool, even if they cannot ensure the exact package versions of the requirements.
So with this separation, nothing is lost, but additional flexibility and usability is gained.

A strategy like this is for example used by SV-COMP and Test-Comp. They force a specific base system (latest Ubuntu LTS), allow participants to define a list of additional Ubuntu packages that will be installed on top of the base system, and the tool needs to be submitted as a ZIP archive that only needs to be extracted in order to start the tool on an appropriate base system. For ease of use the base system with the additional packages is provided as a Docker image, such that it is still possible to run the tool in a container with a single command (after extraction of the ZIP).
But you save a lot of download time and storage space, it becomes more convenient to archive the tools on Zenodo, and of course you gain the flexibility describe above.

Maybe these arguments convince you that using container images as a distribution mechanism is suboptimal anyway.

incaseoftrouble · 2024-02-19T08:08:42Z

I like options 3 and 4 :)

Non technical part: Let me provide my view on this. As context: I'm part of several artifact evaluation committees in a particular area of verification. This doesn't give me absolute knowledge, of course; what follows is my perception of my sub-area.

The current state (whether we like it or not) is that there has been a strong push for VM / Docker images in recent times. (VMs 5 years ago, docker nowadays). Most conferences even require one of the two (barring exceptional circumstances).

And nobody knows what problems this could cause in 10 years, when somebody might want to reproduce results from current papers.

That is (in my eyes) the exact reason for the push for docker images (ideally without requiring network access). The consensus is that it is more reasonable to assume that some form of docker runtime will exist in 10 years than it is to have the exact glibc or exact JVM version or exact python, etc (all of which might not be backwards compatible, not just in theory, but happened practically).

Again, whether this is correct is surely up for debate, yet that is the reality of things in my field.

So with this separation, nothing is lost, but additional flexibility and usability is gained.

Indeed, and for the ACM reusable / available badge, this (more or less) is required. But the situation is that many authors usually only aim for functional (which does not even require any form of source code to be provided!), and thus for a concrete evaluation, authors typically submit images. (This also reduces the potential for errors on the side of the reviewer.) It also often happens that even if public GitHub sources are submitted, they are different from the tool submitted for evaluation (e.g. because the concrete models / benchmark scripts are not put in the repository (yet), or the code is not made public (yet), etc.)

Also, note that the dependencies for two evaluated tools might be mutually exclusive (!), again not just in theory but this happens.

They force a specific base system (latest Ubuntu LTS), allow participants to define a list of additional Ubuntu packages that will be installed on top of the base system, and the tool needs to be submitted as a ZIP archive that only needs to be extracted in order to start the tool on an appropriate base system. For ease of use the base system with the additional packages is provided as a Docker image, such that it is still possible to run the tool in a container with a single command (after extraction of the ZIP).

SV-COMP is much much more standardized and "harsher" in that manner. I wish it was like that for other conferences, but realistically, most researches are really bad at writing and maintaining tools, so by adding stricter requirements we would effectively just get less submissions, which would be counter-productive. In an ideal world, every paper should be required to submit an artifact, and acceptance of the paper depend on the reproducibility of the results, but the reality is that mostly artifacts at all are optional and even if they produce completely wrong results, the papers just don't get the "Artifact Evaluation"-Badge, which is indistinguishable from just not having submitted an artifact.

Just as an example: It is normal that the submitted tools don't work properly (even when submitted as an image) and artifact evaluation proceeds in a two-phase process, where in the first phase authors can reply to questions by the reviewers on how to actually run the tool and fix setup problems (and this already is on the "strict" side, some evaluations have continuously open communication). It is (in general) also considered too strict to assume that the artifact should run without network or be archived on persistent storage (it is now becoming more and more standard that zenodo, figshare, etc. are a strict requirement, but still, this year I reviewed artifacts that were submitted as files on dropbox and this was deemed acceptable)

A different way to view it: Most artifacts are solely prepared by PhD students, which often don't know much about software development practices, and most artifact reviews are also done by PhD students or post-docs. As such, having a as simple as possible setup is really helpful (i.e. image).

(Again, recall, these observations are only based on my area.)

Maybe these arguments convince you that using container images as a distribution mechanism is suboptimal anyway.

I do think that containers / VMs are optimal (better than sources) for a concrete evaluation (there can be no ambiguity in sub-versions of some libraries which can have severe impact on runtime and running them is much easier). I also agree that distribution should be accompanied by a clear description of what is needed to replicate the results + public code.

(Arguably, if you require that the Dockerfile derives from ubuntu-lts, you would actually get this description in "code". This is also one thing I like about Dockerfiles - they document the setup process in a clear, non-ambiguous way.)

As a bottom line: Even if it is not ideal, making benchexec easier to use for these cases could potentially attract more users, which then might get educated to do things right :) (I'm starting to be chair for some evaluation committees and could use this position to recommend benchexec if I can expect people to use it)

PhilippWendler · 2024-02-19T10:41:26Z

I like options 3 and 4 :)

If there are adjustments to BenchExec necessary for this, they could likely be added. Also, if you encounter problems, do not hesitate to ask, maybe I can be of help.

For option 4, I now recognize that a little bit more is necessary than described above, because BenchExec would also need to do more in the container preparation step, like adding /dev/null and similar devices. I don't have a full overview of what is required to turn an arbitrary directory into a full container system, but I think technically all of it should be possible even with the restriction that BenchExec is executed by a regular user (not root).

For the rest of the (somewhat off-topic) discussion, I do have the same experience as you, in particular that researchers struggle to provide usable tools. This is why I see it as a good thing if artifact-evaluation committees and competitions slightly push tool developers in the right direction.

Of course, AEC's should not exclude anybody, so allowing full VM images to be submitted after an individual request should be possible, but one could make separate base image + tool archive the default for submitters, like TACAS does for seven years now.

And certainly it must be a requirement that tools are executable on at least on one specific VM / container image, and that that image is archived permanently just like the rest of the submission. This eliminates all the doubt about version compatibilities, ambiguity, long-term availability, etc.

I am only arguing that it would be nice to not have it executable only on that image. And asking submitters to do a little bit of extra effort to keep base system and tool separate feels worthwhile.

Also note that one has to distinguish between competitions and artifacts here. Competitions need to be more strict anyway, because there are simply technical requirements such as that all tools need to be executable on the same host (you can't submit an ARM binary and hope to get a fair performance comparison against other tools with x86 binaries), but at the same time they can be more strict because they can also give the submitters more chances to resolve problems during a qualification phase where if necessary organizers and submitters look at problems together. SV-COMP even automates a lot of this by running sanity checks on submitted tools in a CI system.

Similarly for forbidding network access, for performance competitions this is simply a requirement anyway.

this year I reviewed artifacts that were submitted as files on dropbox and this was deemed acceptable)

Ouch :-( I hope that gets changed soon.

(Arguably, if you require that the Dockerfile derives from ubuntu-lts, you would actually get this description in "code". This is also one thing I like about Dockerfiles - they document the setup process in a clear, non-ambiguous way.)

Kind of - Dockerfiles often download stuff from the internet, so in 10 years you might have the image and the Dockerfile, but no way to re-create the container from the Dockerfile.

As a bottom line: Even if it is not ideal, making benchexec easier to use for these cases could potentially attract more users, which then might get educated to do things right :) (I'm starting to be chair for some evaluation committees and could use this position to recommend benchexec if I can expect people to use it)

Certainly! Please do so! ;-)

incaseoftrouble · 2024-02-19T11:06:19Z

For option 4, I now recognize that a little bit more is necessary than described above

I know too little about containers, but I imagine it is similar to chroot + cgroups? I think creating special devices (mknod) may require special permissions :(

but one could make separate base image + tool archive the default for submitters, like TACAS does for seven years now.

... which by quite a few is perceived as too restrictive (I think TACAS requires tools to run on their VM, right?)

Also note that one has to distinguish between competitions and artifacts here.

Aha! Yes, I was mainly talking about artifacts; sorry I should have made that clearer. (This is also why I consider this issue less important for QComp concretely.)

I do think that using benchexec even for artifacts would be super cool (this is where I used it). So many experimental setups are so fundamentally flawed that (in my opinion) many paper's experimental conclusions are questionable at least. The main point of this issue is to make it easier to use for artifact evaluation. Especially when advocating for it as AEC chair, it should be simple enough that a standard PhD student can be expected to get it to run for their tool. I think currently its not in this state but not too much is missing.

I now realize that my whole motivation is moot, since benchexec can run inside docker, so we can just say "hey please include benchexec in the image".

And, even if not, I think what you actually can always do is to take any docker image, create a new image from it with RUN apt install python3-pip && pip install benchexec or COPY benchexec_run.py /some/path and then execute benchexec inside. This is probably is much easier than manually injecting benchexec... (You could even do that automatically through the docker API ...)

EDIT: Just to clarify where I was coming from: I was evaluating foreign tools that basically say "if you want to use our tool, here is the docker image you should use" (e.g. https://hub.docker.com/r/movesrwth/storm ). Now this image doesn't include benchexec (understandably, its the standard distribution of their tool) so I was "stuck". But I didn't think of simply deriving a new image.

Kind of - Dockerfiles often download stuff from the internet, so in 10 years you might have the image and the Dockerfile, but no way to re-create the container from the Dockerfile.

Indeed, this is why requiring the full docker save image alongside should be mandatory. What I meant with this is that a Dockerfile simply is a sort of self-checking setup README. (In that sense, if your setup guide says "download xy from URL z" you have the same problem.)

PhilippWendler · 2024-02-19T14:46:50Z

For option 4, I now recognize that a little bit more is necessary than described above

I know too little about containers, but I imagine it is similar to chroot + cgroups? I think creating special devices (mknod) may require special permissions :(

Yes, but during the container setup we do have special permissions inside the container. This is how we mount stuff and configure a network device in the container. I think we can also create (some) devices, and in the worst case we could bind-mount them from the host.

I do think that using benchexec even for artifacts would be super cool (this is where I used it).

Sure! This is our goal.

I think currently its not in this state but not too much is missing.

We are always glad to hear feedback regarding what is missing here.

I now realize that my whole motivation is moot, since benchexec can run inside docker, so we can just say "hey please include benchexec in the image".

Yes, this is what we typically recommend when asked about BenchExec and containers.

And, even if not, I think what you actually can always do is to take any docker image, create a new image from it with RUN apt install python3-pip && pip install benchexec or COPY benchexec_run.py /some/path and then execute benchexec inside. This is probably is much easier than manually injecting benchexec... (You could even do that automatically through the docker API ...)

If you already have Python in the container, just make the BenchExec .whl file available somehow, you can directly run BenchExec from this file without any unpacking or installation.

incaseoftrouble · 2024-02-19T14:59:39Z

We are always glad to hear feedback regarding what is missing here.

I think most / all of the points are in #994 , this was more of a very application specific situation.

If you already have Python in the container, just make the BenchExec .whl file available somehow, you can directly run BenchExec from this file without any unpacking or installation.

Oh, right. This should even be moderately easy to automate with the docker API (i.e. check if benchexec is present and, if not, copy the .whl)

Given that there is a simple way to handle this, I am fine with closing this issue.

PhilippWendler · 2024-02-20T07:26:46Z

Ok.

I would be highly interested in hearing about the result of this, i.e., which way you use in the end and how it is implemented. This can also be of interest for other users of BenchExec, so I would appreciate if you report back (e.g., here).

incaseoftrouble mentioned this issue Feb 18, 2024

Discussion: Qualitative properties and extended task definitions #994

Closed

PhilippWendler added the container related to container mode label Feb 19, 2024

PhilippWendler added the question label Feb 20, 2024

PhilippWendler closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion / Question: Docker inside Benchexec #993

Suggestion / Question: Docker inside Benchexec #993

incaseoftrouble commented Feb 18, 2024 •

edited

Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024 •

edited

Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024 •

edited

Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024

PhilippWendler commented Feb 20, 2024

Suggestion / Question: Docker inside Benchexec #993

Suggestion / Question: Docker inside Benchexec #993

Comments

incaseoftrouble commented Feb 18, 2024 • edited Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024 • edited Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024 • edited Loading

PhilippWendler commented Feb 19, 2024

incaseoftrouble commented Feb 19, 2024

PhilippWendler commented Feb 20, 2024

incaseoftrouble commented Feb 18, 2024 •

edited

Loading

incaseoftrouble commented Feb 19, 2024 •

edited

Loading

incaseoftrouble commented Feb 19, 2024 •

edited

Loading