-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion / Question: Docker inside Benchexec #993
Comments
There is a technical and a non-technical answer for that. The technical part:
The non-technical part: I think there are good arguments for tools (especially in academia) to not be distributed (solely) as container or VM images, because it makes it impossible (in practice) to use the tool without the container or VM, because the distributed image mixes the tool together with the base system and it is not clear what parts would need to be extracted in order to run the tool on a different system. But not everyone might have the possibility to use a container. For example, using Docker requires full root access, so students working on their university machines cannot use it, researchers wanting to run the tool on a computing cluster might have a job-distribution system that does not allow containers (StarExec!), etc. And nobody knows what problems this could cause in 10 years, when somebody might want to reproduce results from current papers. A better choice is to separate base system and actual tool, and document the tool requirements. Container/VM images can still be used for the base system, and should be archived for reproducibility, but if the instructions for tool X are "Either use this provided base image or use Ubuntu 22.04 plus packages A,B,C, and then extract this ZIP archive with the tool in it.", then it becomes much easier for people who cannot or do not want to use a container to create an environment in which they can run the tool, even if they cannot ensure the exact package versions of the requirements. A strategy like this is for example used by SV-COMP and Test-Comp. They force a specific base system (latest Ubuntu LTS), allow participants to define a list of additional Ubuntu packages that will be installed on top of the base system, and the tool needs to be submitted as a ZIP archive that only needs to be extracted in order to start the tool on an appropriate base system. For ease of use the base system with the additional packages is provided as a Docker image, such that it is still possible to run the tool in a container with a single command (after extraction of the ZIP). Maybe these arguments convince you that using container images as a distribution mechanism is suboptimal anyway. |
I like options 3 and 4 :) Non technical part: Let me provide my view on this. As context: I'm part of several artifact evaluation committees in a particular area of verification. This doesn't give me absolute knowledge, of course; what follows is my perception of my sub-area. The current state (whether we like it or not) is that there has been a strong push for VM / Docker images in recent times. (VMs 5 years ago, docker nowadays). Most conferences even require one of the two (barring exceptional circumstances).
That is (in my eyes) the exact reason for the push for docker images (ideally without requiring network access). The consensus is that it is more reasonable to assume that some form of docker runtime will exist in 10 years than it is to have the exact glibc or exact JVM version or exact python, etc (all of which might not be backwards compatible, not just in theory, but happened practically). Again, whether this is correct is surely up for debate, yet that is the reality of things in my field.
Indeed, and for the ACM reusable / available badge, this (more or less) is required. But the situation is that many authors usually only aim for functional (which does not even require any form of source code to be provided!), and thus for a concrete evaluation, authors typically submit images. (This also reduces the potential for errors on the side of the reviewer.) It also often happens that even if public GitHub sources are submitted, they are different from the tool submitted for evaluation (e.g. because the concrete models / benchmark scripts are not put in the repository (yet), or the code is not made public (yet), etc.) Also, note that the dependencies for two evaluated tools might be mutually exclusive (!), again not just in theory but this happens.
SV-COMP is much much more standardized and "harsher" in that manner. I wish it was like that for other conferences, but realistically, most researches are really bad at writing and maintaining tools, so by adding stricter requirements we would effectively just get less submissions, which would be counter-productive. In an ideal world, every paper should be required to submit an artifact, and acceptance of the paper depend on the reproducibility of the results, but the reality is that mostly artifacts at all are optional and even if they produce completely wrong results, the papers just don't get the "Artifact Evaluation"-Badge, which is indistinguishable from just not having submitted an artifact. Just as an example: It is normal that the submitted tools don't work properly (even when submitted as an image) and artifact evaluation proceeds in a two-phase process, where in the first phase authors can reply to questions by the reviewers on how to actually run the tool and fix setup problems (and this already is on the "strict" side, some evaluations have continuously open communication). It is (in general) also considered too strict to assume that the artifact should run without network or be archived on persistent storage (it is now becoming more and more standard that zenodo, figshare, etc. are a strict requirement, but still, this year I reviewed artifacts that were submitted as files on dropbox and this was deemed acceptable) A different way to view it: Most artifacts are solely prepared by PhD students, which often don't know much about software development practices, and most artifact reviews are also done by PhD students or post-docs. As such, having a as simple as possible setup is really helpful (i.e. image). (Again, recall, these observations are only based on my area.)
I do think that containers / VMs are optimal (better than sources) for a concrete evaluation (there can be no ambiguity in sub-versions of some libraries which can have severe impact on runtime and running them is much easier). I also agree that distribution should be accompanied by a clear description of what is needed to replicate the results + public code. (Arguably, if you require that the As a bottom line: Even if it is not ideal, making benchexec easier to use for these cases could potentially attract more users, which then might get educated to do things right :) (I'm starting to be chair for some evaluation committees and could use this position to recommend benchexec if I can expect people to use it) |
If there are adjustments to BenchExec necessary for this, they could likely be added. Also, if you encounter problems, do not hesitate to ask, maybe I can be of help. For option 4, I now recognize that a little bit more is necessary than described above, because BenchExec would also need to do more in the container preparation step, like adding For the rest of the (somewhat off-topic) discussion, I do have the same experience as you, in particular that researchers struggle to provide usable tools. This is why I see it as a good thing if artifact-evaluation committees and competitions slightly push tool developers in the right direction. Of course, AEC's should not exclude anybody, so allowing full VM images to be submitted after an individual request should be possible, but one could make separate base image + tool archive the default for submitters, like TACAS does for seven years now. And certainly it must be a requirement that tools are executable on at least on one specific VM / container image, and that that image is archived permanently just like the rest of the submission. This eliminates all the doubt about version compatibilities, ambiguity, long-term availability, etc. I am only arguing that it would be nice to not have it executable only on that image. And asking submitters to do a little bit of extra effort to keep base system and tool separate feels worthwhile. Also note that one has to distinguish between competitions and artifacts here. Competitions need to be more strict anyway, because there are simply technical requirements such as that all tools need to be executable on the same host (you can't submit an ARM binary and hope to get a fair performance comparison against other tools with x86 binaries), but at the same time they can be more strict because they can also give the submitters more chances to resolve problems during a qualification phase where if necessary organizers and submitters look at problems together. SV-COMP even automates a lot of this by running sanity checks on submitted tools in a CI system. Similarly for forbidding network access, for performance competitions this is simply a requirement anyway.
Ouch :-( I hope that gets changed soon.
Kind of - Dockerfiles often download stuff from the internet, so in 10 years you might have the image and the Dockerfile, but no way to re-create the container from the Dockerfile.
Certainly! Please do so! ;-) |
I know too little about containers, but I imagine it is similar to chroot + cgroups? I think creating special devices (
... which by quite a few is perceived as too restrictive (I think TACAS requires tools to run on their VM, right?)
Aha! Yes, I was mainly talking about artifacts; sorry I should have made that clearer. (This is also why I consider this issue less important for QComp concretely.) I do think that using benchexec even for artifacts would be super cool (this is where I used it). So many experimental setups are so fundamentally flawed that (in my opinion) many paper's experimental conclusions are questionable at least. The main point of this issue is to make it easier to use for artifact evaluation. Especially when advocating for it as AEC chair, it should be simple enough that a standard PhD student can be expected to get it to run for their tool. I think currently its not in this state but not too much is missing. I now realize that my whole motivation is moot, since benchexec can run inside docker, so we can just say "hey please include benchexec in the image". And, even if not, I think what you actually can always do is to take any docker image, create a new image from it with EDIT: Just to clarify where I was coming from: I was evaluating foreign tools that basically say "if you want to use our tool, here is the docker image you should use" (e.g. https://hub.docker.com/r/movesrwth/storm ). Now this image doesn't include benchexec (understandably, its the standard distribution of their tool) so I was "stuck". But I didn't think of simply deriving a new image.
Indeed, this is why requiring the full |
Yes, but during the container setup we do have special permissions inside the container. This is how we mount stuff and configure a network device in the container. I think we can also create (some) devices, and in the worst case we could bind-mount them from the host.
Sure! This is our goal.
We are always glad to hear feedback regarding what is missing here.
Yes, this is what we typically recommend when asked about BenchExec and containers.
If you already have Python in the container, just make the BenchExec |
I think most / all of the points are in #994 , this was more of a very application specific situation.
Oh, right. This should even be moderately easy to automate with the docker API (i.e. check if benchexec is present and, if not, copy the .whl) Given that there is a simple way to handle this, I am fine with closing this issue. |
Ok. I would be highly interested in hearing about the result of this, i.e., which way you use in the end and how it is implemented. This can also be of interest for other users of BenchExec, so I would appreciate if you report back (e.g., here). |
Hi,
First: I didn't find anything on this on this in issues / doc, but maybe I just didn't look deep enough :) Please just point me in the right direction if that is the case.
I was running experiments for QComp 2023 and other submissions, and the general approach there (and which seems to be usual for artifacts nowadays) is to package tools as docker container. This is particularly useful for big, compiled tools with lots of dependencies, such as Storm or PRISM. Now, since usually authors provide their docker container and this docker container usually does not contain benchexec, one cannot run benchexec inside the docker container. Instead (as I did it), it seems reasonable to run a docker container for each execution (and apply cpu pinning etc.). But this doesn't make use of benchexec, which I would like to do.
So, the question is: Is there a reasonable way to use benchexec on the host to run tools provided as arbitrary docker container (i.e. without control over what is contained in the docker container)? (Edit: It is reasonable to require that the images are x86 / x64 linux-based)
I am guessing that
docker run <image> <foo>
inside benchexec would be overkill / something smarter could be done there. A silly but probably doable solution would be to have a statically compiled benchexec binary that is copied into the docker container and takes care of the sandboxing etc. But this does assume a bit about the architecture of the image I guess. What would be really cool is to extract the volumes of the image as basis for the overlayfs, but no idea if this is possible.I am willing to provide implementation if pointed in the right direction, this is more about whether this is possible / feasible / sensible and which direction to take.
The text was updated successfully, but these errors were encountered: