Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint feature tutorial does not work when ran with non-root user #2473

Closed
luispcunha opened this issue Sep 2, 2024 · 4 comments
Closed
Assignees
Milestone

Comments

@luispcunha
Copy link

Version of Apptainer

$ apptainer --version
apptainer version 1.3.3

Expected behavior

Expected to be able to reproduce the checkpointing example in the documentation, running all Apptainer commands with a non-privileged user.

Actual behavior

After executing the apptainer checkpoint instance server, the web server running in the instance crashes. Logs from the ~/.apptainer/instances/logs/{host_name}/{usename}/server.err file:

127.0.0.1 - - [02/Sep/2024 10:28:27] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [02/Sep/2024 10:28:32] "GET / HTTP/1.1" 200 -

[2024-09-02T10:28:39.795, 41000, 41003, ERROR] at fileconnlist.cpp:428 in prepareShmList; REASON='JASSERT(fd != -1) failed'
     (strerror((*__errno_location ()))) = Read-only file system
     area.name = /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
    python3.10: Terminating...
    Backtrace:
        1 jassert_internal::JAssert::~JAssert() in /.singularity.d/libs/libdmtcp.so 0x7f2515e572f1
        2 dmtcp::FileConnList::prepareShmList() in /.singularity.d/libs/libdmtcp_ipc.so 0x7f25162c52de
        3 dmtcp_FileConnList_EventHook(eDmtcpEvent, _DmtcpEventData_t*) in /.singularity.d/libs/libdmtcp_ipc.so 0x7f25162c68f7
        4 dmtcp::PluginManager::eventHook(eDmtcpEvent, _DmtcpEventData_t*) in /.singularity.d/libs/libdmtcp.so 0x7f2515e26e57
        5 dmtcp::DmtcpWorker::preCheckpoint() in /.singularity.d/libs/libdmtcp.so 0x7f2515e1dff4
        6  in /.singularity.d/libs/libdmtcp.so 0x7f2515e2eab4
        7  in /.singularity.d/libs/libdmtcp.so 0x7f2515e30c66
        8  in /lib/x86_64-linux-gnu/libpthread.so.0 0x7f2515852fa3
        9 clone in /lib/x86_64-linux-gnu/libc.so.6 0x7f25155f506f

Following calls to apptainer checkpoint instance server show the following logs:

INFO:    Using checkpoint "example-checkpoint"
Error, computation not in running state.  Either a checkpoint is
 currently happening or there are no connected processes.

If using the "root" user to run the example, this error doesn't occur, and I'm able to reproduce the example but the restarting part doesn't work reliably (similar to the issue described here).

Steps to reproduce this behavior

Follow the instructions in the documentation. The user running shouldn't be the root user.
DMTCP was installed from source from the tag 3.0.0 in the github repo.

What OS/distro are you running

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

How did you install Apptainer

wget https://github.com/apptainer/apptainer/releases/download/v1.3.3/apptainer_1.3.3_amd64.deb
sudo apt install -y ./apptainer_1.3.3_amd64.deb
@ikaneshiro
Copy link
Contributor

Hello, looking to reproduce this. Did you build dmtcp with the --enable-static-libstdcxx flag?

@ikaneshiro ikaneshiro self-assigned this Sep 3, 2024
@luispcunha
Copy link
Author

Hello, looking to reproduce this. Did you build dmtcp with the --enable-static-libstdcxx flag?

Yes.
This is how I built it:

#!/bin/bash

VERSION=3.0.0

apt install git gcc g++ make -y
apt install python3 -y

git clone https://github.com/dmtcp/dmtcp
cd dmtcp
git checkout $VERSION

./configure --enable-static-libstdcxx

make
make check # Optional
make install

echo /usr/local/lib/dmtcp > /etc/ld.so.conf.d/dmtcp.conf
ldconfig

@JasonYangShadow
Copy link
Member

hmm, I can not reproduce this issue, it looks like that it is related to permission issue as shown in the dump trace

[2024-09-02T10:28:39.795, 41000, 41003, ERROR] at fileconnlist.cpp:428 in prepareShmList; REASON='JASSERT(fd != -1) failed'
     (strerror((*__errno_location ()))) = Read-only file system

@DrDaveD DrDaveD added this to the 1.3.5 milestone Sep 5, 2024
@DrDaveD DrDaveD modified the milestones: 1.3.5, 1.4.0 Oct 30, 2024
@DrDaveD
Copy link
Contributor

DrDaveD commented Nov 19, 2024

The documentation was updated in apptainer/apptainer-userdocs#300.

@DrDaveD DrDaveD closed this as completed Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants