Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neighbourlist integer size #977

Merged
merged 16 commits into from
Nov 30, 2023

Conversation

Iximiel
Copy link
Member

@Iximiel Iximiel commented Oct 5, 2023

Description

I found that if a user ask a full neighbor list with a more than circa ~1M atoms (meaning that the neighbor list would have nat(nat-1)/2 elements) plumed will crash, this may be a bug due to the integers type on some indexes:
By compiling with --enable-debug --enable-debug-glibcxx I get the following message

/opt/ohpc/pub/compiler/gcc/9.4.0/include/c++/9.4.0/debug/vector:434:
In function:
    std::__debug::vector<_Tp, _Allocator>::const_reference 
    std::__debug::vector<_Tp, 
    _Allocator>::operator[](std::__debug::vector<_Tp, 
    _Allocator>::size_type) const [with _Tp = PLMD::AtomNumber; _Allocator = 
    std::allocator<PLMD::AtomNumber>; std::__debug::vector<_Tp, 
    _Allocator>::const_reference = const PLMD::AtomNumber&; 
    std::__debug::vector<_Tp, _Allocator>::size_type = long unsigned int]

Error: attempt to subscript container with out-of-bounds index -536758121, 
but container only holds 121500 elements.

Objects involved in the operation:
    sequence "this" @ 0x0x2799798 {
      type = std::__debug::vector<PLMD::AtomNumber, std::allocator<PLMD::AtomNumber> >;
    }

the plumed.dat is

cv: COORDINATION GROUPA=1-121500 R_0=1 NOPBC

PRINT ARG=* FILE=colvar_121500 FMT=%8.4f STRIDE=1

I'm setting up this as a draft because I am unsure if the problem is only on the NL or is more "deep"

Target release

I would like my code to appear in release 2.9

Type of contribution
  • changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
  • changes to a module not authored by you
  • new module contribution or edit of a module authored by you
Copyright
  • I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.
  • the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.
Tests
  • I added a new regtest or modified an existing regtest to validate my changes.
  • I verified that all regtests are passed successfully on GitHub Actions.

@GiovanniBussi
Copy link
Member

@Iximiel writing a regtest that fails with master and fails with this is a bit tricky, because we do not want to store a trajectory with 1M atoms. But you can write a simple awk script that generates such trajectory on the fly (see e.g. basic/rt-fix-226). A single frame is sufficient. Alternatively we might at some point hard code some sample trajectory in the plumed driver, so that they can be directly generated in memory without the need to write a file.

@Iximiel
Copy link
Member Author

Iximiel commented Oct 5, 2023

@Iximiel writing a regtest that fails with master and fails with this is a bit tricky, because we do not want to store a trajectory with 1M atoms. But you can write a simple awk script that generates such trajectory on the fly (see e.g. basic/rt-fix-226). A single frame is sufficient. Alternatively we might at some point hard code some sample trajectory in the plumed driver, so that they can be directly generated in memory without the need to write a file.

I did used the trick from rt-fix-226, so no new trajectories have been pushed.
The tests are based on rt15 (the one that tests mklib), but may or should be rewritten in a more unit-test-like fashion to focus better on the problem (and generate 1M of atoms into memory instead of reading them, after generating them with awk).

@GiovanniBussi
Copy link
Member

Note: the problem should require much less than 1M particles.
sqrt(2**32*2) ~ 100k should be sufficient to trigger the problem.

@GiovanniBussi
Copy link
Member

I am confused. Why are you adding dynamically loaded code? What we need is simply a test that computes the coordination between a group of ~ 100k atoms, and that will crash with master and not crash with the modified version. And it should be a separate test indeed, not using dynamic libraries (otherwise, for instance if we at some point we run on windows this test will not be used).

@GiovanniBussi
Copy link
Member

I would also suggest not to have a test with a small (working) number of atoms. We only need to test the failing case. Coordinations are already tested somewhere else.

@Iximiel
Copy link
Member Author

Iximiel commented Oct 5, 2023

Note: the problem should require much less than 1M particles. sqrt(2**32*2) ~ 100k should be sufficient to trigger the problem.

I'm actually using ~100K atoms, I don't know why I persuaded myself that I am using 1M atoms...

I am confused. Why are you adding dynamically loaded code? What we need is simply a test that computes the coordination between a group of ~ 100k atoms, and that will crash with master and not crash with the modified version. And it should be a separate test indeed, not using dynamic libraries (otherwise, for instance if we at some point we run on windows this test will not be used).

Because I want to test if the problem arises from the neighbor list, and I am testing it in the more Isolated way possible, before testing it in a more complex environment. I'm moving the test to a make-style one to not use dlopen

@Iximiel
Copy link
Member Author

Iximiel commented Oct 5, 2023

@GiovanniBussi
By trying to convert the test into a make style one I think I found the culprit:
The minimum code (in plain C++) you need to reproduce the std::bad_alloc is the following:

#include <vector>
#include <utility>
int main(){ 
    size_t d1=100000;
    size_t dd=(d1*(d1-1))/2;
    std::vector <std::pair<size_t,size_t>> test(dd);
}

That it is similar to what it is happening in the constructor of the NL,

In plumed, without includes and usings would be:

int main(int, char**){
    Pbc pbc{};
    pbc.setBox(PLMD::Tensor({1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0}));
    std::vector<AtomNumber> list0(100000);
    size_t i=0;
    for (auto& an:list0){
      an.setIndex(i);
      ++i;
    }
    bool serial=true;
    bool do_pbc=false;
    Communicator cm{};
    auto nl=NeighborList( list0, serial, do_pbc, pbc,cm);
}

what do you think?

@GiovanniBussi
Copy link
Member

Thanks!

The minimum code (in plain C++) you need to reproduce the std::bad_alloc is the following:

Does this imply that in the original plumed test you received a std::bad_alloc? I thought you had an unexplained crash. If it's a std::bad_alloc it should be reported in the log file. Are you sure it is not?

Then I see that the issue is in storing the neighbor list. If we have more than 232 elements allowed in the list (which is the reason to use long long), and if atoms are stored as 8 bytes integers, this will result on 28(232) = 64 Gb, which is anyway out of reach for most architectures.

@Iximiel
Copy link
Member Author

Iximiel commented Oct 9, 2023

I refined the test more to understand what it is happening:

  • In the master branch you cannot get the std::bad_alloc:
    • reason is in the constructor: members nlist0_, nlist1_ and nallpairs_ are unsigned int, so (with gcc9.4) when you do nallpairs_=nlist0_*nlist1_; you get the narrowed result ((100k*(100k-1))/2 becames '2a052eb0' instead of '12a052eb0' [using hex to show the lost byte]) so in the initialize() method neighbors_ gets 'only' the first 704982704 couples with 100k atoms
  • std::badalloc happens when you change nlist0_, nlist1_ and nallpairs_ to be size_t (that is an unsigned long, accoding to my IDE), so you don't get the narrowing an then initialize() tries to allocate the enormous array of~40 GB (uppercase B) as you said.

I think it is better to risk a bad_alloc (and then construct a gentle way out of the problem) with the dimension stored in size_t variables (while keeping the atom indexes uint32) than risking having a silent error if the users uses more than 65536 atoms (sqrt(2^32-1)~=65535.9).

In this, I do not think I have manage to reproduce the

Error: attempt to subscript container with out-of-bounds index -536758121,
but container only holds 121500 elements.

@GiovanniBussi
Copy link
Member

I agree. indexes should be size_t. The bad_alloc will be reported on the log file. We can to even more: as we know that the critical allocation happens in that constructor, you can put it in a try/catch block.

There's even a way to report this with a nested exception, something like:

try {
  // statement  that might throw
} catch(...) {
  plumed_error_nested() << "An error happened while allocating the neighbor list, please decrease the number of atoms used";
}

commit 1b70cb0
Author: Daniele Rapetti <[email protected]>
Date:   Tue Oct 10 10:56:35 2023 +0200

    Added an exception to catch the memory error

commit 19f37da
Author: Daniele Rapetti <[email protected]>
Date:   Tue Oct 10 10:26:57 2023 +0200

    impreoved readability for NL

commit 6c50341
Author: Daniele Rapetti <[email protected]>
Date:   Tue Oct 10 09:41:14 2023 +0200

    test now check that the results for initialization are the same with the "simpler" algortithm

commit c469d22
Author: Daniele Rapetti <[email protected]>
Date:   Mon Oct 9 16:58:55 2023 +0200

    refactored the NL constructors

commit 3e8e66c
Author: Daniele Rapetti <[email protected]>
Date:   Mon Oct 9 16:07:53 2023 +0200

    reformatted the test

commit 03e5a41
Author: Daniele Rapetti <[email protected]>
Date:   Mon Oct 9 15:12:21 2023 +0200

    integer type change and small optimization of the initialization of the NL

commit 2d5eb2e
Author: Daniele Rapetti <[email protected]>
Date:   Mon Oct 9 14:09:45 2023 +0200

    added a test to control the NL initialization(atomsID)

commit 04f762c
Author: Daniele Rapetti <[email protected]>
Date:   Mon Oct 9 11:45:17 2023 +0200

    added a test to control the NL initialization(size)
@Iximiel
Copy link
Member Author

Iximiel commented Oct 10, 2023

I did some change in the code:

  • the new rt-NeigbourlistInitialization check in a more unit test style that the initialization of the NL work as intended and make it easier to be optimized with more control in the future
  • the new rt-NeigbourlistInitializationError checks that with too many atoms the wanted error is thrown (I used @GiovanniBussi's wording for that error)
  • I changed the initialize to accommodate the try/catch block and to fail faster (if needed) using vector::resize() instead of vector::clear()+vector::push_back(), less initializations, less resizing and only copy-assignments, and now if we want the initialization loop can be #pragma parallel fored
  • Since I was there I did some changes that improves the readability of PLMD::NeighborList
    • I added the aliases using pairIDs=std::pair<unsigned,unsigned> and using pairAtomNumbers=std::pair<PLMD::AtomNumber,PLMD::AtomNumber> I think are clearer, shorter and more meaningful, I put them into PLMD::NeighborList, but maybe can be moved into PLMD namespace, and I'm changing the names if you came with something that communicate better what they are
    • I broke some lines that were too long and also the ones with more than one operation, to have 1 operation per line.
    • I rewrote the constructors, helped by the new tests.

For the initialization performance I wrote this small benchmark:

#include "plumed/tools/AtomNumber.h"
#include "plumed/tools/Communicator.h"
#include "plumed/tools/NeighborList.h"
#include "plumed/tools/Pbc.h"
#include "plumed/tools/Stopwatch.h"
#include <iostream>

using PLMD::AtomNumber;
using PLMD::NeighborList;
using PLMD::Tensor;
using std::vector;

int main(int, char **) {
  PLMD::Stopwatch sw;
  PLMD::Pbc pbc{};
  pbc.setBox(Tensor({1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0}));
  PLMD::Communicator cm{};
  bool serial = true;
  bool do_pbc = false;
  const size_t nat0 = 10000;
  vector<AtomNumber> list0(nat0);
  const size_t nat1 = nat0;
  vector<AtomNumber> list1(nat1);
  {
  auto sww = sw.startStop("setIndex");
  for (size_t i=0;i<nat0;++i) {
    list0[i].setIndex(i);
    list1[i].setIndex(i);
  }
  }
  for (auto rep = 0; rep < 100; ++rep) {
    {
      auto sww = sw.startStop("Single list");
      auto nl = NeighborList(list0, serial, do_pbc, pbc, cm);
    }
    {
      bool do_pair = false;
      auto sww = sw.startStop("Double list, no pairs");
      auto nl = NeighborList(list0, list1, serial, do_pair, do_pbc, pbc, cm);
    }
    {
      bool do_pair = true;
      auto sww = sw.startStop("Double list, with pairs");
      auto nl = NeighborList(list0, list1, serial, do_pair, do_pbc, pbc, cm);
    }
  }
  std::cerr << sw;
}

whose results are:

  • for the master branch
                                              Cycles        Total      Average      Minimum      Maximum
Double list, no pairs                            100    51.821229     0.518212     0.506896     0.572455
Double list, with pairs                          100     0.007873     0.000079     0.000072     0.000103
Single list                                      100    43.047529     0.430475     0.421441     0.464802
  • for this branch
Double list, no pairs                            100    40.824107     0.408241     0.391286     0.466891
Double list, with pairs                          100     0.009438     0.000094     0.000085     0.000147
Single list                                      100    37.022844     0.370228     0.355214     0.448128

@@ -19,7 +19,7 @@
You should have received a copy of the GNU Lesser General Public License
along with plumed. If not, see <http://www.gnu.org/licenses/>.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
#ifndef __PLUMED_tools_NeighborList_h
#ifndef __PLUMED_tools_NeighborList_h //{

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@Iximiel
Copy link
Member Author

Iximiel commented Oct 10, 2023

In local the catch(...) works, here in the CI I get the stack dump along with the error message, I try to workaround that with a regtest_after.
Moreover I am hiding the result of plumed_here that might be trigger a false positive in the future.

@Iximiel
Copy link
Member Author

Iximiel commented Nov 7, 2023

I added the "optional" memory error if NL is compiled on a mac system.
I am not so sure to communicate to the user that they can ignore that with an environment variable, and what is the best wording to do so.

The reason of these modification, I am not 100% sure about this: looks like that on mac if you finish the RAM you start to get the memory allocated in a page file, so you are not getting the memory error I had set up previously, and you get a very heavy performance hit (I think that this is the reason of the tests that getting more than 6 hrs to fail).

@codecov-commenter
Copy link

codecov-commenter commented Nov 7, 2023

Codecov Report

Attention: 698 lines in your changes are missing coverage. Please review.

Comparison is base (d19008f) 85.18% compared to head (daef70d) 84.10%.
Report is 95 commits behind head on master.

❗ Current head daef70d differs from pull request most recent head 58eeab5. Consider uploading reports for the commit 58eeab5 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #977      +/-   ##
==========================================
- Coverage   85.18%   84.10%   -1.09%     
==========================================
  Files         600      602       +2     
  Lines       55555    56104     +549     
==========================================
- Hits        47325    47184     -141     
- Misses       8230     8920     +690     
Files Coverage Δ
src/bias/PBMetaD.cpp 88.03% <ø> (-0.06%) ⬇️
src/cltools/Driver.cpp 90.05% <100.00%> (-0.02%) ⬇️
src/cltools/SimpleMD.cpp 100.00% <ø> (ø)
src/core/ActionRegister.h 100.00% <100.00%> (ø)
src/core/ActionWithArguments.cpp 89.70% <100.00%> (-1.03%) ⬇️
src/core/CLTool.cpp 84.94% <100.00%> (ø)
src/core/PlumedMain.cpp 89.86% <100.00%> (+0.22%) ⬆️
src/mapping/PathTools.cpp 98.31% <100.00%> (ø)
src/opes/OPESmetad.cpp 95.61% <100.00%> (-0.02%) ⬇️
src/tools/FileBase.cpp 100.00% <100.00%> (ø)
... and 12 more

... and 277 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Iximiel Iximiel marked this pull request as ready for review November 13, 2023 11:07
@carlocamilloni carlocamilloni merged commit 635a877 into plumed:master Nov 30, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants