Neighbourlist integer size #977

Iximiel · 2023-10-05T13:50:52Z

Description

I found that if a user ask a full neighbor list with a more than circa ~1M atoms (meaning that the neighbor list would have nat(nat-1)/2 elements) plumed will crash, this may be a bug due to the integers type on some indexes:
By compiling with --enable-debug --enable-debug-glibcxx I get the following message

/opt/ohpc/pub/compiler/gcc/9.4.0/include/c++/9.4.0/debug/vector:434:
In function:
    std::__debug::vector<_Tp, _Allocator>::const_reference 
    std::__debug::vector<_Tp, 
    _Allocator>::operator[](std::__debug::vector<_Tp, 
    _Allocator>::size_type) const [with _Tp = PLMD::AtomNumber; _Allocator = 
    std::allocator<PLMD::AtomNumber>; std::__debug::vector<_Tp, 
    _Allocator>::const_reference = const PLMD::AtomNumber&; 
    std::__debug::vector<_Tp, _Allocator>::size_type = long unsigned int]

Error: attempt to subscript container with out-of-bounds index -536758121, 
but container only holds 121500 elements.

Objects involved in the operation:
    sequence "this" @ 0x0x2799798 {
      type = std::__debug::vector<PLMD::AtomNumber, std::allocator<PLMD::AtomNumber> >;
    }

the plumed.dat is

cv: COORDINATION GROUPA=1-121500 R_0=1 NOPBC

PRINT ARG=* FILE=colvar_121500 FMT=%8.4f STRIDE=1

I'm setting up this as a draft because I am unsure if the problem is only on the NL or is more "deep"

Target release

I would like my code to appear in release 2.9

Type of contribution

changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
changes to a module not authored by you
new module contribution or edit of a module authored by you

Copyright

I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.

the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.

Tests

I added a new regtest or modified an existing regtest to validate my changes.
I verified that all regtests are passed successfully on GitHub Actions.

GiovanniBussi · 2023-10-05T14:06:06Z

@Iximiel writing a regtest that fails with master and fails with this is a bit tricky, because we do not want to store a trajectory with 1M atoms. But you can write a simple awk script that generates such trajectory on the fly (see e.g. basic/rt-fix-226). A single frame is sufficient. Alternatively we might at some point hard code some sample trajectory in the plumed driver, so that they can be directly generated in memory without the need to write a file.

Iximiel · 2023-10-05T14:21:37Z

@Iximiel writing a regtest that fails with master and fails with this is a bit tricky, because we do not want to store a trajectory with 1M atoms. But you can write a simple awk script that generates such trajectory on the fly (see e.g. basic/rt-fix-226). A single frame is sufficient. Alternatively we might at some point hard code some sample trajectory in the plumed driver, so that they can be directly generated in memory without the need to write a file.

I did used the trick from rt-fix-226, so no new trajectories have been pushed.
The tests are based on rt15 (the one that tests mklib), but may or should be rewritten in a more unit-test-like fashion to focus better on the problem (and generate 1M of atoms into memory instead of reading them, after generating them with awk).

GiovanniBussi · 2023-10-05T14:23:26Z

Note: the problem should require much less than 1M particles.
sqrt(2**32*2) ~ 100k should be sufficient to trigger the problem.

GiovanniBussi · 2023-10-05T14:26:11Z

I am confused. Why are you adding dynamically loaded code? What we need is simply a test that computes the coordination between a group of ~ 100k atoms, and that will crash with master and not crash with the modified version. And it should be a separate test indeed, not using dynamic libraries (otherwise, for instance if we at some point we run on windows this test will not be used).

GiovanniBussi · 2023-10-05T14:30:12Z

I would also suggest not to have a test with a small (working) number of atoms. We only need to test the failing case. Coordinations are already tested somewhere else.

Iximiel · 2023-10-05T15:12:23Z

Note: the problem should require much less than 1M particles. sqrt(2**32*2) ~ 100k should be sufficient to trigger the problem.

I'm actually using ~100K atoms, I don't know why I persuaded myself that I am using 1M atoms...

I am confused. Why are you adding dynamically loaded code? What we need is simply a test that computes the coordination between a group of ~ 100k atoms, and that will crash with master and not crash with the modified version. And it should be a separate test indeed, not using dynamic libraries (otherwise, for instance if we at some point we run on windows this test will not be used).

Because I want to test if the problem arises from the neighbor list, and I am testing it in the more Isolated way possible, before testing it in a more complex environment. I'm moving the test to a make-style one to not use dlopen

Iximiel · 2023-10-05T17:28:04Z

@GiovanniBussi
By trying to convert the test into a make style one I think I found the culprit:
The minimum code (in plain C++) you need to reproduce the std::bad_alloc is the following:

#include <vector>
#include <utility>
int main(){ 
    size_t d1=100000;
    size_t dd=(d1*(d1-1))/2;
    std::vector <std::pair<size_t,size_t>> test(dd);
}

That it is similar to what it is happening in the constructor of the NL,

In plumed, without includes and usings would be:

int main(int, char**){
    Pbc pbc{};
    pbc.setBox(PLMD::Tensor({1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0}));
    std::vector<AtomNumber> list0(100000);
    size_t i=0;
    for (auto& an:list0){
      an.setIndex(i);
      ++i;
    }
    bool serial=true;
    bool do_pbc=false;
    Communicator cm{};
    auto nl=NeighborList( list0, serial, do_pbc, pbc,cm);
}

what do you think?

GiovanniBussi · 2023-10-08T23:22:16Z

Thanks!

The minimum code (in plain C++) you need to reproduce the std::bad_alloc is the following:

Does this imply that in the original plumed test you received a std::bad_alloc? I thought you had an unexplained crash. If it's a std::bad_alloc it should be reported in the log file. Are you sure it is not?

Then I see that the issue is in storing the neighbor list. If we have more than 232 elements allowed in the list (which is the reason to use long long), and if atoms are stored as 8 bytes integers, this will result on 28(232) = 64 Gb, which is anyway out of reach for most architectures.

Iximiel · 2023-10-09T08:25:17Z

I refined the test more to understand what it is happening:

In the master branch you cannot get the std::bad_alloc:
- reason is in the constructor: members nlist0_, nlist1_ and nallpairs_ are unsigned int, so (with gcc9.4) when you do nallpairs_=nlist0_*nlist1_; you get the narrowed result ((100k*(100k-1))/2 becames '2a052eb0' instead of '12a052eb0' [using hex to show the lost byte]) so in the initialize() method neighbors_ gets 'only' the first 704982704 couples with 100k atoms
std::badalloc happens when you change nlist0_, nlist1_ and nallpairs_ to be size_t (that is an unsigned long, accoding to my IDE), so you don't get the narrowing an then initialize() tries to allocate the enormous array of~40 GB (uppercase B) as you said.

I think it is better to risk a bad_alloc (and then construct a gentle way out of the problem) with the dimension stored in size_t variables (while keeping the atom indexes uint32) than risking having a silent error if the users uses more than 65536 atoms (sqrt(2^32-1)~=65535.9).

In this, I do not think I have manage to reproduce the

Error: attempt to subscript container with out-of-bounds index -536758121,
but container only holds 121500 elements.

GiovanniBussi · 2023-10-09T14:58:05Z

I agree. indexes should be size_t. The bad_alloc will be reported on the log file. We can to even more: as we know that the critical allocation happens in that constructor, you can put it in a try/catch block.

There's even a way to report this with a nested exception, something like:

try {
  // statement  that might throw
} catch(...) {
  plumed_error_nested() << "An error happened while allocating the neighbor list, please decrease the number of atoms used";
}

commit 1b70cb0 Author: Daniele Rapetti <[email protected]> Date: Tue Oct 10 10:56:35 2023 +0200 Added an exception to catch the memory error commit 19f37da Author: Daniele Rapetti <[email protected]> Date: Tue Oct 10 10:26:57 2023 +0200 impreoved readability for NL commit 6c50341 Author: Daniele Rapetti <[email protected]> Date: Tue Oct 10 09:41:14 2023 +0200 test now check that the results for initialization are the same with the "simpler" algortithm commit c469d22 Author: Daniele Rapetti <[email protected]> Date: Mon Oct 9 16:58:55 2023 +0200 refactored the NL constructors commit 3e8e66c Author: Daniele Rapetti <[email protected]> Date: Mon Oct 9 16:07:53 2023 +0200 reformatted the test commit 03e5a41 Author: Daniele Rapetti <[email protected]> Date: Mon Oct 9 15:12:21 2023 +0200 integer type change and small optimization of the initialization of the NL commit 2d5eb2e Author: Daniele Rapetti <[email protected]> Date: Mon Oct 9 14:09:45 2023 +0200 added a test to control the NL initialization(atomsID) commit 04f762c Author: Daniele Rapetti <[email protected]> Date: Mon Oct 9 11:45:17 2023 +0200 added a test to control the NL initialization(size)

Iximiel · 2023-10-10T09:28:01Z

I did some change in the code:

the new rt-NeigbourlistInitialization check in a more unit test style that the initialization of the NL work as intended and make it easier to be optimized with more control in the future
the new rt-NeigbourlistInitializationError checks that with too many atoms the wanted error is thrown (I used @GiovanniBussi's wording for that error)
I changed the initialize to accommodate the try/catch block and to fail faster (if needed) using vector::resize() instead of vector::clear()+vector::push_back(), less initializations, less resizing and only copy-assignments, and now if we want the initialization loop can be #pragma parallel fored
Since I was there I did some changes that improves the readability of PLMD::NeighborList
- I added the aliases using pairIDs=std::pair<unsigned,unsigned> and using pairAtomNumbers=std::pair<PLMD::AtomNumber,PLMD::AtomNumber> I think are clearer, shorter and more meaningful, I put them into PLMD::NeighborList, but maybe can be moved into PLMD namespace, and I'm changing the names if you came with something that communicate better what they are
- I broke some lines that were too long and also the ones with more than one operation, to have 1 operation per line.
- I rewrote the constructors, helped by the new tests.

For the initialization performance I wrote this small benchmark:

#include "plumed/tools/AtomNumber.h"
#include "plumed/tools/Communicator.h"
#include "plumed/tools/NeighborList.h"
#include "plumed/tools/Pbc.h"
#include "plumed/tools/Stopwatch.h"
#include <iostream>

using PLMD::AtomNumber;
using PLMD::NeighborList;
using PLMD::Tensor;
using std::vector;

int main(int, char **) {
  PLMD::Stopwatch sw;
  PLMD::Pbc pbc{};
  pbc.setBox(Tensor({1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0}));
  PLMD::Communicator cm{};
  bool serial = true;
  bool do_pbc = false;
  const size_t nat0 = 10000;
  vector<AtomNumber> list0(nat0);
  const size_t nat1 = nat0;
  vector<AtomNumber> list1(nat1);
  {
  auto sww = sw.startStop("setIndex");
  for (size_t i=0;i<nat0;++i) {
    list0[i].setIndex(i);
    list1[i].setIndex(i);
  }
  }
  for (auto rep = 0; rep < 100; ++rep) {
    {
      auto sww = sw.startStop("Single list");
      auto nl = NeighborList(list0, serial, do_pbc, pbc, cm);
    }
    {
      bool do_pair = false;
      auto sww = sw.startStop("Double list, no pairs");
      auto nl = NeighborList(list0, list1, serial, do_pair, do_pbc, pbc, cm);
    }
    {
      bool do_pair = true;
      auto sww = sw.startStop("Double list, with pairs");
      auto nl = NeighborList(list0, list1, serial, do_pair, do_pbc, pbc, cm);
    }
  }
  std::cerr << sw;
}

whose results are:

for the master branch

                                              Cycles        Total      Average      Minimum      Maximum
Double list, no pairs                            100    51.821229     0.518212     0.506896     0.572455
Double list, with pairs                          100     0.007873     0.000079     0.000072     0.000103
Single list                                      100    43.047529     0.430475     0.421441     0.464802

for this branch

Double list, no pairs                            100    40.824107     0.408241     0.391286     0.466891
Double list, with pairs                          100     0.009438     0.000094     0.000085     0.000147
Single list                                      100    37.022844     0.370228     0.355214     0.448128

src/tools/NeighborList.h

@@ -19,7 +19,7 @@
   You should have received a copy of the GNU Lesser General Public License
   along with plumed.  If not, see <http://www.gnu.org/licenses/>.
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
-#ifndef __PLUMED_tools_NeighborList_h
+#ifndef __PLUMED_tools_NeighborList_h //{


Iximiel · 2023-10-10T12:58:00Z

In local the catch(...) works, here in the CI I get the stack dump along with the error message, I try to workaround that with a regtest_after.
Moreover I am hiding the result of plumed_here that might be trigger a false positive in the future.

Iximiel · 2023-11-07T12:16:03Z

I added the "optional" memory error if NL is compiled on a mac system.
I am not so sure to communicate to the user that they can ignore that with an environment variable, and what is the best wording to do so.

The reason of these modification, I am not 100% sure about this: looks like that on mac if you finish the RAM you start to get the memory allocated in a page file, so you are not getting the memory error I had set up previously, and you get a very heavy performance hit (I think that this is the reason of the tests that getting more than 6 hrs to fail).

… branch

codecov-commenter · 2023-11-07T13:34:57Z

Codecov Report

Attention: 698 lines in your changes are missing coverage. Please review.

Comparison is base (d19008f) 85.18% compared to head (daef70d) 84.10%.
Report is 95 commits behind head on master.

❗ Current head daef70d differs from pull request most recent head 58eeab5. Consider uploading reports for the commit 58eeab5 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #977      +/-   ##
==========================================
- Coverage   85.18%   84.10%   -1.09%     
==========================================
  Files         600      602       +2     
  Lines       55555    56104     +549     
==========================================
- Hits        47325    47184     -141     
- Misses       8230     8920     +690

Files	Coverage Δ
src/bias/PBMetaD.cpp	`88.03% <ø> (-0.06%)`	⬇️
src/cltools/Driver.cpp	`90.05% <100.00%> (-0.02%)`	⬇️
src/cltools/SimpleMD.cpp	`100.00% <ø> (ø)`
src/core/ActionRegister.h	`100.00% <100.00%> (ø)`
src/core/ActionWithArguments.cpp	`89.70% <100.00%> (-1.03%)`	⬇️
src/core/CLTool.cpp	`84.94% <100.00%> (ø)`
src/core/PlumedMain.cpp	`89.86% <100.00%> (+0.22%)`	⬆️
src/mapping/PathTools.cpp	`98.31% <100.00%> (ø)`
src/opes/OPESmetad.cpp	`95.61% <100.00%> (-0.02%)`	⬇️
src/tools/FileBase.cpp	`100.00% <100.00%> (ø)`
... and 12 more

... and 277 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Iximiel added 4 commits October 5, 2023 12:20

set up test for the NL size problem

2a0f0e3

updated the test

9f4adf8

added a test for 1M atoms

dc59a39

trying to change to size_t the neighbour list

2076f50

forgot the override on the calculate() function

4eafea1

Iximiel added 3 commits October 10, 2023 10:59

removed old regtest

78bdcd7

changed to one line per thing also int NL.h

fcd9b7f

github-advanced-security bot found potential problems Oct 10, 2023

View reviewed changes

Now the test will ignore the stackdump if it is plotted

5228d76

Iximiel added 4 commits November 7, 2023 12:43

changed how the "too much memory" for NL on mac works

feb323f

updated the changelog

e93bc42

added a comment to explain the mac exception

c1d72f6

some more explanation on the arbitrary stop

88d32c7

Iximiel added 3 commits November 7, 2023 13:18

changed the changelog hoping it fast forward to the one in the master…

228ef2f

… branch

changed the changelog hoping it fast forward to the one in the master…

f551b9f

… branch

corrected plumed_merror usage in NL

58eeab5

Iximiel marked this pull request as ready for review November 13, 2023 11:07

carlocamilloni merged commit 635a877 into plumed:master Nov 30, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neighbourlist integer size #977

Neighbourlist integer size #977

Iximiel commented Oct 5, 2023 •

edited

Loading

GiovanniBussi commented Oct 5, 2023

Iximiel commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

Iximiel commented Oct 5, 2023

Iximiel commented Oct 5, 2023 •

edited

Loading

GiovanniBussi commented Oct 8, 2023

Iximiel commented Oct 9, 2023

GiovanniBussi commented Oct 9, 2023

Iximiel commented Oct 10, 2023 •

edited

Loading

Iximiel commented Oct 10, 2023

Iximiel commented Nov 7, 2023

codecov-commenter commented Nov 7, 2023 •

edited

Loading

Neighbourlist integer size #977

Neighbourlist integer size #977

Conversation

Iximiel commented Oct 5, 2023 • edited Loading

Description

Target release

Type of contribution

Copyright

Tests

GiovanniBussi commented Oct 5, 2023

Iximiel commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

GiovanniBussi commented Oct 5, 2023

Iximiel commented Oct 5, 2023

Iximiel commented Oct 5, 2023 • edited Loading

GiovanniBussi commented Oct 8, 2023

Iximiel commented Oct 9, 2023

GiovanniBussi commented Oct 9, 2023

Iximiel commented Oct 10, 2023 • edited Loading

Iximiel commented Oct 10, 2023

Iximiel commented Nov 7, 2023

codecov-commenter commented Nov 7, 2023 • edited Loading

Codecov Report

Iximiel commented Oct 5, 2023 •

edited

Loading

Iximiel commented Oct 5, 2023 •

edited

Loading

Iximiel commented Oct 10, 2023 •

edited

Loading

codecov-commenter commented Nov 7, 2023 •

edited

Loading