Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore filehash package as a means to use large PVs in a memory-efficient manner #23

Open
kostrzewa opened this issue Oct 21, 2019 · 12 comments

Comments

@kostrzewa
Copy link
Member

I would like to play around with the filehash package to see if we can use it to lower memory consumption of pv objects. I've hit a point where I've performed some simple one parameter fits and end up with a 22 GB object and that's just for the test ensemble (to be fair, there are many variations, but I'm still a bit perplexed about the actual size...).

side remark: There also seems to be something fishy going on in that the size of the saved object seems to depend on current memory load, as if the entire environment were being saved to disc rather than just the object in question.

@martin-ueding
Copy link
Contributor

We might want to change it from load and save to readRDS and writeRDS. It could actually be that I store the whole environment without realizing it because of all the non-local variable magic.

@kostrzewa
Copy link
Member Author

We might want to change it from load and save to readRDS and writeRDS. It could actually be that I store the whole environment without realizing it because of all the non-local variable magic.

Hmm, could be... The write part is easy, the load part might be a bit trickier...

@kostrzewa
Copy link
Member Author

I thought a bit more about a useable piecewise serialisation and it's actually not too difficult to make this happen. The trick is to "instantiate" actual data inside functions only, such that one can let it go out of scope, truly freeing the associated memory. I fear, however, that it doesn't scale particularly well...

The main problem occurs when left-folding the inner_outer_join. In the implementation part, after the parameter data frames have been joined, iteration would proceed as is, but instead of directly accessing the value elements of a and b, a function would be called which reads value[[id_a]] and value[[id_b]] from disc, concatenates the two, writes the result to a new file (indexed with the the output index i). As the three variables go out of scope, memory would be freed immediately and one can move on to the next value to be created. There might have to be some caching of filenames or something like that, not sure about the details yet.

After as many calls of inner_outer_join_impl as necessary, the param of the fully joined PV resides in memory while all its values are on disc. (with some suitable filename to be decided, perhaps derived from the names of the objects passed). This is of course where the scaling problem originates as the same data (from different files) will be loaded again and again as the list of PVs to join is processed and value list grows in length.

If this object was created during a pv_call, each value element will now be read from disc again (in a utility function, such that we can let it go out of scope afterwards) and func will be called on the param_row and the newly loaded value_row.

The value part of the result of func is immediately written to disc and goes out of scope before the next element is processed. At this stage some reference to the filename might be returned instead. (not sure about the details yet, or how to handle the actual post_process call...)

The return value of the pv_call (whether through parent.frame() magic or the actual return value) is a PV which contains just the param and some kind of list of references to the files containing the data.

This all means that unless pv_call is called on just one pv, tons and tons of disc I/O will be generated. Even then, all elements of this one object will be read from disc and discarded right after.

I'm honestly not sure if I even want to attempt this given the foreseeable issues...

@kostrzewa
Copy link
Member Author

Finally, pv_save would attempt to somehow clean up the mess, storing the param element as is done now and renaming the value files sensibly.

@kostrzewa
Copy link
Member Author

Maybe all of this can be realised without too much hassle using filehash, I'm not quite sure yet...

@kostrzewa
Copy link
Member Author

Alright, the 22GB object above, that was storing 72x24 = 1728 return values of bootstrap.nlsfit, no way that can possibly amount to that much, right? (even for a crazy fit with many parameters)

After extracting just t0, t and computing an error on the single parameter, I'm left with 21 MB. The problem is that when the bootstrap.nlsfit return value is serialised, it pulls the entire environment with it.

@martin-ueding
Copy link
Contributor

Oh, perhaps because the bootstrap fit contains a closure at some point? Perhaps that pulls in the full environment because R does not know which part of the environment is needed for the closure to function. In principle the closure could make up variables and do parent.frame()[[some_var_name]]. So whenever we generate a closure as a fit function we might have this problem. Guess there is a reason why there is explicit lambda capturing syntax.

@kostrzewa
Copy link
Member Author

Yup, see HISKP-LQCD/hadron#187

@urbach
Copy link
Member

urbach commented Jun 10, 2020

sorry, I don't fully understand the problem with the function and the scope yet. Where is the whole environment stored and why?

Concerning the memory problem: can one decide in paramvalf at which point the real value is needed?

@urbach
Copy link
Member

urbach commented Jun 10, 2020

I mean, assuming we store only the param part and some hashes for the values, how do we decide when we need the real value?

@urbach
Copy link
Member

urbach commented Jun 11, 2020

I think I understand the scope problem now... What about putting the function in its own new evironment?

@martin-ueding
Copy link
Contributor

The way that I tried the value serialization was that instead of the actual value an S3-object lazy_value would be stored, which was just a filename with a class attribute. In its current form, the pv_call function basically does this:

for (i in 1:nrow(param)) {
  result$value[[i]] <- func(param[i, ], value[[i]])
}

With the value serialization it would check whether value[[i]] would be such a lazy value or the actual thing. And if it was a lazy value, it would load it right there, call the given function and also serialize the result if the size was significant.

The problem is that even when calling rm() and gc(), somehow not everything was cleaned up. Also one has to be careful with multiprocessing as some environments get copied over to the other processes. This likely means that currently the memory scaling is rather bad. Loading only certain things in isolated environments might be able to get this under control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants