-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unweighted sampling #21
Comments
On this note: you are already aware of this, but this directly rebukes our method for sampling random 32-bit integers. It might be worth asking R core to expose the Mersenne Twister stream directly (i.e., give us a function that just yields 32-bit integers). This would only be fair if they're not letting us use |
I would not say "rebukes our method", since sampling in R core and generating random seeds have slightly different constraints: The different RNGs in R produce floating point numbers in If one where to multiply a floating point number generated this way with 2^32, this should be no problem when MT (and probably WH) is used. All possible In our case, I think it is not so bad. For the default MT (and probably WH) the full state space of the RNGs can be reached. For most other RNGs, there are only a few states (between one and about three hundred) that cannot be reached. For the two RNGs from TAOCP the possible state space would be significantly reduced, e.g. 2^60 instead of 2^64 possible states. For the task at hand this still seems to be sufficient from my point of view. Anyway, there are still some thing one could do:
I think 1. would be a good idea and 3. seems unnecessary. I am undecided when it comes to 2. |
If the linked PDF is using MT, we would expect a maximum possible bias of ~2-fold in terms of the ratio between the frequencies of the most observed and least observed 32-bit integer. This seems... pretty bad. Another approach would be to assume that R core will fix |
I am not sure where one would get a bias in our case. MT produces any 32 bit integer with equal probability. This gets converted to a double by R, and we convert it back to a 32 bit integer. This conversion is exact, as demonstrated by the following program, which tries the conversion for all possible 32 bit integers: #include <cstdint>
#include <iostream>
double to_double(uint32_t i) {
return (double)i * 2.3283064365386963e-10; // c.f. MT_genrand
}
uint32_t to_uint(double d) { // c.f. R_random_u32
constexpr double upper_limit = 4294967296;
double val = d * upper_limit;
if (val >= upper_limit) { val=0; } // Absolutely avoid overflow.
return static_cast<uint32_t>(val);
}
int main() {
uint32_t i = 0;
do {
double d = to_double(i);
uint32_t j = to_uint(d);
if (i != j) {
std::cout << i << " and " << j << " are not the same" << std::endl;
return 1;
}
if ( i % (1024*1024*512) == 0 )
std::cout << "Reached: " << i << std::endl;
++i;
} while (i != 0);
std::cout << "All good" << std::endl;
return 0;
} Output:
The bias that results from the float to int conversion comes into play when the multiplicative constant does not evenly divide the number of different floating point numbers, c.f. https://blog.daqana.com/en/dqsample-a-bias-free-alternative-to-rs-sample-function/.
That's an excellent idea! |
If you do decide to go with |
I have asked on r-devel concerning |
Prerequisite for #18
The text was updated successfully, but these errors were encountered: