Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unweighted sampling #21

Closed
rstub opened this issue Mar 14, 2019 · 7 comments
Closed

unweighted sampling #21

rstub opened this issue Mar 14, 2019 · 7 comments
Labels
enhancement New feature or request

Comments

@rstub
Copy link
Member

rstub commented Mar 14, 2019

Prerequisite for #18

@rstub rstub added the enhancement New feature or request label Mar 14, 2019
@LTLA
Copy link
Contributor

LTLA commented Mar 15, 2019

On this note: you are already aware of this, but this directly rebukes our method for sampling random 32-bit integers. It might be worth asking R core to expose the Mersenne Twister stream directly (i.e., give us a function that just yields 32-bit integers). This would only be fair if they're not letting us use <random>.

@rstub
Copy link
Member Author

rstub commented Mar 15, 2019

I would not say "rebukes our method", since sampling in R core and generating random seeds have slightly different constraints: The different RNGs in R produce floating point numbers in (0,1) (R does not use [0,1)) in slightly different ways. Most of them generate random integers in [0, MAX] and divide by MAX + 1 ("Wichmann-Hill" is the exception here). For Mersenne-Twister MAX is UINT32_MAX, but for other RNGs MAX is smaller. For the two RNGs from TAOCP MAX is only 2^30-1!

If one where to multiply a floating point number generated this way with 2^32, this should be no problem when MT (and probably WH) is used. All possible unit32_t are generated with equal probability. For the other RNGs, there will be a few uint32_t that cannot be generated. For the two RNGs from TAOCP, many uint32_t cannot be generated. For sampling in R core, which has to work properly with any supported RNG, this would be disastrous. I think that's why they decided to use two random draws to generate 32 bits.

In our case, I think it is not so bad. For the default MT (and probably WH) the full state space of the RNGs can be reached. For most other RNGs, there are only a few states (between one and about three hundred) that cannot be reached. For the two RNGs from TAOCP the possible state space would be significantly reduced, e.g. 2^60 instead of 2^64 possible states. For the task at hand this still seems to be sufficient from my point of view. Anyway, there are still some thing one could do:

  1. Document that best results are achieved when R's default RNG MT is used.
  2. Provide an R wrapper for generateSeedVectors that issues a warning or message when an unsuitable RNG is used.
  3. Use R core's method in R_random_u32, i.e. generate 32 bits from two random draws.

I think 1. would be a good idea and 3. seems unnecessary. I am undecided when it comes to 2.

@LTLA
Copy link
Contributor

LTLA commented Mar 15, 2019

If the linked PDF is using MT, we would expect a maximum possible bias of ~2-fold in terms of the ratio between the frequencies of the most observed and least observed 32-bit integer. This seems... pretty bad.

Another approach would be to assume that R core will fix R_unif_index, and use that instead in our R_random_u32. Then it wouldn't be our problem anymore.

@rstub
Copy link
Member Author

rstub commented Mar 15, 2019

I am not sure where one would get a bias in our case. MT produces any 32 bit integer with equal probability. This gets converted to a double by R, and we convert it back to a 32 bit integer. This conversion is exact, as demonstrated by the following program, which tries the conversion for all possible 32 bit integers:

#include <cstdint>
#include <iostream>

double to_double(uint32_t i) {
    return (double)i * 2.3283064365386963e-10; // c.f. MT_genrand
}

uint32_t to_uint(double d) { // c.f. R_random_u32
    constexpr double upper_limit = 4294967296;
    double val = d * upper_limit;
    if (val >= upper_limit) { val=0; } // Absolutely avoid overflow.
    return static_cast<uint32_t>(val);
}

int main() {
    uint32_t i = 0;
    do {
        double d = to_double(i);
        uint32_t j = to_uint(d);
        if (i != j) {
            std::cout << i << " and " << j << " are not the same" << std::endl;
            return 1;
        }
        if ( i % (1024*1024*512) == 0 )
            std::cout << "Reached: " << i << std::endl;
        ++i;
    } while (i != 0);
    std::cout << "All good" << std::endl;
    return 0;
}

Output:

Reached: 0
Reached: 536870912
Reached: 1073741824
Reached: 1610612736
Reached: 2147483648
Reached: 2684354560
Reached: 3221225472
Reached: 3758096384
All good

The bias that results from the float to int conversion comes into play when the multiplicative constant does not evenly divide the number of different floating point numbers, c.f. https://blog.daqana.com/en/dqsample-a-bias-free-alternative-to-rs-sample-function/.

Another approach would be to assume that R core will fix R_unif_index, and use that instead in our R_random_u32. Then it wouldn't be our problem anymore.

That's an excellent idea! R_unif_index is fixed in R-devel (will be 3.6.0). And for prior versions (or Sample_kind == ROUNDING), it at least eliminates the problem with the two RNGs from TAOCP, since there already two random draws are used.

@LTLA
Copy link
Contributor

LTLA commented Mar 16, 2019

If you do decide to go with R_unif_index, it would be a good idea to check in with R core regarding whether it is an recognized and supported entry point into the C API. I don't see it mentioned in "Writing R extensions", even though it is available from R_ext/Random.h, so it would be best to confirm its (continued) availability. At the very least, R core will know that we're using it, so we'll get on their radar.

@rstub rstub mentioned this issue Mar 22, 2019
2 tasks
@rstub
Copy link
Member Author

rstub commented Mar 25, 2019

I have asked on r-devel concerning R_unif_index. No response so far.

@rstub
Copy link
Member Author

rstub commented Apr 18, 2019

fixed in #23 and #24

BTW, R_unif_index became part of the official API.

@rstub rstub closed this as completed Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants