Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle sample types in linear_pool #27

Closed
elray1 opened this issue Aug 4, 2023 · 6 comments
Closed

handle sample types in linear_pool #27

elray1 opened this issue Aug 4, 2023 · 6 comments
Assignees

Comments

@elray1
Copy link
Contributor

elray1 commented Aug 4, 2023

We pretty much settled on the desired functionality here in discussion on issue #20, but I'm splitting implementation into a separate issue.

@elray1
Copy link
Contributor Author

elray1 commented Aug 10, 2023

More detailed ideas about handling samples. See comments here and below where we have talked about this before.

Previous decision: This function will not do any validations related to potentially different dependence structures represented by the component models. If a hub cares about enforcing that models use the same dependence structure for their samples, this will be specified in the hub's config, and will be checked at the time that model outputs are submitted to the hub, so we don't need to do validations related to this in this function.

There are two cases.

Case 1: collect component samples

If all three of the conditions in points below are satisfied, this function simply collects the samples from the component models and updates the sample indices to ensure that they are different for different component models:

  1. equal weights for all models,
  2. the same number of samples from each component model
  3. no limit on the number of samples the ensemble is allowed to produce

Case 2: do some sampling

However, if any of those conditions are not satisfied, we have to do something else:

  • Draw a sample of the specified size from the collection of all component model samples. We should try to do this in a way that minimized the amount of extra Monte Carlo variability that is introduced by sampling:
    • Ensure that each model is represented in the output according to its model weight. That is, for each component model we should get a number of samples that is (approximately) equal to the model weight times the desired number of ensemble samples. If there is a remainder, that can be distributed among the models at random.
    • To get samples from a component model, two steps:
      • If the number of samples we want to get from a model (target_n_component_samples) is larger than the number of samples that model provided (provided_n_component_samples), duplicate/replicate each of the samples from that model floor(target_n_component_samples / provided_n_component_samples) times. For example, if we want to get 25 samples from model A and model A provided 10 samples, floor(target_n_component_samples / provided_n_component_samples) = 2 and after this step we will have 2 copies of each of the 10 samples provided by that model.
      • Sample without replacement for the remainder. e.g. in this example there will be 5 more samples to obtain for this model, and we choose 5 distinct samples provided by that model, at random without replacement.

Notes about a new function argument related to desired ensemble sample size

The sampling step in case 2 requires the user to specify how many ensemble samples they want in the output. So we need to add an argument to linear_pool allowing the user to specify this, e.g. n. But we want to allow for n = NULL, to say that if possible, the function should just collect the component model samples as in Case 1 above. Let's set n = NULL as the default, but then throw an error if we end up in case 2 and the user did not provide an integer value of n.

We can have 3 separate validations related to this:

  1. n must either be NULL or coercible to an integer
  2. If weights are provided, n must be an integer. Error text: "Component model weights were provided, so a number of ensemble samples n must be provided."
  3. If component models provided a different number of samples within any group defined by a combination of task ids, n must be an integer. Error text: "Component models provided differing numbers of samples within at least one forecast task id group, so a number of ensemble samples n must be provided."

@elray1
Copy link
Contributor Author

elray1 commented Aug 17, 2023

Couple of additional thoughts about this:

  • We don't really have support for samples built into the schemas or hubUtils yet. That means that right now, there’s no formal way to tell whether the output_type_id for samples should be an integer or a character. One option for this could be to just convert it to an integer, and then check the data type of the output_type_id column and convert to that. Or we could for now provide an argument to the linear_pool function to say what data type to use for sample output ids. Or maybe we should just hold off on building this functionality until we have support for it in the other tools?
  • One thing we need to do is ensure that the output type ids are distinct for samples from different component models. Probably the simplest way to do that would be to just paste the input/component model_id together with the output_type_id provided by that model. and then we can do the as.integer(factor(…)) trick to convert these to distinct integers.

@elray1
Copy link
Contributor Author

elray1 commented Apr 24, 2024

I'm closing this issue in favor of #109 and #110

@elray1
Copy link
Contributor Author

elray1 commented Oct 8, 2024

If we do something like this, maybe we should introduce an argument like method allowing to specify how to take the samples: "stratified" (the default?) for stratified by model, "random" for random across models, possibly with other methods to be added in the future.

@elray1 elray1 reopened this Oct 17, 2024
@elray1
Copy link
Contributor Author

elray1 commented Oct 17, 2024

I have re-opened this issue as it would provide features that would be useful to our group at UMass, both for our internal modeling and for administering the variant nowcast hub.

@lshandross
Copy link
Collaborator

Closing this larger issue because it has been partially addressed by #143 (simplest case) and #147 (ability to subset output samples), plus split into smaller issues #151 (add support for non-equal model weights) and #152 (add option for non-stratified sampling)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants