handle sample types in `linear_pool` #27

elray1 · 2023-08-04T15:18:36Z

We pretty much settled on the desired functionality here in discussion on issue #20, but I'm splitting implementation into a separate issue.

elray1 · 2023-08-10T14:04:51Z

More detailed ideas about handling samples. See comments here and below where we have talked about this before.

Previous decision: This function will not do any validations related to potentially different dependence structures represented by the component models. If a hub cares about enforcing that models use the same dependence structure for their samples, this will be specified in the hub's config, and will be checked at the time that model outputs are submitted to the hub, so we don't need to do validations related to this in this function.

There are two cases.

Case 1: collect component samples

If all three of the conditions in points below are satisfied, this function simply collects the samples from the component models and updates the sample indices to ensure that they are different for different component models:

equal weights for all models,
the same number of samples from each component model
no limit on the number of samples the ensemble is allowed to produce

Case 2: do some sampling

However, if any of those conditions are not satisfied, we have to do something else:

Draw a sample of the specified size from the collection of all component model samples. We should try to do this in a way that minimized the amount of extra Monte Carlo variability that is introduced by sampling:
- Ensure that each model is represented in the output according to its model weight. That is, for each component model we should get a number of samples that is (approximately) equal to the model weight times the desired number of ensemble samples. If there is a remainder, that can be distributed among the models at random.
- To get samples from a component model, two steps:
  - If the number of samples we want to get from a model (target_n_component_samples) is larger than the number of samples that model provided (provided_n_component_samples), duplicate/replicate each of the samples from that model floor(target_n_component_samples / provided_n_component_samples) times. For example, if we want to get 25 samples from model A and model A provided 10 samples, floor(target_n_component_samples / provided_n_component_samples) = 2 and after this step we will have 2 copies of each of the 10 samples provided by that model.
  - Sample without replacement for the remainder. e.g. in this example there will be 5 more samples to obtain for this model, and we choose 5 distinct samples provided by that model, at random without replacement.

Notes about a new function argument related to desired ensemble sample size

The sampling step in case 2 requires the user to specify how many ensemble samples they want in the output. So we need to add an argument to linear_pool allowing the user to specify this, e.g. n. But we want to allow for n = NULL, to say that if possible, the function should just collect the component model samples as in Case 1 above. Let's set n = NULL as the default, but then throw an error if we end up in case 2 and the user did not provide an integer value of n.

We can have 3 separate validations related to this:

n must either be NULL or coercible to an integer
If weights are provided, n must be an integer. Error text: "Component model weights were provided, so a number of ensemble samples n must be provided."
If component models provided a different number of samples within any group defined by a combination of task ids, n must be an integer. Error text: "Component models provided differing numbers of samples within at least one forecast task id group, so a number of ensemble samples n must be provided."

elray1 · 2023-08-17T16:13:17Z

Couple of additional thoughts about this:

We don't really have support for samples built into the schemas or hubUtils yet. That means that right now, there’s no formal way to tell whether the output_type_id for samples should be an integer or a character. One option for this could be to just convert it to an integer, and then check the data type of the output_type_id column and convert to that. Or we could for now provide an argument to the linear_pool function to say what data type to use for sample output ids. Or maybe we should just hold off on building this functionality until we have support for it in the other tools?
One thing we need to do is ensure that the output type ids are distinct for samples from different component models. Probably the simplest way to do that would be to just paste the input/component model_id together with the output_type_id provided by that model. and then we can do the as.integer(factor(…)) trick to convert these to distinct integers.

elray1 · 2024-04-24T15:49:03Z

I'm closing this issue in favor of #109 and #110

elray1 · 2024-10-08T12:56:28Z

If we do something like this, maybe we should introduce an argument like method allowing to specify how to take the samples: "stratified" (the default?) for stratified by model, "random" for random across models, possibly with other methods to be added in the future.

elray1 · 2024-10-17T21:38:30Z

I have re-opened this issue as it would provide features that would be useful to our group at UMass, both for our internal modeling and for administering the variant nowcast hub.

lshandross · 2025-01-10T17:43:15Z

Closing this larger issue because it has been partially addressed by #143 (simplest case) and #147 (ability to subset output samples), plus split into smaller issues #151 (add support for non-equal model weights) and #152 (add option for non-stratified sampling)

github-project-automation bot added this to hubverse Development overview Aug 4, 2023

github-project-automation bot moved this to Todo in hubverse Development overview Aug 4, 2023

elray1 mentioned this issue Aug 4, 2023

linear_pool method #20

Closed

elray1 mentioned this issue Apr 4, 2024

Add support for the sample model_output_type in hubEnsembles::linear_pool() #101

Open

nickreich added this to the sample output_type v1.0 milestone Apr 4, 2024

elray1 closed this as completed Apr 24, 2024

github-project-automation bot moved this from Todo to Done in hubverse Development overview Apr 24, 2024

elray1 reopened this Oct 17, 2024

elray1 moved this from Done to Todo in hubverse Development overview Oct 17, 2024

elray1 moved this from Todo to Up Next in hubverse Development overview Oct 17, 2024

lshandross linked a pull request Nov 4, 2024 that will close this issue

ls/linear pool supports sample output type/27 #142

Closed

lshandross mentioned this issue Nov 5, 2024

Extend sample handling cases in linear_pool #143

Closed

lshandross removed a link to a pull request Nov 5, 2024

ls/linear pool supports sample output type/27 #142

Closed

bsweger assigned elray1 and lshandross Nov 13, 2024

This was referenced Jan 10, 2025

Add support for non-equal model weights in linear_pool_sample() #151

Open

Add option for non-stratified sampling in linear_pool_sample() #152

Open

lshandross closed this as completed Jan 10, 2025

github-project-automation bot moved this from Up Next to Done in hubverse Development overview Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle sample types in `linear_pool` #27

handle sample types in `linear_pool` #27

elray1 commented Aug 4, 2023

elray1 commented Aug 10, 2023 •

edited

Loading

elray1 commented Aug 17, 2023

elray1 commented Apr 24, 2024

elray1 commented Oct 8, 2024

elray1 commented Oct 17, 2024

lshandross commented Jan 10, 2025

handle sample types in linear_pool #27

handle sample types in linear_pool #27

Comments

elray1 commented Aug 4, 2023

elray1 commented Aug 10, 2023 • edited Loading

Case 1: collect component samples

Case 2: do some sampling

Notes about a new function argument related to desired ensemble sample size

elray1 commented Aug 17, 2023

elray1 commented Apr 24, 2024

elray1 commented Oct 8, 2024

elray1 commented Oct 17, 2024

lshandross commented Jan 10, 2025

handle sample types in `linear_pool` #27

handle sample types in `linear_pool` #27

elray1 commented Aug 10, 2023 •

edited

Loading