Estimation of alternative allele frequencies with metafounders #182

RosCraddock · 2024-11-12T16:54:42Z

Quite a long list of updates here!

New command -update_alt_allele_prob for re-estimating the alternative allele frequency on each peeling
New function updateMafAfterPeeling() to re-estimate the alternative allele frequencies based on the inferred genotype probabilities of the founders (triggered with new command
Restricted all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs).
Updates to documentation for -update_alt_allele_prob and noted the alternative allele frequency range (0.01 to 0.99).
Updates to functional testing for alternative allele frequency estimation.
New R script file for metafounder simulation (new code replacing the previous approach).
New simulation for accuracy testing of metafounders: one pedigree, one true genotype (dosage, phased, unphased), set of obsv genotypes with ~50% missing rate.
Five additional accuracy tests for metafounders. All only with the single method. I can add more (e.g multi, hybrid), but cautious since the summary table is already quite large.
Additions to the accuracy summary table printed in the terminal to include metafounder (as separate entry as uses different pedigree simulation).

@gregorgorjanc @XingerTang - when you have time, please let me know if you have any comments/questions/changes.

- New command -update_alt_allele_prob - New function updateMafAfterPeeling() triggered with new command - Set all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs). - Updates to documentation - Updates to functional testing. - New file for metafounder simulation (new code replacing previous approach). - New simulation for metafounder accuracy testing - Updates to accuracy tests for metafounders - Updates to terminal summary of accuracy tests to include metafounders

…phaPeel into feat_metafounders

docs/source/usage.rst

gregorgorjanc · 2024-11-13T06:26:43Z

docs/source/usage.rst

@@ -133,7 +136,7 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ

 The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.

-The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele probabilities after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders. If both ``-alt_allele_prob_file`` and ``-est_alt_allele_prob`` are used, the inputted alternative allele probabilities are used as a starting point for alternative allele probabilities estimation.
+The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used.


The last sentence on providing alternative allele prob file - I don’t fully follow what you mean here. Do you need to say more?

I will rephrase this. Here, I wanted to highlight that -est_alt_allele_prob estimates one alternative allele frequency for the base population using all available genotype data (so it ignores any genetic grouping). Thus, to utilise the metafounders/genetic grouping in the peeling, -update_alt_allele_prob is required as well.

gregorgorjanc · 2024-11-13T06:29:03Z

docs/source/usage.rst

@@ -342,7 +345,7 @@ Example:
 Model parameter files
 =====================

-|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.
+|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. Here, all values will range from 0.01 to 0.99. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``.


@XingerTang we limit ourselves to 1%-99% to avoid getting “sucked” into 0, where then everything becomes “fixed”. Should we have a bit more lax range? Say, 0.001-0.999 or similar? How to do this well from numerical perspective?

I think we could enforce a more laxed range. I did a quick test with inputted alternative allele frequencies and the estimation via the Newton method, and it seems to be fine. But it would be good to know what @XingerTang thinks!

@RosCraddock I was asking this because some disease allele frequencies in a base population might be very low, much lower than 0.01, but we do want to avoid the "0 trap", which can occur during iteration/peeling cycles. The question now is what limits should we put. @XingerTang I would appreciate your math's input here;)

@RosCraddock one way to check this is to see in your real datasets you work with - if you get allele frequency that is 0.01 for any of the metafounders then we know that we have likely hit this imposed range-limit. That will give us a good test for how to manage these range-limits.

@RosCraddock Have you had a chance to look into this?

gregorgorjanc · 2024-11-13T06:36:48Z

src/tinypeel/Peeling/PeelingUpdates.py

@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp):
    return LLp, LLpp


+def updateMafAfterPeeling(pedigree, peelingInfo):


@RosCraddock @XingerTang the code loops over metafounders and for each metafounder loops over all pedigree members twice - once to calculate and then to assign anterior probabilities. I am thinking out loud if we can speed this up in any way if we flip iteration around - over individuals and then over metafounders, but we would be doing the same amount of work, so it seems we can not.

Yes, I wondered the same when writing this!

src/tinypeel/tinypeel.py

gregorgorjanc · 2024-11-13T06:40:38Z

src/tinypeel/tinypeel.py

@@ -48,6 +56,10 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False):
                        peelingInfo.nLoci, 0.5, dtype=np.float32
                    )
    if args.est_alt_allele_prob:
+        if args.alt_allele_prob_file is not None:
+            warnings.warn(
+                "-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders."


@RosCraddock say that current implementation overwrites differences. We might improve upon this later …

This is only true if a user uses both —alt_allele_prob_file and —est_alt_allele_prob. For now, I think I will rewrite this warning to recommend using —update_alt_allele_prob instead.

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

gregorgorjanc · 2024-11-13T06:48:38Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+                         pullIbdHaplo(pop = pop))
+  data$genoIBS <- rbind(data$genoIBS,
+                        pullSegSiteGeno(pop = pop))
+  listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)


@RosCraddock do you need this listLoci bit of code? Also, isn't your Genotype the same as genoIBS? I got the later via pullSegSiteGeno() and there is also pullSegSiteHaplo() which will give you haploIBS. That's all you need I think,

gregorgorjanc · 2024-11-13T06:49:57Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+  listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
+  marker <- ""
+  i <- 1
+  for (i in 1:length(listLoci)){


Here is an example where doing just paste(“1_”, 1:pop@nLoci, sep=“0”) would be a simpler and faster vectorised way of doing this, but you don’t need it at all - just use pullSegSiteHaplo(‘)

gregorgorjanc · 2024-11-13T06:52:24Z

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R

+# Get pedigree and assign metafounders
+pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation)
+colnames(pedigree) <- c("id", "mid", "fid", "population", "generation")
+pedigree <- pedigree[c(401:1400),]


We have hardcoded numbers here but nGen comes into this script externally - will that clash with these hardcoded numbers here?

gregorgorjanc

Very good work @RosCraddock. I left some minor comments.

gregorgorjanc

Looks good @RosCraddock

gregorgorjanc

LGTM - we can deal with 0.01 - 0.99 bit later - will open an issue

gregorgorjanc · 2025-02-28T03:05:48Z

@RosCraddock can you please squash the commits into one and give it commit message "Estimation of alternative allele frequencies with metafounders". Then we merge this in. Thanks!

- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders. - Correct ordering of metafounders in alt_allele_prob output when there are 10 or more. - Added a functional test to check this.

RosCraddock · 2025-03-04T10:42:05Z

I cannot squash the commits further (from 8 to 3 now) due to a complex conflict with a submodule update from the previous merge.

I was not expecting the error in the checks (I only squashed commits, and very carefully too). Looks like the issue is in the pip installation. The whl file is saved as alphapeel-1.1.3 etc, but for installation we search for AlphaPeel-1.1.3 etc. Thus, fails as the "file does not exist". @XingerTang. Do I need to change line 39 in the tests.yml file? Or do I need to change where the whl file is named?

RosCraddock · 2025-03-05T16:42:32Z

I found the cause for the error above - on 25th Feb the setuptools package was updated for all wheel file naming to follow binary distribution specification (i.e all lower case). It states that this is a trial (so they may revert?). I will set up an issue to resolve this - this will be relevant for other branches too.

RosCraddock added 2 commits November 12, 2024 16:25

Merge branch 'feat_metafounders' of https://github.com/RosCraddock/Al…

2f4f4d3

…phaPeel into feat_metafounders

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

docs/source/usage.rst Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

src/tinypeel/tinypeel.py Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R Show resolved Hide resolved

gregorgorjanc reviewed Nov 13, 2024

View reviewed changes

gregorgorjanc approved these changes Nov 27, 2024

View reviewed changes

gregorgorjanc approved these changes Feb 28, 2025

View reviewed changes

gregorgorjanc mentioned this pull request Feb 28, 2025

Change estimation limits of base population allele frequency from [0.01, 0.99] to something wider #185

Open

RosCraddock force-pushed the feat_metafounders branch from 6ddead0 to 423906e Compare March 4, 2025 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimation of alternative allele frequencies with metafounders #182

Estimation of alternative allele frequencies with metafounders #182

RosCraddock commented Nov 12, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 15, 2024

gregorgorjanc Nov 15, 2024

gregorgorjanc Feb 28, 2025

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

RosCraddock Nov 14, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc Nov 13, 2024

gregorgorjanc left a comment

gregorgorjanc left a comment

gregorgorjanc left a comment

gregorgorjanc commented Feb 28, 2025

RosCraddock commented Mar 4, 2025 •

edited

Loading

RosCraddock commented Mar 5, 2025

		@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp):
		return LLp, LLpp


		def updateMafAfterPeeling(pedigree, peelingInfo):

Estimation of alternative allele frequencies with metafounders #182

Are you sure you want to change the base?

Estimation of alternative allele frequencies with metafounders #182

Conversation

RosCraddock commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregorgorjanc left a comment

Choose a reason for hiding this comment

gregorgorjanc left a comment

Choose a reason for hiding this comment

gregorgorjanc left a comment

Choose a reason for hiding this comment

gregorgorjanc commented Feb 28, 2025

RosCraddock commented Mar 4, 2025 • edited Loading

RosCraddock commented Mar 5, 2025

RosCraddock commented Mar 4, 2025 •

edited

Loading