-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimation of alternative allele frequencies with metafounders #182
base: feat_metafounders
Are you sure you want to change the base?
Estimation of alternative allele frequencies with metafounders #182
Conversation
- New command -update_alt_allele_prob - New function updateMafAfterPeeling() triggered with new command - Set all alternative allele frequencies to be between 0.01 and 0.99 (including user-inputs). - Updates to documentation - Updates to functional testing. - New file for metafounder simulation (new code replacing previous approach). - New simulation for metafounder accuracy testing - Updates to accuracy tests for metafounders - Updates to terminal summary of accuracy tests to include metafounders
…phaPeel into feat_metafounders
docs/source/usage.rst
Outdated
@@ -133,7 +136,7 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ | |||
|
|||
The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations. | |||
|
|||
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele probabilities after each peeling cycle. This option can be useful if there are a large number of non-genotyped founders. If both ``-alt_allele_prob_file`` and ``-est_alt_allele_prob`` are used, the inputted alternative allele probabilities are used as a starting point for alternative allele probabilities estimation. | |||
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence on providing alternative allele prob file - I don’t fully follow what you mean here. Do you need to say more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rephrase this. Here, I wanted to highlight that -est_alt_allele_prob
estimates one alternative allele frequency for the base population using all available genotype data (so it ignores any genetic grouping). Thus, to utilise the metafounders/genetic grouping in the peeling, -update_alt_allele_prob
is required as well.
@@ -342,7 +345,7 @@ Example: | |||
Model parameter files | |||
===================== | |||
|
|||
|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``. | |||
|Software| outputs four model parameter files: *.alt_allele_prob.txt*, *.seq_error_prob.txt*, *.geno_error_prob.txt*, *.rec_prob.txt*. These give the alternative allele frequency, sequencing error rates, genotyping error rates and the recombination rates used. In the *.alt_allele_prob.txt*, there is a column per metafounder with an alternative allele frequency for each marker. Here, all values will range from 0.01 to 0.99. The other three files contain a single column with an entry for each marker. By default, |Software| will output *.seq_error_prob.txt*, *.geno_error_prob.txt* and *.rec_prob.txt*. The *.alt_allele_prob.txt* will only be outputted with the argument ``-alt_allele_prob``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XingerTang we limit ourselves to 1%-99% to avoid getting “sucked” into 0, where then everything becomes “fixed”. Should we have a bit more lax range? Say, 0.001-0.999 or similar? How to do this well from numerical perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could enforce a more laxed range. I did a quick test with inputted alternative allele frequencies and the estimation via the Newton method, and it seems to be fine. But it would be good to know what @XingerTang thinks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock I was asking this because some disease allele frequencies in a base population might be very low, much lower than 0.01, but we do want to avoid the "0 trap", which can occur during iteration/peeling cycles. The question now is what limits should we put. @XingerTang I would appreciate your math's input here;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock one way to check this is to see in your real datasets you work with - if you get allele frequency that is 0.01 for any of the metafounders then we know that we have likely hit this imposed range-limit. That will give us a good test for how to manage these range-limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock Have you had a chance to look into this?
@@ -111,6 +111,34 @@ def addIndividualToUpdate(d, p, LLp, LLpp): | |||
return LLp, LLpp | |||
|
|||
|
|||
def updateMafAfterPeeling(pedigree, peelingInfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock @XingerTang the code loops over metafounders and for each metafounder loops over all pedigree members twice - once to calculate and then to assign anterior probabilities. I am thinking out loud if we can speed this up in any way if we flip iteration around - over individuals and then over metafounders, but we would be doing the same amount of work, so it seems we can not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wondered the same when writing this!
src/tinypeel/tinypeel.py
Outdated
@@ -48,6 +56,10 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False): | |||
peelingInfo.nLoci, 0.5, dtype=np.float32 | |||
) | |||
if args.est_alt_allele_prob: | |||
if args.alt_allele_prob_file is not None: | |||
warnings.warn( | |||
"-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock say that current implementation overwrites differences. We might improve upon this later …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true if a user uses both —alt_allele_prob_file
and —est_alt_allele_prob
. For now, I think I will rewrite this warning to recommend using —update_alt_allele_prob
instead.
tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R
Show resolved
Hide resolved
pullIbdHaplo(pop = pop)) | ||
data$genoIBS <- rbind(data$genoIBS, | ||
pullSegSiteGeno(pop = pop)) | ||
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RosCraddock do you need this listLoci bit of code? Also, isn't your Genotype the same as genoIBS? I got the later via pullSegSiteGeno() and there is also pullSegSiteHaplo() which will give you haploIBS. That's all you need I think,
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1) | ||
marker <- "" | ||
i <- 1 | ||
for (i in 1:length(listLoci)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example where doing just paste(“1_”, 1:pop@nLoci, sep=“0”) would be a simpler and faster vectorised way of doing this, but you don’t need it at all - just use pullSegSiteHaplo(‘)
# Get pedigree and assign metafounders | ||
pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation) | ||
colnames(pedigree) <- c("id", "mid", "fid", "population", "generation") | ||
pedigree <- pedigree[c(401:1400),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have hardcoded numbers here but nGen comes into this script externally - will that clash with these hardcoded numbers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work @RosCraddock. I left some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @RosCraddock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we can deal with 0.01 - 0.99 bit later - will open an issue
@RosCraddock can you please squash the commits into one and give it commit message "Estimation of alternative allele frequencies with metafounders". Then we merge this in. Thanks! |
- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders. - Correct ordering of metafounders in alt_allele_prob output when there are 10 or more. - Added a functional test to check this.
6ddead0
to
423906e
Compare
I cannot squash the commits further (from 8 to 3 now) due to a complex conflict with a submodule update from the previous merge. I was not expecting the error in the checks (I only squashed commits, and very carefully too). Looks like the issue is in the pip installation. The whl file is saved as alphapeel-1.1.3 etc, but for installation we search for AlphaPeel-1.1.3 etc. Thus, fails as the "file does not exist". @XingerTang. Do I need to change line 39 in the tests.yml file? Or do I need to change where the whl file is named? |
I found the cause for the error above - on 25th Feb the setuptools package was updated for all wheel file naming to follow binary distribution specification (i.e all lower case). It states that this is a trial (so they may revert?). I will set up an issue to resolve this - this will be relevant for other branches too. |
Quite a long list of updates here!
-update_alt_allele_prob
for re-estimating the alternative allele frequency on each peeling@gregorgorjanc @XingerTang - when you have time, please let me know if you have any comments/questions/changes.