Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test differences date by date #206

Open
alexisrosuel opened this issue Mar 19, 2018 · 6 comments
Open

Test differences date by date #206

alexisrosuel opened this issue Mar 19, 2018 · 6 comments
Labels

Comments

@alexisrosuel
Copy link

Context

It is very useful when running ab test to see the evolution of the difference / pvalues / credible interval / etc. through time. For instance if I start an experiment on 2018-04-01, and finish it on 2018-04-30, I would like to know what was the state (in term of pvalue, etc.) each day. It helps to visualize if the test has "converged" or not.
airbnb
(source : https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 )

Proposition

Would it be possible to apply sequentially the statistical analysis date by date (it could apply the analysis to the sequence [df[df.date <= dt.datetime(2018-04-01) + dt.timedelta(days=i)] for i in range(30)], and then report the same json, but with a date level at the top. (Maybe there is a much cleaner architecture than this !)

Thanks

@gbordyugov
Copy link
Contributor

gbordyugov commented Mar 21, 2018

Dear Alexis @alexisrosuel,

thanks a lot for the suggestion. What you're talking about here seems to me an instance of the 'early stopping' problem to me and is a subject of the multiple hypothesis testing issues. The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Expan kind of supports early stopping in a highly experimental mode and tries to mitigate the risk of spurious early stopping by applying a stricter p-value threshold when there is less data than expected. But it always consume all the date which is present in the dataframe.

Let me know if I understood your question correctly.

Best,
Grisha

@alexisrosuel
Copy link
Author

Hi Grisha,

In fact the idea behind this chart (and the whole airbnb medium article) is the opposite. They wanted to point out that the pvalue can fluctuate through time, go below the signifiance threshold, and then stay there forever or not.

The chart show this : if you stop the experiment represented here around day 10, you commit type 1 error. But I you let the experiment run for a few more days, you see that the pvalue in fact "converges" around its true value.

To recap, this does not provide an early stopping criteria. This helps to monitor wether the pvalue has still an erratic behaviour (so we can't stop the experiment at this moment), or if it hasn't changed sinced a "long time" (to be defined). For me the ideal criteria is :

  • look at the true statistical early stopping criteria (the aim of this package)
  • accept this results iff the pvalue graph has converged

What do you think of it?

@gbordyugov
Copy link
Contributor

Please pardon my poor expression: What I meant in my first reply is exactly what you're talking about

The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Our early stopping logic counteracts the effects like this by reducing the alpha-threshold at the beginning of experiment (where you've got less data), so it's not 0.05, but much larger for small quantities of data in the first days.

@alexisrosuel
Copy link
Author

Oh indeed I see your point now too :)

Yes, expan use some kind of "dynamic pvalue threshold", so we could draft this value day by day, along with the observed pvalue?

@shansfolder
Copy link
Contributor

Yes the "dynamic threshold" is based on information fraction, which is ratio of current sample size and estimated sample size for the experiment.

Here is the method we use: https://github.com/zalando/expan/blob/master/expan/core/early_stopping.py#L24-L36

@shansfolder
Copy link
Contributor

Whether it is day-by-day analysis or other periods, will depends on how your code calls ExpAn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants