Systematic Trading research and development, with a flavour of Trend Following
Au.Tra.Sy blog – Automated trading System header image 2

Bootstrap – Take 2: Data Mining bias, Code and using geometric mean

August 13th, 2010 · 65 Comments · Backtest, Code


In part 1 of this bootstrap post, we looked at how to apply the method to establish the statistical significance of a single trading rule. In Part 2, we’ll look at how to deal with the data mining bias, the impact of geometric vs. arithmetic mean return. The code implementing the bootstrap test is available for download at the bottom of this post.

Dealing with Data Mining Bias

The approach described in the single rule test is not valid when performing data mining (whether testing different rules or different parameter values of the same rule). As per the data mining bias (explained previously), the (best) rule selected from the data mining process will invariably owe a large part of its over-performance to random (good) luck.

The way the bootstrap test deals with the data mining bias is by implementing a concept introduced in White’s Reality Check. The Reality Check derives the sampling distribution appropriate to test the statistical significance of the best rule found by data mining.

In effect, the concept is fairly simple – and similar to the single-rule bootstrap: assuming N rules have been tested in the data mining process, each resample iteration will perform a resample with replacement for each rule and the best mean return will be kept as this resample iteration’s test statistic:

  1. N back-tests are run on detrended data. The mean daily return, based on x observations, is calculated for each back-tested rule.
  2. Each rule’s mean daily return is substracted from the rule’s set of daily returns (zero-centering), This gives a set of adjusted returns for each rule.
  3. For each “higher-level” resample (to form the sampling distribution of the best-performing rule in a universe of N rules), perform a “lower-level” resample with replacement on every rule. For each rule select x instances of adjusted returns at random and calculate their mean daily return (rule bootstrapped mean). Compare each rule bootstrapped mean and select the highest one: this is the test statistic of this “higher level” resample (bootstrapped best mean).
  4. Perform a large number of “higher level” resamples to generate a large number of bootstrapped best means.
  5. Form the sampling distribution of the best means generated in the step above.
  6. Derive the p-value of the best back-test mean return (non zero-centered) based on the sampling distribution derived above

In effect: for each iteration, resample each rule, take the best return, keep it as this iteration test statistic and move on to the next iteration. The sampling distribution is formed of each iteration’s best return.

White Reality Check Related Papers

I have not yet found searched hard for White’s paper on the Reality Check but I did find the two following papers which seem to be worth a read:
Stepwise Multiple Testing as Formalized Data Snooping – Romano & Wolf


It is common in econometric applications that several hypothesis tests are carried out at the same time. The problem then becomes how to decide which hypotheses to reject, accounting for the multitude of tests. In this paper, we suggest a stepwise multiple testing procedure which asymptotically controls the familywise error rate at a desired level. Compared to related single-step methods, our procedure is more powerful in the sense that it often will reject more false hypotheses.

Unlike some stepwise methods, our method implicitly captures the joint dependence structure of the test statistics, which results in increased ability to detect alternative hypotheses. We prove our method asymptotically controls the familywise error rate under minimal assumptions. Some simulation studies show the improvements of our methods over previous proposals. We also provide an application to a set of real data.

Re-Examining the Profitability of Technical Analysis with White’s Reality Check and Hansen’s SPA Test

In this paper, we re-examine the profitability of technical analysis using White’s Reality Check and Hansen’s SPA test that correct the data snooping bias. Comparing to previous studies, we study a more complete universe of trading techniques, including not only simple rules but also investor’s strategies, and we test the profitability of these rules and strategies with four main indices. It is found that significantly profitable simple rules and investor’s strategies do exist in the data from relatively young markets (NASDAQ Composite and Russell 2000) but not in the data from relatively mature markets (DJIA and S&P 500). Moreover, after taking transaction costs into account, we find that the best rules for NASDAQ Composite and Russell 2000 outperform the buy-and-hold strategy in most in- and out-of-sample periods. Our results thus suggest that the degree of market efficiency may be related to market maturity. It is also found that investor’s strategies are able to improve on the profits of simple rules and may even generate significant profits from unprofitable simple rules.

Geometric or Arithmetic Mean?

In part 1, I introduced the idea that the mean arithmetic return being positive is not equivalent to the strategy being profitable (ie. this is not a sufficient condition). On the other hand, the mean geometric return being positive is a necessary and sufficient condition to the strategy being profitable (ie. both conditions are equivalent).

Therefore bootstrapping using the mean geometric return as the test statistic should provide a better evaluation of the system’s profitability statistical siginificance.

I will not go into detail of how the calculation is done as it is very similar to the arithmetic mean return, but using log of returns instead. Note that the geometric mean will be a stricter test than the arithmetic mean (a rules can have a significantly positive arithmetic return but a negative geometric return).

To illustrate the multiple applications of the bootstrapping methodology, I decided to run the test on one of the Trend Following Wizards track record (set of monthly returns). I picked Chesapeake and ran the monthly returns (from 1988 to 2009) through the bootstrap test.

The p-value calculated using the arithmetic mean is 0.000098 (less than 1 chance in 10,000 that this kind of results are due to random luck). Using the geometric mean, the p-value is 0.00022. The values are extremely low, which is not surprising given Jerry Parker’s 20-year track record with only one losing year and a monthly average return of 1.7%.

Many people would point out that survivorship bias should be considered, and obviously it depends on how you look at it. The main point of this dual test is that the geometric p-value is higher than the arithmetic p-value, verifying that it is a stricter test of statistical significance.

Bootstrap Code

Finally, here is a tool coded to implement the bootstrap test for a single strategy – available for download. Note that this is distributed “as is”, with no guarantee (but that’s the one I have been using so I still think it does the job…). It should run on any Windows machine with the .Net framework installed (XP or higher should do fine).

It simply takes three parameters (separated by space):

  1. Returns file path and name
  2. Number of resamples
  3. Flag for Arithmetic (A) or Geometric (G) mean calculation

It also generates a file in the same directory with all of the resamples test-statistic values (to draw the histogram).

Simply place the bootstrap.exe in your directory of choice and run it from the command prompt as below:

Run the bootstrap.exe from the command line

Run the bootstrap.exe from the command line

Download here:

Related Posts with Thumbnails


65 Comments so far ↓

  • prazor

    Very good! Saved me a bunch of hours…


  • Troy S.

    R also has a built in boot() method, which is nice.

    How did you calculate the p-value in your bootstrap.exe tool?

  • Jez

    yeah – I really need to get into R…

    Nothing fancy to calculate the p-value: basically a counter gets incremented for every resampling iteration where the bootstrapped mean is above the observed one. Divide by total number of iterations at the end, and you get your p-value.

  • Troy S.

    Haha that makes a lot more sense than what I had in mind…

    I found this PDF very helpful for getting started with R ( Take a look at it when you have the time.

    Thanks for your clear explanations of everything. Looking forward to your next posts on Monte Carlo!

  • Adam


    I stumbled across your blog looking for White’s reality check. Excellent explanation I might add. If you want to find the original paper just search google scholar, the paper should be the first hit.


  • George

    It seems that the code is not downloadable (only the application). Would you mind sharing the code as well? (in a zip file)

  • Jez Liberty

    Well… the exe is “compiled” code ;-)
    I dont have access to my code repository at the moment but I’ll try and see about that when I’m back…

  • Jonathan Keith

    Jez —

    Excellent site! I’m still working through (struggling) with the calcs involved in detrending, so if you wouldn’t mind posting your source so I could follow and example, that would be wonderful. Thanks for all the great work…

  • Michael Harris


    This is one of the most interesting blogs/websites I have ever come across.

    Regarding the bootstrap tests you performed for Chesapeake, I run my own proprietary test for randomness for the monthly returns since 2006 and I get a P-value of 0.03. That points to very moderate evidence against randomness. However,

    I believe, and I may be wrong, please correct me in that case, that the bootstrap test will only tell you if the expected real return in the future will be different that zero. It is not a test for randomness in a strict sense.

    Best regards,

    Michael Harris

    P.S. I will add your blog to my blogroll

  • Rick


    You mentioned the detrending of the data series in several places before the backtest is run. I do not understand this. Detrended data are unrealistic data and nobody that I know will ever design any system based on such data because the results are unrealistic. Furthermore, when you used the monthly returns of the fund to get the p-value you had no way of detrending any data because you had no way of knowing what the data were. So, allow me please to ask the following questions: where is this requirement about detrending data came from and if detrending is not done, does that affect the bootstrap test results?

    Thank you.

  • Jez Liberty

    Hi Michael,
    Thanks for your comments!
    Yes – the bootstrap tests only for statistical significance of the result being positive.
    Aronson’s logic in the description of the test is that if a trading rule has “predictive power”, its output in terms of performance should be positive (with statistical significance). As this is an “approximation” with confidence interval, there is no guarantee that the positive output is not random (but you can set the threshold for p-value) and determine the level of certainty you need.

  • Jez Liberty

    Hi Rick,
    Please check this post on detrending:

    My conclusion is that it is not clear-cut – and in my own testing I do not detrend the data.

    Aronson describes detrending as a way of removing a possible position bias (ie Buy and Hold in a bull market would show positive results despite the “rule” having no predictive power. By detrending, the performance would drop to 0)

    PS: also detrending is done after the backtest (ie from that post linked above: “For each trade in the backtest, adjust the trade return by subtracting the daily drift for each day in the trade”

  • Michael Harris

    Hi Jez,
    Thanks for the reply.

    What do you think about the other comment I made regarding the Chesapeake performance since 2006? I run my own proprietary test for randomness using the monthly returns since 2006 and I get a P-value of 0.03. That points to moderate evidence against randomness. It is I believe 3 orders of magnutude less than your number.

    I also run a bootstrap test for the monthly returns since 2006 and I get a p-value close to 0.20. This point to no evidence against the null hypothesis. What numbers do you get with your algorithms?

    Also, do you think then that after 2006 or so the performance of the fund is getting more random and it is the past (before 2006) that contributes to the high significance of your test?


  • Jez Liberty

    Hi Michael,

    I presume a track record like Chesapeake contains periods where random noise is more predominant than the system’s edge and other periods when the edge of the system is more apparent and random noise manifests less.
    Probably the track record from 2006 is of the former case.

    Assuming the “randomness” level of the system/programme is constant through time, a longer test period should allow for a better evaluation of the system (and probably why I find a much lower p-value). Of course, if the assumption is wrong and system edges actually change through time, periods of increased randomness might actually point to the system losing its edge/becoming more random… (this is a possibility for Chesapeake since 2006)

    Maybe some sort of “rolling” p-value calculation is a good way to see how it changes through time and if a real degradation in system performance seems to be occurring

    I would be interested to know about what concepts you use in your test for randomness?

  • Michael Harris

    Ji Jez,

    Thank you for the reply.

    I want to make sure I fully understand what you are doing with the single bootstrap test before I compare it to my own test. If I understand what you wrote correctly, you take the original returns, like in the Chesepeake example, and you subtract from each observation the mean of the sample. Then, you bootstrap to generate a distribution. This will be a distribution of the adjusted mean returns. Do I understand this correctly?

    If this is the case, then the statistic you are testing for significance is the mean of the adjusted returns. I have a problem understanding what this statistic reflects. I would think that the objective would be to test the significance of the hypothesis that the mean return is zero or less. Thus, the objective would be from the sample of returns to get the distribution of the mean via bootstrapping and see whether the observed mean is significant. I still have hard time understanding what the centering of observations accomplishes and what the practical value of the resulting statistic is. What is exactly the null hypothesis in that case? Is it for example, that the mean deviation of returns from the sample mean is zero or less?
    Furthermore, in the code you posted do you actually center the observations before you resample?

    Thanks you for the interesting discussion.

  • Jez Liberty

    From Aronson’s book:
    “the bootstrap tests for the null hypothesis that the rule does not have any predictive power”. In the book he equates this to “the null hypothesis that the arithmetic mean return of the rule being back-tested is zero”

    H0: back-tested rule has no predictive power = arithmetic mean return of the rule is zero
    Ha: back-tested rule has predictive power = arithmetic mean return of the rule is positive

    H0 is the hypothesis that the mean return is zero, so you need a distribution with a zero-mean.
    What the bootstrap does is build such distribution using the actual data from the test in order to have variance/deviations and other distribution characteristics in line with the process being tested.
    After the zero-mean distribution is built you perform a standard statistical significance test by comparing the data tested (non-adjusted mean return) against the zero-mean distribution. This gives you the p-value, which represents the probability that the observed mean return is drawn from a zero-centered distribution. The lower the p-value the better
    This is a non-parametric method, but an analogy would be to calculate how many standard deviations the mean tested lies at.

    The code I posted does center each return before resampling.

    Hopes this clarifies things…

    PS: you could also look at the “bootstrap 1” post where another reader was confused (maybe my explanations are not so clear!) and we exchanged on the comments section for clarification. The diagram on there hopefully helps for understanding

  • Michael Harris

    Hi Jez,

    All these are understood. I am questioning the procedure. To start with, linearity dictates that you do not have to subtract the sample mean from each observation. You can just subtract the sample mean from each bootstrapped mean. It is easier and faster. Don’t you agree?

    But, I still do not agree with the process and I think, Lou, the other poster who I just found out that he disagreed, he was intuitively correct. I do not know Aronson, but in general I have found out not to take anything for granted.

    In my opinion, the correct procedure is to generate the distribution of plain returns via bootstrap. Then center that distribution. This will give you the bias, if we can call such. You subtract the bias, i.e., the mean of the distribution from your sample mean return and then you check the results, which is the sampling error, for significance.

    Example: You have a mean return of 1.5%. You bootstrap the returns and you find that the mean of the distribution you obtain is 0.5%. This is the bias. It means that any random system should attain this return. Anything else will be a losing system or worse than random. Now you center your distribution by subtracting 0.5%. Then, you test the mean return minus the bias for significance, i.e. 1.5% – 1% = 1%. This is the return of a system that outperforms random systems. If this is significant, then the system is probably not random to some measure.

    What do you think? I think centering in advance removes the bias and makes the test extra non-conservative. Also, I think is this way, which is in my opinion the correct one, you do not have to detrend data.

  • Jez Liberty

    I see what you are saying. Intuitively, I would have thought that this would be equivalent (as per my earlier comment to Lou), ie the bias from the non-adjusted bootsrapped distribution (ie your 0.5%) would be equal to the return of the backtest (1.5%). I might very well be wrong though and this is something I have not tested though. I’ll add this on my “stack” – as I do not think we can draw any conclusions without testing this.

    Computationally, substracting the bias prior boostrapping is probably better as there are typically less individual returns than number of bootstrapping iterations (and therefore less substraction operations).

  • Michael Harris

    Hi Jez,

    Thanks for the reply.

    I think that the centering of the sampling distribution causes very optimistic tests.

    This is what I think is the case. Aronson backtests on detrended data. Thus, he assumes any bias is removed and that centering just removes sampling error. In this case I agree about the centering.

    However, when testing row returns, like in the case of Chesapeake performance, you have no way of knowing the markets and data that were used. In this case, the sampling distribution mean may reflect a bias of the market. A conservative test in this case would be to test whether the results are significant in relation to this bias. Thus, you have to center the sampling distribution after you find its mean. This mean you subtract for the mean sample return. The hypothesis stays the same but the number you are checking for significance is not the original return but the one after the bias is removed.

  • Michael Harris

    Hi Jez,

    I did some reading today and although I think your understanding of the theory is impeccable, in reality there will always be some bias between the sample mean and the bootstrap distribution. Let us call it BIAS.

    I think the correct procedure would be to estimate the bias first. Then, to center the sampling distribution obtained from bootstrap and to test (Mean return – BIAS) for significance.

    I agree with you that theory says the BIAS should be ZERO, i.e. the sampling distribution is centered at the sample mean (law of large numbers). But an estimation of this quantity is crucial I think because we cannot know how it can impact the significance test.

    I will wait to hear from you. I think we both understand now the issue (or non-issue). Thanks for a very interesting and stimulating discussion.

  • Jez Liberty

    Hi Mike,
    Enjoying the discussion too, on my side. I completely see your point and it actually picked my curiosity.

    As I said, I have a lot in my plate at the moment but I will try to run the method you suggest on Chesapeake’s return stream and see if there are any divergences.

  • Justin

    Hi Jez,

    The articles on bootstrapping are excellent, as well as the discussion. Kardi Teknomo has an Excel example of bootstrapping on his website:

    Definitely worth a look.


  • Michael Harris

    Hi Jez,

    I finally found some time to run to study this in some more detail. Specifically, using my statistics resampling package I found out in the case of Chesapeake monthly returns from 01/1990 to 10/2010 the following (100000 runs, arithmetic mean):

    Case 1: Your algorithm. p-value = 0.00049

    My analysis:

    Case 2: Center distribution, resample 100,000 times. p-value = 0.00066 (method similar to yours)

    The difference in the p-valuess is about 25% (My calculation vs. yours).

    Case 2: First resample returns, calculate mean, repeat 100,000 times. Then center distribution. p-value = 0.00071

    Case 3: First resample returns, calculate mean, repeat 100,000 times. Find mean of distribution, calculate Bias = mean of distribution – mean of returns. Subtract Bias from mean of returns. p-value = 0.00072 (very small difference from Case 2 above)

  • Jez Liberty

    Hi Mike,
    I found some time to look at this also.
    I ran 500,000 resamples each time (as even with 100,000 I noticed a bit of variation between identical tests).

    So, I re-ran the bootstrap on the Chesapeake results from the original test in the post (1988-2009, so results would be different to yours):

    For case 1: exact same method as in the post:
    p-value = 0.0001024 (slightly different from the result calc’ed in the post: 0.000098)

    For your case 2: the distribution of the resamples was centered on 0.0169914, which is very close to the mean return of the original returns (0.017055). After zero-centering the distribution, the p-value is 0.0001084.

    For case 3, as the bias is very small, the p-value is very similar to case 2: 0.0001151

    In light of your and mine further tests, I am not sure whether the alternative testing methods make much difference – what’s your opinion?

    Any ideas why there is a 25% difference in both our calculations (with case 1)

  • Michael Harris

    Hi Jez,

    I copied the returns from the link you gave to autumn gold service. It would be better if we use the same numbers. In that website the returns start from 1990. I am going to copy and then paste the numbers below and you can edit this post to remove them after saving them to a .txt file for testing purposes (if you cannot edit you can just delete the post). Again, the returns start at 01/1990


  • Jez Liberty

    I quickly ran the same tests using the data you pasted above:

    p-value of the method as per the post above: 0.0006038 (that would be case 1)

    p-value after zero-centering resamples distributuon: 0.0006148 (that would be your case 2)
    p-value when taking the bias (mean return – mean resamples distribution) into account: 0.0006424 (that would be your case 3)

    for reference:
    mean return of original return stream: 0.0105868
    mean of 500,000 resamples (of non-adjusted returns) 0.010547

    The values seem very close to each other (with the bias being minimal again) and also fairly close to your values..

  • Michael Harris

    Hi Jez,

    Thanks for the follow up. I ran a few more tests using different data to eliminate any variability due to the specific sample used (Case 1, 500K runs):

    Hawskbill fund (1/1990 – 10/2010):

    Your p-value: 0.0021 – My p-value: 0.002076

    Altis fund:

    Your p-value: 0.002886 – My p-value: 0.003038

    Chesapeake (1/1990 – 10/2010):

    Your p-value: 0.000604 – My p-value: 0.00057

    I do the resampling in Excel and it is very slow compared to your fast program.

    Final conclusions:

    (1) Values are very close and there are not any significant deviations present.

    (2) Resampling and then centering (Case 2) does not cause large deviations and any variability is again not significant.

    (3) Correcting for any bias of the sampling distribution mean wrt to the original mean does not again produce any significant change in the results.

    (4) Any observed deviations in the results, in general, can be attributed to resampling variability. Furthernore, your algorithm is a lot faster and preferable in this respect over Cases 2 and 3.

    A post in my blog, maybe later today, will analyze some aspects of the Chesapeak performance and compare it to the other two funds. I will try to identify what is the cause of the low p-value for the period 2000 – 2010.

  • Jez Liberty

    Good to see this follow-up data. And somehow it is reassuring that you are coming to very similar values.
    will look out for the post on your blog.

    ps: credit where credit is due: this is not really “my algorithm” but rather the implementation of the algorithm as described in Aronson’s book…

  • Michael Harris

    Hi Jez,

    Thanks for the discussion regarding the bootstrap method. Here is the link to my related blog post:

  • Keith

    Hi Jez,

    First, thanks very much for providing your excellent blog — and especially your articles on bootstrap and Monte Carlo testing. They have inspired me to code up these tests into our own backtesting regimen.

    David Aronson’s site at provides several interesting additional links — a link to Timothy Master’s excellent 2006 article:
    and a link to a pdf file of “Reader Issues”:

    This was of interest to me because having done countless backtests/walk-forward tests, etc. on our own trading systems over the last decade, I became concerned after coding up and applying the bootstrap and Monte Carlo methods to my backtests, I found that my P-values were always zero, the ideal result! — similar to one reader’s comment: “My dataset has a return that is tremendously better than random systems generated in a Monte-Carlo Permutation Test, and its p-value is zero. I am sure that I have not overfitted the parameters in my model. Is a p-value of zero reasonable?”

    Part of Aronson’s response is: “The dataset tested must be obtained from COMPLETELY out-of-sample data.” So this suggests to me that the bootstrap and Monte Carlo tests are relatively useless (and might give us unjustified enthusiasm about our systems) when we try to apply them to our backtest periods and would only reveal anything when applied to “walk-forward” or “out-of-sample” data.

    I am curious if you have found anything to support the use of these methods on backtest data.

    Thanks again for publicly disclosing your ideas and work.

  • Jez Liberty

    Thanks Keith, for the comments and pointer to the docs.
    Bearing in mind that I have used the bootstrap tests rather than the MC method, I believe Aronson’s comments apply only to single bootstrap/MC testing, in which case you would definitely need to isolate your out-of-sample data from your research/optimization data.
    However the later method described in the book for taking into account data mining should allow you to use the methods on back-test data (because this is exactly where you would generate data mining bias). This is my understanding anyway…

  • Michael Harris


    Happy New Year! Glad to see you back with more posting and reports. This is the most interesting blog on trading system analysis and design.


    The strong condition for the bootstrap to have any meaning is that the data are i.i.d. (Independent and identically distributed). The weak condition is that of exchangeability, meaning that future data behave like past. If you have sufficiently long history in your backtest, the weak condition may be satisfied. However, that cannot eliminate data mining bias. One way of doing that is through cross-validation (forward testing). Thus, there are two issues here: (A) whether the bootstrap has meaning a priori given the data sample and (B) whether the model selection involved bias, in which case the bootstrap has no meaning at all.

    IMO, backtesting is a process for hypothesis rejection only. Results of backtesting cannot be used for validation or selection, whether via bootstrap or not. This is in my mind the key point besides statistics.

  • Steven Schmidt

    Jez, great blog and thank you very much for the code – huge help for me! With regard to White’s Reality Check, I was wondering if you have any thoughts on how it can be applied to an intER day trading system. My difficulty is that many of my systems do not generate a daily return and thus each rule will have a different value for “x”. As I understand it, having the same number of observations is a requirement for bootstrapping.

  • Jez Liberty

    Hi Steven
    I simply use the daily open equity change for the daily return.

    If by inter-day, you mean that the system can hold position(s) for several days – as I suspect (and how most of the systems I develop/research here do) – this becomes a non-problem because the positions will still change in value, giving you a daily return (albeit not realized)
    If your system does not trade at all n specific days, then the return is simply 0.

    Hope this helps – Jez

  • leo


    I’m surprised nobody addressed the fact that Aronson did NOT center the mean at 0 because he demonstrated that the p-values would be significantly higher and thus statistically insignificant. He made vaguely described formula for adjusting the centering of the mean according to several factors – among them are degrees of freedom, amount of data, etc. It’d be nice to know exactly how that formula works since he didn’t go into it although he did mention it twice in the book.



  • Nitin Gupta

    Hi Jez,
    Can you please post the return file that u are uploading in Bootstrap

    I am getting an error in Input String… please post the file soon… it will help….

  • Nitin Gupta

    I have got a p Value of 0… it is possible…….

  • nikolas

    hi nice website.

    i have the following question:
    if i am testing an intra day system (1h) how do i use bootstrap and how should i detrend?

    one way i thought was to subtract the average monthly return from each trade and then run the bootstrap at the end of month balance of my equity

    would that make sense?

  • nikolas

    one more thing. i think the way you said about detrending the trades has a small mistake.

    you said you subtract the drift from the result of the trade; that is if its long. for a short i think it shuodl be added.

  • Jez Liberty

    The detrending is done on the underlying data used for the back-test, not on the result of the trades on each trade’s holding period return by removing the implicit trend contained in the underlying data used for testing. Check this post for more info:
    For an intra-day system (1h periods), you would use the exact same methodology as for a daily strategy: evaluate the set of each period’s return for the strategy in question (ie every hour) and apply the same bootstrapping methodology: calculate mean return, remove from set of returns, generate n resamples and evaluate p-value (being the percentage of resamples that exceed the input mean return).
    Hope that helps.

  • FranSc

    This is the original White’s paper.

  • FranSc

    It seems Dr White passed away:


  • alpha

    hi Jez,

    my understanding is that the boottrap (and MC) methods are good for addressing data mining bias (also known as data snooping bias).

    as per Aronson’s response to some readers questions; he states that you should perform the bootstrap on out of sample backtest.

    Is this correct?

    if so; what is the point of the argument that “bootstrap addresses the data mining bias” ?


    regarding the In-sample vs Out-of-sample backtesting;

  • Jez Liberty

    This gets me confused too but it would be interesting to see Aronson’s answer in context. In a way an in-sample/out-of-sample test is another way to verify against data mining bias. He might simply be referring to the single rule bootstrap test (as discuss on an earlier post to this one), which does not address the data mining bias (which was described by Aronson like this).

    I could see how you would use:
    1- optimize a system by testing multiple rules
    2- Apply the “Reality check” bootstrap test to the selected “best” rule
    3- Test that best rule on out-of-sample data
    4- Run a single-rule bootstrap test on the out-of-sample test results.


  • alpha

    hi Jez,

    start to make sense. However; i see less value in using a bootstrap if I have to do out of sample test.

    in a recent interview with Bill Eckhart by FuturesMag; Bill indicated that he addresses the data snooping bias by the following:
    – using less than 12 degrees of freedom (to ensure robusness and simplicity of the system)
    – minimum sample size: 1800 trades
    – using “proprietry” robust statistic tests to guard against data snooping. He confirmed that he does not use out of sample tests; as this “proprietry” test; already covers this aspect.

    it sounds like a holy grail to me; lol.

    do you have any idea what that robust statistical test is?

    PS: in a 1996 event; Bill indicated he uses bootstrap technique to derive the statistics from the backtest results.

  • Steven Schmidt

    My understanding is that bootstrapping is simply a more accurate way of deriving statistical significance compared to using a chi square test. Chi square test accurate with very large numbers but I don’t think the number of trades that an average system generates get remotely close to “large numbers”. The resampling in bootstrapping exploits the law of large numbers to add robustness to the statistical analysis (and p-value) but does not address data snooping bias. I’m still working on the implementation but the way I’ve been thinking about addressing biases is the following:

    1. Evaluate subsets of market data from both bear and bull markets. This helps address position bias without eliminating it. Basically, if a rule does well a bull/bear market because of position bias, that information in itself is valuable.
    2. Use walk forward-testing to address the data mining bias. Aaronson somewhat dismisses this approach but I think there are definite merits to it.
    3. Assuming there is at least consistency in the results of a system in each bull/bear market period, I then compare those results to an equal number of random trades within each period. By taking a very large number of random trade samples within each period you then compute a p-value for each period.

    I’m still fleshing out #3 but would be very curious to any thoughts

  • alpha

    certainly there is no substitute for walk forward testing. if anyone dismisses this method; i would doubt their trading experience.

    I personally always forward test; for at least a couple of months; or 100 trades; whichever comes first. IMHO; forward tests address all biases; starting from data snooping, look ahead bias, software bug, etc.

    Forward tests comes at a cost; if you constantly develop trading rules; it will take years to forward test all of them. Hence the need for a robust test.

    the point (3) you mention is interesting. I will test it and post results.

    I did toy with bootstraps a lot recently; and I saw some benefits from the max drawdown confidene interval.

    a system that had a drawdown of 8% in the backtest; with sharpe of 3.5; had a max drawdown of 45% at 95% confidence interval.

    that is too high for my standards. am looking at max of 15% with 95% confidence. At least that is what am taking away from the bootstrap technique at the moment.

    PS: am still keen to find out about Eckhart secret formula.

  • Jez Liberty

    Hi Steven,
    Thanks, you are actually addressing alpha’s question about robust statistical test. One of the reasons why the bootstrap test is more robust is because it is non-parametric, it does not make assumptions about the data distribution (that’s why it uses resampling).
    Re: your data snooping question, it is not directly the bootstrap test that addresses it but rather how it is used in the White Reality Check (detailed in the post) to identify whether the best rule resulting from in-sample optimization really has value or whether it is more likely to be the result of random chance/variation.

    alpha, wikipedia is seriously a good start for robust statistics (a very basic one is mean vs median)

  • alpha

    Steven, i did try method (3) that you mention (random sampling of trades from same period). Then compare the system performance to the random distribution; and compute p-value.

    The stats were interesting; however I a tested a system of mine that did not pass the forward testing period; to see if this method exposed the curve fitting nature; but to no avail; despite the p-value was close to zero.

    see attached:
    – p-value of various statistics (on each distribution of a statistic; the dashed green line is the system performance; and the two red lines are 95% confidence interval.
    – equity curve of the system (black color); compared to random equity curves (gray color); at intervals of 100%, 99%, 98%… till 50%.

    Next step is to do a variations of the random sampling method; i.e. enter at exactly same time of day; and hold the trades for the same holding periods.

Leave a Comment