
In part 1 of this bootstrap post, we looked at how to apply the method to establish the statistical significance of a single trading rule. In Part 2, we’ll look at how to deal with the data mining bias, the impact of geometric vs. arithmetic mean return. The code implementing the bootstrap test is available for download at the bottom of this post.
Dealing with Data Mining Bias
The approach described in the single rule test is not valid when performing data mining (whether testing different rules or different parameter values of the same rule). As per the data mining bias (explained previously), the (best) rule selected from the data mining process will invariably owe a large part of its over-performance to random (good) luck.
The way the bootstrap test deals with the data mining bias is by implementing a concept introduced in White’s Reality Check. The Reality Check derives the sampling distribution appropriate to test the statistical significance of the best rule found by data mining.
In effect, the concept is fairly simple – and similar to the single-rule bootstrap: assuming N rules have been tested in the data mining process, each resample iteration will perform a resample with replacement for each rule and the best mean return will be kept as this resample iteration’s test statistic:
- N back-tests are run on detrended data. The mean daily return, based on x observations, is calculated for each back-tested rule.
- Each rule’s mean daily return is substracted from the rule’s set of daily returns (zero-centering), This gives a set of adjusted returns for each rule.
- For each “higher-level” resample (to form the sampling distribution of the best-performing rule in a universe of N rules), perform a “lower-level” resample with replacement on every rule. For each rule select x instances of adjusted returns at random and calculate their mean daily return (rule bootstrapped mean). Compare each rule bootstrapped mean and select the highest one: this is the test statistic of this “higher level” resample (bootstrapped best mean).
- Perform a large number of “higher level” resamples to generate a large number of bootstrapped best means.
- Form the sampling distribution of the best means generated in the step above.
- Derive the p-value of the best back-test mean return (non zero-centered) based on the sampling distribution derived above
In effect: for each iteration, resample each rule, take the best return, keep it as this iteration test statistic and move on to the next iteration. The sampling distribution is formed of each iteration’s best return.
White Reality Check Related Papers
I have not yet found searched hard for White’s paper on the Reality Check but I did find the two following papers which seem to be worth a read:
Stepwise Multiple Testing as Formalized Data Snooping – Romano & Wolf
Abstract:
It is common in econometric applications that several hypothesis tests are carried out at the same time. The problem then becomes how to decide which hypotheses to reject, accounting for the multitude of tests. In this paper, we suggest a stepwise multiple testing procedure which asymptotically controls the familywise error rate at a desired level. Compared to related single-step methods, our procedure is more powerful in the sense that it often will reject more false hypotheses.
Unlike some stepwise methods, our method implicitly captures the joint dependence structure of the test statistics, which results in increased ability to detect alternative hypotheses. We prove our method asymptotically controls the familywise error rate under minimal assumptions. Some simulation studies show the improvements of our methods over previous proposals. We also provide an application to a set of real data.
In this paper, we re-examine the profitability of technical analysis using White’s Reality Check and Hansen’s SPA test that correct the data snooping bias. Comparing to previous studies, we study a more complete universe of trading techniques, including not only simple rules but also investor’s strategies, and we test the profitability of these rules and strategies with four main indices. It is found that significantly profitable simple rules and investor’s strategies do exist in the data from relatively young markets (NASDAQ Composite and Russell 2000) but not in the data from relatively mature markets (DJIA and S&P 500). Moreover, after taking transaction costs into account, we find that the best rules for NASDAQ Composite and Russell 2000 outperform the buy-and-hold strategy in most in- and out-of-sample periods. Our results thus suggest that the degree of market efficiency may be related to market maturity. It is also found that investor’s strategies are able to improve on the profits of simple rules and may even generate significant profits from unprofitable simple rules.
Geometric or Arithmetic Mean?
In part 1, I introduced the idea that the mean arithmetic return being positive is not equivalent to the strategy being profitable (ie. this is not a sufficient condition). On the other hand, the mean geometric return being positive is a necessary and sufficient condition to the strategy being profitable (ie. both conditions are equivalent).
Therefore bootstrapping using the mean geometric return as the test statistic should provide a better evaluation of the system’s profitability statistical siginificance.
I will not go into detail of how the calculation is done as it is very similar to the arithmetic mean return, but using log of returns instead. Note that the geometric mean will be a stricter test than the arithmetic mean (a rules can have a significantly positive arithmetic return but a negative geometric return).
To illustrate the multiple applications of the bootstrapping methodology, I decided to run the test on one of the Trend Following Wizards track record (set of monthly returns). I picked Chesapeake and ran the monthly returns (from 1988 to 2009) through the bootstrap test.
The p-value calculated using the arithmetic mean is 0.000098 (less than 1 chance in 10,000 that this kind of results are due to random luck). Using the geometric mean, the p-value is 0.00022. The values are extremely low, which is not surprising given Jerry Parker’s 20-year track record with only one losing year and a monthly average return of 1.7%.
Many people would point out that survivorship bias should be considered, and obviously it depends on how you look at it. The main point of this dual test is that the geometric p-value is higher than the arithmetic p-value, verifying that it is a stricter test of statistical significance.
Bootstrap Code
Finally, here is a tool coded to implement the bootstrap test for a single strategy – available for download. Note that this is distributed “as is”, with no guarantee (but that’s the one I have been using so I still think it does the job…). It should run on any Windows machine with the .Net framework installed (XP or higher should do fine).
It simply takes three parameters (separated by space):
- Returns file path and name
- Number of resamples
- Flag for Arithmetic (A) or Geometric (G) mean calculation
It also generates a file in the same directory with all of the resamples test-statistic values (to draw the histogram).
Simply place the bootstrap.exe in your directory of choice and run it from the command prompt as below:

Run the bootstrap.exe from the command line
Download here:
bootstrap.exe
Like this post? You may want to read these:
Welcome to my online repository of research and insights on automated trading system development
Very good! Saved me a bunch of hours…
Thanks!!!
R also has a built in boot() method, which is nice.
How did you calculate the p-value in your bootstrap.exe tool?
yeah – I really need to get into R…
Nothing fancy to calculate the p-value: basically a counter gets incremented for every resampling iteration where the bootstrapped mean is above the observed one. Divide by total number of iterations at the end, and you get your p-value.
Haha that makes a lot more sense than what I had in mind…
I found this PDF very helpful for getting started with R (cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf). Take a look at it when you have the time.
Thanks for your clear explanations of everything. Looking forward to your next posts on Monte Carlo!
Hi,
I stumbled across your blog looking for White’s reality check. Excellent explanation I might add. If you want to find the original paper just search google scholar, the paper should be the first hit.
Adam
Hi,
It seems that the code is not downloadable (only the application). Would you mind sharing the code as well? (in a zip file)
George
Well… the exe is “compiled” code ;-)
I dont have access to my code repository at the moment but I’ll try and see about that when I’m back…
Jez –
Excellent site! I’m still working through (struggling) with the calcs involved in detrending, so if you wouldn’t mind posting your source so I could follow and example, that would be wonderful. Thanks for all the great work…
Hello,
This is one of the most interesting blogs/websites I have ever come across.
Regarding the bootstrap tests you performed for Chesapeake, I run my own proprietary test for randomness for the monthly returns since 2006 and I get a P-value of 0.03. That points to very moderate evidence against randomness. However,
I believe, and I may be wrong, please correct me in that case, that the bootstrap test will only tell you if the expected real return in the future will be different that zero. It is not a test for randomness in a strict sense.
Best regards,
Michael Harris
http://www.priceactionlab.com/Blog/
P.S. I will add your blog to my blogroll
Hello,
You mentioned the detrending of the data series in several places before the backtest is run. I do not understand this. Detrended data are unrealistic data and nobody that I know will ever design any system based on such data because the results are unrealistic. Furthermore, when you used the monthly returns of the fund to get the p-value you had no way of detrending any data because you had no way of knowing what the data were. So, allow me please to ask the following questions: where is this requirement about detrending data came from and if detrending is not done, does that affect the bootstrap test results?
Thank you.
Hi Michael,
Thanks for your comments!
Yes – the bootstrap tests only for statistical significance of the result being positive.
Aronson’s logic in the description of the test is that if a trading rule has “predictive power”, its output in terms of performance should be positive (with statistical significance). As this is an “approximation” with confidence interval, there is no guarantee that the positive output is not random (but you can set the threshold for p-value) and determine the level of certainty you need.
Hi Rick,
Please check this post on detrending:
http://www.automated-trading-system.com/detrending-for-trend-following/
My conclusion is that it is not clear-cut – and in my own testing I do not detrend the data.
Aronson describes detrending as a way of removing a possible position bias (ie Buy and Hold in a bull market would show positive results despite the “rule” having no predictive power. By detrending, the performance would drop to 0)
PS: also detrending is done after the backtest (ie from that post linked above: “For each trade in the backtest, adjust the trade return by subtracting the daily drift for each day in the trade”
Hi Jez,
Thanks for the reply.
What do you think about the other comment I made regarding the Chesapeake performance since 2006? I run my own proprietary test for randomness using the monthly returns since 2006 and I get a P-value of 0.03. That points to moderate evidence against randomness. It is I believe 3 orders of magnutude less than your number.
I also run a bootstrap test for the monthly returns since 2006 and I get a p-value close to 0.20. This point to no evidence against the null hypothesis. What numbers do you get with your algorithms?
Also, do you think then that after 2006 or so the performance of the fund is getting more random and it is the past (before 2006) that contributes to the high significance of your test?
Michael
Hi Michael,
I presume a track record like Chesapeake contains periods where random noise is more predominant than the system’s edge and other periods when the edge of the system is more apparent and random noise manifests less.
Probably the track record from 2006 is of the former case.
Assuming the “randomness” level of the system/programme is constant through time, a longer test period should allow for a better evaluation of the system (and probably why I find a much lower p-value). Of course, if the assumption is wrong and system edges actually change through time, periods of increased randomness might actually point to the system losing its edge/becoming more random… (this is a possibility for Chesapeake since 2006)
Maybe some sort of “rolling” p-value calculation is a good way to see how it changes through time and if a real degradation in system performance seems to be occurring
I would be interested to know about what concepts you use in your test for randomness?
Ji Jez,
Thank you for the reply.
I want to make sure I fully understand what you are doing with the single bootstrap test before I compare it to my own test. If I understand what you wrote correctly, you take the original returns, like in the Chesepeake example, and you subtract from each observation the mean of the sample. Then, you bootstrap to generate a distribution. This will be a distribution of the adjusted mean returns. Do I understand this correctly?
If this is the case, then the statistic you are testing for significance is the mean of the adjusted returns. I have a problem understanding what this statistic reflects. I would think that the objective would be to test the significance of the hypothesis that the mean return is zero or less. Thus, the objective would be from the sample of returns to get the distribution of the mean via bootstrapping and see whether the observed mean is significant. I still have hard time understanding what the centering of observations accomplishes and what the practical value of the resulting statistic is. What is exactly the null hypothesis in that case? Is it for example, that the mean deviation of returns from the sample mean is zero or less?
Furthermore, in the code you posted do you actually center the observations before you resample?
Thanks you for the interesting discussion.
Mike,
From Aronson’s book:
“the bootstrap tests for the null hypothesis that the rule does not have any predictive power”. In the book he equates this to “the null hypothesis that the arithmetic mean return of the rule being back-tested is zero”
So
H0: back-tested rule has no predictive power = arithmetic mean return of the rule is zero
Ha: back-tested rule has predictive power = arithmetic mean return of the rule is positive
H0 is the hypothesis that the mean return is zero, so you need a distribution with a zero-mean.
What the bootstrap does is build such distribution using the actual data from the test in order to have variance/deviations and other distribution characteristics in line with the process being tested.
After the zero-mean distribution is built you perform a standard statistical significance test by comparing the data tested (non-adjusted mean return) against the zero-mean distribution. This gives you the p-value, which represents the probability that the observed mean return is drawn from a zero-centered distribution. The lower the p-value the better
This is a non-parametric method, but an analogy would be to calculate how many standard deviations the mean tested lies at.
The code I posted does center each return before resampling.
Hopes this clarifies things…
PS: you could also look at the “bootstrap 1” post where another reader was confused (maybe my explanations are not so clear!) and we exchanged on the comments section for clarification. The diagram on there hopefully helps for understanding
Hi Jez,
All these are understood. I am questioning the procedure. To start with, linearity dictates that you do not have to subtract the sample mean from each observation. You can just subtract the sample mean from each bootstrapped mean. It is easier and faster. Don’t you agree?
But, I still do not agree with the process and I think, Lou, the other poster who I just found out that he disagreed, he was intuitively correct. I do not know Aronson, but in general I have found out not to take anything for granted.
In my opinion, the correct procedure is to generate the distribution of plain returns via bootstrap. Then center that distribution. This will give you the bias, if we can call such. You subtract the bias, i.e., the mean of the distribution from your sample mean return and then you check the results, which is the sampling error, for significance.
Example: You have a mean return of 1.5%. You bootstrap the returns and you find that the mean of the distribution you obtain is 0.5%. This is the bias. It means that any random system should attain this return. Anything else will be a losing system or worse than random. Now you center your distribution by subtracting 0.5%. Then, you test the mean return minus the bias for significance, i.e. 1.5% – 1% = 1%. This is the return of a system that outperforms random systems. If this is significant, then the system is probably not random to some measure.
What do you think? I think centering in advance removes the bias and makes the test extra non-conservative. Also, I think is this way, which is in my opinion the correct one, you do not have to detrend data.
Mike,
I see what you are saying. Intuitively, I would have thought that this would be equivalent (as per my earlier comment to Lou), ie the bias from the non-adjusted bootsrapped distribution (ie your 0.5%) would be equal to the return of the backtest (1.5%). I might very well be wrong though and this is something I have not tested though. I’ll add this on my “stack” – as I do not think we can draw any conclusions without testing this.
Computationally, substracting the bias prior boostrapping is probably better as there are typically less individual returns than number of bootstrapping iterations (and therefore less substraction operations).
Jez
Hi Jez,
Thanks for the reply.
I think that the centering of the sampling distribution causes very optimistic tests.
This is what I think is the case. Aronson backtests on detrended data. Thus, he assumes any bias is removed and that centering just removes sampling error. In this case I agree about the centering.
However, when testing row returns, like in the case of Chesapeake performance, you have no way of knowing the markets and data that were used. In this case, the sampling distribution mean may reflect a bias of the market. A conservative test in this case would be to test whether the results are significant in relation to this bias. Thus, you have to center the sampling distribution after you find its mean. This mean you subtract for the mean sample return. The hypothesis stays the same but the number you are checking for significance is not the original return but the one after the bias is removed.
Hi Jez,
I did some reading today and although I think your understanding of the theory is impeccable, in reality there will always be some bias between the sample mean and the bootstrap distribution. Let us call it BIAS.
I think the correct procedure would be to estimate the bias first. Then, to center the sampling distribution obtained from bootstrap and to test (Mean return – BIAS) for significance.
I agree with you that theory says the BIAS should be ZERO, i.e. the sampling distribution is centered at the sample mean (law of large numbers). But an estimation of this quantity is crucial I think because we cannot know how it can impact the significance test.
I will wait to hear from you. I think we both understand now the issue (or non-issue). Thanks for a very interesting and stimulating discussion.
Hi Mike,
Enjoying the discussion too, on my side. I completely see your point and it actually picked my curiosity.
As I said, I have a lot in my plate at the moment but I will try to run the method you suggest on Chesapeake’s return stream and see if there are any divergences.
Jez
Hi Jez,
The articles on bootstrapping are excellent, as well as the discussion. Kardi Teknomo has an Excel example of bootstrapping on his website:
http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm
Definitely worth a look.
Justin
Hi Jez,
I finally found some time to run to study this in some more detail. Specifically, using my statistics resampling package I found out in the case of Chesapeake monthly returns from 01/1990 to 10/2010 the following (100000 runs, arithmetic mean):
Case 1: Your algorithm. p-value = 0.00049
My analysis:
Case 2: Center distribution, resample 100,000 times. p-value = 0.00066 (method similar to yours)
The difference in the p-valuess is about 25% (My calculation vs. yours).
Case 2: First resample returns, calculate mean, repeat 100,000 times. Then center distribution. p-value = 0.00071
Case 3: First resample returns, calculate mean, repeat 100,000 times. Find mean of distribution, calculate Bias = mean of distribution – mean of returns. Subtract Bias from mean of returns. p-value = 0.00072 (very small difference from Case 2 above)
Hi Mike,
I found some time to look at this also.
I ran 500,000 resamples each time (as even with 100,000 I noticed a bit of variation between identical tests).
So, I re-ran the bootstrap on the Chesapeake results from the original test in the post (1988-2009, so results would be different to yours):
For case 1: exact same method as in the post:
p-value = 0.0001024 (slightly different from the result calc’ed in the post: 0.000098)
For your case 2: the distribution of the resamples was centered on 0.0169914, which is very close to the mean return of the original returns (0.017055). After zero-centering the distribution, the p-value is 0.0001084.
For case 3, as the bias is very small, the p-value is very similar to case 2: 0.0001151
In light of your and mine further tests, I am not sure whether the alternative testing methods make much difference – what’s your opinion?
Any ideas why there is a 25% difference in both our calculations (with case 1)
Hi Jez,
I copied the returns from the link you gave to autumn gold service. It would be better if we use the same numbers. In that website the returns start from 1990. I am going to copy and then paste the numbers below and you can edit this post to remove them after saving them to a .txt file for testing purposes (if you cannot edit you can just delete the post). Again, the returns start at 01/1990
0.0049
0.0337
{…}
0.088
0.1095
Mike
I quickly ran the same tests using the data you pasted above:
p-value of the method as per the post above: 0.0006038 (that would be case 1)
p-value after zero-centering resamples distributuon: 0.0006148 (that would be your case 2)
p-value when taking the bias (mean return – mean resamples distribution) into account: 0.0006424 (that would be your case 3)
for reference:
mean return of original return stream: 0.0105868
mean of 500,000 resamples (of non-adjusted returns) 0.010547
The values seem very close to each other (with the bias being minimal again) and also fairly close to your values..
Hi Jez,
Thanks for the follow up. I ran a few more tests using different data to eliminate any variability due to the specific sample used (Case 1, 500K runs):
Hawskbill fund (1/1990 – 10/2010):
Your p-value: 0.0021 – My p-value: 0.002076
Altis fund:
Your p-value: 0.002886 – My p-value: 0.003038
Chesapeake (1/1990 – 10/2010):
Your p-value: 0.000604 – My p-value: 0.00057
I do the resampling in Excel and it is very slow compared to your fast program.
Final conclusions:
(1) Values are very close and there are not any significant deviations present.
(2) Resampling and then centering (Case 2) does not cause large deviations and any variability is again not significant.
(3) Correcting for any bias of the sampling distribution mean wrt to the original mean does not again produce any significant change in the results.
(4) Any observed deviations in the results, in general, can be attributed to resampling variability. Furthernore, your algorithm is a lot faster and preferable in this respect over Cases 2 and 3.
A post in my blog, maybe later today, will analyze some aspects of the Chesapeak performance and compare it to the other two funds. I will try to identify what is the cause of the low p-value for the period 2000 – 2010.
Good to see this follow-up data. And somehow it is reassuring that you are coming to very similar values.
will look out for the post on your blog.
ps: credit where credit is due: this is not really “my algorithm” but rather the implementation of the algorithm as described in Aronson’s book…
Hi Jez,
Thanks for the discussion regarding the bootstrap method. Here is the link to my related blog post:
http://www.priceactionlab.com/Blog/2010/11/the-bootstrap-method-for-hypothesis-testing/
Hi Jez,
First, thanks very much for providing your excellent blog — and especially your articles on bootstrap and Monte Carlo testing. They have inspired me to code up these tests into our own backtesting regimen.
David Aronson’s site at http://www.evidencebasedta.com/ provides several interesting additional links — a link to Timothy Master’s excellent 2006 article: http://www.evidencebasedta.com/MonteDoc12.15.06.pdf
and a link to a pdf file of “Reader Issues”:
http://www.evidencebasedta.com/ReaderIssues.pdf
This was of interest to me because having done countless backtests/walk-forward tests, etc. on our own trading systems over the last decade, I became concerned after coding up and applying the bootstrap and Monte Carlo methods to my backtests, I found that my P-values were always zero, the ideal result! — similar to one reader’s comment: “My dataset has a return that is tremendously better than random systems generated in a Monte-Carlo Permutation Test, and its p-value is zero. I am sure that I have not overfitted the parameters in my model. Is a p-value of zero reasonable?”
Part of Aronson’s response is: “The dataset tested must be obtained from COMPLETELY out-of-sample data.” So this suggests to me that the bootstrap and Monte Carlo tests are relatively useless (and might give us unjustified enthusiasm about our systems) when we try to apply them to our backtest periods and would only reveal anything when applied to “walk-forward” or “out-of-sample” data.
I am curious if you have found anything to support the use of these methods on backtest data.
Thanks again for publicly disclosing your ideas and work.
Thanks Keith, for the comments and pointer to the docs.
Bearing in mind that I have used the bootstrap tests rather than the MC method, I believe Aronson’s comments apply only to single bootstrap/MC testing, in which case you would definitely need to isolate your out-of-sample data from your research/optimization data.
However the later method described in the book for taking into account data mining should allow you to use the methods on back-test data (because this is exactly where you would generate data mining bias). This is my understanding anyway…
Jez,
Happy New Year! Glad to see you back with more posting and reports. This is the most interesting blog on trading system analysis and design.
Keith,
The strong condition for the bootstrap to have any meaning is that the data are i.i.d. (Independent and identically distributed). The weak condition is that of exchangeability, meaning that future data behave like past. If you have sufficiently long history in your backtest, the weak condition may be satisfied. However, that cannot eliminate data mining bias. One way of doing that is through cross-validation (forward testing). Thus, there are two issues here: (A) whether the bootstrap has meaning a priori given the data sample and (B) whether the model selection involved bias, in which case the bootstrap has no meaning at all.
IMO, backtesting is a process for hypothesis rejection only. Results of backtesting cannot be used for validation or selection, whether via bootstrap or not. This is in my mind the key point besides statistics.
Jez, great blog and thank you very much for the code – huge help for me! With regard to White’s Reality Check, I was wondering if you have any thoughts on how it can be applied to an intER day trading system. My difficulty is that many of my systems do not generate a daily return and thus each rule will have a different value for “x”. As I understand it, having the same number of observations is a requirement for bootstrapping.
Hi Steven
I simply use the daily open equity change for the daily return.
If by inter-day, you mean that the system can hold position(s) for several days – as I suspect (and how most of the systems I develop/research here do) – this becomes a non-problem because the positions will still change in value, giving you a daily return (albeit not realized)
If your system does not trade at all n specific days, then the return is simply 0.
Hope this helps – Jez