In part 1 of this bootstrap post, we looked at how to apply the method to establish the statistical significance of a single trading rule. In Part 2, we’ll look at how to deal with the data mining bias, the impact of geometric vs. arithmetic mean return. The code implementing the bootstrap test is available for download at the bottom of this post.
Dealing with Data Mining Bias
The approach described in the single rule test is not valid when performing data mining (whether testing different rules or different parameter values of the same rule). As per the data mining bias (explained previously), the (best) rule selected from the data mining process will invariably owe a large part of its over-performance to random (good) luck.
The way the bootstrap test deals with the data mining bias is by implementing a concept introduced in White’s Reality Check. The Reality Check derives the sampling distribution appropriate to test the statistical significance of the best rule found by data mining.
In effect, the concept is fairly simple – and similar to the single-rule bootstrap: assuming N rules have been tested in the data mining process, each resample iteration will perform a resample with replacement for each rule and the best mean return will be kept as this resample iteration’s test statistic:
- N back-tests are run on detrended data. The mean daily return, based on x observations, is calculated for each back-tested rule.
- Each rule’s mean daily return is substracted from the rule’s set of daily returns (zero-centering), This gives a set of adjusted returns for each rule.
- For each “higher-level” resample (to form the sampling distribution of the best-performing rule in a universe of N rules), perform a “lower-level” resample with replacement on every rule. For each rule select x instances of adjusted returns at random and calculate their mean daily return (rule bootstrapped mean). Compare each rule bootstrapped mean and select the highest one: this is the test statistic of this “higher level” resample (bootstrapped best mean).
- Perform a large number of “higher level” resamples to generate a large number of bootstrapped best means.
- Form the sampling distribution of the best means generated in the step above.
- Derive the p-value of the best back-test mean return (non zero-centered) based on the sampling distribution derived above
In effect: for each iteration, resample each rule, take the best return, keep it as this iteration test statistic and move on to the next iteration. The sampling distribution is formed of each iteration’s best return.
White Reality Check Related Papers
I have not yet
found searched hard for White’s paper on the Reality Check but I did find the two following papers which seem to be worth a read:
Stepwise Multiple Testing as Formalized Data Snooping – Romano & Wolf
It is common in econometric applications that several hypothesis tests are carried out at the same time. The problem then becomes how to decide which hypotheses to reject, accounting for the multitude of tests. In this paper, we suggest a stepwise multiple testing procedure which asymptotically controls the familywise error rate at a desired level. Compared to related single-step methods, our procedure is more powerful in the sense that it often will reject more false hypotheses.
Unlike some stepwise methods, our method implicitly captures the joint dependence structure of the test statistics, which results in increased ability to detect alternative hypotheses. We prove our method asymptotically controls the familywise error rate under minimal assumptions. Some simulation studies show the improvements of our methods over previous proposals. We also provide an application to a set of real data.
In this paper, we re-examine the profitability of technical analysis using White’s Reality Check and Hansen’s SPA test that correct the data snooping bias. Comparing to previous studies, we study a more complete universe of trading techniques, including not only simple rules but also investor’s strategies, and we test the profitability of these rules and strategies with four main indices. It is found that significantly profitable simple rules and investor’s strategies do exist in the data from relatively young markets (NASDAQ Composite and Russell 2000) but not in the data from relatively mature markets (DJIA and S&P 500). Moreover, after taking transaction costs into account, we find that the best rules for NASDAQ Composite and Russell 2000 outperform the buy-and-hold strategy in most in- and out-of-sample periods. Our results thus suggest that the degree of market efficiency may be related to market maturity. It is also found that investor’s strategies are able to improve on the profits of simple rules and may even generate significant profits from unprofitable simple rules.
Geometric or Arithmetic Mean?
In part 1, I introduced the idea that the mean arithmetic return being positive is not equivalent to the strategy being profitable (ie. this is not a sufficient condition). On the other hand, the mean geometric return being positive is a necessary and sufficient condition to the strategy being profitable (ie. both conditions are equivalent).
Therefore bootstrapping using the mean geometric return as the test statistic should provide a better evaluation of the system’s profitability statistical siginificance.
I will not go into detail of how the calculation is done as it is very similar to the arithmetic mean return, but using log of returns instead. Note that the geometric mean will be a stricter test than the arithmetic mean (a rules can have a significantly positive arithmetic return but a negative geometric return).
To illustrate the multiple applications of the bootstrapping methodology, I decided to run the test on one of the Trend Following Wizards track record (set of monthly returns). I picked Chesapeake and ran the monthly returns (from 1988 to 2009) through the bootstrap test.
The p-value calculated using the arithmetic mean is 0.000098 (less than 1 chance in 10,000 that this kind of results are due to random luck). Using the geometric mean, the p-value is 0.00022. The values are extremely low, which is not surprising given Jerry Parker’s 20-year track record with only one losing year and a monthly average return of 1.7%.
Many people would point out that survivorship bias should be considered, and obviously it depends on how you look at it. The main point of this dual test is that the geometric p-value is higher than the arithmetic p-value, verifying that it is a stricter test of statistical significance.
Finally, here is a tool coded to implement the bootstrap test for a single strategy – available for download. Note that this is distributed “as is”, with no guarantee (but that’s the one I have been using so I still think it does the job…). It should run on any Windows machine with the .Net framework installed (XP or higher should do fine).
It simply takes three parameters (separated by space):
- Returns file path and name
- Number of resamples
- Flag for Arithmetic (A) or Geometric (G) mean calculation
It also generates a file in the same directory with all of the resamples test-statistic values (to draw the histogram).
Simply place the bootstrap.exe in your directory of choice and run it from the command prompt as below: