Systematic Trading research and development, with a flavour of Trend Following

## Bootstrap – Take 2: Data Mining bias, Code and using geometric mean

#### August 13th, 2010 · 65 Comments · Backtest, Code

In part 1 of this bootstrap post, we looked at how to apply the method to establish the statistical significance of a single trading rule. In Part 2, we’ll look at how to deal with the data mining bias, the impact of geometric vs. arithmetic mean return. The code implementing the bootstrap test is available for download at the bottom of this post.

### Dealing with Data Mining Bias

The approach described in the single rule test is not valid when performing data mining (whether testing different rules or different parameter values of the same rule). As per the data mining bias (explained previously), the (best) rule selected from the data mining process will invariably owe a large part of its over-performance to random (good) luck.

The way the bootstrap test deals with the data mining bias is by implementing a concept introduced in White’s Reality Check. The Reality Check derives the sampling distribution appropriate to test the statistical significance of the best rule found by data mining.

In effect, the concept is fairly simple – and similar to the single-rule bootstrap: assuming N rules have been tested in the data mining process, each resample iteration will perform a resample with replacement for each rule and the best mean return will be kept as this resample iteration’s test statistic:

1. N back-tests are run on detrended data. The mean daily return, based on x observations, is calculated for each back-tested rule.
2. Each rule’s mean daily return is substracted from the rule’s set of daily returns (zero-centering), This gives a set of adjusted returns for each rule.
3. For each “higher-level” resample (to form the sampling distribution of the best-performing rule in a universe of N rules), perform a “lower-level” resample with replacement on every rule. For each rule select x instances of adjusted returns at random and calculate their mean daily return (rule bootstrapped mean). Compare each rule bootstrapped mean and select the highest one: this is the test statistic of this “higher level” resample (bootstrapped best mean).
4. Perform a large number of “higher level” resamples to generate a large number of bootstrapped best means.
5. Form the sampling distribution of the best means generated in the step above.
6. Derive the p-value of the best back-test mean return (non zero-centered) based on the sampling distribution derived above

In effect: for each iteration, resample each rule, take the best return, keep it as this iteration test statistic and move on to the next iteration. The sampling distribution is formed of each iteration’s best return.

### White Reality Check Related Papers

I have not yet found searched hard for White’s paper on the Reality Check but I did find the two following papers which seem to be worth a read:
Stepwise Multiple Testing as Formalized Data Snooping – Romano & Wolf

Abstract:

It is common in econometric applications that several hypothesis tests are carried out at the same time. The problem then becomes how to decide which hypotheses to reject, accounting for the multitude of tests. In this paper, we suggest a stepwise multiple testing procedure which asymptotically controls the familywise error rate at a desired level. Compared to related single-step methods, our procedure is more powerful in the sense that it often will reject more false hypotheses.

Unlike some stepwise methods, our method implicitly captures the joint dependence structure of the test statistics, which results in increased ability to detect alternative hypotheses. We prove our method asymptotically controls the familywise error rate under minimal assumptions. Some simulation studies show the improvements of our methods over previous proposals. We also provide an application to a set of real data.

In this paper, we re-examine the profitability of technical analysis using White’s Reality Check and Hansen’s SPA test that correct the data snooping bias. Comparing to previous studies, we study a more complete universe of trading techniques, including not only simple rules but also investor’s strategies, and we test the profitability of these rules and strategies with four main indices. It is found that significantly profitable simple rules and investor’s strategies do exist in the data from relatively young markets (NASDAQ Composite and Russell 2000) but not in the data from relatively mature markets (DJIA and S&P 500). Moreover, after taking transaction costs into account, we find that the best rules for NASDAQ Composite and Russell 2000 outperform the buy-and-hold strategy in most in- and out-of-sample periods. Our results thus suggest that the degree of market efficiency may be related to market maturity. It is also found that investor’s strategies are able to improve on the profits of simple rules and may even generate significant profits from unprofitable simple rules.

### Geometric or Arithmetic Mean?

In part 1, I introduced the idea that the mean arithmetic return being positive is not equivalent to the strategy being profitable (ie. this is not a sufficient condition). On the other hand, the mean geometric return being positive is a necessary and sufficient condition to the strategy being profitable (ie. both conditions are equivalent).

Therefore bootstrapping using the mean geometric return as the test statistic should provide a better evaluation of the system’s profitability statistical siginificance.

I will not go into detail of how the calculation is done as it is very similar to the arithmetic mean return, but using log of returns instead. Note that the geometric mean will be a stricter test than the arithmetic mean (a rules can have a significantly positive arithmetic return but a negative geometric return).

To illustrate the multiple applications of the bootstrapping methodology, I decided to run the test on one of the Trend Following Wizards track record (set of monthly returns). I picked Chesapeake and ran the monthly returns (from 1988 to 2009) through the bootstrap test.

The p-value calculated using the arithmetic mean is 0.000098 (less than 1 chance in 10,000 that this kind of results are due to random luck). Using the geometric mean, the p-value is 0.00022. The values are extremely low, which is not surprising given Jerry Parker’s 20-year track record with only one losing year and a monthly average return of 1.7%.

Many people would point out that survivorship bias should be considered, and obviously it depends on how you look at it. The main point of this dual test is that the geometric p-value is higher than the arithmetic p-value, verifying that it is a stricter test of statistical significance.

### Bootstrap Code

Finally, here is a tool coded to implement the bootstrap test for a single strategy – available for download. Note that this is distributed “as is”, with no guarantee (but that’s the one I have been using so I still think it does the job…). It should run on any Windows machine with the .Net framework installed (XP or higher should do fine).

It simply takes three parameters (separated by space):

1. Returns file path and name
2. Number of resamples
3. Flag for Arithmetic (A) or Geometric (G) mean calculation

It also generates a file in the same directory with all of the resamples test-statistic values (to draw the histogram).

Simply place the bootstrap.exe in your directory of choice and run it from the command prompt as below:

Run the bootstrap.exe from the command line

bootstrap.exe

### 65 Comments so far ↓

• alpha

• alpha – cool charts. What tool do you use to generate them?

• One problem (I can see) is that forward-test (I guess you mean something similar to out-of-sample test) is very different in nature to analysing a back-test for randomness.
I see 4 possible cases:
1- back-test was merely based on “random luck”: forward-test will fail (you’d expect tests like bootstrap to fail too – on the back-test results)
2- back-test had a genuine edge but somehow market conditions change and the strategy/forward-test fails (turtle system for example)
3- back-test had a genuine edge but the period used for the forward-test resulted in negative performance (in the range of statistical expectations based on the back-test results – trend following in 2009 for example)
4- back-test had a genuine edge and forward-test shows that edge preserved.

I believe the bootstrap is only useful for identifying between case 1 and cases 2, 3 or 4 (but does not identify which of 2, 3 or 4 it is). So the forward / out-of-sample test adds additional value to the whole testing process
I havent really worked that out yet but I believe you’d need some statistical model to help you identify whether you fall in case 2 or 3 (fundamental change vs. random variation)

• alpha

jez;

i use my own java app to do the backtests and random sampling; and use R to produce the charts and to do sophisticated statistics.

i agree with bootstrap method identifing case 1; and that is what I am after. if I can proove case (1); my mission is accomplished.

let me clarify; the random sampling method I used is what was mentioned in Timothy Masters, Monte-Carlo evaluation of trading systems, 2006; listed on page 4 – 6. It basically pairs the market returns to the systm output of +1/-1/0.

btw; I found this paper to be exactly what was mentioned on Aronson book; but with straight to the points; and explains everything scientificly (as supposed to Aronson book; which was full of fluff; an confusing some at times).

I have applied this method to two systems:
1- curve fitted system: (this is the one mentioned in earlier comment). This system’s equity curve was 45% all way long for the last 2 years. however it failed the forward test miserably.
2- real profitable system: this is my live trading system at the moment that I have been trading for few months.

the test results of using Masters’ random sampling mehtod (aka bootstrap) was not effective in distinguishing luck from skill. It shows that both above systems beating randomness by 98% confidence. pretty optimistic test.

However; i also implemented another type of bootstrap method that scrambles the equity curve; and preserves the serial correlation by using block sampling; i.e. 5 periods, 10 periods..etc.

I found this latter method to be far effective than Master’s method. It did differentiate between the two systems. i.e. the following are the 95% confidence intervals of the total returns:
– system 1: -4% ~ 100%
– system 2: 20% ~ 84%

and the drawdown %95 confidence interval:
– system 1: 10% ~ 38%
– system 2: 6% ~ 18%

clearly system 1 was worthless; at @95%, it has negative total return; and 38% drawdown. straight away to the bin.

system2 did indeed do well in forward test; and also in live trading. it remained profitable till date.

the next step i am pursuing is to do variaty of random sampling methods. I beleive Master’s method can achieve some result if tweaked the right way.

• here is the bootstrap for system 1

and for system 2

PS: i have started a new blog so I can post my comments with illustrations.
liquidalpha.blogspot.com

• well, I am honoured if this blog post inspired you to start a new blog. Liquid Alpha: I like the name!

Thanks for the explanation. Your second method (scrambling the equity curve) sounds a lot like TBB Monte-Carlo simulation – as you note, having more than one tool in the “check for randomness/data-mining” toolbox is definitely a good idea. Will be keen to hear about your findings on tweaking Master’s method.

• i first learned about TBB Monte-Carlo simulation from Curtis Faith’s book; way of the turtle.

• 2sedated

Bootstrap is a great method to construct an artificial sampling distribution. But there is a problem. When you are using bootstraping, you are trying in effect to guees what is your population parameter.

This is fine, when you have a stationary problem. For example F-G, from EBTA book. But stock price movement is not stationary. So if you assume stationarity, you are in effect doing the whole thing wrong.

Stock price movement is rw + drift + deterministic trend. Drift and trend change over time, sometimes dramaticaly.

So my question is, what is the point of bootstrap proof of TA beeing profitable over some sample, even if that sample is rather big? If the underlying process can change greatly, and therefore destroy future observed return, and make it quite different form expected return.

Tnx.

• Sarfaraz

Do I need to Zero Center the Returns Before passing them to the Bootstrap Tool provided here.

If I pass the returns to the bootstrap tool without zero-centering I get a p-value of 0.0020, If I pass the returns without zero-centring I get a P-Value of 1.

any thoughts

• Sarfaraz: Nope, the tool does the zero-centering.

• Sarfaraz Farooqui

Dear Jez,
Thank you for replying, I have one more doubt.
I have a .net program which generates the trades based on the TA rule I feed.
The out put has the following columns

1. Entry Date
2. Entry Price
3. Exit Date
4. Exit price
5. Bars (holding period)
6. Returns % (based on de-trended data)

I am feeding returns % to the bootstrap tool.

for example first trade returned me 2% profit over 20 days.

Is it okay to use these returns for bootstrapping or I should calculate avg per day of each trade and then pass them for bootstrapping.

Regards
Sarfaraz

• SKaRe

Hey,

Since the bootstrap app requires detrended data. How is detrending is done for daily return?. The holding period is 1 day for daily return. The dialy return that I have is from EQ curve which is additive in nature.

• Sarfaraz, the method described here uses daily returns; but in itself the bootstrap test is simply a statistical tool to approximate a distribution based on a given sample. That distribution is then used to measure the statistical significance of the sample result. So it could theoretically be used with a whole lot of different data types (including trade data).

Basically the question we’re asking when running this sort of test is: assuming that the sample result is drawn from a specific distribution (based on resampling of the sample), what are the odds that the sample result was obtained by “random chance”. If it’s less than 1%/5% we can assume it is statistically significantly not the result of chance…
Hope that helps

• @SKaRe: the raw price data is detrended but not the daily returns (those are zero-centered, but this is done automatically by the app..)
see this post for the concept of detrending: http://www.automated-trading-system.com/detrending-for-trend-following/

• Whale

Hi,

First off, great blog and keep up the good work!

I’m not a big statistics guy but it seems absolutely clear to me that all TA should be evidence-based and therefore (statistically) tested. I read Aronson’s book and enjoyed it very much, even though he sometimes takes 40 pages to explain something that could have been done in 2 pages.

More to the point, what I don’t understand is why you need to simulate (in the case of data mining) all the rules that were tested before deciding on the best rule. If the data becomes essentially random when scrambling it, then why can’t you just test the rule you ended up with (say 5000 times) and use the max of every Nth simulation to calculate the distribution? Then simply compare the CAGR of the actual back-test to the simulated average CAGR.

Does this make sens to anyone? :)