As mentioned in the Evidence-based Technical Analysis review post, the main value of the book lies in the presentation of the two methods allowing for computing the statistical significance of trading strategy results, despite having a single sample of data:
Both methods solve the problem of estimating the degree of random variation in a test statistic when there is only a single sample of data and, therefore, only a single value of the test statistic.
Today, let’s look at the bootstrap test, with a practical application of it.
In very brief terms, the concept uses hypothesis testing to verify whether the test statistic (such as mean return of the back-testing sample) is statistically significant. This is done by establishing the p-value of the test statistic based on the sampling distribution. (Aronson covers the basics of statistical analysis earlier in the book. I have also mentioned previously The Cartoon Guide to Statistics, which covers these concepts too)
The problem with back-testing is that the results generated represent a single sample, which does not provide any information on the sample statistic’s variability and its sampling distribution. This is where bootstrapping comes in: by systematically and randomly resampling the single available sample many times, it is possible to approximate the shape of the sampling distribution (and therefore calculate the p-value of the test statistic).
In the context of hypothesis testing, the bootstrap tests for the null hypothesis that the rule does not have any predictive power. In practical terms, this is translated to the population distribution of rule returns having an expected value of zero or less.
The bootstrap uses the daily returns of a back-test (run on detrended data) and performs a resampling with replacement.
To illustrate the concept, we can look at a back-test and apply the bootstrap method to its daily return series. I decided to look at a back-test I presented in Better Trend Following via improved Roll Yield. Remember: a standard 50/20 Moving Average cross-over system applied to Crude Oil was improved by adding a roll yield optimisation process.
In that instance, the benchmark is the standard strategy and we want to check that the strategy improvement was not the result of random chance. In Aronson’s book, benchmarking is achieved by detrending the data. However, this case is different as the benchmark is the standard strategy. The improved strategy results can be thought of 2 distinct parts:
I therefore generated a composite, “Roll Yield-only” equity curve (by removing from the improved strategy equity curve the returns that could be attributed to the Trend Following component). I then computed the daily returns based on that equity curve.
Once the p-value is obtained, it is simply a matter of deciding which threshold qualifies for statistical significance. Scientists usually determine the statistical significance threshold at 0.05 (ie. the null hypothesis would be rejected for any p-value less or equal to 0.05).
As discussed above, the assumption that the rule does not have predictive power is translated to the arithmetic mean of its returns being equal to zero. In the bootstrap method, rejecting the null hypothesis occurs when the mean arithmetic return is statistically significantly positive.
I am usually no big fan of arithmetic mean of returns as it is a flawed indicator of profitability. In effect, a system can have a positive mean arithmetic return and still be unprofitable – think about a return of 50% followed by a return of -40%: arithmetic mean return is +5%, yet the overall return is minus 5.1%
Proving that the mean arithmetic return is significantly positive, and deducing that the trading system is therefore profitable is flawed. It is ironically amusing that Aronson spends quite a lot of time talking about logic reasoning and usual traps people fall into, to actually present a flawed deduction logic. To use an example from the book:
A dog having four legs (a profitable rule having a positive mean arithmetic return) does not imply that any four-legged animal is a dog (ie. any rule with a positive mean arithmetic return is not necessarily profitable).
On the other hand, any profitable rule has a positive mean geometric return, and any rule with positive mean geometric return is profitable. On that basis, using the mean geometric return as the test-statistic in the bootstrap must be more appropriate.
I’ll be running this post in 2 parts, and this concludes part 1…
In part 2, we’ll look at how the bootstrap method can be modified to handle the data mining process and its associated bias. I’ll also make the code used for the practical application above available for download (this will be a simple bootstrap resample tool developed on the .net platform for Windows). Finally we’ll explore the idea of using the geometric mean return as the test-statistic instead of its arithmetic cousin.