Systematic Trading research and development, with a flavour of Trend Following
Au.Tra.Sy blog – Automated trading System header image 2

The Bootstrap Test: How significant are your back-testing results?

August 11th, 2010 · 31 Comments · Backtest, Books


As mentioned in the Evidence-based Technical Analysis review post, the main value of the book lies in the presentation of the two methods allowing for computing the statistical significance of trading strategy results, despite having a single sample of data:

Both methods solve the problem of estimating the degree of random variation in a test statistic when there is only a single sample of data and, therefore, only a single value of the test statistic.

Today, let’s look at the bootstrap test, with a practical application of it.

In very brief terms, the concept uses hypothesis testing to verify whether the test statistic (such as mean return of the back-testing sample) is statistically significant. This is done by establishing the p-value of the test statistic based on the sampling distribution. (Aronson covers the basics of statistical analysis earlier in the book. I have also mentioned previously The Cartoon Guide to Statistics, which covers these concepts too)

The problem with back-testing is that the results generated represent a single sample, which does not provide any information on the sample statistic’s variability and its sampling distribution. This is where bootstrapping comes in: by systematically and randomly resampling the single available sample many times, it is possible to approximate the shape of the sampling distribution (and therefore calculate the p-value of the test statistic).

Bootstrap on Single Rule Back-Test

In the context of hypothesis testing, the bootstrap tests for the null hypothesis that the rule does not have any predictive power. In practical terms, this is translated to the population distribution of rule returns having an expected value of zero or less.

The bootstrap uses the daily returns of a back-test (run on detrended data) and performs a resampling with replacement.

In practice:

  1. A back-test is run on detrended data and the mean daily return, based on n observations, is calculated.
  2. The mean daily return is substracted from each day’s return (zero-centering), This gives a set of adjusted returns.
  3. For each resample, select n instances of adjusted returns, at random (with replacement), and calculate their mean daily return (bootstrapped mean).
  4. Perform a large number of resamples to generate a large number of bootstrapped means.
  5. Form the sampling distribution of the means generated in the step above.
  6. Derive the p-value of the initial back-test mean return (non zero-centered) based on the sampling distribution

A Practical Application

To illustrate the concept, we can look at a back-test and apply the bootstrap method to its daily return series. I decided to look at a back-test I presented in Better Trend Following via improved Roll Yield. Remember: a standard 50/20 Moving Average cross-over system applied to Crude Oil was improved by adding a roll yield optimisation process.

In that instance, the benchmark is the standard strategy and we want to check that the strategy improvement was not the result of random chance. In Aronson’s book, benchmarking is achieved by detrending the data. However, this case is different as the benchmark is the standard strategy. The improved strategy results can be thought of 2 distinct parts:

  • Results from standard Trend Following strategy
  • Results from Roll Yield Optimisation

I therefore generated a composite, “Roll Yield-only” equity curve (by removing from the improved strategy equity curve the returns that could be attributed to the Trend Following component). I then computed the daily returns based on that equity curve.

  1. This set of daily returns is the original sample of 5120 observations, with an arithmetic mean of 0.216%.
  2. Substracting 0.216% to all 5120 returns adjusts those returns (zero-centering), ready to be picked for resampling.
  3. The 10,000 resamples all pick at random, with replacement, 5120 observations from the zero-centered, adjusted returns. A mean is computed for each resample.
  4. Each of the resample means are used to form the sampling distribution of the mean return:

  5. The last step is the comparison of the non-adjusted original sample mean (0.216%) to the sampling distribution to establish the p-value, which is 0.006 in this example.

Once the p-value is obtained, it is simply a matter of deciding which threshold qualifies for statistical significance. Scientists usually determine the statistical significance threshold at 0.05 (ie. the null hypothesis would be rejected for any p-value less or equal to 0.05).

Note on Arithmetic Mean vs. Geometric Mean

As discussed above, the assumption that the rule does not have predictive power is translated to the arithmetic mean of its returns being equal to zero. In the bootstrap method, rejecting the null hypothesis occurs when the mean arithmetic return is statistically significantly positive.

I am usually no big fan of arithmetic mean of returns as it is a flawed indicator of profitability. In effect, a system can have a positive mean arithmetic return and still be unprofitable – think about a return of 50% followed by a return of -40%: arithmetic mean return is +5%, yet the overall return is minus 5.1%

Proving that the mean arithmetic return is significantly positive, and deducing that the trading system is therefore profitable is flawed. It is ironically amusing that Aronson spends quite a lot of time talking about logic reasoning and usual traps people fall into, to actually present a flawed deduction logic. To use an example from the book:

A dog having four legs (a profitable rule having a positive mean arithmetic return) does not imply that any four-legged animal is a dog (ie. any rule with a positive mean arithmetic return is not necessarily profitable).

On the other hand, any profitable rule has a positive mean geometric return, and any rule with positive mean geometric return is profitable. On that basis, using the mean geometric return as the test-statistic in the bootstrap must be more appropriate.

I’ll be running this post in 2 parts, and this concludes part 1…
In part 2, we’ll look at how the bootstrap method can be modified to handle the data mining process and its associated bias. I’ll also make the code used for the practical application above available for download (this will be a simple bootstrap resample tool developed on the .net platform for Windows). Finally we’ll explore the idea of using the geometric mean return as the test-statistic instead of its arithmetic cousin.

Related Posts with Thumbnails

Tags: ··

31 Comments so far ↓

  • KF

    First off: Great blog!

    For this specific post: What do you think about the de-trending? Always had mixed feelings about that.


  • Jez

    Thanks Kim!
    Re: detrending, I have mixed feelings about it too (see: for an earlier post)
    Aronson shows that detrending is benchmarking based on position bias (ie. avoid favoring rules with bullish biases during bull markets for example) and therefore allows for comparing different rules on “equal footing”. There is a point there…
    He also quotes a study done by Timothy Masters comparing the bootstrap and Monte Carlo methods and concluding that the two methods give similar results, when both are run on detrended data.
    I have not done enough testing to have a strong opinion on both and I would welcome additional comments from more experienced readers.
    My initial thoughts are that at least you have to be aware of the position bias issue and decide how much it would affect the rules you’re testing (for example a Long-only strategy is likely to outperform a Short-only strategy during a bull market – a Long/Short Trend Following strategy goes equally likely long or short and should not suffer from position bias as much…)

  • KF

    My big question after reading the book is: But aren’t I trying to exploit a bias in the data? Sure, omitting short trades in the 90’s trading Nasdaq only is overoptimizing. However, what if there is a systematic bias in the data (i.e. roll yield), by detrending you are distorting reality.

  • Paolo

    The idea of bootstrapping the geometric mean returns sounds correct.

    I had the feeling Aronson used arithmetic mean since permuting daily returns could produce the same arithmetic but different geometric mean return but it’s not – thanks to a quick simulation in excel.

    Looking forward to your suggested adjustment then.


  • Motomoto

    While discussing the mathematics is well above my pay grade – I have to agree with KF.
    When asking how significant are the backtesting results, aren’t you really trying to look at how good are the backtesting results in replicating what I could have profited from by following the system in the real world. It seems if you then start chopping up the data, detrending and then running multiple tests, are you complicating things with tests that then distort what actually happens? These are not randomly generated numbers that are based on a population, they are meant to be a ‘live time’ simulation of a series of trades – that may be related to previous trades…. ie, the market trends – it does not revert back to a mean – or average?….. (as I said my statistical knowledge is sub-standard)

  • Jez

    I agree, this can seem counter-intuitive to reshuffle the results like this and I do have mixed feelings about some aspects of the approach – detrending for instance (as I mentioned before – especially for strategies like trend following).

    You also allude to dependency in the results – which is obviously discarded in this sort of approach and might indeed indeed be a weakness of the methodology. Unfortunately I have yet to see a model of trading results dependency – in which case it could probably be integrated to that method (ie some sort of conditional/random resampling to produce a more sophisticated testing method).

    The problem is that the outcomes of a trading strategy is a stochastic process with a large part of randomness and therefore with only one set of back-testing results (single sample), you cannot assert how much of the performance can be attributed to randomness.

    And I think the bootstrap, amongst other methods, tries (with some weaknesses that you highlight) to address this issue to identify whether the performance is more likely coming from random luck or strategy value. Not the holy grail of trading results significance checking but probably a good tool in the system development box..

    There are long debates on the practice of back-testing and this issue is surely one of them!… ;-)

  • Lou

    Am I missing something here?
    Of course you were able to reject Ho. You built a distribution based on (x – mean) and then contrasted the mean against this distribution.

  • Jez Liberty

    Lou, this can seem confusing at first indeed (but it makes sense, it’s all about checking the variance in the process)…
    H0 is the hypothesis that the mean return is zero, so you need a distribution with a zero-mean.
    What the bootstrap does is build such distribution using the actual data from the test in order to have variance/deviations in line with the process being tested.
    After the zero-mean distribution is built you perform a standard statistical significance test by comparing the data tested (mean return) against the zero-mean distribution. This is a non-parametric method, but an analogy would be to calculate how many standard deviations the mean tested lies at.

    You only reject H0 if the mean return is in the top x%. If the process has high variance, it is likely that H0 will not be rejected (at x%) level because the mean return will not be “far enough” on the right

  • Lou

    I think that the test you’ve created only shows that the mean does not come from the bootstrapped sample of (x – mean).

  • Jez Liberty

    Hi Lou,
    Not sure what you mean. I am only describing the bootstrap test as per Aronson’s book. Are you saying that the test is flawed or that there is some mistake in the illustration?

    I was not very clear/accurate in my description/comment above but in effect the test simulates a zero-mean sample process with similar variance to the process (back-)tested. If we run that sample process a large amount of times, we know that the expected mean (ie mean of means) will be zero but we can also check how far and frequenty the data spreads towards the right/left of the mean (as shown graphically in the distribution).

    If 95% of the means are below the back-test observed mean, there is statistical significance (at 95% confidence level) that the back-test’s profitability was not the result of random variation from a zero-mean process with similar variance (which is H0, which can then be rejected).

    Hope this clarifies things.

  • Lou

    I don’t know anything about Aronson. Maybe you can post or send a link to a relevant excerpt on applying bootstrapping to hypothesis tests.

    Here’s an example of what I was trying to say:
    Given an initial sample of 1000 trees. We find that the mean height of the sample is 15 feet. We cut off the 1st 15 feet of each tree (we just write down the negative values since we can’t have negative trees). The we create a simulated distribution of the detrended trees via bootstrapping.

    So, now we have a distribution of tree tops (x – mean) and recorded values for negative trees.

    If we want to test a full tree, one that hasn’t been cut (detrended) against the distribution of cut trees then our hypothesis test is Ho: tree(uncut) = tree(cut).

    If we reject Ho we’ve inferred that the uncut tree did not come from the cut tree sample.

    That’s all that you’re doing in your example. You’ve inferred that the mean did not come from the detrended distribution.

  • Jez Liberty

    To stay with your tree analogy:

    The 1000 trees only represent a sample from the total population of trees (for which we do not know the true average height).
    We want to know if the sample average height (15 feet) is statistically significant to infer that the true average height (for all trees) is different from 0 (which is our H0 hypothesis: true average tree height = 0).

    One problem is that we only have one sample forest (our 1000 trees). We need to have many more sample forests to establish the sampling distribution of the average height.
    Moreover, if our assumption is true (H0 = the true tree average height is 0), then the sample forest is skewed/biased upwards. So we need to adjust it (by cutting the tree tops) to “zero-center” it (and meet H0’s condition).

    We can now create many forests/resamples from the adjusted trees and calculate the average height from each resample to establish the bootstrapped sampling distribution of the average height.

    The mean average height will be zero, but some samples’average height will be 15 feet or over. The number/frequency of these samples with average heights of 15ft+ provide us with an indication of how rare they are. The more rare they are in the sampling distribution, the least likely our initial sample’s average height was due to random variation from a zero-mean population: ie the total population is unlikely to have a zero mean provided that our sample has a mean of 15 feet.

    It is mostly related to variance within the sample: if all trees are 15 feet +/- 2 inches, it is very likely that the true average height is different from 0. In the bootstrap test, very few or no sampling distributions (zero-centered) will have an average height of 15 feet or more: H0 is rejected.

    If if all trees are 15 feet +/- 100 feet (assuming negative trees), we have much less certainty about the true average height being different from 0 (as the 15 feet average could just be the result of random variation). In the bootstrap test, a much larger number of sampling distributions (zero-centered) will have an average height of 15 feet or more: H0 can not be rejected .

    Apologies as I do not have any good links on this topic to refer you to. I feel maybe we should “branch” out to a discussion offline on this (email if you prefer) if you need…

    I do think Aronson’s book does a good job of explaining the concept (obviously in a longer form) and I thought I managed to put a clear synthesis of the ideas… Maybe not so clear.

  • Lou

    It seems that we’re talking past each other. My issue isn’t with bootstrapping. My issue with your original example is that you are contrasting the mean return with a sample distribution made up of (x – mean) observations and then stating that “…the bootstrap tests for the null hypothesis that the rule does not have any predictive power”.

    This sounds really vague and in any case I’m pretty sure that you haven’t found out anything about “predictive power”. I think it would help to know exactly what Ho: and Ha: are and what they have to do with predictive power.

    Thank you for your response.


  • Jez Liberty

    Yes – sorry that we don’t seem to understand each other…
    “the bootstrap tests for the null hypothesis that the rule does not have any predictive power” is a quote from Aronson’s book. In it he equates this to “the null hypothesis that the arithmetic mean return of the rule being back-tested is zero”

    H0: back-tested rule has no predictive power = arithmetic mean return of the rule is zero
    Ha: back-tested rule has predictive power = arithmetic mean return of the rule is positive

    I suppose you could also see/run this test in a different way:
    generate a sample distribution made up of (x) observations (instead of x – mean) and check how many observations are positive. If the number is high enough (ie 95%), the positive result would be deemed statistically significant.

    This would give the same results as the method described above.

  • Bootstrap Testing of My Backtest data | MTJC-Capital's Blog

    […] blog – Automated Trading Systems.  His posts discussing bootstrap testing can be found here and […]

  • Mike

    Has anyone tried Aronson’s detrending on a portfolio? It is reasonably straight forward when dealing with a single symbol. But, what about when dealing with trades from multiple symbols? Must we detrend the history of each symbol individually for the calculations of those trades, or is there some way to amalgamate the data streams? Individual streams would likely have too few trades to really be of any value.

    P.S. The mean of your example 50% followed by -40% is 5%, not the 10% stated in your example.

  • Jez Liberty

    Hi Mike – oops, thanks for letting me know about that mistake. Fixed now.
    Re: the detrending question, I am not convinced by the concept of detrending ( I discuss about it here) and therefore have not researched it too much. Intuitively I would think you need to keep track of trades and daily drift for each individual symbol and make individual adjustments.

  • Lou

    Did you try the way that you proposed (quoted below) and were 95% of them positive? I’m curious to see how that worked out.
    “I suppose you could also see/run this test in a different way:
    generate a sample distribution made up of (x) observations (instead of x – mean) and check how many observations are positive. If the number is high enough (ie 95%), the positive result would be deemed statistically significant.”

    Also, until you stated the null I didn’t realize that you were just testing for Ho = 0. That was really what my original question was about.

    Thx for your help.

  • Jez Liberty

    Lou – glad that we managed somehow to understand each other…
    I have not tried the other option, but from a logical point of view it just seems to be the same thing phrased in a different way.

  • Lou

    If you have the time would you mind trying the other option. I’d really like to see if it works.

    Also, in this instance I don’t think that this was a particularly useful test. My understanding is that you had 2 strategies already optimized in this test and then you did a simple hypothesis test for 1 mean. You’d expect to reject the null in this case.

  • Attila

    Hello, Great blog, I am just catching up with the posts
    (and the book), but could you shed some light on this: “think about
    a return of 50% followed by a return of -40%: arithmetic mean
    return is +5%, yet the overall return is minus 5.1%” How did you
    get the -5.1% for the overall return ? Thanks, Attila P.S.: Minor
    technical issue: reading the post in Google Reader via RSS, it
    seems it’s still picking up the original “arithmetic mean return is
    +10%” version.

  • Jez Liberty

    The return from +50% and -40% is -10%. When taking the square root (for geometric average) the average return is -5.1%

    Thanks for mentioning the issue on the rss reader. I’ll try to look into fixing it.

  • Headhorn

    Geometric Return = {[Prod_i(1+R_i)]^(1/n)}-1. So in the example given,

  • Subhash


    Lou is right. There is a logical fallacy in the RC test as suggested by the author.

    This is because, the resampled distribution is from the “detrended” return (after subtracting the mean), but we are reading the p-value as the fraction of observations from the resamplings that is above the “initial” mean return. Its not correct to compare these two because the initial mean return is not detrended.

  • Performance Metric: Pessimistic Gain/MDD « Quanting Dutchman

    […] of the original list of trades (If you want to learn more about bootstrap-testing I higly recommend this post by Jez Liberty). The Pessimistic Gain/MaxDD is then calculated by subtracting 1xSD from the Gain […]

  • ZigZag

    First of all, thanks for a great blog.
    Two basic (but important to me, at least) questions on bootstrapping as applied to estimating the distribution of equity CAGR and MaxDD of a series of back-tested trades P&Ls:
    1. re-sampling with replacement vs without replacement: I have seen plausible arguments for either approaches. Do you have a view?
    2. how important it is to resample without destroying (totally or even partially) the original trade P&L series’ degree of randomness (or lack thereof)?
    Hope these issues are relevant to the other readers as well.

  • xi

    Great blog but I have a question about the random re-sampling. Would the random re-sampling totally destroy the correlation structure the original sample?

  • Jez Liberty

    yes it would most likely but I do not think it is an issue with this approach/usage as we are not recreating trading signals off that re-sampled data or even calculating time-sensitive stats such as MaxDD but only using one composite return stat.

  • Alex

    This is a great blog. I wonder why has anyone paid attention to the stuff that guy Aronson wrote. He repeats the same things over and over again in his book like he is talking to elementary school kids or even more seriously like he is trying to understand it himself. As some people have noted above, his tests are seriously flawed. Why would anyone want to check: “H0: back-tested rule has no predictive power = arithmetic mean return of the rule is zero”?

    In practice this will happen if there is no commission and slippage. But in reality these two can result in serious performance degradation. YOU DO NOT WANT to test the hypothesis before applying cost to your trading. This is preposterous indeed. Then, why bootstrapping at all? Bootstrapping returns WILL NOT subject your system to stress from new market conditions, which is the real issue here. Again, the whole book Aronson wrote promoted some idiosyncratic methods of hypothesis testing that have virtually little to do with real world system trading.

    Jez, thanks anyway, I hope you are doing fine.

  • Rick

    Hello everyone,

    After a few years, have we settled on the following issues?

    1. Does it make sense to subtract mean and center distribution to zero? Why not just counting the percentage of mean returns above the target return and calculate p-value accordingly?

    2. BTW, how is sample size determined in re-sampling? Does this have any effect? Or are trade return order just reshuffled?


Leave a Comment