The second method to evaluate the statistical significance of a backtest result presented by Aronson (in EBTA) is the Monte Carlo Permutation. This is an extension of the classic Monte Carlo method, applied to rule testing.
The concept behind the Monte Carlo Permutation is similar to the Bootstrap method:
 Generate multiple random outputs based on the single sample data from the backtest.
 compare the random Monte Carlo outputs to the backtest output to evaluate its statistical significance.
The difference lies in how the multiple random outputs are generated. Whereas the bootstrap generates a sampling distribution for the backtested rule return, the Monte Carlo Permutation focuses on the pairing of the rule positions with the instrument daily return. Its resampling randomly associates the rule positions with the market returns, without replacement.
The H0 hypothesis in the Monte Carlo Permutation test asserts that the returns of the rule evaluated are a sample from a nonprofitable population, or, in other words, that rule positions are randomly correlated to market returns.
Monte Carlo Illustration
Imagine the following backtest result, presented day by day:
Day  1  2  3  4  5  6  7  8 
Rule Position  Long  Long  Long  No Pos  No Pos  Short  Short  Short 
Market Return 
0.54%

0.32%

1.54%

0.69%

1.02%

0.68%

1.20%

2.50%

Output 
0.54%

0.32%

1.54%

0.00%

0.00%

0.68%

1.20%

2.50%

Mean Return 
0.47%

There are effectively two input time series:
 Rule Positions
 Market Returns
The way these two time series are linked (by date) produces the daily output for the rule return – and a mean return can be calculated.
The permutation of the Monte Carlo method will reshuffle one time series to produce random links, or pairing, and produce a different rule output.
Two examples can be found below. The market return time series has been randomly reshuffled to produce two different sample outputs:
Day  1  2  3  4  5  6  7  8 
Rule Position  Long  Long  Long  No Pos  No Pos  Short  Short  Short 
Market Return 
0.54%

0.69%

1.54%

0.66%

1.20%

0.32%

2.50%

1.02%

Output 
0.54%

0.69%

1.54%

0.00%

0.00%

0.32%

2.50%

1.02%

Mean Return 
0.83%

Day  1  2  3  4  5  6  7  8 
Rule Position  Long  Long  Long  No Pos  No Pos  Short  Short  Short 
Market Return 
2.50%

1.02%

1.54%

0.32%

1.20%

0.68%

0.69%

0.54%

Output 
2.50%

1.02%

1.54%

0.00%

0.00%

0.68%

0.69%

0.54%

Mean Return 
0.32%

The Monte Carlo Permutation produces a large number of these random outputs. The pvalue of the original backtesting sample can then be computed (it is equal to the fraction of random rule returns equal or greater to the backtested rule return).
Note that Aronson once again recommends to run the backtest evaluated by the Monte Carlo Permutation on detrended data. It is also mentioned that Timothy Masters (who got the idea of applying the Monte Carlo method to rule testing) has performed tests showing that the bootstrap and Monte Carlo Permutation methods produce similar results when using detrended data.
Step by Step with Data Mining Bias Handling
Of course when applying this method to more than one rule, data mining bias comes into play.
The methodology for the Monte Carlo Permutation for data mining backtesting can be broken down as follows:
 N backtests are run on detrended data. Both rule position and market return time series are collected for the backtested rules.
 The market return time series is randomly reshuffled and paired with each of the N rule position time series to produce a new daily rule output time series for each rule. The same pairings must be used for all rules to ensure that the potential correlation structure present in the rules is preserved.
 A mean daily return is calculated for each of the N rules – the best return is selected as the value for the sampling distribution in this iteration
 Repeat steps 2 and 3 a large number of times
 Form the sampling distribution of the best means generated in the steps above.
 Derive the pvalue of the best backtest mean return based on the sampling distribution.
Some “Criticisms”
Aronson mentions that since the Monte Carlo Permutation does not test a hypothesis about the rule’s mean return (H0 is about random correlation of positions and market returns) it is not possible to use it to derive confidence intervals – as could be done with the bootstrap sampling distribution.
The method also requires access to more information than the bootstrap (which only needs the daily rule returns). It makes it impossible to apply to “black box” systems or programs. For example, the Monte Carlo Permutation method would not enable us to check the statistical significance of a Trend following Wizard as was done in bootstrap post #2.
The same remark concerning the use of arithmetic mean return instead of geometric mean return applies here also, but that can be easily modified.
Finally, the method, as formulated, only considers extremely simple cases of money management with identical size for all positions. The method would need to be adapted to be used for rules with more complex money management strategies.
I’ll let you come to your conclusions and experimentations but it does seem like the Monte Carlo Permutation method has more weak points than the Bootstrap test.
Craig // Aug 19, 2010 at 1:44 pm
I have done some comparisons of the ttest and the nonparametric tests described in the Aronson book. Basically I found that the results are soo close (even in highly nonGaussian cases) that the extra CPU involved in the nonparametric tests is simply not worth it.
Andrew // Sep 7, 2010 at 6:44 am
I have recently stumbled across a forum posting (almost 6 years old!) that discusses permutation testing.
http://www.nuclearphynance.com/Show%20Post.aspx?PostIDKey=20934
2sedated // Feb 10, 2012 at 4:33 pm
market returns should be always the same,
in the original sample you have 8 different market returns, and in the resampled examples you have some other values.
isnt the procedure : randomly arrange market returns, and then select one by one and pair it up with ruleoutput (long or short)
in the firs example you have 0.17% and there is no such market return in the original sample…
why is that
thanks.
Jez Liberty // Feb 11, 2012 at 4:29 am
2sedated – perfectly correct. The examples did contain some errors (and were not in line with the explanation). This should be fixed now (according to the description and your correct understanding).
Apologies for the confusion.
Jez
2sedated // Feb 11, 2012 at 1:56 pm
@Jez
I am glad to be of help.
BTW you have one of the BEST BLOGS on automated trading, backtesting etc.
Great job,
and keep up the good work.
Kind regards from Croatia.
Jez Liberty // Feb 12, 2012 at 10:48 am
Thanks 2sedated!
2sedated // Feb 12, 2012 at 7:08 pm
@Jez
I wonder if you could clarify one thing for me, plz. I have my personal trading algorithm, that i have backtested on tick data, using a platform that i have built myself. This platform accounts for volume beeing traded, depth of the market, trading costs etc, and i have very positive returns over many samples (IWM 5yrs data, TNA/TZA 3yrs data etc).
I would like to test my system with bootstrap test, monte carlo, al enhanced with romano and wolf.
But i have a problem. How do i define what is a rule? For example. Lets say that i even consider market to be bullish if SMA200 on a minute chart has postitive angle, and i produce a signal with RSI falling bellow 25. In this simple example is the this only one rule or two? Only one output is given…. i am little confused on how to define “rule”.
I have read whole EvidenceBasedTA book, but have not managed to clarify to myself how to distinguish what is a single rule, what is a complex rule and when can i take a complex rule and think of it a “just one rule”.
:)
Jez Liberty // Feb 13, 2012 at 2:17 am
2sedated,
First off, the bootstrap test is convenient in that it does not require ou to define rules.
For the MC testing, I think of a rule as the whole set of rules in a system that produce the trading positions (i.e. Long/Short/Neutral). Actually, as defined in the article, you don’t even need to know what the rules are. Just what the rule output is for each day.
Hope this helps.
Jez
2sedated // Feb 16, 2012 at 7:21 pm
@ Jez
I was wondering if that was the case, but you confirmed it for me. Thank you very much.
Still,
i have but one more troubling detail :)
My daytrading system produces on average 23 trading signals, mostly of the same type in the same day (LLL or SSS or rather LNLNL or SNSNS), so i have difficulty in choosing my time unit for any given output.
In this example given above the unit of division is one day, where output of the system is either long or short in that particular day. I wonder if my unit of division should then be minute? I must say that at end of day my system closes its position so there is never overnight holding of assets. I suppose that because my system trades only during the 1day sessions my time unit of division should be 1minute. For example in minute 1 2 3 4 the output was LLLL then in 567 was NNN etc… and then associate the apropriate maret returns for those minutes.
And there is another thing that bothers me a lot, and that is “slippage”. In the example given above we have the following… on the first 3 days the rule output is L L L, and the market performace is given +0.54, 0.33, +1.54 and those values are then assoiciated with the “return” of the rule in that day. But this aproach doesnt take in to account when exactly did the rule enter long positions in the first day? Was it at the open of the first day? What if rule genrates the long signal in the middle of the day?
What is the the appropriate approach in this problem?
I test my system on extensive tick data, market depth (full order book), and it is VERY VERY important to say that slippage of even 12 cents can make or brake the system, so i presume i have to resolve this in some logical manner or otherwise this “slippage” would greatly influence my Pvalues of statistical significance, as it does with my profit line, and thus render bootstrap and MC method totally worthless.
Could you comment on this problem.
Thank you very much.