This is really a post where 1 picture = 1,000 words, so please consider the datasets charted below:
Pretty different graphs, right?
Yet, you might be surprised to hear that they are all identical in the light of common summary statistics: mean, variance, (Pearson) correlation and linear regression.
Here are the exact figures:
|Mean of x in each case||9.0|
|Variance of x in each case||11.0|
|Mean of y in each case||7.5|
|Variance of y in each case||4.12|
|Correlation between x and y in each case||0.816|
|Linear regression line in each case||y = 3 + 0.5x|
Details of all dataset points are at the bottom of this post if you fancy double-checking this yourself.
Credits for this amusing and quite extraordinary illustration go to statistician Francis Anscombe, who created this Anscombe’s Quartet back in 1973 (more info on wikipedia).
I believe Anscombe’s main point was to prove that statistics can be misleading (a fact greatly abused by today’s media?) but also that outliers can have a strong impact on statistical properties. Surely two points that can impact your trading system design and testing conclusions.
Exploratory Data Analysis
Anscombe’s quartet is often used to highlight the importance of graphical exploration of the data for analysis. This concept is behind an area of statistics known as Exploratory Data Analysis. From wikipedia:
Exploratory data analysis (EDA) is an approach to analysing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses. It was so named by John Tukey to contrast with Confirmatory Data Analysis, the term used for the set of ideas about hypothesis testing, p-values, confidence intervals etc. which formed the key tools in the arsenal of practising statisticians at the time.
The objectives of EDA are to:
- Suggest hypotheses about the causes of observed phenomena
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical tools and techniques
- Provide a basis for further data collection through surveys or experiments
In this interview, David Harding, from Winton Capital Management, stresses the importance of research and statistical data analysis in his company. The two books he highlights cover Exploratory Data Analysis (Understanding robust and exploratory data analysis and Nonparametric Statistical Inference).
Another topic that Anscombe’s quartet illustrates is correlation (and non-linearity).
The most common measure of correlation is Pearson product-moment correlation coefficient. Its calculation only derives the linear dependence between two variables. If a non-linear relationship exists between these variables, it will go undetected by the Pearson correlation.
To illustrate this, consider Anscombe’s top right dataset, exhibiting a perfect functional relationship between x and y. The correlation should be 1, yet the Pearson coefficient is irrelevant at 0.816.
Consider the following more extreme case (of the type y=(x-a)^2 + b), where a 100% relationship translates in a zero linear/Pearson correlation:
Taleb is probably a bit extreme when he says:
Anything that relies on correlation is charlatanism.
But using correlation on market data, typically described as non-linear, can have its pitfalls.
I was re-reading Ralph Vince’s paper on Leverage Space Model not long ago and he describes the fallacy of correlation:
Correlation fails when you are counting on it the most – at the (fat) tails of the distribution. The point is evident throughout this study; big moves in one market amplify the correlation between other markets, and vice versa
In the paper, he explains that correlation between two instruments becomes much stronger during periods of large standard-deviation moves (when potentially diversification/non-correlation would be required most).
RIP Benoît Mandelbrot (20/11/1924 – 14/10/2010)
Finally, I would like to pay tribute to Benoît Mandelbrot, who passed away last week.
Mandelbrot has been dubbed “The Father of Fractals” and spent a big part of his life trying to understand how markets work. Looking at market data through the prism of fractal geometry, he tried to uncover non-linear, chaotic relationships in the data.
I’ll reiterate my recommendation to read his book, The (Mis)Behavior of Markets – one of the most enjoyable rebuttals of the Efficient Market Hypothesis. Although I have not (yet) found a practical application of the principles described in his book and papers, Mandelbrot’s attempt to define a new paradigm to understand the markets is interesting – and it is a book that reinforced my belief and understanding of Trend Following (despite not covering this topic explicitely).
One of his main finding is that price changes in financial markets do not follow a normal (Gaussian) distribution, but rather a fat-tailed distribution (Levy, Paretian or also called power-law distributions). This is potentially one of the main reasons for Trend Following to work (well that’s my interpretation).
I might actually pick up that book again and read it soon (and write a summary post on it).
Appendix: Dataset point coordinates: