Systematic Trading research and development, with a flavour of Trend Following
Au.Tra.Sy blog – Automated trading System header image 2

Data Pitfalls: a true Minefield?

May 4th, 2010 · 2 Comments · Data

Data - image by Pixelsior@flickr

Data - image by Pixelsior@flickr

In my other job, at a big investment bank, one of our main focus on a daily basis is DATA. Making sure that the hundreds of feeds and millions of records get uploaded correctly and contain the right information is key to a smooth, successful day.

There is surely not as much data to interact with in automatic/systematic trading but getting data right is equally vital to ensure quality trading or backtesting results.

Even with the best data package, there can be many pitfalls to look out for when trading or backtesting a systematic trading strategy. Below is a non-exhaustive list – please let me know in the comments if you think of anything to add.

Data Errors

Despite CSI being deemed one of the most reliable End-Of-Day Data providers, they still send you bogus data (albeit rarely) and you might want to run additional checks to see if the data is valid. Some ideas of checks to perform:

  • Open, Low or Close above the High
  • Prices = 0
  • Price/Volume/Open Interest changes above tolerance
  • Gap in the dates
  • etc.

Of course, this will not catch all errors, but will give a pointer on large data anomalies.

Static Data

Back in the bank job, part of the system data which changes infrequently (hence static) is subject to 4-eyes approval (basically enforcing that another pair of eyes double-checks the correctness of the data). The idea is that, as this data drives the calculations in the system, any error would give flawed calculation results, and needs double-checking.

The equivalent in the automatic trading system world is the futures contract specifications and formats information (price quote type, decimal places, native currency, point value), contained in the FuturesInfo file in TradingBlox. Imagine an error of a factor 10 on the contract point value – this could have you get in a trade at 10 times the intended position size!

CSI appears to contain several of these errors (they seem to have less focus in getting this type of data right) and even official exchange websites have been seen to quote incorrect information. Reconciling both would be a starting point to getting the right information.

A possible solution to mitigate this problem (when starting trading a new, exotic instrument) is a quick in-and-out trade to double-check the actual information on your broker statement.

Data History and Overrides

The CSI data downloading process firstly retrieves all new prices to be stored in the Unfair Advantage database. It then applies potential corrections to past prices, in case errors were present. Rollover logic is applied and the full data history file is generated (to be used in the trading/backtesting software).

The change in data resulting from error corrections might change the outcome of past trade results. For example, a data correction in the recent past history might trigger a breakout and therefore a trade, now open. How would you handle this?

Additonally, if the historical data contains errors not corrected by your data provider, you might want to apply overrides to correct them yourself. However, since the price history gets re-written everyday, you need to have some form of automated overrides upload.

You might also want to make daily backups of price files to be able to compare when price corrections have taken place (and run your system with the exact data that was available on that day).

Other Data

Data such as historical FX/interest rates or Margin data are necessary to calculate some of the system statistics (including overall performance).

The FX rates are mostly used to convert the P&L from the native contract currency to the main account currency.

Interest rates would be used to calculate the interest earned on the account, which impact on the overall performance is not negligible for a long-term Trend Following system.

The Margin data is useful to determine the total amount of margin your system requires at any one time, with statistics such as the Margin-to-Equity ratio. Unfortunately, it is fairly hard to get hold of this type of data, which changes fairly often, at the discretion of the exchanges. Most backtesting platforms I have come across use a static margin number, which is fairly inaccurate.

To get more accurate results, you could run your backtesting simulations using historical (real or simulated) margin requirements. A method suggested on various forums to estimate historical margin requirements is to run a regression analysis between margin requirements (known history) and underlying instrument volatility (using different potential measurements such ATR, High-Low, Bollinger Bands, etc.). Finding a good-fit model would allow you to extrapolate past margin requirements based on volatility.

Holidays are also useful data to collect, to be able to know which exchanges are open when (note that exchanges can have different holidays even when they are in the same country).

Backup Provider

Your main data provider might be down or unavailable (hopefully on rare occasisons only). A backup, alternative source of market data might come in very handy for those special days.

Back-Adjustments and Rollovers

There is unfortunately no ideal or universal solution to roll and back-adjust contracts. As discussed earlier, a proportional back-adjustment allows for better representations of prices, keeping historical ratio values between all prices. However, backtesting systems usually require point-based back-adjustements in order to calculate the P&L of each trade correctly. You might need to maintain several sets of back-adjusted prices to be used in your simulations.

Furthermore, different instruments might require different rolling logic. Different delivery months in agricultural futures, for example, might relate to different crops. It might not make as much sense to roll from one ot the other in the same way that could be done with equity index futures.

Do you get the rollover information on rollover day?
Let’s assume that, on Thursday, you receive the data indicating that a roll should have occurred on Tuesday (ie. roll based on Open Interest or Volume shifts). Let’s also assume that an entry signal was triggered on the Wednesday and that the high volatility of the front contract would have prevented the system from entering the trade (risk management threshold breached) – whereas the next contract’s volatility, being much lower, would have allowed the trade. In real life, Wednesday’s trade would not take place (as the system does not know yet that the contract should have rolled on Tuesday). In backtested life however, Wednesday’s trade would take place, as the system would assume the roll has taken place on Tuesday. These things do actually happen!

You might want to test using the data as you would have received it, instead of a constructed back-adjusted series, potentially containing hindsight information.

LME Data

Some instruments deal in “strange fashion” and the London Metal Exchange is a perfect example of this. Trading on the LME is done by purchasing a forward contract with a given maturity (or prompt date). For instance, you might purchase a 3-month forward contract, which will run for the next 3 months gradually maturing to converge to the Cash price (at expiration), including the effect of contango or backwardation. However, the data quoted from CSI (and probably any other data provider) is a new daily quote for a new 3-month forward, which does not reflect real trading.

You would probably need to construct a custom price series with a bespoke algorithm to be able to reflect and backtest real-life trading. This topic was further discussed on a tradingblox forum post

Need for an ETL layer?

Drawing the parallel to the bank job again, where we use specialised tools/platforms to deal with data (Extract-Transform-Load = ETL); I believe that a similar layer between your data provider(s) and your trading/backtesting system, used to perform data checks, archive, roll and back-adjustments, etc. (all poinst discussed above) is necessary. It could also be used to add extra functionality such as creating spread time series to be traded/backtested as new instruments.

The data issue might not be as much of a trading problem as an IT one, nonetheless it is an important one that needs to be addressed – or it could impact your trading and backtesting results.

Related Posts with Thumbnails

Tags: ···

2 Comments so far ↓

  • Craig

    Another one to look out for is how a specific providers timestamp their data, I spent some rather irritating time finding out that the bar timestamps on DTN are the close time not the open time, is this mentioned anywhere on their site?Nope…

  • RiskCog

    Great ideas, I am going to take action by adding some code to look for obvious errors every time I load data!

Leave a Comment