Data cleaning

The Realized Library is based on underlying high frequency data, which we obtain through Reuters DataScope Tick History. We are not in a position to make available this base data, or its cleaned version, for commercial reasons as Reuters owns the copyright to it. Although the raw data is of high quality it does need to be cleaned so it is suitable for econometric inference.

Cleaning is an important aspect of computing realised measures. Although realised kernels are somewhat robust to noise, experience suggests that when there are misrecordings of prices or hit large amounts of turbulence at the start of a trading day then they may sometimes give false signals. Barndorff-Nielsen, Hansen, Lunde and Shephard (2009) have studied systematically the effect of cleaning on realised kernels, using cleaning methods which build on those documented by Falkenberry(2002) and Brownless and Gallo (2006). Our data has more variation in structure than that dealt with in Barndorff-Nielsen, Hansen, Lunde and Shephard (2009) and so we discuss how our methods use their rules.

Most of the datasets we use are based on indexes, which are updated at distinct frequencies. Some indexes, such as the DAX and Dow Jones index, are updated every second or a couple of seconds. Most are updated every 15 or 60 seconds. The only data cleaning we applied to this was that applied to all datasets, called P1, given below.

All data

  • P1. Delete entries with a time stamp outside the interval when the exchange is open.

Quote data only

Quote data for the exchange rates is very plentiful and has the virtue of having no market closures. We use four rules for this, given below as Q1-Q4. Q1 is by far the most commonly used.

  • Q1. When multiple quotes have the same timestamp, we replace all these with a single entry with the median bid and median ask price.
  • Q2. Delete entries for which the spread is negative.
  • Q3. Delete entries for which the spread is more that 50 times the median spread on that day.
  • Q4. Delete entries for which the mid-quote deviated by more than 10 mean absolute deviations from a rolling centered median (excluding the observation under consideration) of 50 observations (25 observations before and 25 after).

In addition we have made various manual edits in the library when the results were unsatisfactory. Some of these were due to rebasing of indexes, which had their biggest effects on daily returns. It is the hope of the editors of the library that as it develops then the degree of manual edits will decline.

References

  • Barndorff-Nielsen, Ole E, Peter Hansen, Asger Lunde and Neil Shephard (2009) "Realised kernels in Practice: trades and quotes", Econometrics Journal, Forthcoming
  • Brownless, C T and G M Gallo (2006) "Financial econometrics at ultra-high frequency: data handling concerns", Computational Statistics and Data Analysis, 51, 2232--2245.
  • Falkenberry, T.N. (2002) "High frequency data filtering", unpublished technical report.