Friday, December 5, 2014

Using R to Test Pairs of Securities for Cointegration

Paul Teetor
August 2009

Ernie Chan's book, Quantitative Trading, explains why cointegrated pairs of securities are useful for constructing mean-reverting trades.  It also explains how to test pairs of securities for cointegration.  Ernie uses Matlab, but some readers may want to use R, the software for statistical computing and graphics.  This note explains how to perform the cointegration test using R.

Let's assume you have the history of daily prices for two stocks, GLD and GDX. You want to know if the prices are cointegrated. Let's also assume the data are captured in two files, GLD.csv andGDX.csv, which are in comma-separated value (CSV) format with seven columns: date, open, high, low, close, volume, and adjusted close.

Two notes before we begin.

    This is not a tutorial on R.  (Sorry.)  You can learn about R from its official web site, from An Introduction to R, from one of the tutorials, or from one of the available books.
    This is also not a tutorial on cointegration.  For that I recommend Ernie Chan's book; or, for the more mathematically inclined, the book by Bernhard Pfaff, Analysis of Integrated and Cointegrated Time Series with R.  Both books are excellent.

Data Representation

You could perform this work using vectors or data frames to represent your time series data, but that would be tedious. I use zoo objects for representation of time series, and I strongly recommend either the zoo package or the xts package.  xts is a super-set of zoo with an extremely fast implementation and many other features. Here, I'll assume you are using zoo.

Once your data are captured in a zoo object, say t, it behaves like a data frame. One zoo object can contain several columns, where each column is a different time series and each row is a set of (simultaneous) observations of those series.  The object provides an additional attribute, index(t), which is the vector of dates, one date per observation.  The first and last dates are available using start(t) and end(t), respectively.
Loading the Data

Loading the CSV files into a zoo object requires a few simple steps.

    Read the two files into two data frames.
    Convert the dates from strings into Date objects.
    Convert the two data frames into two zoo objects.
    Take the intersection of the two zoo objects.  That will create one zoo object with the observations common to both datasets.

Here is the R code.

library(zoo)            # Load the zoo package

# Read the CSV files into data frames
#
gld <- read.csv("GLD.csv", stringsAsFactors=F)
gdx <- read.csv("GDX.csv", stringsAsFactors=F)

# The first column contains dates.  The as.Date
# function can convert strings into Date objects.
#
gld_dates <- as.Date(gld[,1])
gdx_dates <- as.Date(gdx[,1])

# The seventh column contains the adjusted close.
# We use the zoo function to create zoo objects from that data.
# The function takes two arguments: a vector of data and
# a vector of dates.
#
gld <- zoo(gld[,7], gld_dates)
gdx <- zoo(gdx[,7], gdx_dates)

# The merge function can combine two zoo objects,
# computing either their intersection (all=FALSE)
# or union (all=TRUE).
#
t.zoo <- merge(gld, gdx, all=FALSE)

# At this point, t.zoo is a zoo object with two columns: gld and gdx.
# Most statistical functions expect a data frame for input,
# so we create a data frame here.
#
t <- as.data.frame(t.zoo)

# Tell the user what dates are spanned by the data.
#
cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

The as.Date function assumes your date strings are formatted like yyyy-mm-dd, which is the ISO standard format. If your strings use the common US format of mm/dd/yy, include a formatspecification.

gld_dates <- as.Date(gld[,1], format="%m/%d/%y")

gdx_dates <- as.Date(gdx[,1], format="%m/%d/%y")

Constructing the Spread

In Ernie's book, he first tests for cointegration, then he constructs the spread. In R, we do it the other way around: First we construct the spread, then we test the spread for a unit root.  It the spread has a root inside the unit circle, the underlying securities are cointegrated.  (See Bernhard Pfaff's book for a discusion of unit roots and their significance.)

The spread is defined this way.

S = y - (ß × x)

where ß is the hedge ratio, calculated using ordinary least squares (OLS).  Rearranging terms, we want to find the value of ß which best fits this equation.

 y = (-ß) × x

This is a simple linear equation with no y intercept.  In R, the lm function can fit a linear model such as this.

# The lm function builds linear regression models using OLS.
# We build the linear model, m, forcing a zero intercept,
# then we extract the model's first regression coefficient.
#
m <- lm(gld ~ gdx + 0, data=t)
beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

# Now compute the spread
#
sprd <- t$gld - beta*t$gdx

The first argument to lm is a formula, which specifies the linear model.  The formula gld ~ gdx + 0 says the model is

GLDi = ß×GDXi + ei

(If we omit "+ 0" from the formula, R would fit a y intercept, too.)
Testing for Cointegration

The Augmented Dickey-Fuller test is a basic statistical test for a unit root, and several R packages implement that test.  Here, we will use the adf.test function which is implemented in thetseries package.  The function returns an object which contains the test results. In particular, it contains the p-value that we want.

library(tseries)            # Load the tseries package

# Setting alternative="stationary" chooses the appropriate test.
# Setting k=0 forces a basic (not augmented) test.  See the
# documentation for its full meaning.
#
ht <- adf.test(sprd, alternative="stationary", k=0)
cat("ADF p-value is", ht$p-value, "\n")

Setting alternative="stationary" is important.

    To a statistician, it specifies a null hypothesis that the spread is non-stationary or explosive.
    For everyone else, it means that if the p-value is small, the spread is mean-reverting. By "small," I mean less than 0.10 or less than 0.05, depending how fussy you are.  (Smaller is better.)

We can interpret the ADF test results for the user.

# The ht object contains the p-value from the ADF test.
# The p-value is the probability that the spread is NOT
# mean-reverting.  Hence, a small p-value means it is very
# improbable that the spread is NOT mean-reverting
# (got that?).
#
if (ht$p.value < 0.05) {
    cat("The spread is likely mean-reverting.\n")
} else {
    cat("The spread is not mean-reverting.\n")
}

One word of caution:  The adf.test function essentially detrends your data before testing for stationarity.  If your data contains a strong trend, you might be very surprised to learn it is "mean reverting" when it is obvously moving upward or downward.  If this is a problem for you, consider the  fUnitRoots package which contains the adfTest function (note the spelling!).  That function lets you analyze either with or without the trend assumption.
Putting It All Together

OK, I've talked a lot, but it's really not much code. Here it is with some fat removed.

library(zoo)
library(tseries)

gld <- read.csv("GLD.csv", stringsAsFactors=F)
gdx <- read.csv("GDX.csv", stringsAsFactors=F)

gld <- zoo(gld[,7], as.Date(gld[,1]))
gdx <- zoo(gdx[,7], as.Date(gdx[,1]))

t.zoo <- merge(gld, gdx, all=FALSE)
t <- as.data.frame(t.zoo)

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")
m <- lm(gld ~ gdx + 0, data=t)
beta <- coef(m)[1]

cat("Assumed hedge ratio is", beta, "\n")

sprd <- t$gld - beta*t$gdxht <- adf.test(sprd, alternative="stationary", k=0)

cat("ADF p-value is", ht$p.value, "\n")
if (ht$p.value < 0.05) {
    cat("The spread is likely mean-reverting\n")
} else {
    cat("The spread is not mean-reverting.\n")
}

When I ran this code on recent data, I got this output.

Date range is 2006-05-23 to 2009-08-10
Assumed hedge ratio is 1.928096
ADF p-value is 0.2609909
The spread is not mean-reverting.
Postscripts

I used many R functions, but did not try to explain all their nuances.  If you have any questions (e.g., "Why did you set stringsAsFactors=FALSE for read.csv?"), use the help facility to learn more about the function:

> ?read.csv

These notes assume your data is captured inside CSV files.  I did that just to mirror the example in Ernie Chan's book.  R can read its data from many places.  To learn more about R's input/output in general, start with the R Data Import/Export manual.  To learn more about loading financial data in particular, visit the web site for the quantmod package, an excellent source of tools for anyone working with financial data.

R's input/output is so flexible, in fact, that you can read data directly from web sites, such as the Yahoo! Finance web site.  Just use a URL instead of a file name, like this.

gld <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=GLD&ignore=.csv", stringsAsFactors=F)
gdx <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=GDX&ignore=.csv", stringsAsFactors=F)

This eliminates the extra step of saving the data to an intermediate file.

I used the adf.test function, above, to test for stationarity.  The ADF test is implemented in several other packages, too, such as the fUnitRoots and urca packages.  Bernhard Pfaff's book contains an entire chapter on tests for unit roots, including tests beyond the original ADF test.

Finally, high-powered statisticians will be offended by my statement that "the p-value [of the ADF test] is the probability that the spread is not mean-reverting." Just for the record, here is the strict interpretation:  "The p-value is the probability that we could have observed the spread data, given the assumption that the spread is not mean-reverting."