## Maximum Likelihood Estimation

Generally, when traders or risk managers use EWMA, they arbitrarily choose the smoothing parameter, X. We could also do this for GARCH and arbitrarily set values for the parameters. However, those who believe the variance process is described by a GARCH model generally do not do this, relying instead on maximum likelihood estimation to find the parameters for each underlying. (This is one of the often-stated objections to GARCH: that it is subject to retrospective curve fitting of the parameters. More correctly, this is a possible problem with the common implementation of the model rather than with the model itself.)

Maximum likelihood estimation (MLE) is a method for estimating the parameters of a probability distribution. Likelihood differs from probability. Probability refers to the chance that a future event occurs, whereas likelihood refers to past events. In MLE the parameters are chosen to maximize the probability of the final observed outcome actually happening.

A commonly cited example is estimating the number of cabs in a city. All we know is that all cabs are given a unique number and that numbers are consecutively issued, with no numbers being skipped. If the first cab we see has number 2028, what is our MLE estimate? Clearly there can be no fewer than 2028 cabs. Recall that MLE looks to find the parameter that makes the observation most likely. In this case if there were exactly 2,028 cabs in the city our chances of spotting that particular car would be 1/2,028. If there were any more cars than this we would have had a smaller chance of seeing exactly that one. So the MLE estimate is 2,028. Note that although this is our best estimate, it could be a long way from the actual number. There could well be 10,000 cabs. But MLE makes the best use of the information available to us.

The next example is less trivial, although still somewhat contrived. Let's assume that we toss a coin 10 times. We don't know whether it is a fair coin—in fact, it could be one of three coins. One coin comes up heads one-third of the time, another gives heads half of the time, and the third yields heads two-thirds of the time. If we get six heads in our experiment, what coin is it most likely that we were using?

Let p be the (unknown) probability of throwing a head. So the probability of getting a tail is 1 — p. The result of a coin toss is described by the binomial distribution. The probability of getting h heads from N tosses is n

0.057

0.205

0.228

So in this case it is most likely that we have been tossing the coin that comes up heads two-thirds of the time.

This is an example of applying MLE to a discrete distribution. The case of a continuous distribution involves similar reasoning.

The likelihood function for a GARCH(1, 1) model is given by

2no,

=exp

although it is normal to use the equivalent log-likelihood version:

A spreadsheet is given where we estimate the parameters of a GARCH(1,1) model by maximizing the (log) likelihood function.

After using the spreadsheet to fit a GARCH(1,1) model to a few price series, the reader is likely to discover a problem: The log-likelihood function is very flat. The solver algorithm is likely to have difficulties fitting a model, as there will be very small changes in the likelihood for a wide range of parameters. This can be somewhat addressed by using variance targeting, where the omega term is set equal to the unconditional variance of the sample multiplied by (1 - a - i), and we only vary the alpha and beta terms.

Still, it may be necessary to use more sophisticated numerical techniques if we want to use GARCH models extensively. These could also provide statistics giving the goodness of fit of the model. There are a number of reasons why the model may not fit the data well. These include:

• Insufficient data. Generally at least 1,000 data points are required.

• Poor initial values for the parameters.

• Persistent seasonality contained in the data. This is a particular problem with intraday data.

• Wrong model. The data just isn't consistent with the model chosen!

Econometricians have developed a large number of models to address this last issue. I list some of the more common models here, but this list is far from exhaustive. In fact, since the original work by Engle (1982), an enormous number of versions have been developed.1 One paper documents a comparative test of 330 models (Hansen and Lunde 2005).

• Exponential GARCH (EGARCH) models the log of the variance. This means it can incorporate asymmetry, as negative shocks can have a different effect to positive shocks (Nelson 1991).

• GJR-GARCH is another asymmetric model where an extra term is present in the case of downward shocks (Glosten et al. 1993).

• Integrated GARCH (IGARCH) further constrains the parameters so that alpha and beta sum to one.

• Threshold GARCH (TGARCH) has an extra term that applies when shocks are negative, again allowing asymmetry.

• Absolute Value GARCH (AGARCH) directly models volatility instead of variance (Taylor 1986; Schwert 1989).

• Component GARCH (CGARCH) models the variance as the sum of several processes or components. One could be used to capture the short-term response to shock and another to capture the long-term response. This allows the model to capture long memory effects (Engle and Lee 1999; Ding and Granger 1996).

It certainly seems that GARCH does capture some elements of the time evolution of variance. Further, it can be motivated by a fairly simple microstructure-based argument (Sato and Takayasu 2002). However, the fact that so many models exist in the family, with no one being definitively superior, can certainly be seen as a negative. Also, if we estimate the parameters of the model using MLE, then reestimate the model at a later date, we find little persistence in the values of the parameters. This would also seem to indicate that the model was not a particularly good description of reality. Also note that this model is meant to be predictive, not just descriptive. Unlike BSM, which is just a conceptual framework, a volatility forecast really does need to have good correspondence with the future.

1 Engle actually developed ARCH. The GARCH model was first proposed by Boller-slev (1986). ARCH is GARCH(0,1).

 FORECASTING THE VOLATILITY DISTRIBUTION

In addition to the GARCH family, there are many other methods for predicting time series. These include neural networks, genetic algorithms, and classical econometric methods such as the ARMA family of models. We don't touch on these here for several reasons: I have seen no conclusive evidence that they are any good at predicting; genetic algorithms and neural nets are very specialized methods that are easy to misuse; and while time series analysis is probably a good thing to know (refer to Taylor 1986), it is doubtful that spending a great deal of time trying to refine a point forecast is worth it.

Even if we agree with the Nobel Prize committee that the GARCH "models have become indispensible tools not only for researchers, but also for analysts on financial markets, who use them in asset pricing and in evaluating portfolio risk,"2 it isn't actually the breakthrough that an option trader needs. A point forecast of volatility just isn't all that useful. We need a forecast of the volatility distribution. Selling one-month implied volatility at 15 percent might seem like a good idea if we have a forecast of 12 percent. It will seem less like a good idea if we know that one-month realized volatility has had a range of 11 to 35 percent. It isn't just the forecast that is necessary. It is putting the forecast into the context of a volatility range. A simple way to do this is through the use of volatility cones.

As stated in the seminal paper on the subject (Burghart and Lane 1990), "The purpose of a volatility cone is to illustrate the ranges of volatility experience for different trading horizons." Let's look at the example of MSFT volatility for the four years ending April 30, 2007. We calculate volatility (here we use the close-to-close estimator but there is no reason another estimator could not be used) over nonoverlapping periods of 20 trading days, 40 trading days, 60 trading days, 120 trading days, and 240 trading days. These correspond closely to one calendar month, two months, three months, six months, and one year. The results are displayed in Figure 2.10.

When displayed graphically, we can see why this method was called a volatility cone. The cone shows the tendency for short-term volatilities to fluctuate more widely than longer-dated volatilities. On the one hand this is obvious, as big moves will be averaged away in the longer term. On the other hand, we also know that volatility measurements are prone to

2From the press release accompanying the announcement of the 2003 prize for economics, http://nobelprize.org/nobeLprizes/economics/laureates/2003/press.html

0.5

0.45

0.4

0.35

>

0.3

0.25

<3

>

0.2

0.15

0.1

0.05

120 Time (Days)

FIGURE 2.10 The Volatility Cone for MSFT, Generated from the Four Years of Closing Prices Ending on April 30, 2007

20 40

120 Time (Days)

FIGURE 2.10 The Volatility Cone for MSFT, Generated from the Four Years of Closing Prices Ending on April 30, 2007

sampling error, and this more dramatically affects shorter measurement periods (refer back to Table 2.2, where we saw how sampling error is dependent on sample size). Further, in order to gain the most information from a given price series, it would generally be necessary to use overlapping data. This clearly will induce an artificial degree of correlation in the estimates of volatility and will bias our results somewhat.

We need to take all of this into account. Specifically, we need to know how much bias is introduced into our volatility estimates by using overlapping data. This problem was studied extensively by Hodges and Tompkins (2002). They find that variance measured from overlapping return series need to be multiplied by the adjustment factor

n 3n2

where h is the length of each subseries (for example, 20 days) n = (T — h) + 1 is the number of distinct subseries available for a total number of observations T

So for the example given in Table 2.7, where we need to adjust the variance measured over 60-day subperiods from 1,006 data points, the adjustment factor would be 1.06, or approximately 1.03 when applied to the volatility. Using this adjustment factor means we can use rolling windows for estimating volatility, which makes volatility cones a very useful trading tool.

TABLE 2.7

The Volatility Cone Numbers for MSFT, Generated from the Four Years of Prices Up to April 30, 2007

TABLE 2.7

The Volatility Cone Numbers for MSFT, Generated from the Four Years of Prices Up to April 30, 2007

 20-Day 40-Day 60-Day 120-Day volatility volatility volatility volatility Maximum 0.465 0.352 0.287 0.258 75% 0.213 0.213 0.225 0.2 Median 0.159 0.172 0.169 0.181 25% 0.123 0.149 0.147 0.161 Minimum 0.062 0.096 0.091 0.136

In fact, given that for stocks and many futures the impact of news is far more important than any other factor in projecting and forecasting realized volatility, we may generally be best served by comparing implied volatility to the historical volatility distribution given by the volatility cone. Selling one-month implied volatility at 35 percent because this is in the 90th percentile for one-month volatility over the past two years can form the basis of a sensible trading plan. Selling 35 percent because GARCH is forecasting the realized volatility to be 20 percent is less sensible. The point forecast isn't as important as the possible distribution of results.

Obviously, it is also possible to construct volatility cones from the other volatility estimators.

The volatility cone is very useful for placing current market information (realized volatility, implied volatility, and the spread between them) into historical context. But it doesn't place it in the context of the current overall market. This is also something to monitor. If we had a choice of selling Citigroup's implied volatility at 39 percent when realized is 26 percent, or selling the S&P 500 implied volatility at 24 percent when its realized was 18 percent, we should think carefully. We are not getting much more edge as a percentage for selling the single stock than we are for the index. Consider using the implied/realized spread of the index as a benchmark for the amount of edge you look for in all of your trades. Things change, and this context is always very important.

Volatility cones are often not very helpful to market makers or other very active traders. In fact, they can be very irritating. This is because whenever the volatility cone tells you that implied volatility is historically high, you will already be short it. Generally you will have been selling it all the way up, and will now be sitting on a losing position. But at least you will now know that implied volatility is at all-time high, and now might not be the best time to cover. Market makers will lose money in these types of situations anyway. Ignorance can't help.

When making forecasts we will generally find that the implied volatility is equal to or significantly above our forecast volatility. The BSM

implied volatility is in general an upwardly biased estimate. For example, it is not uncommon for our forecasts to be 30 percent below current implied volatilities, but we practically never see the converse. There are a number of obvious reasons for this:

• By selling implied volatility we are selling insurance. Thus there is a risk premium associated with it.

• Perfectly reasonable things could happen that have never happened before. These will not be taken into account if we only base our forecast on past data.

• Market microstructure encourages implied volatility to be biased high. Market makers make the bulk of their money by collecting the bid/ask spread in the options. They will willingly bias their quotes a little too high to protect their business. In essence, they are buying insurance (slight long volatility exposure, particularly on the downside), as any prudent business owner will do.

The incorrect conclusion to draw from this discussion is that we can always profit by selling implied volatility. Remember that insurance companies profit largely from the profits made by reinvesting the insurance premium. This opportunity isn't available to option traders using borrowed funds. The option insurance premium alone isn't a good enough reason to sell options. But we need to acknowledge its existence and debias our forecasts accordingly so that we can tell if an option really is too expensive. So we adjust each forecast by subtracting the usual implied/forecast spread.

For example, the realized/implied spread for the S&P 500 can be loosely proxied by the spread between the rolling 30-day close-to-close volatility and the Chicago Board Options Exchange Volatility Index (VIX). (The construction and interpretation of the VIX is covered in Appendix A.) This is shown in Figure 2.11. Note that the VIX is almost always above the 30-day rolling volatility. You need to know the average amount of this premium. Remember that you should be looking for things that are out of the

39000

39200

39000

39200

39400

Date

FIGURE 2.11 The VIX and the S&P 500 30-Day Volatility

10/10/06

4/28/07

11/14/07

Date

FIGURE 2.12 The Implied Volatility Premium for the S&P 500

10/10/06

4/28/07

11/14/07

Date

FIGURE 2.12 The Implied Volatility Premium for the S&P 500

ordinary. The actual spread is shown in Figure 2.12. Here the average value is 3.09. This is the crucial value to remember. (Note also that the period in April 2007 where the spread went negative corresponds to a spike in the 30-day close-to-close volatility. This was largely caused by the problem with historical volatilities being backward-looking, as discussed earlier and shown in Figure 2.8.)

For position traders, context-adjusted volatility cones can be very useful as they get to monitor volatility without having to continuously make markets. They can wait to establish a position as it reaches the highs (and the market makers will generally be eager to lighten up their positions at this point). Position traders have to take advantage of the fact that they can be selective. Trading may look like a competitive activity but it shouldn't be about posturing. There is no need to always have a position. Lions are smart and powerful hunters, but they don't waste energy fighting rhinos. They wait until the wounded antelope gets separated from the herd.

0 0