## Rescaling of Variables for Numerical Accuracy

One important step for avoiding serious numerical problems is to make sure that the units of measurement are similar for y and for all regressors. This may mean rescaling some regressors by multiplying them by a constant such as 0.01,

0.1,10,100, and 1000. It can be shown that numerical problems are less severe if all variables are standardized so that they are measured from their means in units of their standard deviations. This scaling is achieved by subtracting the mean and dividing by the standard deviation.

Collinearity and Near-collinearity of Regressors. If one regressor is nearly proportional to another, the estimated regression coefficients will not be reliable at all. More generally, if one regressor is a weighed sum of two or more regressors, then also the regression coefficients are unreliable. Many numerical problems can arise in these situations. Good software programs warn the user about the presence of collinearity.

When the investment of large sums is being considered with the help of a regression equation, getting numerically most accurate software and checking the results on more than one software platforms becomes important. Free software called R and its commercial cousin S-Plus appears to have good numerical reliability, as does TSP. Excel has important limitations and is generally not recommended for serious work. However, Excel can provide good initial estimates of regression coefficients (subject to care in avoiding bad scaling of variables and collinearity issues) to a researcher in finance who may not be familiar with professional statistical software. Since the Microsoft Excel is almost universally available, it offers great convenience for our pedagogical descriptive purposes here. Our use of Excel here should not be interpreted as our endorsement of it in serious research due to known numerical inaccuracies.

Column of Ones Instead of the Intercept. For some theoretical and pedagogical purposes it is convenient to change the notation slightly and rewrite the two regressor model y = a + Px + p2x2 + e without explicit reference to the intercept a by merging it into just another regression coefficient. We completely change the notation and write it as y = p1x1 + p2x2 + p3x3 + e, where the old a is now p1, a new artificial column of regressor data where all elements are unity is now called x1. The old x1 now becomes x2 and the old x2 becomes x3. Verify within Excel that this change gives the same results provided one clicks on a box left of "Constant is zero" in the regression menu. This is next to the Labels box of the regression menu of the "Data analysis ..." option.

The column of T data points (say) on the dependent variable y is denoted by the y vector. Note that the software already works with several columns of regressor data as a whole and treats the entire collection of numbers regarded as regressors in a prespecified columnwise format. The set of numbers for regressors (without the column headings) is called the X matrix. Matrix algebra deals with the rules for operating on (adding, multiplying, dividing) matrices and vectors. The rules depend on the number of rows and number of columns of each matrix or vector object.

### Multiple Regression in Matrix Notation

Regression model remains the main work horse in many branches of finance. We have used the matrix notation in (9.3.1) for multiple regression models. For the convenience of some readers who may not be comfortable with it, this section of the Appendix gives further details of multiple regression in matrix notation. First, it is useful to consider y = px + p2x2 + ... + ppxp + e, stated above, in the vector notation. We let each regressor variable xi for i = 1, 2,..., p, with T observations be a T x 1 vector. Similarly we let y and error term e also be T x 1 vectors, and note that the regression coefficients p1 to pp are constants, known as 1 x 1 scalars. The expression y = p1x1 + p2x2 + ... + ppxp + e can be written in matrix notation as follows, with matrix dimensions indicated as subscripts:

The matrix expression X(Txp)p(px1) represents p1x1 + p2x2 + ... + ppxp due to the so-called row-column multiplication. Note that after multiplication the resultant matrix is of dimension T x 1. This is critical, since when one adds matrices on the right hand side of (9.A.1), each part of the addition has to have exactly the same dimension T x 1.

For statistical inference, it is customary to assume that regression errors have mean zero, Ee = 0. Let e' denote the 1 x T matrix representing the transpose of e so that ee' is a T x T matrix from the outer product of errors. Upon taking expectation, we have W, representing the T x T variance co-variance matrix of errors, that is, Eee' = o2Q. If the errors are heteroscedastic, the diagonal elements of W are distinct from each other, and if there is autocorrelation among the errors, W has nonzero off-diagonal elements.

CAPM from Data. When one tries to represent theoretical finance with the help of available data, there are always some practical problems, many of which have been already mentioned in the text above. For example, the representation of the market portfolio by a market index (S&P 500 or Vanguard 500 index fund) for the capital asset pricing model (CAPM) remains controversial. Some of the objections are as follows:

1. The index composition changes over time because removing failed corporations implies a bias in favor of successful companies.

2. There are many small corporations excluded, so the index is not truly comprehensive.

3. It ignores many classes of asset markets, among these bonds and real estate.

-5 RmMRf

Figure 9.A.1 Scatter plot of excess return for Magellan over TB3 against excess return for the market as a whole (Van 500 minus TB3), with the line of regression forced to go through the origin;beta is simply the slope of the straight line in this figure

-5 RmMRf

Figure 9.A.1 Scatter plot of excess return for Magellan over TB3 against excess return for the market as a whole (Van 500 minus TB3), with the line of regression forced to go through the origin;beta is simply the slope of the straight line in this figure

Illustration of CAPM Estimation of Wall Street Beta from Data

The data from Table 2.1.4 can be used to estimate the CAPM model and to compute the beta for Fidelity's Magellan fund provided that we make following assumptions. We assume that a good proxy for "market" return is the Vanguard 500 index fund (Van 500) containing 500 stocks in the S&P 500 index with proportions of stock quantities close to the proportions in S&P 500. We also assume that interest rates on three-month treasury bills (TB3) are a good proxy for risk-free yield. The simplest method of finding the beta for the risky portfolio represented by Fidelity's Magellan fund is to regress (Magellan fund minus TB3) on a column of ones for the intercept and (Van 500 minus TB3). This expands the right-hand side of (2.2.9) by a regressor for the intercept. Now the beta for the risky asset, Magellan fund is simply the regression coefficient of the second regressor (Van 500 minus TB3). This regression equation can be estimated in an Excel workbook itself.

The theoretical formula does not include any intercept (constant term) in the regression equation. Its estimation is accomplished by checking the box "Constant is zero" in the "Data analysis" part of Excel's "Tools" menu. One may need to use "Add-ins" to make sure that the "Data analysis" option is present in the "Tools" menu. Forcing the intercept to be zero is also described as forcing the regression line through the origin in some statistics textbooks. It is well known that zero intercept regressions are subject to numerical accuracy problems. For example, R2, the measure of overall fit called squared multiple correlation, can sometimes become negative and be unreliable. Hence it is advisable to run two regressions, one with and one without the intercept.

Next, we suggest comparing the two estimates of beta the regression coefficient of (Van 500 minus TB3) and the two estimates of the R2. If these two estimates differ too much, we can assume the presence of serious numerical accuracy problems needing further investigation by software tools not readily available in Excel.

For our data, if the extra nonzero intercept is included, R2 = 0.948, and the estimate of b denoted by ¡3 = 1.0145, with the Student's t statistic 23.04. The estimate of the intercept is 4.7734 with the t statistic of 1.4392, which is statistically insignificant. Since one of the claims of CAPM is that the intercept will be zero, if the data yield an insignificant intercept, this supports CAPM.

If the intercept (constant term) is forced to be zero, R2 = 0.945 and ¡3= 1.0243 with the student's t statistic 23.15. We conclude that the regression fit is excellent; the high t values mean that the beta is statistically highly significant, so we reject the null hypothesis b = 0. Since the two estimates of b and R2 do not differ much, we can reliably accept the estimate ¡3 = 1.0243.

In light of the discussion of equation (2.2.9) we note that in this application we are more interested in testing whether the null hypothesis H0: b = 1. That is, we want to know whether the risk of the Magellan fund is as much as the risk of the market portfolio. Since hypothesis testing is more reliable for the model with the intercept, we use the Student's t statistic revised as [¡3 - 1]/SE = 0.3295, where the standard error (SE) is 0.0440. This is statistically insignificant, suggesting that the beta for Magellan is close enough to unity and the observed small difference could have arisen from random variation. The simple average of the returns in Table 2.1.4 given along the bottom row suggest that the risk premium of the Magellan over TB3 (= 1.948 - 0.411 = 1.537) is not small. However, the Van 500 average return 1.927 is close to 1.948 for the Magellan. The usual risk measured by the standard deviation for the Magellan fund (sm = 5.2201) exceeds the risk of Van 500 (sv = 4.9609).

However, are the two standard deviations statistically significantly different? Could the sm arise from a population with standard deviation o0 = 4.9609 (based on the Van 500) estimate? The test statistic Tm = (n - 1)(sm)2/(o0)2, where n = 33, the number of observations in the data. Now Tm = 35.43014 follows a C2 distribution with (n - 1) degrees of freedom with a tabulated value of 46.19426 for a 5% tail. We conclude that the Magellan fund's standard deviation is not significantly different from 4.9609. Similarly we can define Tv = (n - 1)(Sv)2/(oq)2 = 28.90195 < 46.19426 the tabulated value. We conclude that the two variances are not statistically significantly different. The point to remember is that the risk measured by beta is distinct from the risk measured by the standard deviation of the distribution of returns.

## Lessons From The Intelligent Investor

If you're like a lot of people watching the recession unfold, you have likely started to look at your finances under a microscope. Perhaps you have started saving the annual savings rate by people has started to recover a bit.

Get My Free Ebook