## Info

13.3.3.2 The E-step and M-step of the algorithm

When some observations 7obs of the sample are missing, the MLE estimates 0MLE are not obtainable, because the likelihood function is not defined at the missing data points. In order to overcome this problem we apply the EM algorithm which exploits the interdependence between the missing data 7mis and parameters 0MLE. The 7mis enclose information relevant to estimating 6>MLE, and 0MLE in turn helps us to compute probable values of ymis. This relation suggests the following scheme for estimating 6MLE in the presence of

Yobs alone: "fill in" the missing data Ymis based on an initial estimate of 0MLE, re-estimate 0MLE based on Yobs and the filled-in Ymis and iterate until the estimates converge (see Schafer (1997)). In any missing data case, the distribution of the complete dataset Y can be factored as:

Considering each term in the above equation as a function of 0, we have:

l(0\Y) = l(0\Yobs) + log P(Ymis\Yobs, 0) + c (13.2)

where l(0\Y) = log P(Y\0) indicates the complete data log-likelihood, l(0\Yobs) = log L(0\Yobs) the observed data log-likelihood, and c a random constant. Averaging equation (13.2) over the distribution P(Ymis\Yobs, 0(t)), where 0(t) is an initial estimate of the unknown parameter, we get:

where

Q(0\0(ty) = j (l(0\Y) x P( Ymis \ Yobs ,0(ty))d Ymis and

H(0\0(ty) = j (log P(Ymis \ Yobs, 0) X P(Ymis\Yobs,0(t)))dYmis

A basic conclusion of Dempster et al. (1977) is that if we consider 0(t+1) as the value of 0 that maximises Q(0\0(ty), then 0(t+1) is an improved estimation compared to 0(t):

In summary, it is convenient to think of the EM as an iterative algorithm that operates in two steps as follows:

• E-step: The expectation step, in which the function Q(0 \0(ty) is calculated by averaging the complete data log-likelihood l(0\Y) over P(Ymis\Yobs, 0(t)).

• M-step: The maximisation step, in which 0(t+1) is calculated by maximising Q(0\0(t)).5

### 13.3.4 The data augmentation algorithm

Data augmentation is an iterative simulation algorithm, a special kind of Markov chain Monte Carlo (MCMC). As underlined by Schafer (1997), DA is very similar to the EM algorithm, and may be regarded as a stochastic edition of EM. In many cases with missing values, the observed data distribution P(0\Yobs) is difficult to define. However, if Yobs is "augmented" by a preliminary value of Ymis, the complete data distribution P(0\ Yobs, Ymis) can be handled.

5 The EM algorithm was implemented with the SPSS software.

For an initial guess 0(t) of the parameter, we select a value of the missing data from the conditional distribution of Ymis:

Conditioning on Y1(i+1), we select a new value of 0 from its complete data distribution:

Alternating (13.4) and (13.5) from an initial value 0(O), we obtain a stochastic sequence {(d(t),Y^s): t = 1, 2,...} that converges to a stationary distribution P(0, Ymis\Yobs), the joint distribution of the missing data and parameters given the observed data, and the subsequences {0(t): t = 1, 2,...} and {Y(): t = 1, 2,...} with P(0\Yobs) and P(Ymis\Yobs) their stationary distributions respectively. For a large value of t we can consider 0(t) as an approximate draw from P(0\Yobs); alternatively, we can regard Y^ as an approximate draw from P(Ymis\Yobs).

In summary, DA is an iterative algorithm that operates in two steps as follows:

• I-step: The imputation step, in which the missing data are imputed by drawing them from their conditional distribution given the observed data and assumed initial values for the parameters 0(t).

• P-step: The posterior step, in which the values for the parameters are simulated by drawing them from their complete data distribution given the most recently imputed values Y^1 for the missing data.

The convergence performance of the DA algorithm depends on the amount of missing information (how much information about the parameters is contained in the missing part of the data relative to the observed part). High rates of missing information cause successive iterations to be highly correlated, and a large number of cycles will be needed for the algorithm to converge. Low rates of missing information produce low correlation and rapid convergence.6

The EM and DA algorithms provide optimal solutions under the assumption that the data are normally distributed. However, weather temperature data deviate from normality. Nevertheless, in many cases the normal model is useful even when the actual data are nonnormal (see Schafer (1997), p. 147). The EM algorithm is used to calculate the missing values on the level of the series. Additionally, it is almost always better to run the EM algorithm before using the DA to impute missing data because running the EM first will provide good starting values for the DA and will help to predict its likely convergence behaviour.

### 13.3.5 The Kalman filter models

The seminal works of Harvey (1989) and Hamilton (1994) have underlined the advantages of using state space modelling for representing dynamic systems where unobserved

6 The DA algorithm was implemented using Schafer's stand alone NORM software, which can be downloaded at http://www.stat.psu.edu/~jls/misoftwa.html.

variables (the so-called "state" variables) can be integrated within an "observable" model. The advantage of handling problems of missing observations with the state space form is that the missing values can be estimated by a smoothing algorithm like the Kalman filter. The filtering is used to estimate the expected value of the state vector according to the information available at time t, while the intention of smoothing is to include the information made available after time t.

Harvey (1981) has shown that, if the observations are normally distributed, and the present estimator of the state vector is the most accurate, the predictor and the updated estimator will also be the most accurate. In the absence of the normality assumption a similar result holds, but only within the class of estimators and predictors which are linear in the observations.

Several chapters in this book have documented in detail state space models and the Kalman filter so we will avoid giving a new presentation.7 Let it suffice to say that, based on the principle of parsimony, our initial Kalman filter model was an AR(1) process with a constant mean, implying that extreme temperatures would show some persistence, but they would eventually return to their mean level for the period under review. Nevertheless, comparing the imputation accuracy of alternative models that we tried, we concluded that the best model for the Tavg series of the entire dataset is an ARMA(1,2) process, while for the "Autumn" dataset it is an ARMA(2,1) and for the "November" dataset an ARMA(1,1).8

### 13.3.6 The neural networks regression models

Here again, two chapters in this book have documented the use of NNR models for prediction purposes.9 In the circumstances, let us just say that the starting point for the NNR models was the linear correlations between the Tmax and Tmin of the index station considered and the explanatory temperature series. While NNR models endeavour to identify nonlinearities, linear correlation analysis can give an indication of which variables should be included. Variable selection was achieved using a backward stepwise neural regression procedure: starting with lagged historical values of the dependent variable and observations for correlated weather stations, we progressively reduced the number of inputs, keeping the network architecture constant. If omitting a variable did not deteriorate the level of explained variance over the previous best model, the pool of explanatory variables was updated by getting rid of this input (see also Dunis and Jalilov (2002)). The chosen model was then kept for further tests and improvements.

7 See Bentz (2003), Billio and Sartore (2003) and Cassola and Luis (2003).

8 The Kalman filter models were implemented with EViews 4.0. The model for the entire dataset is an ARMA(1,2) process giving a system of four equations:

as the observation equation, plus the three state equations:

@state sv1 = c(5)*sv1(-1) + [var = exp(c(4))] @state sv2 = sv1(-1) @state sv3 = sv2(-1)

9 See Dunis and Williams (2003) and Dunis and Huang (2003).

Finally, each of the datasets was partitioned into three subsets, using approximately 2/3 of the data for training the model, 1/6 for testing and the remaining 1/6 for validation. The intentions of this partition are the control of the error and the reduction of the risk of overfitting. For instance, the final models for the "November" Tmax and Tmin series have one hidden layer with five hidden nodes.10

### 13.3.7 Principal component analysis

PCA is a standard method for extracting the most significant uncorrelated sources of variation in a multivariate system. The objective of PCA is to reduce dimensionality, so that only the most significant sources of information are used. This approach is very useful in highly correlated systems, like weather temperatures from neighbouring stations, because there will be a small number of independent sources of variation and most of them can be described by just a few principal components. PCA has numerous applications in financial markets modelling: the main ones concern factor modelling within the Arbitrage Pricing Theory framework (see Campbell et al. (1997)), robust regression analysis in the presence of multicollinearity (again a problem often affecting factor modelling) and yield curve modelling (see Alexander (2001)), although, in this latter case, other techniques seem preferable.11 It may also be used to overcome the problem of missing data, as we shall see next.

### 13.3.7.1 Principal components computation12

Assume that the data for which the PCA is to be carried out consist of M variables indexed j = 1, 2,..., M and N observations on each variable, i = 1, 2,..., N, generating an N x M matrix X. The input data must be stationary. Furthermore, these stationary data will need to be normalised before the analysis, otherwise the first principal component will be dominated by the input variable with the greatest volatility. Thus, we also assume that each of the M columns of the stationary data matrix X has mean f = 0 and variance a2 = 1. This can be achieved by subtracting the sample mean and dividing by the sample standard deviation for each element xij of matrix X. Consequently we have created a matrix X of standardised mean deviations. We will transform this matrix to a new set of random variables, which are pairwise uncorrelated. Let z1 be the new variable with the maximum variance, then the first column vector a1 of M elements is defined as:

The new variable z1 (N x 1) is a linear combination of the elements in vector a1. The product z1T x z1 is the sum of the squares of z1 elements.13 Substituting equation (13.6) for z1 we have:

zT x z1 = (X x a1 )T x (X x a1) = aT x (XT x X) x a1 (13.7)

10 The NNR models were implemented using the PREVIA software.

11 See Cassola and Luis (2003).

12 Appendix A gives a brief technical reminder on eigenvectors, eigenvalues and PCA.

13 T indicates matrix transposition.

## Post a comment