10  Data Reduction

\[ \renewcommand{\P}{\mathbb{P}} \renewcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \newcommand{\var}{\mathrm{Var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\dx}{\,\mathrm{d}x} \newcommand{\dy}{\,\mathrm{d}y} \newcommand{\eps}{\varepsilon} \]

A useful resource for this chapter is Using Multivariate Statistics by B.G.Tabachnick and L.S.Fidell. The material taught in this chapter will be met from a machine learning perspective in MA-M28 Modelling and Machine Learning — please see chapter 4 of Essential Math for Data Science if you would like an insight into this.

Factor Analysis (FA) and Principal Component Analysis (PCA) are statistical techniques applied to a (large) set of variables to try to reduce them into subsets of relatively independent variables. Such subsets contain variables that are correlated with one another, but largely independent of other subsets of variables and are combined into factors (or components in PCA).

Therefore, the idea of FA and PCA is to summarise patterns of correlations among observed variables and then to use this information to reduce a large number of observed variables to a smaller number of factors. A good FA or PCA makes sense, a bad one does not, therefore a good understanding of the data is required.

10.1 (Exploratory) Factor Analysis

In Factor Analysis, the subsets of variables are unobservable latent variables — we cannot measure them directly. Examples of such variables could be intelligence or social class. We could try to measure such concepts indirectly, for example by measuring occupation, salary and value of home for social class.

Mathematically, the technique involves representing the original variables as a linear combination of the “hidden” factors and an error term. If \(Y_1,Y_2,\ldots,Y_n\) represent the \(n\) observed variables with means \(\mu_1,\ldots,\mu_n\), and \(F_1,\ldots,F_m\) represent the “hidden” \(m\) factors, then we may consider the centralised observations \(X_i=Y_i-\mu_i\) as follows: \[ \begin{aligned} X_1=Y_1-\mu_1&=a_{11}F_1+a_{12}F_2+\cdots+a_{1m}F_m+\epsilon_1\\ X_2=Y_2-\mu_2&=a_{21}F_1+a_{22}F_2+\cdots+a_{2m}F_m+\epsilon_2\\ \vdots&\hskip2cm\vdots\hskip2cm\vdots\\ X_n=Y_n-\mu_n&=a_{n1}F_1+a_{n2}F_2+\cdots+a_{nm}F_m+\epsilon_n, \end{aligned} \tag{10.1}\]

where \(a_{ij}\) represents the factor loading of the \(i^{\text{th}}\) variable on the \(j^{\text{th}}\) factor and \(\epsilon_i\) represents the error or unique specific factor. We assume that \(\epsilon_i\) has 0 mean and specific variance \(\psi_i\). In matrix notation, this can be represented as, \[ X=AF+\epsilon. \tag{10.2}\]

Consider the following illustrative example.

Example 10.1 In an experiment, 200 primary school children were psychologically tested. The children were tested on the following (the observed variables):

  • Paragraph comprehension (\(X_1\));

  • Sentence completion (\(X_2\));

  • Word meaning (\(X_3\));

  • Addition (\(X_4\));

  • Counting (\(X_5\)).

A factor analysis gives the following linear combinations: \[ \begin{aligned} X_1&=0.81F_1+0.06F_2+\epsilon_1\\ X_2&=0.72F_1+0.08F_2+\epsilon_2\\ X_3&=0.91F_1+0.01F_2+\epsilon_3\\ X_4&=0.02F_1+0.69F_2+\epsilon_4\\ X_5&=0.11F_1+0.92F_2+\epsilon_5 \end{aligned} \] Clearly, variables \(X_1,X_2\) and \(X_3\) have a high factor loading with \(F_1\) and a low factor loading with \(F_2\). Variables \(X_4\) and \(X_5\) have a low factor loading with \(F_1\) and a high factor loading with \(F_2\). This suggests that \(F_1\) is the factor, or latent variable, literacy skills and \(F_2\) is the factor, or latent variable, numeracy skills.

This gives a general insight into the method. We now consider the finer details of the procedure, in particular, we will investigate the methods of calculating factor loadings and determining factors. We first consider/recall the definition of the covariance of random variables, \[ \begin{aligned} \cov(X,Y)&=\E((X-\E(X))(Y-\E(Y)))\\ &=\E(XY-\E(X)Y-X\E(Y)+\E(X)\E(Y))\\ &=\E(XY)-\E(X)\E(Y)-\E(X)\E(Y)+\E(X)\E(Y)\\ &=\E(XY)-\E(X)\E(Y).\\ \end{aligned} \] For matrices, this generalises to, \[ \Sigma=\E\left[(X-\E(X))(X-\E(X))^T\right]. \tag{10.3}\]

Since the \(X_i\)’s are centralised in our calculations, \(\E(X)=0\) in the linear model (10.2), and we obtain \(\Sigma\), the covariance matrix of the variables \(X_1,\ldots,X_n\), as follows: \[ \Sigma=\E(XX^T), \] by (10.3) with \(\E(X)=0\), and by (10.2), \[ \begin{aligned} \E(XX^T) &= \E((AF+\epsilon)(AF+\epsilon)^T)\\ &=\E((AF+\epsilon)(F^TA^T+\epsilon^T))\\ &=\E(AFF^TA^T)+A\E(F\epsilon^T)+\E(\epsilon F^T)A^T+\E(\epsilon\epsilon^T)\\ &=AIA^T+0+0+\Psi\\ &=AA^T+\Psi, \end{aligned} \] where \(\Psi\) is a diagonal matrix of the specific variances \(\psi_i\). We assume that the factors are uncorrelated with unit variance, hence \(\E(FF^T)=I\) above. Also, the cross-multiplication terms are 0 since we assume that the factors are not correlated with the errors \(\epsilon\). Note that the factors themselves have now dropped out of the calculations. Next we set, \[ R=AA^T=\Sigma-\Psi, \] where \(R\) is known as the adjusted covariance matrix, i.e. the variance of the observations are “adjusted” by subtracting the specific variances. Like \(\Sigma\), \(R\) is a symmetric matrix and hence using results from linear algebra we may state \[ R=VLV^T, \] where \(V\) is a matrix of the eigenvectors of \(R\) and \(L\) a matrix of the eigenvalues of \(R\). Furthermore, \[ \begin{aligned} R=VLV^T&=V\sqrt{L}\sqrt{L}V^T\\ &=(V\sqrt{L})(\sqrt{L}V^T)\\ &=(V\sqrt{L})(V\sqrt{L})^T\\ &=AA^T, \end{aligned} \] where we used that \(L\) is diagonal, hence \(\sqrt{L}^T=\sqrt{L}\). This implies that \[ A=V\sqrt{L}. \tag{10.4}\] Therefore, once the eigenvectors and eigenvalues of \(R\) are known, the factor loading matrix \(A\) is easily obtained by (10.4).

TipRemark 10.2

Note that equation (10.4) is true if all factors, or eigenvalues, are used in the model. However, we only want to consider significant factors (i.e. we may choose to ignore certain factors) and therefore we require methods of extracting and evaluating such factors.

In Python, there are various methods of “extracting” the factors, the main one being Principle Axis Factoring which finds the least number of factors that account for the common variance of a set of variables.

Evaluating Factors

There are various means of evaluating and extracting the factors, including:

  • Eigenvalues: one method of choosing factors is to consider factors with eigenvalues \(>1\). This is known as the Kaiser criterion.

  • Scree plot: this is a plot of the eigenvalues which can indicate where there is a clear cut-off (an inflexion point) between large and small eigenvalues.

  • Communality: this is the sum of the squared loadings for a variable across factors and it provides a percentage of variance accounted for by the factors. The accepted proportions for communality are dependent on the sample size. The general rule is as follows:

    • If all communalities \(>0.6\), then this is considered very strong and we may even take relatively small samples in this scenario (\(< 100\));

    • Communalities \(>0.5\) are adequate for sample of size \(100-200\), or more;

    • Smaller communalities may be accepted for larger sample sizes.

  • Factor loadings: we aim for factor loadings to be \(\geq0.4\) for the main factor. We then study the observations with high loadings of a particular factor in order to try to identify the factor. Factor loadings appear in the Pattern Matrix in Python.

Rotations

In cases where the factor loading matrix \(A\) cannot be interpreted clearly, it may be rotated to try to improve interpretations. The aim is to maximise high correlations between factors and variables and to minimise low ones. This can be performed since the factor loading matrix is not uniquely defined. There are two different types of rotation, orthogonal and oblique.

Orthogonal Rotations

Orthogonal rotations are used when we assume that the factors are uncorrelated. There are various orthogonal rotations possible, with the most common being Varimax, Quartimax and Equamax. The process involves a simple matrix multiplication as follows: \[ A_{\text{rotated}}=A\Lambda, \tag{10.5}\] where \(\Lambda\) is the rotation matrix, \[ \begin{pmatrix} \cos\theta &-\sin\theta\\ \sin\theta&\cos\theta \end{pmatrix}. \tag{10.6}\] For the case where we have 2 factors, a typical orthogonal rotation is illustrated as follows:

Varimax Rotation

Varimax is often the most common used rotation which involves a variance maximising procedure. The goal of varimax rotation is to maximise the variance of factor loadings by making high loadings higher and low ones lower for each factor.

Quartimax Rotation

Quartimax does for variables what varimax does for factors. It simplifies variables by increasing the dispersion of the loadings within variables, across factors.

Equamax Rotation

Equamax rotation is a hybrid between varimax and quartimax that tries simultaneously to simplify the factors and the variables.

In conclusion, varimax rotation simplifies the factors, quartimax the variables and equamax both.

Oblique Rotation

Oblique rotations allow the factors to be correlated. In practice, this is a highly likely possibility. For example, if two of our factors were Achievement and Alcoholism, we would expect there to be a correlation between these factors. Oblique rotations also include orthogonal rotations, i.e. when the factors are assumed to be uncorrelated. For the case where we have 2 factors, a typical oblique rotation is illustrated as follows:

The two main types of oblique rotations are Direct Oblimin and Promax.

Direct Oblimin is the default oblique rotation we will use in Python.

The Promax method is quicker and is therefore better to use if dealing with large data sets.

It is good practice to first perform an oblique rotation and to change to an orthogonal rotation if correlation between the factors does not seem to exist. In Python, this can be checked by examining the Factor Correlation Matrix \(\Phi\), which is given by, \[ \Phi=\begin{pmatrix} \phi_{11} & \phi_{12} &\cdots&\phi_{1m}\\ \phi_{21}&\phi_{22}&\cdots&\phi_{2m}\\ \vdots& & \vdots&\\ \phi_{m1}&\phi_{m2}&\cdots&\phi_{mm} \end{pmatrix}, \] where \(m\) is the number of factors.

The general rule is to use an oblique rotation if \(|\phi_{ij}|> 0.32\) for all \(i,j=1,\ldots,m, i\neq j\). Clearly, we do not include the diagonal terms as these will always be 1 (i.e. the correlation of a factor with itself).

If an oblique rotation is found to be suitable, the elements of the Pattern Matrix are reported.

The following example is for illustrative purposes only.

Example 10.3 In an experiment, skiers were asked about their opinions on the cost of a skiing ticket (COST), the speed of the ski lifts (LIFT), the depth of the snow (DEPTH) and the moisture of snow (POWDER). Here is the raw data,

Skier COST LIFT DEPTH POWDER
\(S_1\) 32 64 65 67
\(S_2\) 61 37 62 65
\(S_3\) 59 40 45 43
\(S_4\) 36 62 34 35
\(S_5\) 62 46 43 40

When no limit is placed on the number of factors, we have 4 factors with eigenvalues 2.02, 1.94, 0.04 and 0.00. Using the Kaiser criterion and a scree plot, we keep the eigenvalues 2.02 and 1.94, and then we run the factor analysis again with these 2 factors only. Below is the scree plot that shows a clear distinction between the eigenvalues:

Once we run the analysis keeping only the 2 strong factors, we obtain the communalities under the extraction column in the table below: Clearly, as these communalities are close to 1, a large proportion of the variation in each variable can be accounted for by the factors. The pattern matrix is given by, We see that DEPTH and POWDER have a high factor loading with Factor 1 and that COST and LIFT have a high factor loading with Factor 2. However, since the remaining factor loadings are not negligible, we will perform rotations with the aim of obtaining a clearer interpretation. Firstly, let us consider the Direct Oblimin oblique rotation. The Factor Correlation Matrix \(\Phi\) is given below, We can see that \(|\phi_{ij}|\leq 0.32\) for \(i\neq j\). Therefore, we conclude that an oblique rotation is not warranted and instead we use the Varimax orthogonal rotation, This is a rotation in the sense of (10.6) by 0.33 radians (19 degrees). Using (10.5), we can confirm the rotated factors are given by: \[ \begin{aligned} A_{\text{rotated}}=A\Lambda&=\begin{pmatrix} -.40 & .90\\ .25&-.95\\ .93&.35\\ .96&.29 \end{pmatrix}\begin{pmatrix} \cos0.33 &-\sin0.33\\ \sin0.33 &\cos0.33 \end{pmatrix}\\ &=\begin{pmatrix} -.40 & .90\\ .25&-.95\\ .93&.35\\ .96&.29 \end{pmatrix}\begin{pmatrix} .95 &-.33\\ .33 &.95 \end{pmatrix}\\ &=\begin{pmatrix} -.08&.98\\-.08&-.98\\.99&.02\\1&.05 \end{pmatrix} \end{aligned} \] Here is an illustration of the rotation:

In this example, it is clear that the variables DEPTH and POWDER are associated with a factor concerning snow conditions. The variables COST and LIFT are associated with a factor concerning resort conditions.

10.2 Principal Component Analysis (PCA)

Principal component analysis is similar to factor analysis in that both are used for data reduction, and they often provide similar results. However, in PCA we write the components (factors in FA) as a linear combination of the variables, where as in FA, we write the variables in terms of the factors, see (10.1). This can be expressed as follows: \[ \begin{aligned} C_1&=e_{11}X_1+e_{12}X_2+\cdots+e_{1n}X_n\\ C_2&=e_{21}X_1+e_{22}X_2+\cdots+e_{2n}X_n\\ \vdots&\hskip2cm\vdots\\ C_m&=e_{m1}X_1+e_{m2}X_2+\cdots+e_{mn}X_n, \end{aligned} \] where \(C_1,\ldots, C_m\) represent the components, \(X_1,\ldots, X_n\) are the variables and \(e_{ij}\) are the regression coefficients, or weights of the variables. Similar to (10.2), we can rewrite this system in matrix form as below, \[ C=EX. \] In the case of factor analysis, the factor loadings were given by the eigenvectors and eigenvalues, however, in principal component analysis, the weightings of the variables are given by the eigenvectors and eigenvalues of the covariance matrix. The largest eigenvalue and associated eigenvector is applied to the first principal component, with the next largest applied to the second principal component etc.

Choosing Principal Components

The principal components are chosen in the following way:

  • The first principal component, \[ C_1=e_{11}X_1+e_{12}X_2+\cdots+e_{1n}X_n, \] is chosen such that it accounts for as much variation in the data as possible, subject to the condition that \(e_{11}^2+e_{12}^2+\ldots e_{1n}^2=1\).

  • The second, \[ C_2=e_{21}X_1+e_{22}X_2+\cdots+e_{2n}X_n, \] is chosen such that the variance is as high as possible, similarly conditional on \(e_{21}^2+e_{22}^2+\ldots e_{2n}^2=1\).

  • The second principal component must be chosen such that it is uncorrelated with the first.

  • The \(i^{\text{th}}\) principal component, \[ C_i=e_{i1}X_1+e_{i2}X_2+\cdots+e_{in}X_n, \] again is chosen such that the variance is as high as possible, conditional on \(e_{i1}^2+e_{i2}^2+\ldots e_{in}^2=1\) and it being uncorrelated with all other principal components.

  • All principal components are uncorrelated with each other.

As previously stated, the weightings \(e_{ij}\) are obtained from eigenvectors corresponding to the \(i^{\text{th}}\) largest eigenvalue.

TipRemark 10.4

The process of maximising the variance uses the theory of Constrained Optimisation, which, in this case, essentially means maximising the variance for the \(i^{\text{th}}\) principal component conditional on \(e_{i1}^2+e_{i2}^2+\ldots e_{in}^2=1\).

Number of Components and Rotations

We will use the same guidelines for determining the number of components as we did for factors in factor analysis, i.e. by evaluating eigenvalues, scree plots, communalities and factor loadings. Similarly, we will use rotations in the same way for principal component analysis as we did in factor analysis, i.e. if components are not clear from the unrotated results, we next perform an oblique rotation. If there is not enough correlation between the principal components to warrant the use of an oblique rotation, we then perform an orthogonal rotation.

Deciding between PCA and FA

PCA is used to simply reduce the observed variables into a smaller set of important independent composite variables (components). FA tends to be used when there are suspected latent factors (not directly measurable factors) causing the observed variables.