Difficulties verifying outputted PCA scores

Dear all,

I have had some difficulties verifying the outputted PCA scores - that is, when I put the original raw data into the equations to compute the principal components, I do not duplicate the values generated by the program. Therefore, I must presume that I am an idiot and have missed some elementary algebraic step. Nevertheless, after re-checking my calculations, I seem to arrive at the same discrepancy.

To illustrate my procedure, I will use the example data provided in the help menu (I am using version statistiXL v1.6) - I do this simply because everyone here has access to that data set:

The first line of data (which I will assume is "Case 1" when it is transformed via PCA), is:

WDIM=15.50; CIRCUM=59.69; FBEYE=21.10; EYEHD=10.30; EARHD=13.40; JAW=12.40

and the average value for each variable over the 60 cases given are

WDIM*=15.5; CIRCUM*=57.575; FBEYE*=19.807; EYEHD*=10.513; EARHD*=13.575; JAW*=11.873

Let us say that we applied PCA to the correlation matrix; then, the component score coefficients for PC1 are as listed in the help menu:

a1=0.511 (WDIM); a2=0.561 (CIRCUM); a3=0.462 (FBEYE); a4=0.144 (EYEHD); a5=0.110 (EARHD); a6=0.421 (JAW)

and, for PC2:

b1=-0.008 (WDIM); b2=0.087 (CIRCUM); b3=-0.147 (FBEYE); b4=0.664 (EYEHD); b5=0.644 (EARHD); b6=-0.339 (JAW)

Then, for Case 1, the first principal component score is calculated as

PC 1 = a1(WDIM-WDIM*)+a2(CIRCUM-CIRCUM*)+a3(FBEYE-FBEYE*)+a4(EYEHD-EYEHD*)+a5(EARHD-EARHD*)+a6(JAW-JAW*)

PC 1 = 0.511*(15.50-15.5)+0.561*(59.69-57.575)+0.462*(21.10-19.807)+0.144*(10.30-10.513)+0.110*(13.40-13.575)+0.421*(12.40-11.873)

PC 1 = 1.955826, which matches the value of 1.952 listed in the outputted casewise scores provided (within rounding errors due to the coefficients a1 through a6, of course). So far, so good.

Similarly, the second principal component score of Case 1 is

PC 2 = b1(WDIM-WDIM*)+b2(CIRCUM-CIRCUM*)+b3(FBEYE-FBEYE*)+b4(EYEHD-EYEHD*)+b5(EARHD-EARHD*)+b6(JAW-JAW*)

PC 2 = -0.008*(15.50-15.5)+0.087*(59.69-57.575)-0.147*(21.10-19.807)+0.664*(10.30-10.513)+0.644*(13.40-13.575)-0.339*(12.40-11.873)

PC 2 = -0.43885, which does not match the value of -0.760 listed in the outputted casewise scores for "PCA 2" under Case 1. Given that the PC coefficients are provided to three decimal places, this discrepancy could not be the result of rounding errors.

Is this an isolated typographical error in the help menu, or am I missing something? I am experiencing similar and multiple discrepancies in my own data set, where the outputted casewise scores also do not match the "by hand" calculations.

The algebra seems to be exceedingly facile and transparent, but yet the discrepancy resists my best and repeated attempts to reveal a flaw. Any thoughts?

Sincerely,

Albert Loui, Ph.D.
Lawrence Livermore National Laboratory, U.S.A.

Comments

  • On a slightly separate but related issue:

    After computing the principal component scores for my 6-D data set by hand, and finding no numerical match to the outputted scores, I went ahead and plotted the leading pair of scores and found that the "by hand" plot appears identical to that obtained by plotting the outputted data, except for the scaling of the axes. Taking a ratio of the outputted and "by hand" PC scores reveals a nearly uniform multiplicative scale factor for each of the principal components.

    It seems, then, that I am missing some sort of scale factor per PC axis - my immediate thought was the standardization option ("S.D. to 1"), but re-doing the PCA without it only produced outputted scores which differed more from the "by hand" set than with the option implemented.

    Am I on the right track?

    Sincerely,

    Al Loui

  • Hi Albert

    I think you will find that your initial math will work if you applied the PCA to the covariance matrix. When using a correlation matrix, however, in addition to standardising the raw variables to their mean you should also standardise them to SD=1 (this isn't necessary for a covariance based analysis). Your score is therefore calculated as

    PC 1 = a1((WDIM-WDIM*)/WDIM_SD)+a2((CIRCUM-CIRCUM*)/CIRCUM_SD)+a3((FBEYE-FBEYE*)/FBEYE_SD)+a4((EYEHD-EYEHD*)/EYEHD_SD)+a5((EARHD-EARHD*)/EARHD_SD)+a6((JAW-JAW*)/JAW_SD)

    Hope this helps

    Alan
  • Dear Alan,

    Thank you for your quick reply! I feel quite sheepish now for the original post - I knew that it was something quite simple. The reason that it did not occur to me to divide each original variable value by its standard deviation is that I believed that it was already done as part of the standardization in forming the correlation matrix before eigenanalysis - that is, standardization of the raw data by subtraction of the mean and division by the standard deviation per variable prior to PCA. Thus, I didn't think that the same standardization steps needed to be applied ex post facto of the analysis - but, of course, the very application of those formulas is the actual linear transformation of the raw data, so is not ex post facto at all!

    I also realized that I was ignoring the final scaling of the outputted data to have a standard deviation of "1" per principal component, which accounted for the final part of the discrepancy between the "by hand" and outputted results from the PCA of my data set.

    While it makes complete sense in retrospect, the standardization procedure in transforming raw data to PC space via diagonalization of the correlation matrix, is not explicitly detailed in the help menu; that is, the general formula posted in the help menu entry for PCA of the correlation matrix does not include the standard deviation divisors. If I may make a humble request: would it be possible to update that entry to reflect this fact manifestly, just in case another poor ignorant fool like me should stumble across it?

    Thank you again for your swift reply!

    Best regards,

    Albert Loui
    Lawrence Livermore National Laboratory, U.S.A.
  • Good point, consider it done!

    Cheers

    Alan
  • Dear Alan,

    Thank you for the correction! I have a related question concerning the "standardise to S.D. = 1" option, which the help menu instructs not to use. Clearly, it scales each set of principal component scores by its own standard deviation; however, what purpose would this serve? That is, it makes sense that raw data comprised of dissimilar subsets (say, heights, weights, blood pressures, etc.) should be standardized to put them on equal statistical footing. But why would one ever want to do this on the resulting principal component scores?

    My thought is that one might want to compare the PC scores of one analysis to the PC scores of another analysis on a similar set of data. But my instinct in such a case is that the scaling factor applied to the leading PC score should be identically applied to all of the successive scores, in order to preserve the relative hierarchy of variances between PC 1 and the remaining scores. Can you give me an example of a scenario where it would be advantageous to scale the S.D. of each set of PC scores to 1? I have decided to follow your help menu advice of not implementing it, but I am still curious about its purpose(s).

    Sincerely,

    Al
  • Hi Albert

    Alan correctly identified your problem - when using the correlation matrix your data are standardised to a mean of zero and a SD of 1, and these standardised data are then used in the PCA analysis. You can either use Alan's formula for calculating the PCA scores, or jsut standardise your data to a mean of 0 and SD of 1 before you do the PCA, then your calculations will work.

    We will modify the help file to make it clearer that the PCA scores are for standardised data if the correlation matrix is used.

    Phil Withers
  • Dear Phil,

    Thank you for your reply. I completely comprehend both Alan's previous post, and yours as well. But I was actually referring to something a bit different - I should have been more operationally specific:

    The standard deviation of each set of PC scores is generally not equal to one, even if the corresponding raw data has already been standardized. By selecting "correlation matrix", the original raw data is automatically subjected to subtraction of the empirical mean and division by the standard deviation (as both you and Alan have clearly explained). However, if you select "correlation matrix" and also "standardise to S.D. = 1", you will obtain a set of PC scores that differ from those obtained without the latter option - obviously, that difference is that each set of PC scores will have an S.D. = 1 when they wouldn't otherwise.

    My point was meant to address this issue: while it makes sense to have the raw data standardized to S.D. = 1 (which is automatic when selecting "correlation matrix" diagonalization), why would one want to perform an additional standardization of S.D. = 1 on the resulting PC scores themselves? The standardization procedure on the raw data has already put the variables on equal statistical footing, so performing an additional standardization on the corresponding PC scores seems superfluous. If it is, perhaps the "S.D. = 1" option should be "locked out" when "correlation matrix" is selected?

    Also, I understand that selecting "covariance matrix" only centers the data on the empirical mean, but if one simultaneously selects "S.D. = 1" (which one has the option to do) doesn't that merely duplicate the result of selecting "correlation matrix" alone? In short, is it necessary to have a separate "S.D. = 1" option at all when it seems to be covered in the two options already available: PCA either on the covariance matrix, or on the correlation matrix?

    Sorry if it seems if I am belaboring the point - I am merely curious and trying to understand the functional purpose of these software options. I have been using statistiXL quite a bit, and I think that it is a fantastic program!

    Sincerely,

    al
Sign In or Register to comment.