Difference between revisions of "Principal Component Analysis (PCA)"

Jump to: navigation, search
 
Line 1: Line 1:
== Read First ==<onlyinclude>
== Read First ==<onlyinclude>
PCA, is a way to create an index from a group of variables that are similar in the information that they provide.  This allows maximizing the information we keep, without using variables that will cause multicolinearity, and without having to choose one variables among many.</onlyinclude>
PCA is a way to create an index from a group of '''variables''' that are similar in the information that they provide.  This allows us to maximize the information we keep, without using '''variables''' that will cause multicollinearity, and without having to choose one '''variable''' among many.</onlyinclude>


== Guidelines ==
== Guidelines ==


===The spatial principle of a PCA===
===The spatial principle of a PCA===
In a space of 3 dimensions, that are for instance income in x, savings in y and consumption in z, we have let’s say 12 vectors that represent our 12 similar variables that were measured in the field. Those vectors combined together create a cloud in 3D. That cloud has 3 principal directions; the first 2 like the sticks of a kite, and a 3rd stick at 90 degrees from the first 2. Well, the longest of the sticks that represent the cloud, is the main Principal Component.
In a 3 dimensional space, say income in x, savings in y, and consumption in z, we have 12 vectors that represent our 12 similar '''variables''' that were measured in the field. Those vectors combined together create a cloud in 3D. That cloud has 3 principal directions: the first 2 like the sticks of a kite, and a 3rd stick at 90 degrees from the first 2. The longest of the "sticks" that represent the cloud is the main Principal Component.


In fact, our variables explain more than 3 dimensions, so then the space that contain our vectors can be in 8, 12, 15 dimensions, etc, and so is the cloud. You observe this in your results, as there are several principal components that are listed. The same applies than for our example in 3D, though, so that the PCA provides the size of the dimension that represents the cloud the best (so the longest stick within the several-dimensions cloud).
In fact, our '''variables''' explain more than 3 dimensions, so the space that contains our vectors can be in 8, 12, or 15 dimensions, etc. and so is the cloud. You observe this in your results, as there are several principal components that are listed. The same applies than for our example in 3D, though, so that the PCA provides the size of the dimension that represents the cloud the best (so the longest stick within the several-dimensions cloud).


PCA provides us information on the one main component, which corresponds to the information that similar variables have the most in common. Thus, the other components are not taken into account. All complementary information (orthogonal to the main component) in then lost. Therefore, we will want to use PCAs ''only on variables that have a lot in common, so that the loss of complementary information is minimized".
PCA provides us information on the one main component, which corresponds to the information that similar '''variables''' have the most in common. Thus, the other components are not taken into account. All complementary information (orthogonal to the main component) is then lost. Therefore, we will want to use PCAs only on '''variables''' that have a lot in common, so the loss of complementary information is minimized.


===How to do it===
===How to do it===
First, check the multiple correlation between your similar variables. Keep only the ones with the highest correlations. Most of the art happens here: The most similar the variables are, the most charged with information the pca is.  
First, check the multiple correlation between your similar '''variables'''. Keep only the ones with the highest correlations. Most of the art happens here: The more similar the '''variables''' are, the more charged with information the PCA is.  


Beware of variables with a lot of missings, because the new pca variable will be set as missing for those observations.
Beware of '''variables''' with a lot of missing values, because the new PCA '''variable''' will be set as missing for those observations.


Then, the manipulation its simple: you can use the functions pca or pcamat, and predict in Stata. You will want to take a close look at the proportion of the variance that is explained by your first component. You can also use estat kmo (Kaiser-Meyer-Olkin), that tests if your variables were appropriate for factor analysis.
Then the manipulation is simple: you can use the functions <code>pca</code> or <code>pcamat</code>, and predict in '''Stata'''. You will want to take a close look at the proportion of the variance that is explained by your first component. You can also use <code>estat</code> kmo (Kaiser-Meyer-Olkin), that tests if your '''variables''' were appropriate for factor analysis.


===Use of multiple imputation prior to calculation of a pca===
===Use of multiple imputation prior to calculation of a PCA===
When having to deal with several missing for a few variables among the group of similar variables, one may be tempted to use multiple imputation prior to doing the PCA.  
When having to deal with several missing values for a few '''variables''' among the group of similar '''variables''', one may be tempted to use multiple imputation (MI) prior to doing the PCA.  


However, the problem is that the MI is done by regression, so will be sensitive to multicollinearity, whereas the PCA is the most charged with information when the variables are the most similar. Therefore it is feasible, yes, but then it will cost you in the information that your final PCA will contain, as you will have had to drop some of your most similar variables prior to doing your MI.
However, the problem is that the MI is done by regression, so will be sensitive to multicollinearity, whereas the PCA is the most charged with information when the '''variables''' are the most similar. Therefore it is feasible, but it will cost you in the information that your final PCA will contain, as you will have had to drop some of your most similar '''variables''' prior to doing your MI.


== Back to Parent ==
== Back to Parent ==

Latest revision as of 18:18, 8 August 2023

Read First

PCA is a way to create an index from a group of variables that are similar in the information that they provide. This allows us to maximize the information we keep, without using variables that will cause multicollinearity, and without having to choose one variable among many.

Guidelines

The spatial principle of a PCA

In a 3 dimensional space, say income in x, savings in y, and consumption in z, we have 12 vectors that represent our 12 similar variables that were measured in the field. Those vectors combined together create a cloud in 3D. That cloud has 3 principal directions: the first 2 like the sticks of a kite, and a 3rd stick at 90 degrees from the first 2. The longest of the "sticks" that represent the cloud is the main Principal Component.

In fact, our variables explain more than 3 dimensions, so the space that contains our vectors can be in 8, 12, or 15 dimensions, etc. and so is the cloud. You observe this in your results, as there are several principal components that are listed. The same applies than for our example in 3D, though, so that the PCA provides the size of the dimension that represents the cloud the best (so the longest stick within the several-dimensions cloud).

PCA provides us information on the one main component, which corresponds to the information that similar variables have the most in common. Thus, the other components are not taken into account. All complementary information (orthogonal to the main component) is then lost. Therefore, we will want to use PCAs only on variables that have a lot in common, so the loss of complementary information is minimized.

How to do it

First, check the multiple correlation between your similar variables. Keep only the ones with the highest correlations. Most of the art happens here: The more similar the variables are, the more charged with information the PCA is.

Beware of variables with a lot of missing values, because the new PCA variable will be set as missing for those observations.

Then the manipulation is simple: you can use the functions pca or pcamat, and predict in Stata. You will want to take a close look at the proportion of the variance that is explained by your first component. You can also use estat kmo (Kaiser-Meyer-Olkin), that tests if your variables were appropriate for factor analysis.

Use of multiple imputation prior to calculation of a PCA

When having to deal with several missing values for a few variables among the group of similar variables, one may be tempted to use multiple imputation (MI) prior to doing the PCA.

However, the problem is that the MI is done by regression, so will be sensitive to multicollinearity, whereas the PCA is the most charged with information when the variables are the most similar. Therefore it is feasible, but it will cost you in the information that your final PCA will contain, as you will have had to drop some of your most similar variables prior to doing your MI.

Back to Parent

This article is part of the topic Data Analysis

Additional Resources

Please add here related articles, including a brief description and link.