How to Do Principal Components Analysis in Q
Principal Components Analysis (PCA) is a technique for taking a large number of variables and creating a new, smaller set of variables. These aim to capture as much of the variation in the data as possible. The new variables are not correlated with one another.
Market researchers often use PCA to:
- try and identify the dimensions which underlay people’s answers to attitudinal statements,
- identify redundant questions in a questionnaire,
- and create a smaller number of variables to feed into another analysis.
In this article I show you how to set up your PCA in Q, what the outputs and options are, and how to save the new variables.
Principal Components Analysis always views data numerically. This means that you need to be careful about the Question Type assigned to your variables to ensure the analysis views their numeric values. The variables in a PCA should be part of a Number, Number – Multi, or Pick Any question. Pick Any is appropriate to use when the data are binary.
In most cases, however, you should set your variables up as Number or Number – Multi. The variables do not need to be grouped together. Remember, they could come from different questions, but they should all be on the same scale (that is, don’t mix 5-point scales with binary variables or 10-point scales).
If your variables are not set up this way, you can:
- Locate the variables in the Variables and Questions tab.
- (Optional) Make new copies of the variables by selecting them, right-clicking, and choosing Copy and Paste Variables > Exact Copy.
- Click into the Question Type column and change the selection to either:
- Number, if there’s a single numeric variable,
- Number – Multi, if you have multiple numeric variables that are grouped together, or
- Pick Any, for binary variables.
In this article, I am using an example of a 5-point scale (called “Q23. Attitudes”). We asked several statements about our respondent’s mobile phone use. Originally, the variables were set up as a Pick One – Multi question, which is typically how looped scales like this will appear in Q. In my screenshot, observe that I made a copy of the question for use in the PCA, and set it up with the Question Type of Number – Multi.
Creating the Principal Components Analysis
To create the PCA in Q:
- Select Create > Dimension Reduction > Principal Components Analysis.
- In the Object Inspector on the right side of the screen, choose the variables that you want to analyze in the Variables box.
- Tick Automatic, which ensures the PCA will remain up to date when the data changes or when you change the settings.
The output from the PCA is what is known as a “loadings table”. This table shows one row for each of my original mobile-phone statement variables (there are 23). Each of the 8 new variables identified by the PCA appears in the columns. The cells of the table show figures referred to as “loadings”.
These actually represent the correlations between the new variables and the old variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their value. Change this by toggling the Suppress small coefficients box.
The table is sorted in a way that makes it easy to work out what the 8 new variables mean. The first variable (“Component 1”) shows a strong correlation with the variables for “Want to view video”, “Want video phone”, “Want to play music”, “Like fast internet on phone”, and “Do mobile banking”. We conducted this study before the age of the smartphone. At the time these higher-technology features were uncommon in phones.
This new variable thus represents an underlying factor of desire for better technological capabilities in phones. Variable number 2 strongly correlates with variables that reveal a desire to stay in touch and connected. Variable number 3 represents an attitude that phones need only make calls or have basic functionality. And so on.
The output also tells us a number of key bits about the analysis:
- The 8 components represent 51.8% of the original variance in the data. You inevitably lose some information when you reduce variables like this.
- The first variable (“Component 1”) accounts for 13.3% of the variation. The second accounts for 8.88% of the variation, etc. The sort order goes from most variation to the least variation.
- The footer contains additional sample size information and settings info.
In the next few sections, I’ll explain some of the settings that we didn’t change, and how to save the new variables to your data set so you can use them elsewhere.
Determining the number of components
In the analysis above, the PCA automatically generated 8 variables. It did this using a heuristic know as the “Kaiser rule”, an option in the Rule for selecting components section. This is a commonly-used rule, but you can also choose to use two different methods:
- Number of components. Choose this option if you want to choose the number of components to keep.
- Eigenvalues over. Eigenvalues are numbers associated with each component, and these are listed at the top of each column. This setting lets you specify the cut-off value for components.
The analysis above used a technique called a Varimax rotation, Q’s default option in the Rotation method drop-down. The concept of the rotation can be a bit abstract to talk about without getting into the mathematics of the technique. Putting it simply, the PCA problem can have an infinite number of solutions which all capture the same amount of variation in the data. The rotation tries to find which of those many solutions is the easiest to write down an interpretation for, by writing them in a way so that as many loadings are close to zero (or to a value of 1) as possible.
If you have a favorite rotation method to use then the menu contains several other options. They are all described in mathematical terms, so discussing them here would not add much value if you don’t already have a preferred technique. In my experience, the Varimax seems to be the most popular.
To use the results of the PCA in another analysis you need to save the variables into your data set. To do so:
- Have your PCA output selected on the screen.
- Click Create > Dimension Reduction > Save Variables. This will add the variables and show them in a table.
- (Optional) Right-click on the row labels in the table and Rename them, to make the components more recognizable.
Q will show a new table of your components. The table will be full of 0’s, indicating that the average score of each is zero. Don’t be alarmed! This occurs because the variables are standardized – with a mean of zero and a standard deviation of 1 – which is the standard technique. If you create a cross-tab with another question, then the variation between variables will become more apparent. For instance, I renamed my components and created a table with the Age groups from the study:
Rather unsurprisingly, the younger people have higher scores on the “Wanting technology” and “Cost-sensitivity” components, and a much lower score on the “Only used the basics” component.
These new variables can be used just like any other in Q. Once you are happy with your new components, go back to the PCA output and untick the Automatic box. This will prevent any changes to the components. If you modify your PCA later on and change the number of components in the solution, you should delete the saved variables and run Create > Dimension Reduction > Save Variables again.
Hopefully, you find that Principal Component Analysis is easy to do in Q, and by saving the variables you can use it to complement your other analyses. Don’t forget the three main steps: get your data set up correctly, create the analysis output and use the output to save your new variables. Good luck and happy dimension reducing!
Author: Chris Facer
Chris is the Head of Customer Success at Displayr. Here, and previously at Q (www.q-researchsoftware.com), he has developed a wealth of scripts and tools for helping customers accomplish complex tasks, automate repetitive ones, and generally succeed in their work. Chris has a passion for helping people solve problems, and you’ll probably run into him if you contact Displayr Support. Chris has a PhD in Physics from Macquarie University.