Focusing the Results of Correspondence Analysis in Q
Correspondence analysis is often used to visualize a table of data. The goal is to represent as much information as possible, as accurately as possible. However, there may be circumstances when we are interested in a specific row of the table (usually a brand). Such a point may not be represented well in the standard scatterplot output. In this post we explain a new technique developed by Q for rotating the results of correspondence analysis to focus on a specific point.
Correspondence analysis outputs consist of coordinates (usually plotted on a scatterplot) that explain the most variation across all of the brands. When we are interested in a specific brand, it can be useful to use focused rotation, described below. This is a novel technique that we have developed, described in the paper A brand’s eye view of correspondence analysis published in the International Journal of Market Research.
Start your engines
The data we are using describes the characteristics that people associate with cars. The input table below is labelled by 14 car brands along the rows. The columns are labelled by characteristics. Each cell indicates the strength of association between a characteristic and a car.
The chart below shows the correspondence analysis resulting from this data. In Q it is created from Create > Dimension Reduction > Correspondence Analysis of a Table. The data is plotted with normalization of principal coordinates. This means that we can compare distances between column labels and distances between row labels, but not the distance between a row and a column label. See this post for a more in-depth discussion about normalization and interpretation of correspondence analysis.
The dimensions output by correspondence analysis are in decreasing order of variance explained. This means that later dimensions explain smaller portions of the variance. The chart shows only the first two dimensions, which for this example, capture only 53.4% of the variance. So the hidden dimensions contain a reasonable amount of information. Importantly, from the plot alone we cannot tell how much information about any given point (brand) is retained.
Our first car
As an example, Mini Cooper is relatively close to the origin. This could be because it is poorly represented by the two plotted dimensions. Or it could genuinely be the case that Mini Cooper is close to the origin in all dimensions.
If we were performing this analysis to find the relationship of Mini Cooper to the other cars and characteristics, we could not draw any strong conclusions from this plot. The best we could say is that in the first two dimensions alone, there is little to discriminate it.
Quality of the map
We can create a table showing how much variance is represented in each dimension with Create > Dimension Reduction > Diagnostic > Quality. The resulting table (below) shows the variance in the first two dimensions before the row label of each car. Since Mini Cooper has only 16%, we can now say that the plot above hides much of the information for this brand.
Making a sharp turn
In order to find out more about the Mini, we rotate the results so that all of its variance is in the first dimension. This means that there is no longer any hidden information about this point. We shift the focus of the output onto Mini Cooper.
In Q, this is done by entering Mini Cooper in the box labeled Focus row or column. The effect of the rotation is shown below.
In this case, correspondence analysis produces embeddings in 5 dimensional space. If you find this difficult to visualize, join the club. What matters here is that there is no longer any hidden information about Mini Cooper. We can now see that it is more related to Fiat 500 than the other cars. This makes intuitive sense, as they are both small cars. We have gained insight by focusing on what differentiates Mini Cooper from the other cars.
However, note that the chart as a whole explains 46.3% of the variance in contrast to 53.4% in the first chart. The price we pay for the rotation is that the first two dimensions no longer contain as much variance as possible about all of the data. It is no longer the best general representation of all the points.
Buying a new car
As another example, let’s rotate to focus on the VW Golf. Notice how the plot below is very similar to the original, except rotated on the page.
This rotation is easier to visualize. We have turned the page clockwise by about 135 degrees and the relationship between VW Golf and the other cars has been closely maintained. The total variance explained has dropped by only 0.1% from the original plot. All of this tells us that VW Golf was well represented originally. This confirms the 99% variance in the first two dimensions from the quality table above.
TRY IT OUT
The analysis in this post was performed in Q. The flipDimensionReduction package (available on GitHub), which uses the ca package, performed the correspondence analysis. Click here to download the QPack with all the analysis in this post.
The car data is from a latent feature analysis performed in Meulders, M. (2013). An R Package for Probabilistic Latent Feature Analysis of Two-Way TwoMode
Frequencies. Journal of Statistical Software, 54(14), 1-29. This analysis uses data from Van Gysel, E. (2011). Perceptuele analyse van automodellen met probabilistische feature modellen.
[translation from Dutch: Perceptual analysis of car models with probabilistic feature models] Master thesis. Hogeschool-Universiteit Brussel.
Author: Jake Hoare
After escaping from physics to a career in banking, then escaping from banking, I decided to go back to BASIC and study computing. This led me to rediscover artificial intelligence and data science. I now get to indulge myself at Displayr working in the Data Science team, sometimes on machine learning.