3 Ways to Quickly Sift Through 3,050 Crosstabs and Find the Magic One
Most market researchers know the pain of having to hunt through huge numbers of crosstabs to find the interesting ones. In my younger days (shhh…), you would have to print out your crosstabs, read through them and using a pencil – or if you were more modern, a yellow highlighter and post-it-note(!) – to mark any interesting results. Believe it or not, some industry professionals still do this, but in Excel! If you’re one of these market researchers, don’t stress – this post will show you three ways you can save an enormous amount of time by automating the whole process.
The three approaches are: automatically deleting uninteresting crosstabs, creating a heatmap that summarizes lots of data, and a technique that we call smart tables. I know you’re raring to go but before diving into these approaches, we need to review some light “theory”, and explore the question of what makes a table interesting.
What makes a table interesting?
Two crosstabs are shown below. Which of these is interesting? If we are going to automate the process of sifting through crosstabs we need to decide what makes a table interesting. The simplest way to do this is to see if there are meaningful differences between the percentages, reading across the rows. But instead of manually having to read them all, you can automate the process using tests of statistical significance. In our example below, I’ve used colors to denote significant results. This can also be accomplished in different ways like having letters indicate which columns are different to which others. Looking at these tables, only one of them jumps out as “interesting” and I bet you picked it right away.
Approach 1: Automatically delete tables that have no significant results
This approach is for those that love deleting things (and efficiency!). All you have to do is have your software automatically scan through all your tables and delete any that do not contain significant results – leaving you with only important ones. I’ve got 3,050 crosstabs. By deleting three-quarters of these, I’m saving myself heaps of time and energy. While this is a great way to quickly narrow down your results, you’ll need to be proficient at writing code to script this in R and SPSS. If the thought of writing code makes you wince, use Q or Displayr. They both automate this – meaning it’s a click away.
Approach 2: Use a heatmap to summarize thousands of crosstabs
While the first approach wins points for effectiveness and ease, it loses points for being binary. In this approach, tables are either black or white, significant or insignificant. There’s no allowance for shades of grey. Luckily, we can enter a technicolor world with an even more powerful approach. Introducing using heatmaps to summarize thousands of crosstabs!
The heatmap below summarizes 3,050 crosstabs. Each colored box shows the degree of statistical significance, where the degree in this case is something called a z-Statistic. Below, I describe a bit more about what the z-Statistic is, but for now, all we need to know is that the darker the box is shaded, the more the underlying table is significant. What can we gleam from this? For example:
- The first column shows how other questions in the study were related to agreeing with Allows to keep in touch. Reading down this column we can quickly see that agreement with this attitude can be predicted by Work status, Occupation, and Age. Reading across the rows we can see that there is no other attitude that is related to these three.
- If we scroll down further you will see a white diagonal line of boxes. This is showing the crosstabs of each attitude with each other. Putting aside the white, note how there are a lot more dark cells here. This tells us that the attitudes are highly correlated with each other. Also note however that there is a lot of variation. Some of the cells are much darker than others, telling us that we can likely group together similar attitudes (e.g., using PCA or cluster analysis).
- Scrolling even further down, you will see that the blue gets very, very, pale for most, but not all, of the variables relating to behavior. We can see two things here. First, the attitudes and behavior in these examples are not closely related. Second, there are a small number of stronger relationships meriting more exploration (i.e., the very dark cells).
Why does the heatmap use z-Statistics?
There are a few issues with the traditional approach of deleting tables that exceed the p-value cutoff for significance (0.05). One of the main issues is that, as they get smaller and closer to 0, it becomes difficult to compare them without having to squint at a lot of decimal places. In the table below, you can see the p-values in the third row of numbers. If you zoom in on the Strongly disagree column, the p-values could be a whole range of number like 0.003, or 0.000000001 – it’s simply impossible to tell.
But wait, there’s a solution! The z-Statistics contain the same information as the p-value, except re-scaled to make comparison easier. Check the table below for a handy guide to converting p-Value to z-Statistic. The key value here is that difference between p-values of 0.0001 and 0.0000001 is much bigger when viewed as a z-Statistic making it much easier to understand practical differences the two.
Approach 3: Smart Tables
The third approach kind of combines the two approaches from above for the best of both. This bit of magic works as follows:
- First, you identify a specific question of interest. For example, if you are wanting to profile a segmentation, then you select the variable that indicates which person is in which segment.
- Select any questions that may be of interest as crosstabs with the question of interest. If you aren’t sure, you just select all the variables.
- Compute statistical significance for each of the crosstabs.
- Delete all the tables that aren’t statistically significant.
- Rank the tables according to statistical significance.
An example of such an output, from Q, is shown below. Like with the heatmap, we end up with an output allows us to quickly identify the crosstabs of interest.
We’ve written two related posts that tell you which buttons to click to run these analyses in Q and Displayr:
- 3 Approaches to Quickly Sifting through 3,050 Crosstabs in Q.
- 2 Approaches to Quickly Sifting through 3,050 Crosstabs in Displayr.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.