It can often be difficult and time-consuming to organize raw text data into meaningful insights. Manually coding even a single text question can take several hours, even with relatively small sample sizes. Q has built-in text categorization tools designed to help you quickly categorize your text data to easily find valuable insights. One of these tools is the automatic categorization of text data into a list of items.

The example below uses text data from a survey about Tom Cruise.  An open-ended question asked respondents “What don’t you like about Tom Cruise?”, which yielded a wide range of responses.  Use the automatic categorization tools in Q to generate a list of items to see what main themes are of concern to the survey respondents.

Automatic Categorization of Text Data into a List of Items

Start by first importing your data containing the text variable you want to categorize into Q. With the data loaded, you’re now ready to run the categorization analysis:

  1. Go to Create > Text Analysis > Automatic Categorization > List of  Items.
  2. In the object inspector (the section that opens on the right of the screen), under Inputs > Text variable(s) select the variable that contains the text you want to analyze.
  3. Change the Inputs > Minimum category size to the number of categories you would like to classify the data into.  For this example, we’ll set the number of categories to 2. This means that Displayr will only show categories that contain at least 2 responses.

The output will calculate automatically:

List of items output

Interpreting the Output

The left side of the output lists the automatically generated categories.  The number of individual responses that are put into each category is displayed in the frequency column. The variants indicate how many different specific variations of responses there are in each category. The right side of the output shows the raw (original) text and the normalized or categorized text. The normalized text is the “cleaned” text. In other words, this is the text after spell checking, stemming, word removals and other cleaning functions are performed.

The colored boxes around the word categories are randomly assigned to each category. The only purpose of this is to give specific words a unique style and make them easier to identify.  There is no further meaning to this.  If “Scientologist” is in a blue box, then all occurrences of “Scientologist” will be in a blue box.  This makes it easier to recognize common categories.

How to Save the Categories to your Data Set

You can save the created categories to your data set as a new variable which can then be used in other analyses.  Make sure that the output above is selected and then go to Create > Text Analysis > Advanced > Save Variable(s) > Categories.  A new multi-select variable is created and can be seen in the Variables and Questions tab. You can now use it like any of the other variables in your data to create tables, crosstabs, charts, and visualizations.

Author: Kris Tonthat