# How to Aggregate Data in Q

You have a data set and you need to quickly organize it to perform your data analysis. Where do you start? You could create a table of statistics which summarizes data by aggregating it. In this post, I’ll show you how you can easily aggregate data in Q.

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function to aggregate data.

To perform aggregation, we need to specify three things in the code:

- The data that we want to aggregate
- The variable to group by within the data
- The calculation to apply to the groups (what you want to find out)

## An example of aggregating data

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Aggregation is performed with the following R code. In Q, navigate to **Create > R Output **to type or paste in code.

agg = aggregate(data, by = list(data$Role), FUN = mean)

This produces a table of the average salary and age by role, as below.

## Replicate this table yourself

The first argument to the function is a table. In Q, this could be pasted in with **Create > Tables > Paste Data** or created from a data set with **Create > Tables > Raw Data**.

In Q, the example data can be loaded by running the following in a R output:

library(flipAPI) data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

The *by* argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.

The *FUN* argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as *Name* and *Shif*t, they result in empty columns, which are removed for clarity.

## Aggregating your own data and other aggregation functions

Any function that can be applied to a numeric variable can be used within *aggregate*. Maximum, minimum, count, standard deviation and sum are all popular.

For more specific purposes, it is also possible to write your own function in R and refer to that within *aggregate*. This is demonstrated below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by *Role* and by *Shift* together.

second = function(x) { if (length(x) == 1) return(x) return(sort(x, decreasing = TRUE)[2])} agg = aggregate(data, by = list(data$Role, data$Shift), FUN = second)

## Additional features

The *aggregate* function has a few more features to be aware of:

- Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
- Setting
*drop = TRUE*means that any groups with zero count are removed. *na.action*controls the treatment of missing values within the data.

#### Author: Jake Hoare

After escaping from physics to a career in banking, then escaping from banking, I decided to go back to BASIC and study computing. This led me to rediscover artificial intelligence and data science. I now get to indulge myself at Displayr working in the Data Science team, sometimes on machine learning.