## In this tutorial, we'll learn about the apply() function in R, including when to use it and why it's more efficient than loops.

The `apply()`

function is the basic model of the family of apply functions in R, which includes specific functions like `lapply()`

, `sapply()`

, `tapply()`

, `mapply()`

, `vapply()`

, `rapply()`

, `bapply()`

, `eapply()`

, and others. All of these functions allow us to iterate over a data structure such as a list, a matrix, an array, a DataFrame, or a selected slice of a given data structure — and perform the same operation at each element.

Such operations can imply aggregation (i.e., calculating summary statistics like mean, max, min, sum, etc.), transformation — or any other vectorized functions, either simple or complex, built-in or custom. The difference between the functions of the apply family is the types of input and output of the data structures and the function they perform.

In comparison to the more conservative approach of using loop constructs for the same purpose, the `apply()`

function and its variations offer significantly faster program execution and compact, one-line syntax instead of a code block that spans multiple lines. This becomes particularly important when working with large datasets.

## How to Use the `apply()`

Function (and Its Varieties) in R

Let's explore some of the most useful varieties of the apply functions in R.

`apply()`

We'll start with the main function of the apply group: `apply()`

. It takes a DataFrame, a matrix, or a multi-dimensional array as input and, depending on the input object type and the function passed in, outputs a vector, a list, a matrix, or an array.

The syntax of the `apply()`

function is very simple and has only three parameters:

`apply(X, MARGIN, FUN)`

Here `X`

is an input object (a DataFrame, a matrix, or an array), `MARGIN`

is the parameter that determines the function application (it can take values `1`

, `2`

, or `c(1,2)`

, meaning that the function is applied row-wise, column-wise, or both row- and column-wise, correspondingly), and `FUN`

is the function (built-in or custom) to apply to the input data.

Let's look at some examples. We'll use a matrix as an input data structure, but the same principle works for the other possible data structures:

`my_matrix <- matrix((1:12), nrow=3)print(my_matrix)`

` [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12`

For example, we may want to find the maximum value for each row of our matrix. For this purpose, we'll set `1`

to the `MARGIN`

parameter and pass in the `max`

function:

`print(apply(my_matrix, 1, max))`

`[1] 10 11 12`

In the code above, we virtually implemented an aggregation on the input **matrix**, (which is a two-dimensional data structure). As a result, the output is a **vector** (which is a one-dimensional data structure) containing the corresponding maximum values for each row.

Now, let's calculate the sum of values of the matrix by column (`MARGIN=2`

):

`print(apply(my_matrix, 2, sum))`

`[1] 6 15 24 33`

The output data structure is again a vector. The `sum`

function is another example of an aggregation function that reduces the dimensionality of an input object by 1.

In some cases, we may need to calculate a cumulative sum by column:

`print(apply(my_matrix, 2, c*msum))`

` [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 3 9 15 21[3,] 6 15 24 33`

This time, we obtained a **matrix** of the same size as the input one since the `c*msum`

function computes a value for each value of the input matrix.

Note that the last result (the output object being the same size as the input object) isn't always the case for a non-aggregation function. For example, we might want a *range* of values by column:

`print(apply(my_matrix, 2, range))`

` [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 3 6 9 12`

Here, we also got a matrix as an output object but *of a different size than the input one* (2x4 rather than 3x4).

It is possible to provide any custom function to `apply()`

. Let's define a function that calculates the mean of squared values for each input:

`mean_squared_vals <- function(x) mean(x**2)`

Just as we did earlier, we can apply this function by row (`MARGIN=1`

):

`print(apply(my_matrix, 1, mean_squared_vals))`

`[1] 41.5 53.5 67.5`

We can also apply the function by column (`MARGIN=2`

):

`print(apply(my_matrix, 2, mean_squared_vals))`

`[1] 4.666667 25.666667 64.666667 121.666667`

Finally — and this is something we haven't tried yet — we can apply it by both rows and columns (`MARGIN=c(1,2)`

):

`print(apply(my_matrix, c(1,2), mean_squared_vals))`

` [,1] [,2] [,3] [,4][1,] 1 16 49 100[2,] 4 25 64 121[3,] 9 36 81 144`

In the last case, we got a matrix where each value is a squared corresponding value of the input matrix. Since the `mean`

component of our user-defined function was practically applied to only one value at each iteration, the value itself was returned. So, in this particular case, the `mean`

operation doesn't make any sense.

`lapply()`

The `lapply()`

function is a variety of `apply()`

that takes in a vector, a list, or a DataFrame as input and always outputs a **list** ("l" in the function name stands for "list"). The specified function applies to each element of the input object, hence the length of the resulting list is always equal to the input object's length.

The syntax of this function is similar to the syntax of `apply()`

, only here there is no need for the `MARGIN`

parameter since the function applies element-wise for lists and vectors and column-wise for DataFrames:

`lapply(X, FUN)`

Let's see how it works on vectors, lists, and DataFrames. First, we'll create a simple function that adds 1 to an input value:

`add_one <- function(x) x+1`

Let's test it on a vector:

`my_vector = c(1, 2, 3)print(lapply(my_vector, add_one))`

`[[1]][1] 2[[2]][1] 3[[3]][1] 4`

We added 1 to each value of the vector.

Now, we will create a list:

`my_list = list(TRUE, c(1, 2, 3), 10)print(my_list)`

`[[1]][1] TRUE[[2]][1] 1 2 3[[3]][1] 10`

Now we will apply our function on it:

`print(lapply(my_list, add_one))`

`[[1]][1] 2[[2]][1] 2 3 4[[3]][1] 11`

Since `TRUE`

evaluates to 1, adding 1 to it, we got the value of 2 for the first item of the resulting list. In the case of a vector item, 1 was added to each of its values.

Finally, let's use `lapply()`

on a dataframe:

`my_df <- data.frame(a=1:3, b=4:6, c=7:9, d=10:12)print(my_df)`

` a b c d1 1 4 7 102 2 5 8 113 3 6 9 12`

`print(lapply(my_df, add_one))`

`$a[1] 2 3 4$b[1] 5 6 7$c[1] 8 9 10$d[1] 11 12 13`

As we mentioned earlier, the `lapply()`

function applies column-wise for DataFrames.

`sapply()`

The `sapply()`

function is a simplified form of `lapply()`

("s" in the function name stands for "simplified"). It has the same syntax as `lapply()`

(i.e., `sapply(X, FUN)`

); takes in a vector, a list, or a DataFrame as input, just as `lapply()`

does, and tries to reduce the output object to the most simplified data structure. That means that, by default, the `sapply()`

function outputs a vector for a vector, a list for a list, and a matrix for a DataFrame.

Let's try it on our variables `my_vector`

, `my_list`

, and `my_df`

using the same custom function `add_one`

as earlier:

`print(sapply(my_vector, add_one))`

`[1] 2 3 4`

`print(sapply(my_list, add_one))`

`[[1]][1] 2[[2]][1] 2 3 4[[3]][1] 11`

`print(sapply(my_df, add_one))`

` a b c d[1,] 2 5 8 11[2,] 3 6 9 12[3,] 4 7 10 13`

We can change the default behavior of the `sapply()`

function passing in an optional parameter `simplify=FALSE`

(by default, it is `TRUE`

). In this case, the `sapply()`

function becomes identical to `lapply()`

and always outputs a list for any valid input data structure:

`print(typeof(sapply(my_vector, add_one, simplify=FALSE)))print(typeof(sapply(my_list, add_one, simplify=FALSE)))print(typeof(sapply(my_df, add_one, simplify=FALSE)))`

`[1] "list"[1] "list"[1] "list"`

`tapply()`

We use the `tapply()`

function for calculating summary statistics (such as mean, median, min, max, sum, etc.) for different factors (i.e., categories). It has the following syntax:

`tapply(X, INDEX, FUN)`

Here `X`

is an R object, typically a vector, containing numeric data; `INDEX`

is an R object, typically a vector or a list, containing factors; and `FUN`

is the function to be applied on `X`

.

To see how it works, let's imagine we have information about the salaries of a group of people with data-related jobs: Data Scientist (DS), Data Analyst (DA), and Data Engineer (DE). Using the `tapply()`

function, we can calculate the mean salary by job title.

(*Side note:* as a rough guide, here we used the information from Indeed to estimate the mean salary by role in the USA, February 2022.)

`salaries <- c(80000, 62000, 113000, 68000, 75000, 79000, 112000, 118000, 65000, 117000)jobs <- c('DS', 'DA', 'DE', 'DA', 'DS', 'DS', 'DE', 'DE', 'DA', 'DE')print(tapply(salaries, jobs, mean))`

` DA DE DS 65000 115000 78000 `

## Conclusion

To sum up, we learned many things about using the apply functions in R. Now we know the following:

- How to define the
`apply()`

function in R - The varieties of the function and which are the most common
- Why the functions of the apply family are more efficient than loops
- When each of the common varieties of the apply function is applicable
- The syntax of each variety
- The types of input each variety takes and how to use it on different types of input