Prerequisites:
dplyr
packageThe apply
family of functions (lapply()
, sapply()
, vapply()
) can be used within bespoke functions or used to apply bespoke functions, not just regular functions.
vector <- c(1, NA, 3, 4, 5)
dataframe <- data.frame(col1 = c(NA, NA, 6, 8),
col2 = c(2, 4, NA, 9))
data_list <- list(vector, dataframe)
## lapply for a bespoke function
cut_in_half <- function(data) {
data / 2
}
cut_in_half(vector)
[1] 0.5 NA 1.5 2.0 2.5
cut_in_half(dataframe)
col1 col2
1 NA 1.0
2 NA 2.0
3 3 NA
4 4 4.5
lapply(data_list, cut_in_half)
[[1]]
[1] 0.5 NA 1.5 2.0 2.5
[[2]]
col1 col2
1 NA 1.0
2 NA 2.0
3 3 NA
4 4 4.5
## lapply inside function
cut_list_in_half <- function(list) {
lapply(list, cut_in_half)
}
cut_list_in_half(data_list)
[[1]]
[1] 0.5 NA 1.5 2.0 2.5
[[2]]
col1 col2
1 NA 1.0
2 NA 2.0
3 3 NA
4 4 4.5
Not all functions need to be named. Small (one-line) functions that are not worth naming are called anonymous functions.
A good rule of thumb is that an anonymous function should fit on one line and shouldn’t need to use {}
.
name height mass hair_color skin_color eye_color
87 46 39 13 31 15
birth_year sex gender homeworld species films
37 5 3 49 38 24
vehicles starships
11 17
# Keeps only non-numeric columns
head(Filter(function(x) !is.numeric(x), starwars))
# A tibble: 6 x 11
name hair_color skin_color eye_color sex gender homeworld species
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Luke~ blond fair blue male mascu~ Tatooine Human
2 C-3PO <NA> gold yellow none mascu~ Tatooine Droid
3 R2-D2 <NA> white, bl~ red none mascu~ Naboo Droid
4 Dart~ none white yellow male mascu~ Tatooine Human
5 Leia~ brown light brown fema~ femin~ Alderaan Human
6 Owen~ brown, gr~ light blue male mascu~ Tatooine Human
# i 3 more variables: films <list>, vehicles <list>, starships <list>
# Integrates x to the power of 2 using limits of 0 and 10
integrate(function(x) x ^ 2, 0, 10)
333.3333 with absolute error < 3.7e-12
The best way to group functions together is using packages.
If you are not going to build a package for your functions, a simple way to group similar/related functions together is to combine them into a list (remember, functions are objects).
You then call them the same way that you would call a list element.
This is not generally recommended.
function_list <- list(
half_num = function(x) x / 2,
double_num = function(x) x * 2
)
function_list$double_num(10)
[1] 20
Using base r
and dplyr
Function Tidy Evaluation
Some variables, commonly column names, can be dependent on some external factor (e.g. user input, data values) and you may want to manipulate them in a generic way in using function arguments. This can commonly happen when you want to choose the column of interest manually in a function or when iterating/looping through.
This can be done with base r
, using variable selection such as square brackets []
and double square brackets [[]]
.
base r
:
colour_data <- data.frame(red = c(1, 2, 3),
blue = c(5, 7, 3),
green = c(9, 8, 7))
# I want to get the mean of one column of my choice.
col_mean <- function(data, col) {
print(paste0(col, ": ", mean(data[[col]])))
}
col_mean(colour_data, "blue")
[1] "blue: 5"
# I want to iterate through the columns using a vector
col_vector <- c("green", "blue", "red")
for (colour in col_vector) {
col_mean(colour_data, colour)
}
[1] "green: 8"
[1] "blue: 5"
[1] "red: 2"
Some people prefer to structure their code following tidyverse
styling.
dplyr
provides a way to dynamically manipulate columns, but it makes writing functions containing dplyr
functions a bit more complicated.
To determine the syntax to use, you need to remember the distinction between data and environmental variables and which dplyr
function uses which.
<-
, can be seen in the RStudio Environment window.Use ?dplyr::function()
to see which type of formatting the variable uses: data masking or tidy selection.
The syntax for dynamic arguments in dplyr
are dependent on if the variable (column name) is quoted (a string) or unquoted (not a string).
It is possible to convert between the two but that gets even more complicated.
df <- data.frame(column_one = c(seq(1, 5, 1)),
column_two = c(seq(6, 10, 1)))
colname <- "column_one"
## Base using the known column name unquoted and quoted
df$column_one
[1] 1 2 3 4 5
df$"column_one"
[1] 1 2 3 4 5
df[, "column_one"]
[1] 1 2 3 4 5
## Base with a variable column name
df[[colname]]
[1] 1 2 3 4 5
df[, colname]
[1] 1 2 3 4 5
# Doesn't work
df$colname
NULL
df$"colname"
NULL
df[["colname"]]
NULL
## dplyr can use unquoted and quoted column names
dplyr::pull(df, column_one)
[1] 1 2 3 4 5
dplyr::pull(df, colname)
[1] 1 2 3 4 5
## But when you use dplyr functions inside bespoke functions
# It works like normal if you statically define the column name
pull_col_one <- function(data) {
dplyr::pull(data, column_one)
}
pull_col_one(df)
[1] 1 2 3 4 5
pull_col_one <- function(data) {
dplyr::pull(data, "column_one")
}
pull_col_one(df)
[1] 1 2 3 4 5
# It becomes more complicated when you want to chose the column with an argument
# This doesn't work
pull_colname <- function(data, column_name) {
dplyr::pull(data, column_name)
}
pull_colname(df, column_one)
Error in `dplyr::pull()`:
Caused by error:
! object 'column_one' not found
Unquoted
The variable is generally unquoted when you know what the column name is (i.e. you aren’t pulling it from somewhere else or creating it somehow). This requires external knowledge of the data.
This is how we use dplyr
functions to access column names.
Quoted
Quoted/string variables are generally useful for when the column name is derived from something else in an automated way (e.g. using paste()
, colnames()
e.t.c.). Column names as strings are much easier to work with than unquoted, but the syntax for creating dynamic arguments is a bit more complicated. These string column names are more flexible as they can be manipulated with normal string manipulation.
When creating dplyr
tidy evaluation functions, the first argument should be data.
Then when you are piping with your custom functions, you do not need to give the data argument.
Further information can be found in dplyr
documentation.
Applies to:
filter()
group_by()
mutate()
summarise()
arrange()
count()
Can use column names (i.e. data variables) as if they were variables in the environment, similar to how dplyr
variables are called.
i.e. you write column_name
, not df$column_name
or df[[column_name]]
Need to double-embrace it with curly brackets {{variable}}
where it is used in the body of the function.
Also, can use glue syntax with :=
to name the resulting variables after the unquoted argument variable contained in double curly brackets {{}}
.
group_count_min_max <- function(df, group_var, summ_var) {
df %>%
dplyr::group_by({{group_var}}) %>%
dplyr::summarise("n_{{summ_var}}" := n(),
"min_{{summ_var}}" := min({{summ_var}}),
"max_{{summ_var}}" := max({{summ_var}}))
}
group_count_min_max(df = mtcars, group_var = cyl, summ_var = mpg)
# A tibble: 3 x 4
cyl n_mpg min_mpg max_mpg
<dbl> <int> <dbl> <dbl>
1 4 11 21.4 33.9
2 6 7 17.8 21.4
3 8 14 10.4 19.2
# If piping, so long as your first argument of your custom function is the data,
# you don't need to specify the data using a full-stop
# Can pipe the custom function along with other dplyr functions
mtcars %>%
dplyr::select(-gear) %>%
group_count_min_max(group_var = cyl, summ_var = mpg)
# A tibble: 3 x 4
cyl n_mpg min_mpg max_mpg
<dbl> <int> <dbl> <dbl>
1 4 11 21.4 33.9
2 6 7 17.8 21.4
3 8 14 10.4 19.2
## This is much more complicated to do in base
If a variable exists as a character string (i.e. as an env-variable), need to indirectly select it from .data
in the body of the function where it is used using [[]]
.
The .data
bit does not vary with different argument names, it is always the .data
as it refers to the data within the pipe level above.
Can also use glue syntax with :=
to name the resulting variables after the quoted argument variable (embraced by single curly brackets {}
).
group_count_min_max <- function(df, group_var, summ_var) {
df %>%
dplyr::group_by(.data[[group_var]]) %>%
dplyr::summarise("n_{summ_var}" := n(),
"min_{summ_var}" := min(.data[[summ_var]]),
"max_{summ_var}" := max(.data[[summ_var]]))
}
mtcars %>%
group_count_min_max(group_var = "cyl", summ_var = "mpg")
# A tibble: 3 x 4
cyl n_mpg min_mpg max_mpg
<dbl> <int> <dbl> <dbl>
1 4 11 21.4 33.9
2 6 7 17.8 21.4
3 8 14 10.4 19.2
Applies to:
select()
across()
rename()
relocate()
pull()
Can easily choose variables based on their position, name or type.
e.g. starts_with("x")
or is.numeric()
Similar to data masking, you can use embracing with double curly brackets {{}}
.
drop_rename_iris <- function(data, drop_var, chosen_var) {
data %>%
dplyr::select(-{{drop_var}}) %>%
dplyr::rename(var_of_interest = {{chosen_var}})
}
iris %>%
drop_rename_iris(drop_var = Sepal.Length,
chosen_var = Species) %>%
head()
Sepal.Width Petal.Length Petal.Width var_of_interest
1 3.5 1.4 0.2 setosa
2 3.0 1.4 0.2 setosa
3 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
6 3.9 1.7 0.4 setosa
Similar to data masking, can use indirect selection from .data
using [[]]
.
drop_rename_iris <- function(data, drop_var, chosen_var) {
data %>%
dplyr::select(-.data[[drop_var]]) %>%
dplyr::rename(var_of_interest = .data[[chosen_var]])
}
iris %>%
drop_rename_iris(drop_var = "Sepal.Length",
chosen_var = "Species") %>%
head()
Sepal.Width Petal.Length Petal.Width var_of_interest
1 3.5 1.4 0.2 setosa
2 3.0 1.4 0.2 setosa
3 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
6 3.9 1.7 0.4 setosa
# As seen in the warning above, it appears that .data[[var]]
# has now been depreciated
drop_rename_iris <- function(data, drop_var, chosen_var) {
data %>%
dplyr::select(-all_of(drop_var)) %>%
dplyr::rename(var_of_interest = .data[[chosen_var]])
}
iris %>%
drop_rename_iris(drop_var = "Sepal.Length",
chosen_var = "Species") %>%
head()
Sepal.Width Petal.Length Petal.Width var_of_interest
1 3.5 1.4 0.2 setosa
2 3.0 1.4 0.2 setosa
3 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
6 3.9 1.7 0.4 setosa
## More advanced example
df <- data.frame(index_partial = sample(0:1, size = 100, replace = TRUE),
index_all = sample(0:1, size = 100, replace = TRUE),
days_partial = sample(x = 0:100, size = 100),
days_all = sample(x = 0:100, size = 100))
mean_day <- function(var_name) {
df %>%
dplyr::filter(.data[[paste0("index_", var_name)]] == 1) %>%
dplyr::summarise("mean_days_{ var_name }" :=
mean(.data[[paste0("days_", var_name)]], na.rm = TRUE))
}
mean_day("partial")
mean_days_partial
1 49.12069
How do you select a variable in a function with data masking?
{var}
quoted: [var]
{var}
unquoted: [var]
{{var}}
quoted: .data[[var]]
{{var}}
unquoted: .data[[var]]
How do you select a variable in a function with data masking?
{var}
quoted: [var]
{var}
unquoted: [var]
{{var}}
quoted: .data[[var]]
{{var}}
unquoted: .data[[var]]
##### Exercise 3 - Dplyr Data Masking vs base -----
## Write 3 functions to keep rows in a starwars dataset (from dplyr) if it
## contains a certain value in a selected column
## 1) using dplyr where the column name is unquoted
## 2) using dplyr where the column name is quoted
## 3) in base r
# Test if your function works by selecting those who are "feminine" in
# the gender column
data("starwars")
##### Exercise 4 - Dplyr Tidy Selection vs base -----
## Write a function to select the winning team (given by an argument)
## Then sort the winning team into alphabetical order
## using both base r and dplyr
## EXTRA: Loop over each of the names and save the output as <team_name>_wins
## using both base r and dplyr
team_names <- c("apple", "banana", "orange", "pear")
school_teams <- data.frame(
apple = c("Willibald", "Kilie", "Anuradha", "Theodora"),
banana = c("Branislav", "Thorbjorn", "Ward", "Silvana"),
orange = c("Seeta", "Yota", "Griet", "Edmund"),
pear = c("Gyula", "Della", "Duru", "Sutekh"))
##### Answer 3 -----
# dplyr - unquoted column name
starwars_filter_unquoted <- function(data,
col_name_filter,
filter_by) {
dplyr::filter(data, {{col_name_filter}} == filter_by)
}
starwars_filter_unquoted(data = starwars,
col_name_filter = gender,
filter_by = "feminine")
# dplyr - quoted column name
starwars_filter_quoted <- function(data,
col_name_filter,
filter_by) {
dplyr::filter(data, .data[[col_name_filter]] == filter_by)
}
starwars_filter_quoted(data = starwars,
col_name_filter = "gender",
filter_by = "feminine")
# Base
starwars_filter_base <- function(data,
col_name_filter,
filter_by) {
data[which(data[[col_name_filter]] == filter_by), ]
}
starwars_filter_base(data = starwars,
col_name_filter = "gender",
filter_by = "feminine")
##### Answer 4 -----
winning_team_alph <- function(data, winning_team) {
data <- data %>%
dplyr::select(winning_team)
data <- data.frame(data[order(data[1]), ])
colnames(data) <- "winning team"
data
}
winning_team_alph(data = school_teams, winning_team = "apple")
winning_team_alph <- function(data, winning_team) {
data <- data %>%
dplyr::select({{winning_team}})
data <- data.frame(data[order(data[1]), ])
colnames(data) <- "winning team"
data
}
winning_team_alph(data = school_teams, winning_team = "apple")