---
title: "Benchmarks"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{benchmarks}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
Using `tidylog` adds a small overhead to each function call. For instance, because tidylog needs
to figure out how many rows were dropped when you use `tidylog::filter`,
this call will be a bit slower than using `dplyr::filter` directly.
The overhead is usually not noticeable, but can be for larger datasets, especially when using joins.
The benchmarks below give some impression of how large the overhead is.
```{r message = FALSE, warning = FALSE}
library("dplyr")
library("tidylog", warn.conflicts = FALSE)
library("bench")
library("knitr")
```
## filter
On a small dataset:
```{r message = FALSE}
bench::mark(
dplyr::filter(mtcars, cyl == 4),
tidylog::filter(mtcars, cyl == 4), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```
On a larger dataset:
```{r message = FALSE}
df <- tibble(x = rnorm(100000))
bench::mark(
dplyr::filter(df, x > 0),
tidylog::filter(df, x > 0), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```
## mutate
On a small dataset:
```{r message = FALSE}
bench::mark(
dplyr::mutate(mtcars, cyl = as.factor(cyl)),
tidylog::mutate(mtcars, cyl = as.factor(cyl)), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```
On a larger dataset:
```{r message = FALSE}
df <- tibble(x = round(runif(10000) * 10))
bench::mark(
dplyr::mutate(df, x = as.factor(x)),
tidylog::mutate(df, x = as.factor(x)), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```
## joins
Joins are the most expensive operation, as tidylog has to do
two additional joins behind the scenes.
On a small dataset:
```{r message = FALSE}
bench::mark(
dplyr::inner_join(band_members, band_instruments, by = "name"),
tidylog::inner_join(band_members, band_instruments, by = "name"), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```
On a larger dataset (with many row duplications):
```{r message = FALSE}
N <- 1000
df1 <- tibble(x1 = rnorm(N), key = round(runif(N) * 10))
df2 <- tibble(x2 = rnorm(N), key = round(runif(N) * 10))
bench::mark(
dplyr::inner_join(df1, df2, by = "key"),
tidylog::inner_join(df1, df2, by = "key"), iterations = 100
) %>%
dplyr::select(expression, min, median, n_itr) %>%
kable()
```