--- title: "Benchmarks" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{benchmarks} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Using `tidylog` adds a small overhead to each function call. For instance, because tidylog needs to figure out how many rows were dropped when you use `tidylog::filter`, this call will be a bit slower than using `dplyr::filter` directly. The overhead is usually not noticeable, but can be for larger datasets, especially when using joins. The benchmarks below give some impression of how large the overhead is. ```{r message = FALSE, warning = FALSE} library("dplyr") library("tidylog", warn.conflicts = FALSE) library("bench") library("knitr") ``` ## filter On a small dataset: ```{r message = FALSE} bench::mark( dplyr::filter(mtcars, cyl == 4), tidylog::filter(mtcars, cyl == 4), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ``` On a larger dataset: ```{r message = FALSE} df <- tibble(x = rnorm(100000)) bench::mark( dplyr::filter(df, x > 0), tidylog::filter(df, x > 0), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ``` ## mutate On a small dataset: ```{r message = FALSE} bench::mark( dplyr::mutate(mtcars, cyl = as.factor(cyl)), tidylog::mutate(mtcars, cyl = as.factor(cyl)), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ``` On a larger dataset: ```{r message = FALSE} df <- tibble(x = round(runif(10000) * 10)) bench::mark( dplyr::mutate(df, x = as.factor(x)), tidylog::mutate(df, x = as.factor(x)), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ``` ## joins Joins are the most expensive operation, as tidylog has to do two additional joins behind the scenes. On a small dataset: ```{r message = FALSE} bench::mark( dplyr::inner_join(band_members, band_instruments, by = "name"), tidylog::inner_join(band_members, band_instruments, by = "name"), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ``` On a larger dataset (with many row duplications): ```{r message = FALSE} N <- 1000 df1 <- tibble(x1 = rnorm(N), key = round(runif(N) * 10)) df2 <- tibble(x2 = rnorm(N), key = round(runif(N) * 10)) bench::mark( dplyr::inner_join(df1, df2, by = "key"), tidylog::inner_join(df1, df2, by = "key"), iterations = 100 ) %>% dplyr::select(expression, min, median, n_itr) %>% kable() ```