Title: | Enhanced 'mutate' |
---|---|
Description: | Provides 'Apache Spark' style window aggregation for R dataframes and remote 'dbplyr' tables via 'mutate' in 'dplyr' flavour. |
Authors: | Srikanth Komala Sheshachala [aut, cre] |
Maintainer: | Srikanth Komala Sheshachala <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.0 |
Built: | 2024-11-03 02:49:10 UTC |
Source: | https://github.com/talegari/tidier |
mutate
Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
mutate(x, ..., .by, .order_by, .frame, .index, .complete = FALSE)
mutate(x, ..., .by, .order_by, .frame, .index, .complete = FALSE)
x |
( |
... |
expressions to be passed to |
.by |
(expression, optional: Yes) Columns to group by |
.order_by |
(expression, optional: Yes) Columns to order by |
.frame |
(vector, optional: Yes) Vector of length 2 indicating the
number of rows to consider before and after the current row. When argument
|
.index |
(expression, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only. |
.complete |
(flag, default: FALSE) This will be passed to
|
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex: c(days(2), days(1))
).
Interval objects are currently supported for dataframe only. (not
tbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.
Application of a window aggregation can be optionally run in parallel
over multiple groups (see argument .by
) by setting a
future parallel backend. This
is implemented using furrr
package.
function subsumes regular usecases of mutate
For tbl_lazy
input:
Uses dbplyr::window_order
and dbplyr::window_frame
to translate to
partition_by
and window frame specification.
data.frame
or tbl_lazy
mutate_
library("magrittr") # example 1 (simple case with dataframe) # Using iris dataset, # compute cumulative mean of column `Sepal.Length` # ordered by `Petal.Width` and `Sepal.Width` columns # grouped by `Petal.Length` column iris %>% mutate(sl_mean = mean(Sepal.Length), .order_by = c(Petal.Width, Sepal.Width), .by = Petal.Length, .frame = c(Inf, 0), ) %>% dplyr::slice_min(n = 3, Petal.Width, by = Species) # example 2 (detailed case with dataframe) # Using a sample airquality dataset, # compute mean temp over last seven days in the same month for every row set.seed(101) airquality %>% # create date column dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>% # create gaps by removing some days dplyr::slice_sample(prop = 0.8) %>% dplyr::arrange(date_col) %>% # compute mean temperature over last seven days in the same month tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE), .order_by = Day, .by = Month, .frame = c(lubridate::days(7), # 7 days before current row lubridate::days(-1) # do not include current row ), .index = date_col ) # example 3 airquality %>% # create date column as character dplyr::mutate(date_col = as.character(lubridate::make_date(1973, Month, Day)) ) %>% tibble::as_tibble() %>% # as `tbl_lazy` dbplyr::memdb_frame() %>% mutate(avg_temp = mean(Temp), .by = Month, .order_by = date_col, .frame = c(3, 3) ) %>% dplyr::collect() %>% dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
library("magrittr") # example 1 (simple case with dataframe) # Using iris dataset, # compute cumulative mean of column `Sepal.Length` # ordered by `Petal.Width` and `Sepal.Width` columns # grouped by `Petal.Length` column iris %>% mutate(sl_mean = mean(Sepal.Length), .order_by = c(Petal.Width, Sepal.Width), .by = Petal.Length, .frame = c(Inf, 0), ) %>% dplyr::slice_min(n = 3, Petal.Width, by = Species) # example 2 (detailed case with dataframe) # Using a sample airquality dataset, # compute mean temp over last seven days in the same month for every row set.seed(101) airquality %>% # create date column dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>% # create gaps by removing some days dplyr::slice_sample(prop = 0.8) %>% dplyr::arrange(date_col) %>% # compute mean temperature over last seven days in the same month tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE), .order_by = Day, .by = Month, .frame = c(lubridate::days(7), # 7 days before current row lubridate::days(-1) # do not include current row ), .index = date_col ) # example 3 airquality %>% # create date column as character dplyr::mutate(date_col = as.character(lubridate::make_date(1973, Month, Day)) ) %>% tibble::as_tibble() %>% # as `tbl_lazy` dbplyr::memdb_frame() %>% mutate(avg_temp = mean(Temp), .by = Month, .order_by = date_col, .frame = c(3, 3) ) %>% dplyr::collect() %>% dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
mutate
Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
mutate_( x, ..., .by, .order_by, .frame, .index, .desc = FALSE, .complete = FALSE )
mutate_( x, ..., .by, .order_by, .frame, .index, .desc = FALSE, .complete = FALSE )
x |
( |
... |
expressions to be passed to |
.by |
(character vector, optional: Yes) Columns to group by |
.order_by |
(string, optional: Yes) Columns to order by |
.frame |
(vector, optional: Yes) Vector of length 2 indicating the
number of rows to consider before and after the current row. When argument
|
.index |
(string, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only. |
.desc |
(flag, default: FALSE) Whether to order in descending order |
.complete |
(flag, default: FALSE) This will be passed to
|
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex: c(days(2), days(1))
).
Interval objects are currently supported for dataframe only. (not
tbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.
Application of a window aggregation can be optionally run in parallel
over multiple groups (see argument .by
) by setting a
future parallel backend. This
is implemented using furrr
package.
function subsumes regular usecases of mutate
For tbl_lazy
input:
Uses dbplyr::window_order
and dbplyr::window_frame
to translate to
partition_by
and window frame specification.
data.frame
or tbl_lazy
mutate
library("magrittr") # example 1 (simple case with dataframe) # Using iris dataset, # compute cumulative mean of column `Sepal.Length` # ordered by `Petal.Width` and `Sepal.Width` columns # grouped by `Petal.Length` column iris %>% tidier::mutate_(sl_mean = mean(Sepal.Length), .order_by = c("Petal.Width", "Sepal.Width"), .by = "Petal.Length", .frame = c(Inf, 0), ) %>% dplyr::slice_min(n = 3, Petal.Width, by = Species) # example 2 (detailed case with dataframe) # Using a sample airquality dataset, # compute mean temp over last seven days in the same month for every row set.seed(101) airquality %>% # create date column dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>% # create gaps by removing some days dplyr::slice_sample(prop = 0.8) %>% dplyr::arrange(date_col) %>% # compute mean temperature over last seven days in the same month tidier::mutate_(avg_temp_over_last_week = mean(Temp, na.rm = TRUE), .order_by = "Day", .by = "Month", .frame = c(lubridate::days(7), # 7 days before current row lubridate::days(-1) # do not include current row ), .index = "date_col" ) # example 3 airquality %>% # create date column as character dplyr::mutate(date_col = as.character(lubridate::make_date(1973, Month, Day)) ) %>% tibble::as_tibble() %>% # as `tbl_lazy` dbplyr::memdb_frame() %>% mutate_(avg_temp = mean(Temp), .by = "Month", .order_by = "date_col", .frame = c(3, 3) ) %>% dplyr::collect() %>% dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
library("magrittr") # example 1 (simple case with dataframe) # Using iris dataset, # compute cumulative mean of column `Sepal.Length` # ordered by `Petal.Width` and `Sepal.Width` columns # grouped by `Petal.Length` column iris %>% tidier::mutate_(sl_mean = mean(Sepal.Length), .order_by = c("Petal.Width", "Sepal.Width"), .by = "Petal.Length", .frame = c(Inf, 0), ) %>% dplyr::slice_min(n = 3, Petal.Width, by = Species) # example 2 (detailed case with dataframe) # Using a sample airquality dataset, # compute mean temp over last seven days in the same month for every row set.seed(101) airquality %>% # create date column dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>% # create gaps by removing some days dplyr::slice_sample(prop = 0.8) %>% dplyr::arrange(date_col) %>% # compute mean temperature over last seven days in the same month tidier::mutate_(avg_temp_over_last_week = mean(Temp, na.rm = TRUE), .order_by = "Day", .by = "Month", .frame = c(lubridate::days(7), # 7 days before current row lubridate::days(-1) # do not include current row ), .index = "date_col" ) # example 3 airquality %>% # create date column as character dplyr::mutate(date_col = as.character(lubridate::make_date(1973, Month, Day)) ) %>% tibble::as_tibble() %>% # as `tbl_lazy` dbplyr::memdb_frame() %>% mutate_(avg_temp = mean(Temp), .by = "Month", .order_by = "date_col", .frame = c(3, 3) ) %>% dplyr::collect() %>% dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
Remove non-list columns when same are present in a list column
remove_common_nested_columns(df, list_column)
remove_common_nested_columns(df, list_column)
df |
input dataframe |
list_column |
Name or expr of the column which is a list of named lists |
dataframe