Title: | Utilities to Retrieve Rulelists from Model Fits, Filter, Prune, Reorder and Predict on Unseen Data |
---|---|
Description: | Provides a framework to work with decision rules. Rules can be extracted from supported models, augmented with (custom) metrics using validation data, manipulated using standard dataframe operations, reordered and pruned based on a metric, predict on unseen (test) data. Utilities include; Creating a rulelist manually, Exporting a rulelist as a SQL case statement and so on. The package offers two classes; rulelist and ruleset based on dataframe. |
Authors: | Srikanth Komala Sheshachala [aut, cre], Amith Kumar Ullur Raghavendra [aut] |
Maintainer: | Srikanth Komala Sheshachala <[email protected]> |
License: | GPL-3 |
Version: | 0.2.7 |
Built: | 2024-10-27 05:05:45 UTC |
Source: | https://github.com/talegari/tidyrules |
Convert a set of rules in a dataframe to a rulelist
## S3 method for class 'data.frame' as_rulelist(x, keys = NULL, model_type = NULL, estimation_type, ...)
## S3 method for class 'data.frame' as_rulelist(x, keys = NULL, model_type = NULL, estimation_type, ...)
x |
dataframe to be coerced to a rulelist |
keys |
(character vector, default: NULL) column names which form the key |
model_type |
(string, default: NULL) Name of the model which generated the rules |
estimation_type |
(string) One among: 'regression', 'classification' |
... |
currently unused |
Input dataframe should contain these columns: rule_nbr
, LHS
,
RHS
. Providing other inputs helps augment better.
rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
rules_df = tidytable::tidytable(rule_nbr = 1:2, LHS = c("var_1 > 50", "var_2 < 30"), RHS = c(2, 1) ) as_rulelist(rules_df, estimation_type = "regression")
rules_df = tidytable::tidytable(rule_nbr = 1:2, LHS = c("var_1 > 50", "var_2 < 30"), RHS = c(2, 1) ) as_rulelist(rules_df, estimation_type = "regression")
Returns a ruleset object
as_ruleset(rulelist)
as_ruleset(rulelist)
rulelist |
A rulelist |
A ruleset
model_class_party = partykit::ctree(species ~ ., data = palmerpenguins::penguins ) as_ruleset(tidy(model_class_party))
model_class_party = partykit::ctree(species ~ ., data = palmerpenguins::penguins ) as_ruleset(tidy(model_class_party))
augment
is re-export of generics::augment from
tidyrules packageSee augment.rulelist
augment(x, ...)
augment(x, ...)
x |
A rulelist |
... |
For methods to use |
rulelist, tidy, augment, predict, calculate, prune, reorder
augment
outputs a rulelist with an additional column named
augmented_stats
based on summary statistics calculated using attribute
validation_data
.
## S3 method for class 'rulelist' augment(x, ...)
## S3 method for class 'rulelist' augment(x, ...)
x |
A rulelist |
... |
(expressions) To be send to tidytable::summarise for custom aggregations. See examples. |
The dataframe-column augmented_stats
will have these columns
corresponding to the estimation_type
:
For regression
: support
, IQR
, RMSE
For classification
: support
, confidence
, lift
along with custom aggregations.
A rulelist with a new dataframe-column named augmented_stats
.
rulelist, tidy, augment, predict, calculate, prune, reorder
# Examples for augment ------------------------------------------------------ library("magrittr") # C5 ---- att = modeldata::attrition set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) model_c5 = C50::C5.0(Attrition ~., data = att[train_index, ], rules = TRUE) tidy_c5 = model_c5 %>% tidy() %>% set_validation_data(att[!train_index, ], "Attrition") tidy_c5 augment(tidy_c5) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment with custom aggregator augment(tidy_c5,output_counts = list(table(Attrition))) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # rpart ---- set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(iris), replace = TRUE) model_class_rpart = rpart::rpart(Species ~ ., data = iris[train_index, ]) tidy_class_rpart = tidy(model_class_rpart) %>% set_validation_data(iris[!train_index, ], "Species") tidy_class_rpart model_regr_rpart = rpart::rpart(Sepal.Length ~ ., data = iris[train_index, ]) tidy_regr_rpart = tidy(model_regr_rpart) %>% set_validation_data(iris[!train_index, ], "Sepal.Length") tidy_regr_rpart # augment (classification case) augment(tidy_class_rpart) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment (regression case) augment(tidy_regr_rpart) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # party ---- pen = palmerpenguins::penguins %>% tidytable::drop_na(bill_length_mm) set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(pen), replace = TRUE) model_class_party = partykit::ctree(species ~ ., data = pen[train_index, ]) tidy_class_party = tidy(model_class_party) %>% set_validation_data(pen[!train_index, ], "species") tidy_class_party model_regr_party = partykit::ctree(bill_length_mm ~ ., data = pen[train_index, ]) tidy_regr_party = tidy(model_regr_party) %>% set_validation_data(pen[!train_index, ], "bill_length_mm") tidy_regr_party # augment (classification case) augment(tidy_class_party) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment (regression case) augment(tidy_regr_party) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # cubist ---- att = modeldata::attrition set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) cols_att = setdiff(colnames(att), c("MonthlyIncome", "Attrition")) model_cubist = Cubist::cubist(x = att[train_index, cols_att], y = att[train_index, "MonthlyIncome"] ) tidy_cubist = tidy(model_cubist) %>% set_validation_data(att[!train_index, ], "MonthlyIncome") tidy_cubist augment(tidy_cubist) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse()
# Examples for augment ------------------------------------------------------ library("magrittr") # C5 ---- att = modeldata::attrition set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) model_c5 = C50::C5.0(Attrition ~., data = att[train_index, ], rules = TRUE) tidy_c5 = model_c5 %>% tidy() %>% set_validation_data(att[!train_index, ], "Attrition") tidy_c5 augment(tidy_c5) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment with custom aggregator augment(tidy_c5,output_counts = list(table(Attrition))) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # rpart ---- set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(iris), replace = TRUE) model_class_rpart = rpart::rpart(Species ~ ., data = iris[train_index, ]) tidy_class_rpart = tidy(model_class_rpart) %>% set_validation_data(iris[!train_index, ], "Species") tidy_class_rpart model_regr_rpart = rpart::rpart(Sepal.Length ~ ., data = iris[train_index, ]) tidy_regr_rpart = tidy(model_regr_rpart) %>% set_validation_data(iris[!train_index, ], "Sepal.Length") tidy_regr_rpart # augment (classification case) augment(tidy_class_rpart) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment (regression case) augment(tidy_regr_rpart) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # party ---- pen = palmerpenguins::penguins %>% tidytable::drop_na(bill_length_mm) set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(pen), replace = TRUE) model_class_party = partykit::ctree(species ~ ., data = pen[train_index, ]) tidy_class_party = tidy(model_class_party) %>% set_validation_data(pen[!train_index, ], "species") tidy_class_party model_regr_party = partykit::ctree(bill_length_mm ~ ., data = pen[train_index, ]) tidy_regr_party = tidy(model_regr_party) %>% set_validation_data(pen[!train_index, ], "bill_length_mm") tidy_regr_party # augment (classification case) augment(tidy_class_party) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # augment (regression case) augment(tidy_regr_party) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse() # cubist ---- att = modeldata::attrition set.seed(100) train_index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) cols_att = setdiff(colnames(att), c("MonthlyIncome", "Attrition")) model_cubist = Cubist::cubist(x = att[train_index, cols_att], y = att[train_index, "MonthlyIncome"] ) tidy_cubist = tidy(model_cubist) %>% set_validation_data(att[!train_index, ], "MonthlyIncome") tidy_cubist augment(tidy_cubist) %>% tidytable::unnest(augmented_stats, names_sep = "__") %>% tidytable::glimpse()
calculate
metrics for a rulelist
Computes some metrics (based on estimation_type
) in cumulative
window function style over the rulelist (in the same order) ignoring the
keys.
## S3 method for class 'rulelist' calculate(x, metrics_to_exclude = NULL, ...)
## S3 method for class 'rulelist' calculate(x, metrics_to_exclude = NULL, ...)
x |
A rulelist |
metrics_to_exclude |
(character vector) Names of metrics to exclude |
... |
Named list of custom metrics. See 'details'. |
These metrics are calculated by default:
cumulative_coverage
: For nth rule in the rulelist, number of distinct row_nbr
s (of new_data
) covered by nth and all preceding rules (in order). In weighted case, we sum the weights corresponding to the distinct row_nbr
s.
cumulative_overlap
: Up til nth rule in the rulelist, number of distinct row_nbr
s (of new_data
) already covered by some preceding rule (in order). In weighted case, we sum the weights corresponding to the distinct row_nbr
s.
For classification:
cumulative_accuracy
: For nth rule in the rulelist, fraction of row_nbr
s such that RHS
matches the y_name
column (of new_data
) by nth and all preceding rules (in order). In weighted case, weighted accuracy is computed.
For regression:
cumulative_RMSE
: For nth rule in the rulelist, weighted RMSE of all predictions (RHS
) predicted by nth rule and all preceding rules.
Custom metrics to be computed should be passed a named list of function(s) in
...
. The custom metric function should take these arguments in same order:
rulelist
, new_data
, y_name
, weight
. The custom metric function should
return a numeric vector of same length as the number of rows of rulelist.
A dataframe of metrics with a rule_nbr
column.
rulelist, tidy, augment, predict, calculate, prune, reorder
library("magrittr") model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) %>% set_validation_data(modeldata::attrition, "Attrition") %>% set_keys(NULL) # calculate default metrics (classification) calculate(tidy_c5) model_rpart = rpart::rpart(MonthlyIncome ~., data = modeldata::attrition) tidy_rpart = tidy(model_rpart) %>% set_validation_data(modeldata::attrition, "MonthlyIncome") %>% set_keys(NULL) # calculate default metrics (regression) calculate(tidy_rpart) # calculate default metrics with a custom metric #' custom function to get cumulative MAE library("tidytable") get_cumulative_MAE = function(rulelist, new_data, y_name, weight){ priority_df = rulelist %>% select(rule_nbr) %>% mutate(priority = 1:nrow(rulelist)) %>% select(rule_nbr, priority) pred_df = predict(rulelist, new_data) %>% left_join(priority_df, by = "rule_nbr") %>% mutate(weight = local(weight)) %>% select(rule_nbr, row_nbr, weight, priority) new_data2 = new_data %>% mutate(row_nbr = 1:n()) %>% select(all_of(c("row_nbr", y_name))) rmse_till_rule = function(rn){ if (is.character(rulelist$RHS)) { inter_df = pred_df %>% tidytable::filter(priority <= rn) %>% left_join(mutate(new_data, row_nbr = 1:n()), by = "row_nbr") %>% left_join(select(rulelist, rule_nbr, RHS), by = "rule_nbr") %>% nest(.by = c("RHS", "rule_nbr", "row_nbr", "priority", "weight")) %>% mutate(RHS = purrr::map2_dbl(RHS, data, ~ eval(parse(text = .x), envir = .y) ) ) %>% unnest(data) } else { inter_df = pred_df %>% tidytable::filter(priority <= rn) %>% left_join(new_data2, by = "row_nbr") %>% left_join(select(rulelist, rule_nbr, RHS), by = "rule_nbr") } inter_df %>% summarise(rmse = MetricsWeighted::mae(RHS, .data[[y_name]], weight, na.rm = TRUE ) ) %>% `[[`("rmse") } res = purrr::map_dbl(1:nrow(rulelist), rmse_till_rule) return(res) } calculate(tidy_rpart, metrics_to_exclude = NULL, list("cumulative_mae" = get_cumulative_MAE) )
library("magrittr") model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) %>% set_validation_data(modeldata::attrition, "Attrition") %>% set_keys(NULL) # calculate default metrics (classification) calculate(tidy_c5) model_rpart = rpart::rpart(MonthlyIncome ~., data = modeldata::attrition) tidy_rpart = tidy(model_rpart) %>% set_validation_data(modeldata::attrition, "MonthlyIncome") %>% set_keys(NULL) # calculate default metrics (regression) calculate(tidy_rpart) # calculate default metrics with a custom metric #' custom function to get cumulative MAE library("tidytable") get_cumulative_MAE = function(rulelist, new_data, y_name, weight){ priority_df = rulelist %>% select(rule_nbr) %>% mutate(priority = 1:nrow(rulelist)) %>% select(rule_nbr, priority) pred_df = predict(rulelist, new_data) %>% left_join(priority_df, by = "rule_nbr") %>% mutate(weight = local(weight)) %>% select(rule_nbr, row_nbr, weight, priority) new_data2 = new_data %>% mutate(row_nbr = 1:n()) %>% select(all_of(c("row_nbr", y_name))) rmse_till_rule = function(rn){ if (is.character(rulelist$RHS)) { inter_df = pred_df %>% tidytable::filter(priority <= rn) %>% left_join(mutate(new_data, row_nbr = 1:n()), by = "row_nbr") %>% left_join(select(rulelist, rule_nbr, RHS), by = "rule_nbr") %>% nest(.by = c("RHS", "rule_nbr", "row_nbr", "priority", "weight")) %>% mutate(RHS = purrr::map2_dbl(RHS, data, ~ eval(parse(text = .x), envir = .y) ) ) %>% unnest(data) } else { inter_df = pred_df %>% tidytable::filter(priority <= rn) %>% left_join(new_data2, by = "row_nbr") %>% left_join(select(rulelist, rule_nbr, RHS), by = "rule_nbr") } inter_df %>% summarise(rmse = MetricsWeighted::mae(RHS, .data[[y_name]], weight, na.rm = TRUE ) ) %>% `[[`("rmse") } res = purrr::map_dbl(1:nrow(rulelist), rmse_till_rule) return(res) } calculate(tidy_rpart, metrics_to_exclude = NULL, list("cumulative_mae" = get_cumulative_MAE) )
Convert a R parsable rule to python/sql parsable rule
convert_rule_flavor(rule, flavor)
convert_rule_flavor(rule, flavor)
rule |
(chr vector) R parsable rule(s) |
flavor |
(string) One among: 'python', 'sql' |
(chr vector) of rules
rulelist, tidy, augment, predict, to_sql_case
Other Auxiliary Rulelist Utility:
to_sql_case()
tidyrules
tidyrules
package provides a framework to work with decision
rules. Rules can be extracted from supported models using tidy, augmented
using validation data by augment, manipulated using
standard dataframe operations, (modified) rulelists can be used to
predict on unseen (test) data. Utilities include:
Create a rulelist manually (as_rulelist), Export
a rulelist to SQL (to_sql_case) and so on. The package offers two
classes; rulelist and ruleset based on dataframe.
Maintainer: Srikanth Komala Sheshachala [email protected]
Authors:
Amith Kumar Ullur Raghavendra [email protected]
rulelist, tidy, augment, predict
prune_rulelist
classPlot method for prune_rulelist
class
## S3 method for class 'prune_rulelist' plot(x, ...)
## S3 method for class 'prune_rulelist' plot(x, ...)
x |
A 'prune_rulelist' object |
... |
unused |
ggplot2 object (invisibly)
Plots a heatmap with rule_nbr
's on x-side and clusters of
row_nbr
's on y-side of a binary matrix with 1 if a rule is applicable for
a row.
## S3 method for class 'rulelist' plot(x, thres_cluster_rows = 1000, dist_metric = "jaccard", ...)
## S3 method for class 'rulelist' plot(x, thres_cluster_rows = 1000, dist_metric = "jaccard", ...)
x |
A rulelist |
thres_cluster_rows |
(positive integer) Maximum number of rows beyond which a x-side dendrogram is not computed |
dist_metric |
(string or function, default: "jaccard") Distance metric
for y-side ( |
... |
Arguments to be passed to pheatmap::pheatmap |
Number of clusters is set to min(number of unique rows in the row_nbr X rule_nbr matrix and thres_cluster_rows)
library("magrittr") att = modeldata::attrition tidy_c5 = C50::C5.0(Attrition ~., data = att, rules = TRUE) %>% tidy() %>% set_validation_data(att, "Attrition") %>% set_keys(NULL) plot(tidy_c5)
library("magrittr") att = modeldata::attrition tidy_c5 = C50::C5.0(Attrition ~., data = att, rules = TRUE) %>% tidy() %>% set_validation_data(att, "Attrition") %>% set_keys(NULL) plot(tidy_c5)
predict
method for a rulelist
Predicts rule_nbr
applicable (as per the order in rulelist)
for a row_nbr
(per key) in new_data
## S3 method for class 'rulelist' predict(object, new_data, multiple = FALSE, ...)
## S3 method for class 'rulelist' predict(object, new_data, multiple = FALSE, ...)
object |
A rulelist |
new_data |
(dataframe) |
multiple |
(flag, default: FALSE) Whether to output all rule numbers applicable for a row. If FALSE, the first satisfying rule is provided. |
... |
unused |
If a row_nbr
is covered more than one rule_nbr
per 'keys', then
rule_nbr
appearing earlier (as in row order of the rulelist) takes
precedence.
When multiple is FALSE
(default), output is a dataframe with three
or more columns: row_number
(int), columns corresponding to 'keys',
rule_nbr
(int).
When multiple is TRUE
, output is a dataframe with three
or more columns: row_number
(int), columns corresponding to 'keys',
rule_nbr
(list column of integers).
If a row number and 'keys' combination is not covered by any rule, then
rule_nbr
column has missing value.
A dataframe. See Details.
rulelist, tidy, augment, predict, calculate, prune, reorder
model_c5 = C50::C5.0(species ~., data = palmerpenguins::penguins, trials = 5, rules = TRUE ) tidy_c5 = tidy(model_c5) tidy_c5 output_1 = predict(tidy_c5, palmerpenguins::penguins) output_1 # different rules per 'keys' (`trial_nbr` here) output_2 = predict(tidy_c5, palmerpenguins::penguins, multiple = TRUE) output_2 # `rule_nbr` is a list-column of integer vectors
model_c5 = C50::C5.0(species ~., data = palmerpenguins::penguins, trials = 5, rules = TRUE ) tidy_c5 = tidy(model_c5) tidy_c5 output_1 = predict(tidy_c5, palmerpenguins::penguins) output_1 # different rules per 'keys' (`trial_nbr` here) output_2 = predict(tidy_c5, palmerpenguins::penguins, multiple = TRUE) output_2 # `rule_nbr` is a list-column of integer vectors
predict
method for a ruleset
Predicts multiple rule_nbr
(s) applicable for a row_nbr
(per
key) in new_data
## S3 method for class 'ruleset' predict(object, new_data, ...)
## S3 method for class 'ruleset' predict(object, new_data, ...)
object |
A ruleset |
new_data |
(dataframe) |
... |
unused |
A dataframe with three or more columns: row_number
(int), columns
corresponding to 'keys', rule_nbr
(list column of integers). If a row
number and 'keys' combination is not covered by any rule, then rule_nbr
column has missing value.
model_c5 = C50::C5.0(species ~., data = palmerpenguins::penguins, trials = 5, rules = TRUE ) tidy_c5_ruleset = as_ruleset(tidy(model_c5)) tidy_c5_ruleset predict(tidy_c5_ruleset, palmerpenguins::penguins)
model_c5 = C50::C5.0(species ~., data = palmerpenguins::penguins, trials = 5, rules = TRUE ) tidy_c5_ruleset = as_ruleset(tidy(model_c5)) tidy_c5_ruleset predict(tidy_c5_ruleset, palmerpenguins::penguins)
prune_rulelist
classPrint method for prune_rulelist
class
## S3 method for class 'prune_rulelist' print(x, ...)
## S3 method for class 'prune_rulelist' print(x, ...)
x |
A 'prune_rulelist' object |
... |
unused |
Prints rulelist attributes and first few rows.
## S3 method for class 'rulelist' print(x, banner = TRUE, ...)
## S3 method for class 'rulelist' print(x, banner = TRUE, ...)
x |
A rulelist object |
banner |
(flag, default: |
... |
Passed to |
input rulelist (invisibly)
rulelist, tidy, augment, predict, calculate, prune, reorder
Prints the ruleset object
## S3 method for class 'ruleset' print(x, banner = TRUE, ...)
## S3 method for class 'ruleset' print(x, banner = TRUE, ...)
x |
A rulelist |
banner |
(flag, default: |
... |
Passed to |
(invisibly) Returns the ruleset object
model_class_party = partykit::ctree(species ~ ., data = palmerpenguins::penguins ) as_ruleset(tidy(model_class_party))
model_class_party = partykit::ctree(species ~ ., data = palmerpenguins::penguins ) as_ruleset(tidy(model_class_party))
prune
is re-export of generics::prune from
tidyrules packageSee prune.rulelist
prune(tree, ...)
prune(tree, ...)
tree |
A rulelist |
... |
See prune.rulelist |
rulelist, tidy, augment, predict, calculate, prune, reorder
prune
rules of a rulelist
Prune the rulelist by suggesting to keep first 'k' rules based on metrics computed by calculate
## S3 method for class 'rulelist' prune( tree, metrics_to_exclude = NULL, stop_expr_string = "relative__cumulative_coverage >= 0.9", min_n_rules = 1, ... )
## S3 method for class 'rulelist' prune( tree, metrics_to_exclude = NULL, stop_expr_string = "relative__cumulative_coverage >= 0.9", min_n_rules = 1, ... )
tree |
A rulelist |
metrics_to_exclude |
(character vector or NULL) Names of metrics not to be calculated. See calculate for the list of default metrics. |
stop_expr_string |
(string default: "relative__cumulative_coverage >= 0.9") Parsable condition |
min_n_rules |
(positive integer) Minimum number of rules to keep |
... |
Named list of custom metrics passed to calculate |
Metrics are computed using calculate. 2. Relative metrics (prepended by 'relative__') are calculated by dividing each metric by its max value. 3. The first rule in rulelist order which meets the 'stop_expr_string' criteria is stored (say 'pos'). Print method suggests to keep rules until pos.
Object of class 'prune_ruleslist' with these components: 1. pruned: ruleset keeping only first 'pos' rows. 2. n_pruned_rules: pos. If stop criteria is never met, then pos = nrow(ruleset) 3. n_total_rules: nrow(ruleset), 4. metrics_df: Dataframe with metrics and relative metrics 5. stop_expr_string
rulelist, tidy, augment, predict, calculate, prune, reorder
library("magrittr") model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) %>% set_validation_data(modeldata::attrition, "Attrition") %>% set_keys(NULL) #' prune with defaults prune_obj = prune(tidy_c5) #' note that all other metrics are visible in the print output prune_obj plot(prune_obj) prune_obj$pruned #' prune with a different stop_expr_string threshold prune_obj = prune(tidy_c5, stop_expr_string = "relative__cumulative_coverage >= 0.2" ) prune_obj #' as expected, has smaller then 10 rules as compared to default args plot(prune_obj) prune_obj$pruned #' prune with a different stop_expr_string metric st = "relative__cumulative_overlap <= 0.7 & relative__cumulative_overlap > 0" prune_obj = prune(tidy_c5, stop_expr_string = st) prune_obj #' as expected, has smaller then 10 rules as compared to default args plot(prune_obj) prune_obj$pruned
library("magrittr") model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) %>% set_validation_data(modeldata::attrition, "Attrition") %>% set_keys(NULL) #' prune with defaults prune_obj = prune(tidy_c5) #' note that all other metrics are visible in the print output prune_obj plot(prune_obj) prune_obj$pruned #' prune with a different stop_expr_string threshold prune_obj = prune(tidy_c5, stop_expr_string = "relative__cumulative_coverage >= 0.2" ) prune_obj #' as expected, has smaller then 10 rules as compared to default args plot(prune_obj) prune_obj$pruned #' prune with a different stop_expr_string metric st = "relative__cumulative_overlap <= 0.7 & relative__cumulative_overlap > 0" prune_obj = prune(tidy_c5, stop_expr_string = st) prune_obj #' as expected, has smaller then 10 rules as compared to default args plot(prune_obj) prune_obj$pruned
reorder generic for rulelist
reorder(x, ...)
reorder(x, ...)
x |
A rulelist |
... |
See reorder.rulelist |
rulelist, tidy, augment, predict, calculate, prune, reorder
Implements a greedy strategy to add one rule at a time which maximizes/minimizes a metric.
## S3 method for class 'rulelist' reorder(x, metric = "cumulative_coverage", minimize = FALSE, init = NULL, ...)
## S3 method for class 'rulelist' reorder(x, metric = "cumulative_coverage", minimize = FALSE, init = NULL, ...)
x |
A rulelist |
metric |
(character vector or named list) Name of metrics or a custom function(s). See calculate. The 'n+1'th metric is used when there is a match at 'nth' level, similar to base::order. If there is a match at final level, row order of the rulelist comes into play. |
minimize |
(logical vector) Whether to minimize. Either TRUE/FALSE or a logical vector of same length as metric |
init |
(positive integer) Initial number of rows after which reordering should begin |
... |
passed to calculate |
rulelist, tidy, augment, predict, calculate, prune, reorder
library("magrittr") att = modeldata::attrition tidy_c5 = C50::C5.0(Attrition ~., data = att, rules = TRUE) %>% tidy() %>% set_validation_data(att, "Attrition") %>% set_keys(NULL) %>% head(5) # with defaults reorder(tidy_c5) # use 'cumulative_overlap' to break ties (if any) reorder(tidy_c5, metric = c("cumulative_coverage", "cumulative_overlap")) # reorder after 2 rules reorder(tidy_c5, init = 2)
library("magrittr") att = modeldata::attrition tidy_c5 = C50::C5.0(Attrition ~., data = att, rules = TRUE) %>% tidy() %>% set_validation_data(att, "Attrition") %>% set_keys(NULL) %>% head(5) # with defaults reorder(tidy_c5) # use 'cumulative_overlap' to break ties (if any) reorder(tidy_c5, metric = c("cumulative_coverage", "cumulative_overlap")) # reorder after 2 rules reorder(tidy_c5, init = 2)
A rulelist
is ordered list of rules stored as a dataframe. Each row,
specifies a rule (LHS), expected outcome (RHS) and some other details.
It has these mandatory columns:
rule_nbr
: (integer vector) Rule number
LHS
: (character vector) A rule is a string that can be parsed using base::parse()
RHS
: (character vector or a literal)
| rule_nbr|LHS |RHS | support| confidence| lift| |--------:|:--------------------------------------------------------------------|:---------|-------:|----------:|--------:| | 1|( island %in% c('Biscoe') ) & ( flipper_length_mm > 203 ) |Gentoo | 122| 1.0000000| 2.774193| | 2|( island %in% c('Biscoe') ) & ( flipper_length_mm <= 203 ) |Adelie | 46| 0.9565217| 2.164760| | 3|( island %in% c('Dream', 'Torgersen') ) & ( bill_length_mm > 44.1 ) |Chinstrap | 65| 0.9538462| 4.825339| | 4|( island %in% c('Dream', 'Torgersen') ) & ( bill_length_mm <= 44.1 ) |Adelie | 111| 0.9459459| 2.140825|
A rulelist
can be created using tidy()
on some supported model fits
(run: utils::methods(tidy)
). It can also be created manually from a
existing dataframe using as_rulelist.
Columns identified as 'keys' along with rule_nbr
form a unique
combination
– a group of rules. For example, rule-based C5 model with multiple trials
creates rules per each trial_nbr
. predict
method understands 'keys',
thereby provides/predicts a rule number (for each row in new data / test
data) within the same trial_nbr
.
A rulelist has these mandatory attributes:
estimation_type
: One among regression
, classification
A rulelist has these optional attributes:
keys
: (character vector)Names of the column that forms a key.
model_type
: (string) Name of the model
This helps a few methods like augment, calculate, prune, reorder require few additional attributes which can be set using set_validation_data.
Predict: Given a dataframe (possibly without a
dependent variable column aka 'test data'), predicts the first rule (as
ordered in the rulelist) per 'keys' that is applicable for each row. When
multiple = TRUE
, returns all rules applicable for a row (per key).
Augment: Outputs summary statistics per rule over validation data and returns a rulelist with a new dataframe-column.
Calculate: Computes metrics for a rulelist in a
cumulative manner such as cumulative_coverage
, cumulative_overlap
,
cumulative_accuracy
.
Prune: Suggests pruning a rulelist such that some expectation are met (based on metrics). Example: cumulative_coverage of 80% can be met with a first few rules.
Reorder: Reorders a rulelist in order to maximize a metric.
Rulelists are essentially dataframes. Hence, any dataframe operations which preferably preserve attributes will output a rulelist. as_rulelist and as.data.frame will help in moving back and forth between rulelist and dataframe worlds.
as_rulelist: Create a rulelist
from a
dataframe with some mandatory columns.
set_keys: Set or Unset 'keys' of a rulelist
.
to_sql_case: Outputs a SQL case statement for a rulelist
.
convert_rule_flavor: Converts R
-parsable rule strings to python/SQL
parsable rule strings.
rulelist, tidy, augment, predict, calculate, prune, reorder
'keys' are a set of column(s) which identify a group of rules in a rulelist. Methods like predict, augment produce output per key combination.
set_keys(x, keys, reset = FALSE)
set_keys(x, keys, reset = FALSE)
x |
A rulelist |
keys |
(character vector or NULL) |
reset |
(flag) Whether to reset the keys to sequential numbers starting
with 1 when |
A new rulelist is returned with attr keys
is modified. The input
rulelist object is unaltered.
A rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Rulelist Utility:
set_validation_data()
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) tidy_c5 # keys are: "trial_nbr" tidy_c5[["rule_nbr"]] = 1:nrow(tidy_c5) new_tidy_c5 = set_keys(tidy_c5, NULL) # remove all keys new_tidy_c5 new_2_tidy_c5 = set_keys(new_tidy_c5, "trial_nbr") # set "trial_nbr" as key new_2_tidy_c5 # Note that `tidy_c5` and `new_tidy_c5` are not altered. tidy_c5 new_tidy_c5
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy_c5 = tidy(model_c5) tidy_c5 # keys are: "trial_nbr" tidy_c5[["rule_nbr"]] = 1:nrow(tidy_c5) new_tidy_c5 = set_keys(tidy_c5, NULL) # remove all keys new_tidy_c5 new_2_tidy_c5 = set_keys(new_tidy_c5, "trial_nbr") # set "trial_nbr" as key new_2_tidy_c5 # Note that `tidy_c5` and `new_tidy_c5` are not altered. tidy_c5 new_tidy_c5
validation_data
to a rulelist
Returns a rulelist with three new attributes set:
validation_data
, y_name
and weight
. Methods such as
augment, calculate,
prune, reorder require this to be set.
set_validation_data(x, validation_data, y_name, weight = 1)
set_validation_data(x, validation_data, y_name, weight = 1)
x |
A rulelist |
validation_data |
(dataframe) Data to used for computing some metrics.
It is expected to contain |
y_name |
(string) Name of the dependent variable column. |
weight |
(non-negative numeric vector, default: 1) Weight per
observation/row of |
A rulelist with some extra attributes set.
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Rulelist Utility:
set_keys()
att = modeldata::attrition set.seed(100) index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) model_c5 = C50::C5.0(Attrition ~., data = att[index, ], rules = TRUE) tidy_c5 = tidy(model_c5) tidy_c5 tidy_c5_2 = set_validation_data(tidy_c5, validation_data = att[!index, ], y_name = "Attrition", weight = 1 # default ) tidy_c5_2 tidy_c5 # not altered
att = modeldata::attrition set.seed(100) index = sample(c(TRUE, FALSE), nrow(att), replace = TRUE) model_c5 = C50::C5.0(Attrition ~., data = att[index, ], rules = TRUE) tidy_c5 = tidy(model_c5) tidy_c5 tidy_c5_2 = set_validation_data(tidy_c5, validation_data = att[!index, ], y_name = "Attrition", weight = 1 # default ) tidy_c5_2 tidy_c5 # not altered
tidy
is re-export of generics::tidy from
tidyrules packagetidy
applied on a supported model fit creates a rulelist.
See Also section links to documentation of specific methods.
tidy(x, ...)
tidy(x, ...)
x |
A supported model object |
... |
For model specific implementations to use |
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Tidy Utility:
tidy.C5.0()
,
tidy.cubist()
,
tidy.rpart()
Each row corresponds to a rule per trial_nbr
## S3 method for class 'C5.0' tidy(x, ...)
## S3 method for class 'C5.0' tidy(x, ...)
x |
C50::C5.0 model fitted with |
... |
Other arguments (See details) |
The output columns are: rule_nbr
, trial_nbr
, LHS
, RHS
,
support
, confidence
, lift
.
Rules per trial_nbr
are sorted in this order: desc(confidence)
,
desc(lift)
, desc(support)
.
Optional named arguments:
laplace
(flag, default: TRUE) is supported. This
computes confidence with laplace correction as documented under 'Rulesets'
here: C5 doc.
A rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Tidy Utility:
tidy()
,
tidy.cubist()
,
tidy.rpart()
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy(model_c5)
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy(model_c5)
Each row corresponds to a rule
## S3 method for class 'constparty' tidy(x, ...)
## S3 method for class 'constparty' tidy(x, ...)
x |
partykit::party model typically built using partykit::ctree |
... |
Other arguments (currently unused) |
These types of party models are supported:
regression
(y is numeric), classification
(y is factor)
For party classification model:
Output columns are: rule_nbr
, LHS
, RHS
, support
, confidence
, lift
, terminal_node_id
.
Rules are sorted in this order: desc(confidence)
, desc(lift)
,
desc(support)
.
For party regression model:
Output columns are: rule_nbr
, LHS
, RHS
, support
, IQR
, RMSE
, terminal_node_id
.
Rules are sorted in this order: RMSE
, desc(support)
.
A rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
pen = palmerpenguins::penguins model_class_party = partykit::ctree(species ~ ., data = pen) tidy(model_class_party) model_regr_party = partykit::ctree(bill_length_mm ~ ., data = pen) tidy(model_regr_party)
pen = palmerpenguins::penguins model_class_party = partykit::ctree(species ~ ., data = pen) tidy(model_class_party) model_regr_party = partykit::ctree(bill_length_mm ~ ., data = pen) tidy(model_regr_party)
Each row corresponds to a rule per committee
## S3 method for class 'cubist' tidy(x, ...)
## S3 method for class 'cubist' tidy(x, ...)
x |
Cubist::cubist model |
... |
Other arguments (currently unused) |
The output columns are: rule_nbr
, committee
, LHS
, RHS
, support
, mean
, min
, max
, error
.
Rules are sorted in this order per committee:
error
, desc(support)
A rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Tidy Utility:
tidy()
,
tidy.C5.0()
,
tidy.rpart()
att = modeldata::attrition cols_att = setdiff(colnames(att), c("MonthlyIncome", "Attrition")) model_cubist = Cubist::cubist(x = att[, cols_att], y = att[["MonthlyIncome"]] ) tidy(model_cubist)
att = modeldata::attrition cols_att = setdiff(colnames(att), c("MonthlyIncome", "Attrition")) model_cubist = Cubist::cubist(x = att[, cols_att], y = att[["MonthlyIncome"]] ) tidy(model_cubist)
Each row corresponds to a rule
## S3 method for class 'rpart' tidy(x, ...)
## S3 method for class 'rpart' tidy(x, ...)
x |
rpart::rpart model |
... |
Other arguments (currently unused) |
For rpart rules, one should build the model without ordered factor variable. We recommend you to convert ordered factor to factor or integer class.
For rpart::rpart classification model:
Output columns are: rule_nbr
, LHS
, RHS
, support
, confidence
, lift
.
The rules are sorted in this order: desc(confidence)
, desc(lift)
,
desc(support)
.
For rpart::rpart regression(anova) model:
Output columns are: rule_nbr
, LHS
, RHS
, support
.
The rules are sorted in this order: desc(support)
.
A rulelist object
rulelist, tidy, augment, predict, calculate, prune, reorder
Other Core Tidy Utility:
tidy()
,
tidy.C5.0()
,
tidy.cubist()
model_class_rpart = rpart::rpart(Species ~ ., data = iris) tidy(model_class_rpart) model_regr_rpart = rpart::rpart(Sepal.Length ~ ., data = iris) tidy(model_regr_rpart)
model_class_rpart = rpart::rpart(Species ~ ., data = iris) tidy(model_class_rpart) model_regr_rpart = rpart::rpart(Sepal.Length ~ ., data = iris) tidy(model_regr_rpart)
Extract SQL case statement from a rulelist
to_sql_case(rulelist, rhs_column_name = "RHS", output_colname = "output")
to_sql_case(rulelist, rhs_column_name = "RHS", output_colname = "output")
rulelist |
A rulelist object |
rhs_column_name |
(string, default: "RHS") Name of the column in the rulelist to be used as RHS (WHEN some_rule THEN rhs) in the sql case statement |
output_colname |
(string, default: "output") Name of the output column created by the SQL statement (used in case ... AS output_column) |
As a side-effect, the SQL statement is cat to stdout. The output contains newline character.
(string invisibly) SQL case statement
rulelist, tidy, augment, predict, convert_rule_flavor
Other Auxiliary Rulelist Utility:
convert_rule_flavor()
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy(model_c5) to_sql_case(tidy(model_c5))
model_c5 = C50::C5.0(Attrition ~., data = modeldata::attrition, rules = TRUE) tidy(model_c5) to_sql_case(tidy(model_c5))