Title: | An Implementation of Isolation Forest |
---|---|
Description: | Isolation forest is anomaly detection method introduced by the paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>). |
Authors: | Komala Sheshachala Srikanth [aut, cre], David Zimmermann [ctb] |
Maintainer: | Komala Sheshachala Srikanth <[email protected]> |
License: | GPL-3 |
Version: | 1.1.3 |
Built: | 2024-10-31 16:34:46 UTC |
Source: | https://github.com/talegari/solitude |
for a single integer
is_integerish(x)
is_integerish(x)
x |
input |
TRUE or FALSE
## Not run: is_integerish(1)
## Not run: is_integerish(1)
'solitude' class implements the isolation forest method
introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou
<doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees)
required to build the isolation forest is grown using
ranger
function from ranger package.
$new()
initiates a new 'solitude' object. The
possible arguments are:
sample_size
: (positive integer, default = 256) Number of
observations in the dataset to used to build a tree in the forest
num_trees
: (positive integer, default = 100) Number of trees
to be built in the forest
replace
: (boolean, default = FALSE) Whether the sample of
observations should be chosen with replacement when sample_size is less
than the number of observations in the dataset
seed
: (positive integer, default = 101) Random seed for the
forest
nproc
: (NULL or a positive integer, default: NULL, means use
all resources) Number of parallel threads to be used by ranger
respect_unordered_factors
: (string, default: "partition")See
respect.unordered.factors argument in ranger
max_depth
: (positive number, default:
ceiling(log2(sample_size))) See max.depth argument in
ranger
$fit()
fits a isolation forest for the given dataframe or sparse matrix, computes
depths of terminal nodes of each tree and stores the anomaly scores and
average depth values in $scores
object as a data.table
$predict()
returns anomaly scores for a new data as a data.table
Parallelization: ranger
is parallelized and by
default uses all the resources. This is supported when nproc is set to
NULL. The process of obtaining depths of terminal nodes (which is excuted
with $fit()
is called) may be parallelized separately by setting up
a future backend.
new()
isolationForest$new( sample_size = 256, num_trees = 100, replace = FALSE, seed = 101, nproc = NULL, respect_unordered_factors = NULL, max_depth = ceiling(log2(sample_size)) )
fit()
isolationForest$fit(dataset)
predict()
isolationForest$predict(data)
clone()
The objects of this class are cloneable with this method.
isolationForest$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: library("solitude") library("tidyverse") library("mlbench") data(PimaIndiansDiabetes) PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes) PimaIndiansDiabetes splitter = PimaIndiansDiabetes %>% select(-diabetes) %>% rsample::initial_split(prop = 0.5) pima_train = rsample::training(splitter) pima_test = rsample::testing(splitter) iso = isolationForest$new() iso$fit(pima_train) scores_train = pima_train %>% iso$predict() %>% arrange(desc(anomaly_score)) scores_train umap_train = pima_train %>% scale() %>% uwot::umap() %>% setNames(c("V1", "V2")) %>% as_tibble() %>% rowid_to_column() %>% left_join(scores_train, by = c("rowid" = "id")) umap_train umap_train %>% ggplot(aes(V1, V2)) + geom_point(aes(size = anomaly_score)) scores_test = pima_test %>% iso$predict() %>% arrange(desc(anomaly_score)) scores_test ## End(Not run)
## Not run: library("solitude") library("tidyverse") library("mlbench") data(PimaIndiansDiabetes) PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes) PimaIndiansDiabetes splitter = PimaIndiansDiabetes %>% select(-diabetes) %>% rsample::initial_split(prop = 0.5) pima_train = rsample::training(splitter) pima_test = rsample::testing(splitter) iso = isolationForest$new() iso$fit(pima_train) scores_train = pima_train %>% iso$predict() %>% arrange(desc(anomaly_score)) scores_train umap_train = pima_train %>% scale() %>% uwot::umap() %>% setNames(c("V1", "V2")) %>% as_tibble() %>% rowid_to_column() %>% left_join(scores_train, by = c("rowid" = "id")) umap_train umap_train %>% ggplot(aes(V1, V2)) + geom_point(aes(size = anomaly_score)) scores_test = pima_test %>% iso$predict() %>% arrange(desc(anomaly_score)) scores_test ## End(Not run)
Isolation forest is an anomaly detection method introduced by the paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>)
Srikanth Komala Sheshachala
Useful links:
Depth of each terminal node of all trees in a ranger model is returned as a three column tibble with column names: 'id_tree', 'id_node', 'depth'. Note that root node has the node_id = 0.
terminalNodesDepth(model)
terminalNodesDepth(model)
model |
A ranger model |
This function may be parallelized using a future backend.
A tibble with three columns: 'id_tree', 'id_node', 'depth'.
rf = ranger::ranger(Species ~ ., data = iris, num.trees = 100) terminalNodesDepth(rf)
rf = ranger::ranger(Species ~ ., data = iris, num.trees = 100) terminalNodesDepth(rf)
Depth of each terminal node of a single tree in a ranger model. Note that root node has the id_node = 0.
terminalNodesDepthPerTree(treelike)
terminalNodesDepthPerTree(treelike)
treelike |
Output of 'ranger::treeInfo' |
data.table with two columns: id_node and depth
## Not run: rf = ranger::ranger(Species ~ ., data = iris) terminalNodesDepthPerTree(ranger::treeInfo(rf, 1)) ## End(Not run)
## Not run: rf = ranger::ranger(Species ~ ., data = iris) terminalNodesDepthPerTree(ranger::treeInfo(rf, 1)) ## End(Not run)