API documentation of skgarden

Table of contents

skgarden.mondrian

skgarden.quantile

skgarden.forest

skgarden.mondrian

skgarden.mondrian.MondrianForestClassifier

A MondrianForestClassifier is an ensemble of MondrianTreeClassifiers.

The probability of class is given

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • max_depth (integer, optional (default=None))

    The depth to which each tree is grown. If None, the tree is either grown to full depth or is constrained by min_samples_split.

  • min_samples_split (integer, optional (default=2))

    Stop growing the tree if all the nodes have lesser than min_samples_split number of samples.

  • bootstrap (boolean, optional (default=False))

    If bootstrap is set to False, then all trees are trained on the entire training dataset. Else, each tree is fit on n_samples drawn with replacement from the training dataset.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Methods

MondrianForestClassifier.fit(X, y)

Builds a forest of trees from the training set (X, y).

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs])

    The target values (class labels in classification, real numbers in regression).

  • sample_weight (array-like, shape = [n_samples] or None)

    Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

Returns

  • self (object)

    Returns self.

MondrianForestClassifier.partial_fit(X, y, classes=None)

Incremental building of Mondrian Forest Classifiers.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32

y: array_like, shape = [n_samples] Input targets.

classes: array_like, shape = [n_classes] Ignored for a regression problem. For a classification problem, if not provided this is inferred from y. This is taken into account for only the first call to partial_fit and ignored for subsequent calls.

Returns self: instance of MondrianForestClassifier

MondrianForestClassifier.weighted_decision_path(X)

Returns the weighted decision path in the forest.

Each non-zero value in the decision path determines the weight of that particular node while making predictions.

Parameters

  • X (array-like, shape = (n_samples, n_features))

    Input.

Returns

  • decision_path (sparse csr matrix, shape = (n_samples, n_total_nodes))

    Return a node indicator matrix where non zero elements indicate the weight of that particular node in making predictions.

  • est_inds (array-like, shape = (n_estimators + 1,))

    weighted_decision_path[:, est_inds[i]: est_inds[i + 1]] provides the weighted_decision_path of estimator i

Properties

skgarden.mondrian.MondrianForestRegressor

A MondrianForestRegressor is an ensemble of MondrianTreeRegressors.

The variance in predictions is reduced by averaging the predictions from all trees.

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • max_depth (integer, optional (default=None))

    The depth to which each tree is grown. If None, the tree is either grown to full depth or is constrained by min_samples_split.

  • min_samples_split (integer, optional (default=2))

    Stop growing the tree if all the nodes have lesser than min_samples_split number of samples.

  • bootstrap (boolean, optional (default=False))

    If bootstrap is set to False, then all trees are trained on the entire training dataset. Else, each tree is fit on n_samples drawn with replacement from the training dataset.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Methods

MondrianForestRegressor.fit(X, y)

Builds a forest of trees from the training set (X, y).

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs])

    The target values (class labels in classification, real numbers in regression).

  • sample_weight (array-like, shape = [n_samples] or None)

    Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

Returns

  • self (object)

    Returns self.

MondrianForestRegressor.partial_fit(X, y)

Incremental building of Mondrian Forest Regressors.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32

y: array_like, shape = [n_samples] Input targets.

classes: array_like, shape = [n_classes] Ignored for a regression problem. For a classification problem, if not provided this is inferred from y. This is taken into account for only the first call to partial_fit and ignored for subsequent calls.

Returns self: instance of MondrianForestClassifier

MondrianForestRegressor.predict(X, return_std=False)

Returns the predicted mean and std.

The prediction is a GMM drawn from where .

The mean reduces to

The variance is given by

Parameters

  • X (array-like, shape = (n_samples, n_features))

    Input samples.

  • return_std (boolean, default (False))

    Whether or not to return the standard deviation.

Returns

  • y (array-like, shape = (n_samples,))

    Predictions at X.

  • std (array-like, shape = (n_samples,))

    Standard deviation at X.

MondrianForestRegressor.weighted_decision_path(X)

Returns the weighted decision path in the forest.

Each non-zero value in the decision path determines the weight of that particular node while making predictions.

Parameters

  • X (array-like, shape = (n_samples, n_features))

    Input.

Returns

  • decision_path (sparse csr matrix, shape = (n_samples, n_total_nodes))

    Return a node indicator matrix where non zero elements indicate the weight of that particular node in making predictions.

  • est_inds (array-like, shape = (n_estimators + 1,))

    weighted_decision_path[:, est_inds[i]: est_inds[i + 1]] provides the weighted_decision_path of estimator i

Properties

skgarden.mondrian.MondrianTreeClassifier

A Mondrian tree.

The splits in a mondrian tree regressor differ from the standard regression tree in the following ways.

At fit time: - Splits are done independently of the labels. - The candidate feature is drawn with a probability proportional to the feature range. - The candidate threshold is drawn from a uniform distribution with the bounds equal to the bounds of the candidate feature. - The time of split is also stored which is proportional to the inverse of the size of the bounding-box.

At prediction time: - Every node in the path from the root to the leaf is given a weight while making predictions. - At each node, the probability of an unseen sample splitting from that node is calculated. The farther the sample is away from the bounding box, the more probable that it will split away. - For every node, the probability that an unseen sample has not split before reaching that node and the probability that it will split away at that particular node are multiplied to give a weight.

Parameters

  • max_depth (int or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.
    • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Methods

MondrianTreeClassifier.apply(X, check_input=True)

Returns the index of the leaf that each sample is predicted as.

.. versionadded:: 0.17

Parameters

  • X (array_like or sparse matrix, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • X_leaves (array_like, shape = [n_samples,])

    For each datapoint x in X, return the index of the leaf x ends up in. Leaves are numbered within [0; self.tree_.node_count), possibly with gaps in the numbering.

MondrianTreeClassifier.decision_path(X, check_input=True)

Return the decision path in the tree

.. versionadded:: 0.18

Parameters

  • X (array_like or sparse matrix, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • indicator (sparse csr array, shape = [n_samples, n_nodes])

    Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.

MondrianTreeClassifier.fit(X, y, sample_weight=None, check_input=True, X_idx_sorted=None)
MondrianTreeClassifier.partial_fit(X, y, classes=None)

Incremental building of Mondrian Tree Classifiers.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32

y: array_like, shape = [n_samples] Input targets.

classes: array_like, shape = [n_classes] Ignored for a regression problem. For a classification problem, if not provided this is inferred from y. This is taken into account for only the first call to partial_fit and ignored for subsequent calls.

Returns self: instance of MondrianTree

MondrianTreeClassifier.predict(X, check_input=True, return_std=False)

Predict class or regression value for X.

For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

  • return_std (boolean, (default=True))

    Whether or not to return the standard deviation.

Returns

  • y (array of shape = [n_samples] or [n_samples, n_outputs])

    The predicted classes, or the predict values.

MondrianTreeClassifier.predict_proba(X, check_input=True)

Predicts the probability of each class label given X.

Parameters

  • X (array-like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • y_prob (array of shape = [n_samples, n_classes])

    Prediceted probabilities for each class.

MondrianTreeClassifier.weighted_decision_path(X, check_input=True)

Returns the weighted decision path in the tree.

Each non-zero value in the decision path determines the weight of that particular node in making predictions.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • indicator (sparse csr array, shape = [n_samples, n_nodes])

    Return a node indicator matrix where non zero elements indicate the weight of that particular node in making predictions.

Properties

skgarden.mondrian.MondrianTreeRegressor

A Mondrian tree.

The splits in a mondrian tree regressor differ from the standard regression tree in the following ways.

At fit time: - Splits are done independently of the labels. - The candidate feature is drawn with a probability proportional to the feature range. - The candidate threshold is drawn from a uniform distribution with the bounds equal to the bounds of the candidate feature. - The time of split is also stored which is proportional to the inverse of the size of the bounding-box.

At prediction time: - Every node in the path from the root to the leaf is given a weight while making predictions. - At each node, the probability of an unseen sample splitting from that node is calculated. The farther the sample is away from the bounding box, the more probable that it will split away. - For every node, the probability that an unseen sample has not split before reaching that node and the probability that it will split away at that particular node are multiplied to give a weight.

Parameters

  • max_depth (int or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.
    • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Methods

MondrianTreeRegressor.apply(X, check_input=True)

Returns the index of the leaf that each sample is predicted as.

.. versionadded:: 0.17

Parameters

  • X (array_like or sparse matrix, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • X_leaves (array_like, shape = [n_samples,])

    For each datapoint x in X, return the index of the leaf x ends up in. Leaves are numbered within [0; self.tree_.node_count), possibly with gaps in the numbering.

MondrianTreeRegressor.decision_path(X, check_input=True)

Return the decision path in the tree

.. versionadded:: 0.18

Parameters

  • X (array_like or sparse matrix, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • indicator (sparse csr array, shape = [n_samples, n_nodes])

    Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.

MondrianTreeRegressor.fit(X, y, sample_weight=None, check_input=True, X_idx_sorted=None)
MondrianTreeRegressor.partial_fit(X, y)

Incremental building of Mondrian Tree Regressors.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32

y: array_like, shape = [n_samples] Input targets.

Returns self: instance of MondrianTree

MondrianTreeRegressor.predict(X, check_input=True, return_std=False)

Predict class or regression value for X.

For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

  • return_std (boolean, (default=True))

    Whether or not to return the standard deviation.

Returns

  • y (array of shape = [n_samples] or [n_samples, n_outputs])

    The predicted classes, or the predict values.

MondrianTreeRegressor.weighted_decision_path(X, check_input=True)

Returns the weighted decision path in the tree.

Each non-zero value in the decision path determines the weight of that particular node in making predictions.

Parameters

  • X (array_like, shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • indicator (sparse csr array, shape = [n_samples, n_nodes])

    Return a node indicator matrix where non zero elements indicate the weight of that particular node in making predictions.

Properties

skgarden.quantile

skgarden.quantile.DecisionTreeQuantileRegressor

A decision tree regressor that provides quantile estimates.

Parameters

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.

  • splitter (string, optional (default="best"))

    The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.

  • max_features (int, float, string or None, optional (default=None))

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If "auto", then max_features=n_features. - If "sqrt", then max_features=sqrt(n_features). - If "log2", then max_features=log2(n_features). - If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_depth (int or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • presort (bool, optional (default=False))

    Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.

Attributes

  • feature_importances_ (array of shape = [n_features])

    The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [4]_.

  • max_features_ (int,)

    The inferred value of max_features.

  • n_features_ (int)

    The number of features when fit is performed.

  • n_outputs_ (int)

    The number of outputs when fit is performed.

  • tree_ (Tree object)

    The underlying Tree object.

  • y_train_ (array-like)

    Train target values.

  • y_train_leaves_ (array-like.)

    Cache the leaf nodes that each training sample falls into. y_train_leaves_[i] is the leaf that y_train[i] ends up at.

Methods

DecisionTreeQuantileRegressor.predict(X, quantile=None, check_input=False)

Predict regression value for X.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • quantile (int, optional)

    Value ranging from 0 to 100. By default, the mean is returned.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • y (array of shape = [n_samples])

    If quantile is set to None, then return E(Y | X). Else return y such that F(Y=y | x) = quantile.

Properties

skgarden.quantile.ExtraTreeQuantileRegressor

An extremely randomized tree regressor.

Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally random decision tree.

Warning: Extra-trees should only be used within ensemble methods.

Read more in the :ref:User Guide <tree>.

Parameters

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error.

    .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.

  • splitter (string, optional (default="best"))

    The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.

  • max_depth (int or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.
    • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

    .. versionchanged:: 0.18 Added float values for percentages.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node:

    • If int, then consider min_samples_leaf as the minimum number.
    • If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

    .. versionchanged:: 0.18 Added float values for percentages.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float, string or None, optional (default=None))

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.
    • If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
    • If "auto", then max_features=n_features.
    • If "sqrt", then max_features=sqrt(n_features).
    • If "log2", then max_features=log2(n_features).
    • If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • min_impurity_decrease (float, optional (default=0.))

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following::

    N_t / N * (impurity - N_t_R / N_t * right_impurity
                        - N_t_L / N_t * left_impurity)
    

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

    .. versionadded:: 0.19

  • min_impurity_split (float,)

    Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

    .. deprecated:: 0.19 min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19 and will be removed in 0.21. Use min_impurity_decrease instead.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

See also ExtraTreeClassifier, ExtraTreesClassifier, ExtraTreesRegressor

Notes The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

References

.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees", Machine Learning, 63(1), 3-42, 2006.

Methods

ExtraTreeQuantileRegressor.predict(X, quantile=None, check_input=False)

Predict regression value for X.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • quantile (int, optional)

    Value ranging from 0 to 100. By default, the mean is returned.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • y (array of shape = [n_samples])

    If quantile is set to None, then return E(Y | X). Else return y such that F(Y=y | x) = quantile.

Properties

skgarden.quantile.ExtraTreesQuantileRegressor

An extra-trees regressor that provides quantile estimates.

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.

  • max_features (int, float, string or None, optional (default="auto"))

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If "auto", then max_features=n_features. - If "sqrt", then max_features=sqrt(n_features). - If "log2", then max_features=log2(n_features). - If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_depth (integer or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • bootstrap (boolean, optional (default=False))

    Whether bootstrap samples are used when building trees.

  • oob_score (bool, optional (default=False))

    Whether to use out-of-bag samples to estimate the R^2 on unseen data.

  • n_jobs (integer, optional (default=1))

    The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, optional (default=0))

    Controls the verbosity of the tree building process.

  • warm_start (bool, optional (default=False))

    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

Attributes

  • estimators_ (list of ExtraTreeQuantileRegressor)

    The collection of fitted sub-estimators.

  • feature_importances_ (array of shape = [n_features])

    The feature importances (the higher, the more important the feature).

  • n_features_ (int)

    The number of features when fit is performed.

  • n_outputs_ (int)

    The number of outputs when fit is performed.

  • oob_score_ (float)

    Score of the training dataset obtained using an out-of-bag estimate.

  • oob_prediction_ (array of shape = [n_samples])

    Prediction computed with out-of-bag estimate on the training set.

  • y_train_ (array-like, shape=(n_samples,))

    Cache the target values at fit time.

  • y_weights_ (array-like, shape=(n_estimators, n_samples))

    y_weights_[i, j] is the weight given to sample j` while estimatori`` is fit. If bootstrap is set to True, this reduces to a 2-D array of ones.

  • y_train_leaves_ (array-like, shape=(n_estimators, n_samples))

    y_train_leaves_[i, j] provides the leaf node that y_train_[i] ends up when estimator j is fit. If y_train_[i] is given a weight of zero when estimator j is fit, then the value is -1.

References .. [1] Nicolai Meinshausen, Quantile Regression Forests http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf

Methods

ExtraTreesQuantileRegressor.fit(X, y)

Build a forest from the training set (X, y).

Parameters

  • X (array-like or sparse matrix, shape = [n_samples, n_features])

    The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs])

    The target values (class labels) as integers or strings.

  • sample_weight (array-like, shape = [n_samples] or None)

    Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

  • X_idx_sorted (array-like, shape = [n_samples, n_features], optional)

    The indexes of the sorted training input samples. If many tree are grown on the same dataset, this allows the ordering to be cached between trees. If None, the data will be sorted here. Don't use this parameter unless you know what to do.

Returns

  • self (object)

    Returns self.

ExtraTreesQuantileRegressor.predict(X, quantile=None)

Predict regression value for X.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • quantile (int, optional)

    Value ranging from 0 to 100. By default, the mean is returned.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • y (array of shape = [n_samples])

    If quantile is set to None, then return E(Y | X). Else return y such that F(Y=y | x) = quantile.

Properties

skgarden.quantile.RandomForestQuantileRegressor

A random forest regressor that provides quantile estimates.

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.

  • max_features (int, float, string or None, optional (default="auto"))

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If "auto", then max_features=n_features. - If "sqrt", then max_features=sqrt(n_features). - If "log2", then max_features=log2(n_features). - If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_depth (integer or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • bootstrap (boolean, optional (default=True))

    Whether bootstrap samples are used when building trees.

  • oob_score (bool, optional (default=False))

    whether to use out-of-bag samples to estimate the R^2 on unseen data.

  • n_jobs (integer, optional (default=1))

    The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, optional (default=0))

    Controls the verbosity of the tree building process.

  • warm_start (bool, optional (default=False))

    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

Attributes

  • estimators_ (list of DecisionTreeQuantileRegressor)

    The collection of fitted sub-estimators.

  • feature_importances_ (array of shape = [n_features])

    The feature importances (the higher, the more important the feature).

  • n_features_ (int)

    The number of features when fit is performed.

  • n_outputs_ (int)

    The number of outputs when fit is performed.

  • oob_score_ (float)

    Score of the training dataset obtained using an out-of-bag estimate.

  • oob_prediction_ (array of shape = [n_samples])

    Prediction computed with out-of-bag estimate on the training set.

  • y_train_ (array-like, shape=(n_samples,))

    Cache the target values at fit time.

  • y_weights_ (array-like, shape=(n_estimators, n_samples))

    y_weights_[i, j] is the weight given to sample j` while estimatori`` is fit. If bootstrap is set to True, this reduces to a 2-D array of ones.

  • y_train_leaves_ (array-like, shape=(n_estimators, n_samples))

    y_train_leaves_[i, j] provides the leaf node that y_train_[i] ends up when estimator j is fit. If y_train_[i] is given a weight of zero when estimator j is fit, then the value is -1.

References .. [1] Nicolai Meinshausen, Quantile Regression Forests http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf

Methods

RandomForestQuantileRegressor.fit(X, y)

Build a forest from the training set (X, y).

Parameters

  • X (array-like or sparse matrix, shape = [n_samples, n_features])

    The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

  • y (array-like, shape = [n_samples] or [n_samples, n_outputs])

    The target values (class labels) as integers or strings.

  • sample_weight (array-like, shape = [n_samples] or None)

    Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

  • X_idx_sorted (array-like, shape = [n_samples, n_features], optional)

    The indexes of the sorted training input samples. If many tree are grown on the same dataset, this allows the ordering to be cached between trees. If None, the data will be sorted here. Don't use this parameter unless you know what to do.

Returns

  • self (object)

    Returns self.

RandomForestQuantileRegressor.predict(X, quantile=None)

Predict regression value for X.

Parameters

  • X (array-like or sparse matrix of shape = [n_samples, n_features])

    The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

  • quantile (int, optional)

    Value ranging from 0 to 100. By default, the mean is returned.

  • check_input (boolean, (default=True))

    Allow to bypass several input checking. Don't use this parameter unless you know what you do.

Returns

  • y (array of shape = [n_samples])

    If quantile is set to None, then return E(Y | X). Else return y such that F(Y=y | x) = quantile.

Properties

skgarden.forest

skgarden.forest.ExtraTreesRegressor

ExtraTreesRegressor that supports conditional standard deviation.

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error.

  • max_features (int, float, string or None, optional (default="auto"))

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If "auto", then max_features=n_features. - If "sqrt", then max_features=sqrt(n_features). - If "log2", then max_features=log2(n_features). - If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_depth (integer or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, optional (default=0.))

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • bootstrap (boolean, optional (default=True))

    Whether bootstrap samples are used when building trees.

  • oob_score (bool, optional (default=False))

    whether to use out-of-bag samples to estimate the R^2 on unseen data.

  • n_jobs (integer, optional (default=1))

    The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, optional (default=0))

    Controls the verbosity of the tree building process.

  • warm_start (bool, optional (default=False))

    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

Attributes

  • estimators_ (list of DecisionTreeRegressor)

    The collection of fitted sub-estimators.

  • feature_importances_ (array of shape = [n_features])

    The feature importances (the higher, the more important the feature).

  • n_features_ (int)

    The number of features when fit is performed.

  • n_outputs_ (int)

    The number of outputs when fit is performed.

  • oob_score_ (float)

    Score of the training dataset obtained using an out-of-bag estimate.

  • oob_prediction_ (array of shape = [n_samples])

    Prediction computed with out-of-bag estimate on the training set.

Notes The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values. The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

References .. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.

Methods

ExtraTreesRegressor.predict(X, return_std=False)

Predict continuous output for X.

Parameters

  • X (array-like of shape=(n_samples, n_features))

    Input data.

  • return_std (boolean)

    Whether or not to return the standard deviation.

Returns

  • predictions (array-like of shape=(n_samples,))

    Predicted values for X. If criterion is set to "mse", then predictions[i] ~= mean(y | X[i]).

  • std (array-like of shape=(n_samples,))

    Standard deviation of y at X. If criterion is set to "mse", then std[i] ~= std(y | X[i]).

Properties

skgarden.forest.RandomForestRegressor

RandomForestRegressor that supports conditional std computation.

Parameters

  • n_estimators (integer, optional (default=10))

    The number of trees in the forest.

  • criterion (string, optional (default="mse"))

    The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error.

  • max_features (int, float, string or None, optional (default="auto"))

    The number of features to consider when looking for the best split: - If int, then consider max_features features at each split. - If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. - If "auto", then max_features=n_features. - If "sqrt", then max_features=sqrt(n_features). - If "log2", then max_features=log2(n_features). - If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_depth (integer or None, optional (default=None))

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int, float, optional (default=2))

    The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int, float, optional (default=1))

    The minimum number of samples required to be at a leaf node: - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, optional (default=0.))

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_leaf_nodes (int or None, optional (default=None))

    Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, optional (default=0.))

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • bootstrap (boolean, optional (default=True))

    Whether bootstrap samples are used when building trees.

  • oob_score (bool, optional (default=False))

    whether to use out-of-bag samples to estimate the R^2 on unseen data.

  • n_jobs (integer, optional (default=1))

    The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

  • random_state (int, RandomState instance or None, optional (default=None))

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, optional (default=0))

    Controls the verbosity of the tree building process.

  • warm_start (bool, optional (default=False))

    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

Attributes

  • estimators_ (list of DecisionTreeRegressor)

    The collection of fitted sub-estimators.

  • feature_importances_ (array of shape = [n_features])

    The feature importances (the higher, the more important the feature).

  • n_features_ (int)

    The number of features when fit is performed.

  • n_outputs_ (int)

    The number of outputs when fit is performed.

  • oob_score_ (float)

    Score of the training dataset obtained using an out-of-bag estimate.

  • oob_prediction_ (array of shape = [n_samples])

    Prediction computed with out-of-bag estimate on the training set.

Notes The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values. The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

References .. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.

Methods

RandomForestRegressor.predict(X, return_std=False)

Predict continuous output for X.

Parameters

  • X (array of shape = (n_samples, n_features))

    Input data.

  • return_std (boolean)

    Whether or not to return the standard deviation.

Returns

  • predictions (array-like of shape = (n_samples,))

    Predicted values for X. If criterion is set to "mse", then predictions[i] ~= mean(y | X[i]).

  • std (array-like of shape=(n_samples,))

    Standard deviation of y at X. If criterion is set to "mse", then std[i] ~= std(y | X[i]).

Properties