NGBoost
Overview
NGBoost learns (approximates) the distribution function of target values, rather than individual point estimates.
Knowing the parametric form of a distribution function opens up unique opportunities. For example, predicting quantiles (eg. "predict the target along with its q5 and q95 quantiles") or prediction intervals (eg. "predict the target with a range that includes the true value with 90% confidence").
Classical Scikit-Learn estimators can achieve similar objectives.
For example, predict quantiles using
QuantileRegressor or
GradientBoostingRegressor (with the pinball loss), then wrap the elementary models into a single deployable unit using
MultiOutputRegressor.
However, NGBoost outperforms such setups in many ways. It produces smaller and computationally more efficient models and, most importantly, its predictions are heteroskedastic rather than homoskedastic.
To elaborate, NGBoost does not predict the target value for a sample directly. It predicts the parameters of the distribution function for a sample, constructs the corresponding distribution instance, and evaluates it for the mean (q50).
All samples share the same distribution family, but they each parameterize it differently. Specifically, a sample's confidence is reflected in its estimated dispersion parameter value(s). Samples from higher-density (ie. more typical) regions have lower dispersion, which naturally leads to narrower prediction intervals.
Workflow
NGBoost estimators follow Scikit-Learn API conventions, making them usable in standalone mode as well as when embedded in pipelines. Adopt the pipeline approach proactively, because real-life datasets almost always require some form of transformation before they can be passed to the final estimator.
Data pre-processing requirements depend on the NGBoost base estimator (the
Base parameter).
The default base estimator is
sklearn.tree.DecisionTreeRegressor, which accepts continuous features as-is, but requires categorical features to be encoded:
from ngboost.learners import default_tree_learner
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer([
("cat", OneHotEncoder(sparse_output = False), cat_cols),
("cont", "passthrough", cont_cols)
])
base_estimator = default_tree_learner
Adjust the data pre-processor and base estimator in tandem when tackling more complex problems.
For example, if the dataset is sparse and contains many high-cardinality categorical features, then consider using
xgboost.XGBRegressor as the base estimator:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
transformer = ColumnTransformer([
("cat", OrdinalEncoder(), cat_cols),
("cont", "passthrough", cont_cols)
])
feature_types = ["c"] * len(cat_cols) + ["q"] * len(cont_cols)
base_estimator = XGBRegressor(n_estimators = 3, max_depth = 2, feature_types = feature_types)
Some words of caution and advice.
First, prefer the most basic and backwards-compatible solution whenever several technical approaches are feasible.
For example,
XGBRegressor accepts categorical features either in ordinally-encoded or
pandas.CategoricalDtype forms.
At the time of writing this, NGBoost only supports the former.
This limitation stems from the fact that NGBoost casts all incoming data containers to NumPy arrays, which discards or disfigures any specialized third-party data representations (eg. Pandas' categoricals).
Second, NGBoost estimator classes deviate from the core Scikit-Learn convention of storing fitted state in attributes ending with the underscore character (
_).
Scikit-Learn 1.8.0 started checking the fitted state of pipeline steps during prediction.
This change causes all NGBoost pipelines to fail with a
sklearn.exceptions.NotFittedError, whether actually justified or not.
As a quick fix, set an indicator attribute:
pipeline = ...
pipeline.fit(X, y)
# Mark the regressor as fitted
pipeline["regressor"].fitted_ = True
Alternatively, exclude the NGBoost estimator from the pipeline, and emulate the transform-then-predict sequence manually.
Apply the same pattern when calling the non-standard
NGBoost.pred_dist(X) method:
pipeline = ...
pipeline.fit(X, y)
# Steps up until the final step
Xt = pipeline[:-1].transform(X)
# The final step, with a non-standard predict method
y_dist = pipeline._final_estimator.pred_dist(Xt)
Export to PMML
Convert NGBoost artifacts to PMML as usual:
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "NGBoostPipeline.pmml")
By default, the converter generates PMML that replicates the point-prediction behaviour of
predict(X) and
predict_proba(X) methods.
Activate the generation of NGBoost-specific prediction aspects on a per-aspect basis.
Extending the model schema with more output fields is easy. However, do not get carried away. Adding unnecessary or undocumented fields is likely a net negative, because it increases complexity for model consumers.
Supported configurations:
| Estimator | Distribution | Parameters | Prediction intervals |
|---|---|---|---|
NGBRegressor |
Normal |
loc,
scale |
Yes |
NGBRegressor |
LogNormal |
loc,
scale |
Yes |
NGBRegressor |
Poisson |
loc |
No |
NGBClassifier |
Bernoulli |
loc |
No |
NGBClassifier |
k_categorical |
loc(0),
loc(1), ...,
loc(<k-1>) |
No |
Distribution parameters
The predicted parameter values are available internally under pre-defined names.
For example, consider the scoring of log-normal distribution models.
The first two child models predict the
loc and
scale parameter values.
The last child model retrieves them by name, and calculates the log-normal mean of the target value as
exp(loc + exp(2.0 * scale) / 2.0).
If the application wants to reconstruct and perform custom work with the log-normal distribution instance on its side, then it needs the same
loc and
scale parameter values.
When exposing internal fields, give the final
OutputField elements more meaningful names, and apply relevant transformations.
This spares application developers from having to re-discover and re-implement the underlying statistics.
For example, expose the
loc parameter directly as
dist(mu), but exponentiate the
scale parameter before exposing it as
dist(sigma).
The SkLearn2PMML post-processing mechanism is the easiest way to generate and append custom output fields straight to the final model element, especially when working with complex model ensembles:
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.cross_reference import Recaller
from sklearn2pmml.decoration import Alias
from sklearn2pmml.preprocessing import ExpressionTransformer
def make_branch(in_name, expr, out_name):
recaller = Recaller(memory = None, names = [in_name])
transformer = Alias(ExpressionTransformer(expr), name = out_name, prefit = True)
return make_pipeline(recaller, transformer)
dist_output = FeatureUnion([
("loc", make_branch("loc", "X['loc'] + 0.0", "dist(mu)")),
("scale", make_branch("scale", "numpy.exp(X['scale'])", "dist(sigma)"))
])
pipeline.predict_transformer = dist_output
sklearn2pmml(pipeline, "NGBoostPipeline-dist.pmml")
Prediction intervals
For NGBoost regressors, the most common use of the
scale parameter is for calculating prediction intervals.
The SkLearn2PMML package supports the
confidence_level conversion option, which generates a pair of output fields (upper and lower bound) in one go.
Set conversion options using the
PMMLPipeline.configure(**pmml_options) convenience method:
pipeline = PMMLPipeline(...)
pipeline.fit(X, y)
pipeline.configure(confidence_level = ...)
Conversion options must be set after fitting, never before. The reason is that Scikit-Learn pipelines clone the steps on each fit, and the default cloning algorithm discards any custom attributes, including PMML conversion options.
Strictly speaking, the
confidence_level conversion option relates to total coverage (probability) via a symmetric two-sided prediction interval.
For example,
confidence_level = 0.90 represents a prediction interval from the 5th to the 95th percentile (90% coverage, 5% tails on each side).
Translation to quantiles:
confidence_level = 0.90
lower_quantile = (1 - confidence_level) / 2 # 0.05
upper_quantile = 1 - (1 - confidence_level) / 2 # 0.95
The converter handles
confidence_level values as follows:
| Value | Bounds | Input schema | Output schema |
|---|---|---|---|
boolean
False |
None | None | None |
float
0.0 < x < 1.0 |
Static | None |
lower(<target>) and
upper(<target>) |
boolean
True |
Dynamic |
Optional
ci field
|
Optional
lower(<target>) and
upper(<target>) |
| string | Dynamic | Optional user-named field |
Optional
lower(<target>) and
upper(<target>) |
No bounds
Set
confidence_level to the
False boolean value:
pipeline.configure(confidence_level = False)
The converter does not generate any extra output fields.
The
scale parameter is functionally redundant in normal distribution models.
If the application does not need it, then delete the corresponding component from the model chain.
Expect the size of the PMML document to be cut in half and, consequently, its scoring throughput to double.
Static bounds
Set
confidence_level to a float value between
0 and
1 (probability):
pipeline.configure(confidence_level = 0.90)
sklearn2pmml(pipeline, "NGBoostPipeline-static.pmml")
The converter hard-codes this value into critical z-value calculations.
The model always outputs valid
lower(<target>) and
upper(<target>) values.
Dynamic bounds
Set
confidence_level to the
True boolean value, or a non-empty string value.
In fact, setting
True is equivalent to setting
"ci":
pipeline.configure(confidence_level = "ci")
sklearn2pmml(pipeline, "NGBoostPipeline-dynamic.pmml")
The converter generates a
DataField element with that name, and references it in all critical z-value calculations.
The bounds respond dynamically to the confidence level value in model scoring arguments.
The model outputs missing
lower(<target>) and
upper(<target>) values if the input is missing.
Scoring with dynamic bounds:
from jpmml_evaluator import make_evaluator
evaluator = make_evaluator("NGBoostPipeline-dynamic.pmml") \
.verify()
pmml_X = X.copy()
# Same coverage for all samples
pmml_X["ci"] = 0.90
pmml_y = evaluator.evaluateAll(pmml_X)
print(pmml_y)
Applications that need even more dynamism should fall back to exporting distribution parameters. This allows reconstructing distribution instances locally, with full flexibility for querying and calculations.