Python

Dataset

DMatrix

The most suitable data container for data preparation work is Pandas' dataframe.

Group columns by their intended type into subsets, and cast them using the astype(dtype) method.

For continuous columns, specify the "smallest" appropriate NumPy or Pandas' data type to reclaim memory. Nullable Pandas' data types such as pandas.Float32DType and pandas.Int32Dtype shine with mixed-type sparse datasets, because they ensure consistent representation of missing values as pandas.NA.

For categorical columns, specify the Pandas' categorical data type.

X = X[float_cols + int_cols + cat_cols]

X[float_cols] = X[float_cols].astype("Float32")
X[int_cols] = X[int_cols].astype("Int32")
X[cat_cols] = X[cat_cols].astype("category")

Prepare a feature_types helper parameter to freeze the interpretation of feature data.

feature_types = ["float"] * len(float_cols) + ["int"] * len(int_cols) + ["c"] * len(cat_cols)

Categorical targets are similar to categorical features. However, to impose a custom order, specify the data type as a fully initialized pandas.CategoricalDtype object instead of a "category" shorthand.

from pandas import CategoricalDtype

y = y.astype(CategoricalDtype(categories = ["zero", "one"]))

Combining everything into a DMatrix object:

from xgboost import DMatrix

dmat = DMatrix(data = X, label = y.cat.codes, feature_types = feature_types, enable_categorical = True)

Workflow

Training:

import xgboost

booster = xgboost.train(params = {"objective" : "binary:logistic"}, dtrain = dmat)
booster.save_model("Booster.json")

Export to PMML

Convert booster objects to PMML using the sklearn2pmml package.

The sklearn2pmml.sklearn2pmml(obj, pmml_path) supports a small number of non-SkLearn artifacts, with LightGBM and XGBoost boosters being prime examples.

A booster object embeds a simplified data schema. If the dataset contains categorical features, then override it with an external feature map so that the PMML converter can translate category level indices back to original values.

The sklearn2pmml.xgboost.make_feature_map(X) utility function can generate various types of feature maps. The generation of "pre-categorical" feature maps is activated by passing category_to_indicator = True.

from sklearn2pmml.xgboost import make_feature_map

fmap = make_feature_map(X, category_to_indicator = True)
fmap.save("Booster.fmap.tsv")

Attach the feature map to the booster using the fmap attribute:

booster.fmap = fmap

Converting:

from sklearn2pmml import sklearn2pmml

sklearn2pmml(booster, "Booster.pmml")

The sklearn2pmml package currently does not support customizing the target field.

Resources

Notebook: View or Download