Integration testing

Overview

JPMML software allows the deployment of non-Java ML framework artifacts on the Java/JVM platform. Naturally, for business-critical applications, there needs to be ample proof that the quality of predictions did not deteriorate during the transition.

Such guarantees can be given and maintained long-term via integration testing.

The original ML artifact makes predictions on a reference dataset, and these results are stored as "correct answers". This artifact is then converted to the PMML representation. The PMML artifact also makes predictions on the same reference dataset, and the results are compared against the "correct answers" on a per-sample basis (ie. not on a population statistics basis). The test passes if there are no differences beyond the stated acceptance criteria.

JPMML-Evaluator includes a testing harness, which implements this idea by asserting that a CSV input file scored with a PMML model file yields the expected result CSV file:

<Input CSV> + <PMML> == <Result CSV>

Workflow

Apache Maven configuration

Create a new directory, and initialize a tests-only Apache Maven project into it:

touch pom.xml
mkdir -p src/test/java
mkdir -p src/test/resources/csv
mkdir -p src/test/resources/pmml

Correct directory layout:

./
├── src/
│   └── test/
│       ├── java/
│       └── resources/
│           ├── csv/
│           └── pmml/
└── pom.xml

Configure the Project Object Model (POM) file to handle the test lifecycle phase. This typically involves declaring a JUnit dependency, plus declaring maven-compiler-plugin and maven-surefire-plugin plugins.

To get started quickly, inherit everything from the org.jpmml:jpmml-parent parent POM:

<?xml version="1.0" ?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.jpmml</groupId>
        <artifactId>jpmml-parent</artifactId>
        <version>${jpmml-parent.version}</version>
    </parent>

    <groupId>com.mycompany</groupId>
    <artifactId>myapp-integration-testing</artifactId>
    <version>1.0-SNAPSHOT</version>
</project>

The only "real" dependencies to declare are org.jpmml:pmml-evaluator-metro and org.jpmml:pmml-evaluator-testing. Pin their version(s) to match the JPMML-Evaluator version that serves the actual business case.

<properties>
    <jpmml-evaluator.version>1.7.7</jpmml-evaluator.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.jpmml</groupId>
        <artifactId>pmml-evaluator-metro</artifactId>
        <version>${jpmml-evaluator.version}</version>
    </dependency>
    <dependency>
        <groupId>org.jpmml</groupId>
        <artifactId>pmml-evaluator-testing</artifactId>
        <version>${jpmml-evaluator.version}</version>
    </dependency>
</dependencies>

Consider extracting dependency versions into POM properties to make them overrideable using -D<property>=value command-line options. For example, this pattern allows running the test suite with different JPMML-Evaluator versions without touching any code or configuration:

mvn -Djpmml-evaluator.version=1.7.7 test
mvn -Djpmml-evaluator.version=1.7.0 test
mvn -Djpmml-evaluator.version=1.6.11 test

Resource files

Generate CSV and PMML files into src/test/resources subdirectories:

./
├── src/
│   └── test/
│       ├── java/
│       └── resources/
│           ├── csv/
│           │   ├── Iris.csv
│           │   ├── DecisionTreeIris.csv
│           │   └── LogisticRegressionIris.csv
│           └── pmml/
│               ├── DecisionTreeIris.pmml
│               └── LogisticRegressionIris.pmml
└── pom.xml

Each test case is identified by a two-part name ${algorithm}${dataset}. The ${algorithm} part reflects the most characteristic algorithm of the ML artifact (eg. decision tree, logistic regression), while ${dataset} reflects the dataset.

Input CSV

Store the input dataset in the csv/${dataset}.csv CSV file. All JPMML testing harnesses expect comma ( ,) as the field separator, and treat N/A (Pandas/Python style) and NA (R style) as placeholders for missing values.

The input dataset is meant to be shared between related test cases. For maximum clarity, it should contain only active feature columns, and never the target column.

Generating the "Iris" input CSV file:

from pandas import DataFrame, Series
from sklearn.datasets import load_iris

iris = load_iris()

X = DataFrame(iris.data, columns = iris.feature_names)
y = Series(iris.target, name = "Species").map(lambda x: iris.target_names[x])

def store_csv(df, name):
    df.to_csv("csv/" + name + ".csv", header = True, index = False, sep = ",", na_rep = "N/A")

store_csv(X, "Iris")

PMML

Train the ML artifact. Convert and store its PMML representation in the pmml/${algorithm}${dataset}.pmml PMML file. It should go without saying that the test case should mimic the actual business case as closely as possible.

Generating the "DecisionTreeIris" PMML file:

from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml import sklearn2pmml

def store_pmml(obj, name):
    sklearn2pmml(obj, "pmml/" + name + ".pmml")

classifier = DecisionTreeClassifier(random_state = 42)
classifier.fit(X, y)

store_pmml(classifier, "DecisionTreeIris")

Result CSV

Use the trained ML artifact to make predictions on the input dataset. Store the results dataset in the csv/${algorithm}${dataset}.csv CSV file.

The test case should cover all predictive functionality. For example, for probabilistic classifiers, it should check both class labels and their estimated probabilities.

The bigger and "more numeric" the results dataset is, the more effective it is at detecting and diagnosing potential error conditions. For example, checking a class label alone would only catch a fraction of mis-predicted samples (ie. those that cross the decision boundary). If the dataset is small and the logic error behind mis-predictions is subtle enough, then there might not be any visible signal at all.

Generating the "DecisionTreeIris" CSV file:

yt = DataFrame()
yt[y.name] = classifier.predict(X)
yt[["probability({})".format(cls) for cls in classifier.classes_]] = classifier.predict_proba(X)

store_csv(yt, "DecisionTreeIris")

Java test class

Create a Java source file into the src/test/java subdirectory:

./
├── src/
│   └── test/
│       ├── java/
│       │   └── SkLearnTest.java
│       └── resources/
└── pom.xml

The class name can be anything, as long as it ends with Test or Tests. The class must extend the org.jpmml.evaluator.testing.SimpleArchiveBatchTest abstract base class, and declare a default constructor that invokes SimpleArchiveBatchTest(com.google.common.base.Equivalence equivalence) with the desired equivalence strategy object:

import org.jpmml.evaluator.testing.PMMLEquivalence;
import org.jpmml.evaluator.testing.SimpleArchiveBatchTest;

public class SkLearnTest extends SimpleArchiveBatchTest {

    public SkLearnTest(){
        super(new PMMLEquivalence(1e-13, 1e-13));
    }
}

Equivalence strategies differ by the handling of continuous floating-point values (ie. continuous+float and continuous+double). All the other types of values are simply checked for equality.

JPMML-Evaluator provides two equivalence strategies:

org.jpmml.evaluator.testing.PMMLEquivalence(double precision, double zeroThreshold). Compares values against the zero threshold parameter to determine zeroness. If both values are non-zero, calculates the difference between them, and compares it against the precision parameter. This procedure is described in full detail in the Model verification chapter of the PMML specification.
org.jpmml.evaluator.testing.RealNumberEquivalence(int tolerance). Calculates the difference between two values in Unit of Least Precision (ULP) terms, and compares it against the tolerance parameter.

Most test cases should work fine with PMMLEquivalence with both parameters set in the 1e-12 to 1e-14 range. This configuration is guaranteed to catch all functionally significant differences, while still permitting some noise that stems from low-level floating-point operations and functions.

The test class should declare a separate method for each test case:

import org.junit.jupiter.api.Test;

public class SkLearnTest extends SimpleArchiveBatchTest {

    @Test
    public void evaluateDecisionTreeIris() throws Exception {
        evaluate("DecisionTree", "Iris");
    }

    @Test
    public void evaluateLogisticRegressionIris() throws Exception {
        evaluate("LogisticRegression", "Iris");
    }
}

Test methods must be marked with the org.junit.jupiter.api.Test annotation for automated discovery.

The method name can be anything. However, it is advisable to use meaningful and consistent names that relate directly to test cases (ie. containing ${algorithm}${dataset} in some form), because Apache Maven prints failing method names to the console.

The method body invokes the most appropriate ArchiveBatchTest#evaluate(String algorithm, String dataset) method. There are overloaded variants available, which enable tightening or loosening class-level configuration on a per-method basis, such as excluding some result fields from checking, or using a different equivalence strategy.

For example, with probabilistic classifiers, there are two or more estimated probability fields whose values sum to 1.0 by definition. It may be beneficial to exclude the column with smallest magnitude values (because relative errors become less stable as the operands approach zero), in order to tighten the acceptance criteria for the remaining column(s).

public class SkLearnTest extends SimpleArchiveBatchTest {

    public SkLearnTest(){
        super(new PMMLEquivalence(1e-13, 1e-13));
    }

    @Test
    public void evaluateLogisticRegressionIris() throws Exception {
        evaluate("LogisticRegression", "Iris", excludeColumns("probability(virginica)"), new PMMLEquivalence(1e-14, 1e-14));
    }
}

Test methods are able to locate and load the backing CSV and PMML resource files based on the algorithm and dataset parameter values. There is no extra wiring needed.

Execution

Navigate to the root of the project directory (where the POM file is) and execute the test lifecycle phase:

mvn clean test

Apache Maven will collect, compile and run all Java test classes, and print a short summary to the console when done. Additional build side-artifacts are available in the target directory. Specifically, the maven-surefire-plugin plugin (the default handler for the test lifecycle phase) generates detailed TXT and XML reports into the target/surefire-reports directory.

During the first run, make sure that all the intended test fixtures are discovered.

This example project has one test class ( SkLearnTest) with two test methods ( evaluateDecisionTreeIris and evaluateLogisticRegressionIris). The print-out confirms that they were all executed successfully:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running SkLearnTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.556 s -- in SkLearnTest
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO]

The same testing harness can be re-purposed from integration testing to basic performance testing. Simply increase the complexity of algorithms, and increase the size of datasets so that JPMML-Evaluator "working time" outweighs Apache Maven and JUnit "working times" by a sufficient margin. The current elapsed time of 1.556 seconds is almost all framework overhead.

For demonstration purposes, consider breaking the build by over-tightening the equivalence strategy from PMMLEquivalence(1e-13, 1e-13) to PMMLEquivalence(1e-15, 1e-15).

The print-out becomes:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running SkLearnTest
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.606 s <<< FAILURE! -- in SkLearnTest
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   SkLearnTest.evaluateLogisticRegressionIris:18->ArchiveBatchTest.evaluate:41->ArchiveBatchTest.evaluate:63->BatchTest.evaluate:37->BatchTest.checkConflicts:48 Found 64 conflict(s)
[INFO] 
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0
[INFO]

The evaluateDecisionTreeIris test method keeps passing, but the evaluateLogisticRegressionIris test method fails after having detected 64 conflicts.

A test class can install a custom conflict handler by overriding the BatchTest#checkConflicts(List<Conflict> conflicts) method. The default handler prints conflicts to the system error stream, and then fails by throwing an org.opentest4j.AssertionFailedError.

Sanitize the Apache Maven build output by redirecting all conflicts to a separate file:

mvn clean test 2> conflicts.txt

Resources

Project template: View or Download