In this example notebook, we will go step-by-step through the process of training and deploying an XGBoost fraud detection model using Triton's new FIL backend. Along the way, we'll show how to analyze the performance of a model deployed in Triton and optimize its performance based on specific SLA targets or other considerations.
While this notebook includes some code to train an XGBoost model, the focus is primarily on how to use that model with Triton. For more general information on XGBoost model training, see the official XGBoost tutorials.
This notebook assumes that you have Docker and libb64 plus a few Python dependencies. To install all Python dependencies in a conda environment, you may make use of the following conda environment file:
---
name: triton_example
channels:
- conda-forge
- nvidia
- rapidsai
dependencies:
- cudatoolkit=11.4
- cudf=21.12
- cuml=21.12
- cupy
- jupyter
- kaggle
- matplotlib
- numpy
- pandas
- pip
- python=3.8
- scikit-learn
- pip:
- tritonclient[all]
- xgboost>=1.5,<1.6
The tritonclient
Python package requires that the libb64 library be available. It is typically installed via the system package manager.
Note that due to a change in XGBoost's JSON serialization format, Triton will not be able to load JSON-serialized models from XGBoost 1.6 until Triton version 22.07.
Categorical variable support was added to the Triton FIL backend in release 21.12 and to XGBoost in release 1.5. If you would like to use an earlier version of either of these or if you simply wish to see how the same workflow would go without explicit categorical variable support, you may set the USE_CATEGORICAL
variable in the following cell to False
. Otherwise, by leaving it as True
, you can take advantage of categorical variable support.
Please note that categorical variable support is still considered experimental in XGBoost 1.5.
USE_CATEGORICAL = True
TRITON_IMAGE = 'nvcr.io/nvidia/tritonserver:21.12-py3'
!docker pull {TRITON_IMAGE}
For this example, we will make use of data from the IEEE-CIS Fraud Detection Kaggle competition. You may fetch the data from this competition using the Kaggle command line client using the following commands.
NOTE: You will need to make sure that your Kaggle credentials are available either through a kaggle.json file or via environment variables.
!kaggle competitions download -c ieee-fraud-detection
!unzip -u ieee-fraud-detection.zip
train_csv = 'train_transaction.csv'
While the IEEE-CIS Kaggle competition focused on a more sophisticated problem involving analysis of both fraudulent transactions and the users linked to those transactions, we will use a simpler version of that problem (identifying fraudulent transactions only) to build our example model. In the following steps, we make use of cuML's preprocessing tools to clean the data and then train two example models using XGBoost. Note that we will be making use of the new categorical feature support in XGBoost 1.5. If you wish to use an earlier version of XGBoost, you will need to perform a label encoding on the categorical features.
import cudf
import cupy as cp
from cuml.preprocessing import SimpleImputer
if not USE_CATEGORICAL:
from cuml.preprocessing import LabelEncoder
# Due to an upstream bug, cuML's train_test_split function is
# currently non-deterministic. We will therefore use sklearn's
# train_test_split in this example to obtain more consistent
# results.
from sklearn.model_selection import train_test_split
SEED=0
# Load data from CSV files into cuDF DataFrames
data = cudf.read_csv(train_csv)
# Replace NaNs in data
nan_columns = data.columns[data.isna().any().to_pandas()]
float_nan_subset = data[nan_columns].select_dtypes(include='float64')
imputer = SimpleImputer(missing_values=cp.nan, strategy='mean')
data[float_nan_subset.columns] = imputer.fit_transform(float_nan_subset)
obj_nan_subset = data[nan_columns].select_dtypes(include='object')
data[obj_nan_subset.columns] = obj_nan_subset.fillna('UNKNOWN')
# Convert string columns to categorical or perform label encoding
cat_columns = data.select_dtypes(include='object')
if USE_CATEGORICAL:
data[cat_columns.columns] = cat_columns.astype('category')
else:
for col in cat_columns.columns:
data[col] = LabelEncoder().fit_transform(data[col])
# Split data into training and testing sets
X = data.drop('isFraud', axis=1)
y = data.isFraud.astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X.to_pandas(), y.to_pandas(), test_size=0.3, stratify=y.to_pandas(), random_state=SEED
)
# Copy data to avoid slowdowns due to fragmentation
X_train = X_train.copy()
X_test = X_test.copy()
import xgboost as xgb
# Define model training function
def train_model(num_trees, max_depth):
model = xgb.XGBClassifier(
tree_method='gpu_hist',
enable_categorical=USE_CATEGORICAL,
use_label_encoder=False,
predictor='gpu_predictor',
eval_metric='aucpr',
objective='binary:logistic',
max_depth=max_depth,
n_estimators=num_trees
)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)]
)
return model
# Train a small model with just 500 trees and a maximum depth of 3
small_model = train_model(500, 3)
# Train a large model with 5000 trees and a maximum depth of 12
large_model = train_model(5000, 12)
# Free up some room on the GPU by explicitly deleting dataframes
import gc
del data
del nan_columns
del float_nan_subset
del imputer
del obj_nan_subset
del cat_columns
del X
del y
gc.collect()
Now that we have two example models to work with, let's actually deploy them for real-time serving using Triton. In order to do so, we will need to first serialize the models in the directory structure that Triton expects and then add configuration files to tell Triton exactly how we wish to use these models.
Triton models can be stored locally on disk or in S3, Google Cloud Storage, or Azure Storage. For this example, we will stick to local storage, but information about using cloud storage solutions can be found here. Each model has a dedicated directory within a main model repository directory. Multiple versions of a model can also be served by Triton, as indicated by numbered directories (see below).
import os
# Create the model repository directory. The name of this directory is arbitrary.
REPO_PATH = os.path.abspath('model_repository')
os.makedirs(REPO_PATH, exist_ok=True)
def serialize_model(model, model_name):
# The name of the model directory determines the name of the model as reported
# by Triton
model_dir = os.path.join(REPO_PATH, model_name)
# We can store multiple versions of the model in the same directory. In our
# case, we have just one version, so we will add a single directory, named '1'.
version_dir = os.path.join(model_dir, '1')
os.makedirs(version_dir, exist_ok=True)
# The default filename for XGBoost models saved in json format is 'xgboost.json'.
# It is recommended that you use this filename to avoid having to specify a
# name in the configuration file.
model_file = os.path.join(version_dir, 'xgboost.json')
model.save_model(model_file)
return model_dir
We will be deploying two copies of each of our example models: one on CPU and one on GPU. We will use these separate instances to demonstrate the performance differences between GPU and CPU execution later on.
small_model_dir = serialize_model(small_model, 'small_model')
small_model_cpu_dir = serialize_model(small_model, 'small_model-cpu')
large_model_dir = serialize_model(large_model, 'large_model')
large_model_cpu_dir = serialize_model(large_model, 'large_model-cpu')
The configuration file associated with a model tells Triton a little bit about the model itself and how you would like to use it. You can read about all generic Triton configuration options here and about configuration options specific to the FIL backend here, but we will focus on just a few of the most common and relevant options in this example. Below are general descriptions of these options:
Based on this information, let's set up configuration files for our models.
# Maximum size in bytes for input and output arrays. If you are
# using Triton 21.11 or higher, all memory allocations will make
# use of Triton's memory pool, which has a default size of
# 67_108_864 bytes. This can be increased using the
# `--cuda-memory-pool-byte-size` option when the server is
# started, but this notebook should work fine with default
# settings.
MAX_MEMORY_BYTES = 60_000_000
features = X_test.shape[1]
num_classes = cp.unique(y_test).size
bytes_per_sample = (features + num_classes) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample
def generate_config(model_dir, deployment_type='gpu', storage_type='AUTO'):
if deployment_type.lower() == 'cpu':
instance_kind = 'KIND_CPU'
else:
instance_kind = 'KIND_GPU'
config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [
{{
name: "input__0"
data_type: TYPE_FP32
dims: [ {features} ]
}}
]
output [
{{
name: "output__0"
data_type: TYPE_FP32
dims: [ {num_classes} ]
}}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
{{
key: "model_type"
value: {{ string_value: "xgboost_json" }}
}},
{{
key: "predict_proba"
value: {{ string_value: "true" }}
}},
{{
key: "output_class"
value: {{ string_value: "true" }}
}},
{{
key: "threshold"
value: {{ string_value: "0.5" }}
}},
{{
key: "storage_type"
value: {{ string_value: "{storage_type}" }}
}}
]
dynamic_batching {{
max_queue_delay_microseconds: 100
}}"""
config_path = os.path.join(model_dir, 'config.pbtxt')
with open(config_path, 'w') as file_:
file_.write(config_text)
return config_path
generate_config(small_model_dir, deployment_type='gpu')
generate_config(small_model_cpu_dir, deployment_type='cpu')
generate_config(large_model_dir, deployment_type='gpu')
generate_config(large_model_cpu_dir, deployment_type='cpu')
With valid models and configuration files in place, we can now start the server. Below, we do so, use the Python client to wait for it to come fully online, and then check the logs to make sure we didn't get any unexpected warnings or errors while loading the models.
!docker run --gpus all -d -p 8000:8000 -p 8001:8001 -p 8002:8002 -v {REPO_PATH}:/models --name tritonserver {TRITON_IMAGE} tritonserver --model-repository=/models
import time
import tritonclient.grpc as triton_grpc
from tritonclient import utils as triton_utils
HOST = 'localhost'
PORT = 8001
TIMEOUT = 60
client = triton_grpc.InferenceServerClient(url=f'{HOST}:{PORT}')
# Wait for server to come online
server_start = time.time()
while True:
try:
if client.is_server_ready() or time.time() - server_start > TIMEOUT:
break
except triton_utils.InferenceServerException:
pass
time.sleep(1)
!docker logs tritonserver
With our models now deployed on a running Triton server, let's confirm that we get the same results from the deployed model as we get locally. Note that we will occasionally see slight divergences due to floating point errors during parallel execution, but otherwise, results should match.
If you are using a model with categorical features, a certain amount of care must be taken with categorical features, just as if you were executing a model locally. Both XGBoost and LightGBM depend on the input data frames to convert categories into numeric variables. If data is later submitted from a data frame which contains a different subset of categories, this numeric conversion will not be handled properly. In this example, we will use the same dataframe we used during testing, so we need not consider this, but otherwise we would need to note the mapping used for the .codes
attribute for each categorical feature in the training dataframe and make sure the same codes were used when submitting inference requests.
import pandas as pd
def convert_to_numpy(df):
df = df.copy()
cat_cols = df.select_dtypes('category').columns
for col in cat_cols:
df[col] = df[col].cat.codes
for col in df.columns:
df[col] = pd.to_numeric(df[col], downcast='float')
return df.values
np_data = convert_to_numpy(X_test).astype('float32')
def triton_predict(model_name, arr):
triton_input = triton_grpc.InferInput('input__0', arr.shape, 'FP32')
triton_input.set_data_from_numpy(arr)
triton_output = triton_grpc.InferRequestedOutput('output__0')
response = client.infer(model_name, model_version='1', inputs=[triton_input], outputs=[triton_output])
return response.as_numpy('output__0')
triton_result = triton_predict('small_model', np_data[0:5])
local_result = small_model.predict_proba(X_test[0:5])
print("Result computed on Triton: ")
print(triton_result)
print("\nResult computed locally: ")
print(local_result)
cp.testing.assert_allclose(triton_result, local_result, rtol=1e-6, atol=1e-6)
Triton offers several tools to help tune your model deployment parameters and optimize your target metrics, whether that be throughput, latency, device utilization, or some other measure of performance. Some of these optimizations depend on expected server load and whether inference requests will be submitted in batches or one at a time from clients. As we shall see, Triton's performance analysis tools allow you to test performance based on a wide range of anticipated scenarios and modify deployment parameters accordingly.
For this example, we will make use of Triton's perf_analyzer
tool, which allows us to quickly measure throughput and latency based on different batch sizes and request concurrency. We'll start with a basic comparison of the performance of our large model deployed on CPU vs GPU with batch size 1 and no concurrency.
All of the specific performance numbers here were obtained on a DGX-1 with 8 V100s and Triton 21.11, but your numbers may vary depending on available hardware and whether or not you chose to enable categorical features.
# Analyze performance of our large model on CPU.
# By default, perf_analyzer uses batch size 1 and concurrency 1.
!perf_analyzer -m large_model-cpu
# Let's now get the same performance numbers for GPU execution
!perf_analyzer -m large_model
Already, we can see that GPU execution offers substantially improved throughput at lower latency for this complex model, but let's see what happens when we look at higher batch sizes or request load.
# Measure performance with batch size 6 and a concurrrency of 6 for
# request submissions
!perf_analyzer -m large_model-cpu -b 6 --concurrency-range 6:6
!perf_analyzer -m large_model -b 6 --concurrency-range 6:6
As we can see, deployed on CPU, the model was able to offer a somewhat increased throughput at higher load, but latency increased dramatically. Meanwhile, the same model deployed on the GPU significantly increased its throughput with only a slight increase in latency.
In order to maintain a tight latency budget on a CPU-only server under high request load, we would have to turn to a significantly less sophisticated model. Let's imagine that we were trying to keep our p99 latency under 2 ms on the DGX machine referred to above. On CPU, we can just barely stay under that budget with a batch size of 6 and concurrency of 6 on CPU. Deploying the same model on GPU with the same parameters, we can keep our p99 latency under 0.7 ms and offer 3.5X the throughput
!perf_analyzer -m small_model-cpu -b 6 --concurrency-range 6:6
!perf_analyzer -m small_model -b 6 --concurrency-range 6:6
Let's see how far we can push our large model on GPU while staying within our 2 ms latency budget.
!perf_analyzer -m large_model -b 80 --concurrency-range 8:8
On the GPU, this larger model can achieve 20X the throughput of the smaller model on CPU, allowing us to handle a substantially higher load. But of course throughput performance is only part of the picture. If our latency budget forces us to use a smaller model on CPU, how much worse will we do at actually detecting fraud? Let's compute results for the entire test dataset using the large and small models and then compare their precision-recall curves to see how much we may be losing by resorting to the smaller model for CPU deployments.
import numpy as np
import cuml
GPU_COUNT = 8
def create_batches(arr):
# Determine how many chunks are needed to keep size <= max_batch_size
chunks = (
arr.shape[0] // max_batch_size +
int(bool(arr.shape[0] % max_batch_size) or arr.shape[0] < max_batch_size)
)
return np.array_split(arr, max(GPU_COUNT, chunks))
%time large_model_results = np.concatenate([triton_predict('large_model', chunk) for chunk in create_batches(np_data)])
%time small_model_results = np.concatenate([triton_predict('small_model-cpu', chunk) for chunk in create_batches(np_data)])
Note that we can more quickly process the full dataset on GPU even with a significantly more sophisticated model than we are using for our CPU deployment. As an interesting point of comparison, due to the optimized inference performance of the RAPIDS Forest Inference Library (FIL) used by the Triton backend and Triton's inherent ability to parallelize over available GPUs, it is even faster to submit these samples for processing to Triton than it is to process them locally using XGBoost for the larger model, despite the overhead of data transfer. For information about invoking FIL directly in Python without Triton, see the FIL documentation.
%time large_model.predict_proba(X_test)
We now return to evaluating the benefit of the larger model for accurately detecting fraud by computing precision-recall curves for both the small and large models.
large_precision, large_recall, _ = cuml.metrics.precision_recall_curve(y_test, large_model_results[:, 1])
small_precision, small_recall, _ = cuml.metrics.precision_recall_curve(y_test, small_model_results[:, 1])
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(small_precision, small_recall, color='#0071c5')
plt.plot(large_precision, large_recall, color='#76b900')
plt.title('Precision vs Recall for Small and Large Models')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.show()
As we can see, the larger, more sophisticated model dominates the smaller model all along this curve. By deploying our model on GPU, we can identify a far greater proportion of actual fraud incidents with fewer false positives, all without going over our latency budget.
# Shut down the server
!docker rm -f tritonserver
In this example notebook, we showed how to deploy an XGBoost model in Triton using the new FIL backend. While it is possible to deploy these models on both CPU and GPU in Triton, GPU-deployed models offer far higher throughput at lower latency. As a result, we can deploy more sophisticated models on the GPU for any given latency budget and thereby obtain far more accurate results.
While we have focused on XGBoost in this example, FIL also natively supports LightGBM's text serialization format as well as Treelite's checkpoint format. Thus, the same general steps can be used to serve LightGBM models and any Treelite-convertible model (including Scikit-Learn and cuML forest models). With the new FIL backend, Triton is now ready to serve forest models of all kinds in production, whether on their own or in concert with any of the deep-learning models supported by Triton.