Distributed Fraud Detection with XGBoost and Dask on Amazon SageMaker¶

Train an XGBoost fraud detection model at scale using distributed multi-GPU training with Dask on the SageMaker XGBoost Deep Learning Container in Algorithm mode.

Overview¶

Fraud detection systems process millions of transactions and must retrain frequently as attack patterns evolve. This tutorial demonstrates how to use SageMaker's built-in XGBoost algorithm with Dask-based distributed GPU training to handle large-scale, imbalanced fraud datasets efficiently.

What you'll learn: - Use the SageMaker XGBoost DLC in Algorithm mode (no custom training script needed) - Generate a realistic synthetic fraud dataset with class imbalance - Run distributed multi-GPU training with Dask across multiple GPUs - Handle class imbalance with scale_pos_weight - Partition data correctly for Dask-based training

Why distributed GPU training? - Train on datasets with millions of rows in minutes instead of hours - Dask utilizes all GPUs across one or more instances - Cost-effective - faster training means lower total compute cost - Available since XGBoost 1.5-1 on SageMaker

Prerequisites¶

AWS account with SageMaker permissions
AWS CLI configured
Python 3.8+ with boto3, sagemaker, pandas, scikit-learn installed
An S3 bucket for training data and model artifacts

Files¶

run_tutorial.py - End-to-end orchestration: synthetic data generation, training, deployment, inference, cleanup

Quick Start¶

1. Install Dependencies¶

pip install boto3 sagemaker pandas scikit-learn

2. Set Environment Variables¶

export SAGEMAKER_ROLE="arn:aws:iam::<account-id>:role/<SageMakerExecutionRole>"
export S3_BUCKET="<your-s3-bucket>"

3. Run the Tutorial¶

# Single multi-GPU instance (recommended starting point)
python run_tutorial.py \
  --role "$SAGEMAKER_ROLE" \
  --bucket "$S3_BUCKET" \
  --instance-type ml.g5.12xlarge \
  --instance-count 1

# Scale out: 2 multi-GPU instances
python run_tutorial.py \
  --role "$SAGEMAKER_ROLE" \
  --bucket "$S3_BUCKET" \
  --instance-type ml.g5.12xlarge \
  --instance-count 2 \
  --num-samples 2000000

Command Line Options¶

--role - SageMaker execution role ARN (required)
--bucket - S3 bucket for data and artifacts (required)
--region - AWS region (default: us-west-2)
--image-uri - XGBoost container image URI (default: auto-generated for region)
--instance-type - Training instance type (default: ml.g5.12xlarge)
--instance-count - Number of training instances (default: 1)
--deploy-instance-type - Endpoint instance type (default: ml.m5.large)
--num-samples - Number of synthetic transactions (default: 500000)
--fraud-rate - Fraction of fraudulent transactions (default: 0.02)
--num-round - Number of XGBoost boosting rounds (default: 200)
--max-depth - Maximum tree depth (default: 8)
--skip-deploy - Skip deployment and inference
--skip-cleanup - Skip endpoint cleanup

Step-by-Step Walkthrough¶

Step 1: Generate Synthetic Fraud Data¶

The script generates a realistic imbalanced dataset mimicking credit card fraud: - 30 numerical features (transaction amount, velocity, distance, etc.) - ~2% fraud rate (configurable) - Default: 500K transactions, scalable to millions

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=500_000,
    n_features=30,
    n_informative=15,
    n_redundant=5,
    weights=[0.98, 0.02],  # 2% fraud rate
    random_state=42,
)

Step 2: Partition Data for Dask¶

Dask reads each file as a partition, with one Dask worker per GPU. The number of data files should exceed the total GPU count.

# For ml.g5.12xlarge (4 GPUs) × 2 instances = 8 GPUs
# Create 16 partitions (2× GPU count)
num_partitions = num_gpus * 2

Important: Dask distributed training only supports CSV and Parquet formats. LIBSVM and PROTOBUF will cause the training job to fail.

Step 3: Launch Distributed GPU Training¶

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=xgboost_image_uri,  # XGBoost 3.0-5
    role=role,
    instance_count=2,
    instance_type="ml.g5.12xlarge",
    hyperparameters={
        "objective": "binary:logistic",
        "num_round": 200,
        "max_depth": 8,
        "eta": 0.1,
        "tree_method": "gpu_hist",
        "scale_pos_weight": 49,  # ratio of negatives to positives
        "eval_metric": "auc",
        "use_dask_gpu_training": "true",
    },
)

# FullyReplicated - Dask handles data distribution internally
train_input = TrainingInput(s3_data=train_s3_uri, distribution="FullyReplicated")
estimator.fit({"train": train_input, "validation": val_input})

Key hyperparameters for distributed GPU training: - tree_method: gpu_hist - enables GPU-accelerated histogram-based training - use_dask_gpu_training: "true" - enables Dask multi-GPU coordination - scale_pos_weight: 49 - compensates for 2% fraud rate (98/2 ≈ 49)

Step 4: Deploy and Test¶

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",  # CPU is fine for inference
)

Step 5: Clean Up¶

predictor.delete_endpoint()

Instance Selection Guide¶

Instance	GPUs	GPU Memory	Best For
ml.g5.xlarge	1 × A10G	24 GB	Small datasets, testing
ml.g5.12xlarge	4 × A10G	96 GB	Medium datasets (recommended)
ml.g5.24xlarge	4 × A10G	96 GB	Large datasets, more CPU/RAM

XGBoost 3.0-5 note: P3 instances are not supported. Use G4dn or G5 family.

Dask Training Best Practices¶

File count: Create more files than total GPUs (instance_count × GPUs per instance). Too few files underutilizes GPUs; too many degrades performance.
File format: Use CSV or Parquet only. Parquet column names must be strings.
Distribution: Set distribution="FullyReplicated" or omit it. Do not use ShardedByS3Key.
No pipe mode: Dask does not support pipe mode input.
File sizes: Aim for roughly equal-sized partitions for balanced GPU utilization.

XGBoost Version Comparison¶

Feature	1.5-1	1.7-1	3.0-5
Dask multi-GPU	✅	✅	✅
GPU instance support	P2, P3, G4dn, G5	P3, G4dn, G5	G4dn, G5
SageMaker Debugger	✅	✅	❌

Cost Estimate¶

Configuration	Instance	Training Time (500K rows)	Approximate Cost
1 × ml.g5.xlarge	1 GPU	~8 min	~$0.14
1 × ml.g5.12xlarge	4 GPUs	~3 min	~$0.28
2 × ml.g5.12xlarge	8 GPUs	~2 min	~$0.37

GPU training is faster and often more cost-effective than CPU despite higher per-instance cost.