SageMaker Model Monitoring

Note: This documentation is also available in a rendered format here.

Deploys SageMaker Model Monitor schedules for all four monitoring types — data quality, model quality, model bias, and model explainability — against a deployed real-time endpoint. Each monitor runs as a scheduled processing job that compares live inference traffic against baseline statistics and constraints, publishing violations to S3 and CloudWatch. Use this module when you need continuous monitoring of a production inference endpoint to detect data drift, model degradation, bias drift, or changes in feature attribution.

Deployed Resources

This module deploys and integrates the following resources:

SageMaker Data Quality Monitoring Schedule (Optional) - Scheduled processing job that compares incoming request data against baseline data statistics and constraints.

SageMaker Model Quality Monitoring Schedule (Optional) - Scheduled processing job that evaluates model prediction accuracy against ground truth labels.

SageMaker Model Bias Monitoring Schedule (Optional) - Scheduled processing job that detects bias drift in model predictions using SageMaker Clarify.

SageMaker Model Explainability Monitoring Schedule (Optional) - Scheduled processing job that tracks changes in feature attribution using SageMaker Clarify.

Baseline Processing Job (Optional) - One-time processing job that generates baseline statistics and constraints from a representative dataset.

Amazon S3 Output Bucket - Stores monitoring output reports, constraint violations, and baseline artifacts.

AWS KMS Key - Customer-managed encryption key for S3 output bucket and processing job storage volumes.

AWS IAM Monitoring Role - Execution role for monitoring processing jobs with permissions to read endpoint data capture and write results to S3.

SageMaker Endpoint — Deploys the real-time inference endpoint that this module monitors for drift and quality degradation
SageMaker MLOps — Provides the model artifacts and model package group used to establish monitoring baselines
SageMaker Studio Domain — Provides SageMaker domain tagging context for resource governance

Security/Compliance Details

This module is designed in alignment with MDAA security/compliance principles and CDK nag rulesets. Additional review is recommended prior to production deployment, ensuring organization-specific compliance requirements are met.

Encryption at Rest:
- S3 output bucket encrypted with customer-managed KMS key
- Processing job storage volumes encrypted with KMS
- Baseline artifacts encrypted at rest
Encryption in Transit:
- All S3 access enforced over HTTPS via bucket policy
- Processing job containers communicate over TLS
Least Privilege:
- Monitoring role scoped to specific endpoint, S3 paths, and KMS key
- KMS key policy restricts usage to the monitoring role and admin principals
Network Isolation:
- Monitoring processing jobs support VPC configuration with security groups and subnets
- Optional network isolation mode prevents containers from making outbound network calls

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

sagemaker-model-monitoring: # Module Name can be customized
  module_path: '@aws-mdaa/sagemaker-model-monitoring' # Must match module NPM package name
  module_configs:
    - ./sagemaker-model-monitoring.yaml # Filename/path can be customized

Module Config Samples and Variants

Copy the contents of the relevant sample config below into the ./sagemaker-model-monitoring.yaml file referenced in the MDAA config snippet above.

Minimal Configuration

Start here for a single data quality monitor on an existing endpoint with default schedule and instance settings.

sample-config-minimal.yaml

# Minimal config for the SageMaker Model Monitoring module.
# Contains only the required properties for basic data quality
# monitoring on a SageMaker endpoint.
#
# NOTE: ECR image URIs below are region-specific (us-east-1). Replace the account ID
# and region with values appropriate for your deployment region. See:
# https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

# Name of the SageMaker endpoint to monitor
# Often created by the SageMaker Endpoint module.
# Example SSM: ssm:/{{org}}/{{domain}}/<endpoint_module_name>/endpoint-name
endpointName: test-endpoint

# Monitor configurations — at least one type must be enabled
monitors:
  dataQuality:
    enabled: true
    schedule: "cron(0 * ? * * *)"
    instanceType: ml.m5.xlarge
    imageUri: "156813124566.dkr.ecr.us-east-1.amazonaws.com/sagemaker-model-monitor-analyzer"

Comprehensive Configuration

Use this as a reference when you need all four monitor types, custom baseline generation, VPC isolation, and per-monitor schedule and instance configuration.

sample-config-comprehensive.yaml

# Comprehensive config for the SageMaker Model Monitoring module.
# Deploys all four monitor types (data quality, model quality,
# model bias, model explainability) with VPC isolation, KMS
# encryption, and automated baselining.
#
# NOTE: ECR image URIs below are region-specific (us-east-1). Replace the account ID
# and region with values appropriate for your deployment region. See:
# https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

# Name of the SageMaker endpoint to monitor
# Often created by the SageMaker Endpoint module.
# Example SSM: ssm:/{{org}}/{{domain}}/<endpoint_module_name>/endpoint-name
endpointName: test-endpoint

# Monitor configurations — at least one type must be enabled
monitors:
  dataQuality:
    enabled: true
    schedule: "cron(0 * ? * * *)"
    instanceType: ml.m5.xlarge
    instanceCount: 1
    volumeSizeInGb: 30
    maxRuntimeInSeconds: 3600
    imageUri: "156813124566.dkr.ecr.us-east-1.amazonaws.com/sagemaker-model-monitor-analyzer"
    # (Optional) Pre-computed baseline references
    baselineDatasetUri: s3://test-model-bucket/baselines/data-quality/dataset.csv
    baselineConstraintsUri: s3://test-model-bucket/baselines/data-quality/constraints.json
    baselineStatisticsUri: s3://test-model-bucket/baselines/data-quality/statistics.json
  modelQuality:
    enabled: true
    schedule: "cron(0 */6 ? * * *)"
    instanceType: ml.m5.xlarge
    imageUri: "156813124566.dkr.ecr.us-east-1.amazonaws.com/sagemaker-model-monitor-analyzer"
    problemType: BinaryClassification
    groundTruthS3Uri: s3://test-model-bucket/ground-truth/
    inferenceAttribute: prediction
    probabilityAttribute: probability
    probabilityThreshold: 0.5
  modelBias:
    enabled: true
    schedule: "cron(0 0 ? * MON *)"
    instanceType: ml.m5.xlarge
    imageUri: "246618743249.dkr.ecr.us-east-1.amazonaws.com/sagemaker-clarify-processing:1.0"
    groundTruthS3Uri: s3://test-model-bucket/ground-truth/
    featuresAttribute: features
  modelExplainability:
    enabled: true
    schedule: "cron(0 0 ? * MON *)"
    instanceType: ml.m5.xlarge
    imageUri: "246618743249.dkr.ecr.us-east-1.amazonaws.com/sagemaker-clarify-processing:1.0"
    featuresAttribute: features

# (Optional) VPC ID for monitoring jobs
# Often created by your VPC/networking stack.
# Example SSM: ssm:/path/to/vpc/id
vpcId: vpc-0123456789abcdef0

# (Optional) Subnet IDs for monitoring jobs
# Often created by your VPC/networking stack.
# Example SSM: ssm:/path/to/subnet/id
subnetIds:
  - subnet-0123456789abcdef0
  - subnet-0123456789abcdef1

# (Optional) Security group IDs for monitoring jobs
# Often created by your VPC/networking stack.
# Example SSM: ssm:/path/to/security-group/id
securityGroupIds:
  - sg-0123456789abcdef0

# (Optional) S3 bucket ARN for model artifacts access
modelBucketArn: arn:{{partition}}:s3:::test-model-bucket

# (Optional) Enable network isolation for monitoring jobs
# (default: false)
networkIsolation: false

# (Optional) KMS key ARN for encryption. If omitted, a new
# customer-managed key is created.
kmsKeyArn: arn:{{partition}}:kms:{{region}}:{{account}}:key/test-key-id

# (Optional) Automated baselining configuration
baselineTrainingDataS3Uri: s3://test-model-bucket/baselines/training-data.csv
baselineOutputDataS3Uri: s3://test-model-bucket/baselines/output/
baselineSchedule: "cron(0 0 1 * ? *)"
baselineDatasetFormat: '{"csv": {"header": true}}'