Skip to content

SageMaker Ground Truth

Note: This documentation is also available in a rendered format here.

Deploys a fully automated, continuous data labeling pipeline using SageMaker Ground Truth. Data uploaded to S3 is automatically detected, batched, and sent to human labelers via Ground Truth labeling jobs. Labeled results are persisted to a SageMaker Feature Group for downstream ML training pipelines. Use this module when you need continuous, hands-off data labeling with quality verification and Feature Store integration.


Deployed Resources

This module deploys and integrates the following resources:

Upload S3 Bucket — Data landing zone with EventBridge notifications enabled. Upload data objects (images, text) here to trigger the labeling pipeline. Deployed using the MdaaBucket construct.

Output S3 Bucket — Stores Ground Truth labeling job output manifests and annotations. Deployed using the MdaaBucket construct.

SQS Queue + Dead Letter Queue — Tracks uploaded data objects pending labeling. S3 EventBridge notifications route to this queue. Items that repeatedly fail labeling are moved to the DLQ with CloudWatch alarms.

Feature Store Bucket + SageMaker Feature Group — Persists labeled data in SageMaker Feature Store for consumption by ML training pipelines. The Feature Group stores label metadata (source reference, label, confidence, job name, timestamp).

Step Functions State Machine — Orchestrates the end-to-end labeling pipeline: 1. Poll SQS for new data objects 2. Build manifest and create Ground Truth labeling job 3. Wait for labeling completion 4. (Optional) Create verification job for quality review 5. Write labeled results to Feature Store 6. Return rejected/unlabeled messages to queue for retry

Lambda Functions — 5-6 functions powering the state machine steps (poll SQS, run labeling job, run verification job, update Feature Store, return messages to queue).

EventBridge Scheduler — Triggers the state machine on a configurable schedule (default: daily at noon UTC).

KMS Encryption — All resources (S3 buckets, SQS queues, Feature Store, Lambda environment variables) encrypted with customer-managed KMS keys via MdaaKmsKey.


Architecture

Why EventBridge + SQS + Step Functions?

A SageMaker Ground Truth labeling job takes a static manifest file as input — it is a one-shot batch operation, not a continuously running service. There is no way to "connect" a labeling job directly to an S3 bucket and have it automatically pick up new files.

This architecture solves that limitation:

  • Continuous ingestion — S3 EventBridge notifications detect new uploads automatically. SQS buffers them, decoupling upload rate from labeling capacity.
  • Batching — The Step Function runs on a configurable schedule, polls SQS, and batches all pending items into a single labeling job. This is more cost-efficient than creating one job per file.
  • Quality loop — When verification is enabled, failed items are returned to the SQS queue for relabeling. This retry/re-queue pattern is not possible with a single direct labeling job.
  • Feature Store integration — Labeled results are automatically written to a SageMaker Feature Group, making them immediately available for downstream model training pipelines.
  • Observability — A dead-letter queue with CloudWatch alarms catches items that repeatedly fail labeling or verification, enabling operational monitoring.

In short: a single labeling job is batch and stateless. This architecture makes it continuous, self-healing, and integrated into the ML pipeline.

How It Works

S3 Upload → EventBridge → SQS Queue
              EventBridge Scheduler (e.g. every 5 min)
                     Step Functions State Machine
                       ├── Poll SQS
                       ├── Build manifest
                       ├── CreateLabelingJob → Human workers label data
                       ├── (Optional) CreateVerificationJob → Workers verify labels
                       ├── Write results to Feature Store
                       └── Delete processed SQS messages

Data can be uploaded at any time. The scheduler periodically checks for pending items and processes them in batches. If the queue is empty, the state machine exits cleanly with no cost.

State Machine Outcomes

Outcome Meaning
SUCCEEDED (no labeling job) Queue was empty — nothing to process.
SUCCEEDED (with labeling job) Labeling job completed, results written to Feature Store.
FAILED (TotalLabeled = 0) Workers did not label items within the task availability window. Messages returned to SQS for retry.
FAILED (labeling job error) SageMaker labeling job failed (e.g. permission issue). Messages returned to SQS for retry.

Note: A FAILED execution does not mean data is lost. The error handling path returns all unprocessed messages to the SQS queue so they are picked up on the next scheduled run. Items that repeatedly fail will eventually move to the dead-letter queue.



Security/Compliance Details

This module is designed in alignment with MDAA security/compliance principles and CDK Nag rulesets. Additional review is recommended prior to production deployment, ensuring organization-specific compliance requirements are met.

  • Encryption at Rest:
    • All S3 buckets (upload, output, Feature Store) encrypted with customer-managed KMS keys
    • SQS queues and dead-letter queues encrypted with KMS
    • Lambda environment variables encrypted with KMS
    • Feature Group encrypted with KMS (offline store)
  • Encryption in Transit:
    • S3 bucket policies enforce aws:SecureTransport
    • SQS queue policies enforce TLS
  • Least Privilege:
    • Separate IAM roles for state machine, Lambda functions, SageMaker labeling, and Feature Store access
    • Lambda functions have only the permissions needed for their specific step
    • SageMaker labeling role scoped to upload/output buckets and specific task types
    • Feature Store role has PutRecord only on the specific Feature Group
  • Network Isolation:
    • Lambda functions can optionally run in VPC (configured via L3 construct)
    • S3 bucket policies restrict access to specific IAM roles
  • Observability:
    • CloudWatch alarms on dead-letter queue depth
    • Step Functions execution history for audit trail
    • Lambda CloudWatch log groups with configurable retention

Prerequisites

The following resources must be created before deploying this module:

Labeling (required)

Resource Config Field How to Create
SageMaker Workteam labelingTaskConfig.workteamArn Create via SageMaker Ground Truth console or aws sagemaker create-workteam. Private workteams require a Cognito user pool.
Label Categories File labelingTaskConfig.categoriesS3Uri Upload a JSON file to S3 with label categories. See categories file format.
Labeling UI Template labelingTaskConfig.templateS3Uri Upload a Liquid HTML template to S3. Required for image_bounding_box and image_semantic_segmentation (SageMaker does not support built-in HumanTaskUiArn for these task types). Optional for classification and NER tasks which use AWS-managed templates. See custom templates.

Verification (optional — only if verification block is configured)

Resource Config Field How to Create
Verification Workteam verification.workteamArn Same as labeling workteam. Can be the same or a different team.
Verification UI Template verification.templateS3Uri Optional. Custom template for verification UI.
Verification Categories File verification.categoriesS3Uri Optional. The first label should be the "pass" label; other labels indicate validation failures.

Supported Task Types

Task Type Media
image_bounding_box Image
image_semantic_segmentation Image
image_single_label_classification Image
image_multi_label_classification Image
text_single_label_classification Text
text_multi_label_classification Text
named_entity_recognition Text

Configuration

Sample Configurations

Key Configuration Fields

Field Required Description
jobName Yes Unique name for the labeling workflow
taskType Yes One of the supported task types above
labelingTaskConfig.workteamArn Yes ARN of the SageMaker private workteam
labelingTaskConfig.categoriesS3Uri Yes S3 URI of the label categories JSON file
labelingTaskConfig.taskTitle Yes Title shown to labelers in the UI
labelingTaskConfig.taskDescription Yes Description shown to labelers
verification No Enable verification step with a second workteam
workflowSchedule No EventBridge cron expression (default: cron(0 12 * * ? *))
featureGroupName No Custom Feature Group name (default: auto-generated)

SSM Outputs

Parameter Description
upload-bucket-name S3 bucket for uploading data objects
output-bucket-name S3 bucket for labeling job results
feature-group-name SageMaker Feature Group name
state-machine-arn Step Functions state machine ARN
upload-queue-url SQS queue URL for data notifications