Data Quality
Note: This documentation is also available in a rendered format here.
Deploys AWS Glue Data Quality rulesets for automated validation and monitoring of data in Glue Catalog tables. Supports both structured rule objects and raw DQDL (Data Quality Definition Language) strings. Use this module when you need to enforce data quality checks such as completeness, uniqueness, or custom validation rules on tables in your Glue Catalog.
Deployed Resources
This module deploys and integrates the following resources:
- Glue Data Quality Ruleset(s) — Rulesets created for each specification in the config. Supports structured rule objects and raw DQDL strings for flexible validation patterns.
- SSM Parameters — Ruleset names and target table information stored in Parameter Store for cross-module reference.
Related Modules
- DataOps Project — Deploy the shared project infrastructure (databases, KMS keys) that data quality rulesets target
- Crawlers — Deploy crawlers that create the Glue tables targeted by data quality rulesets
- ETL Jobs — Trigger data quality evaluations from within Glue ETL jobs
Security/Compliance Details
This module is designed in alignment with MDAA security/compliance principles and CDK nag rulesets. Additional review is recommended prior to production deployment, ensuring organization-specific compliance requirements are met.
- Least Privilege:
- Ruleset management governed by IAM policies
- SSM parameters for ruleset metadata use least-privilege access patterns
Configuration
MDAA Config
Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:
dataops-data-quality: # Module Name can be customized
module_path: '@aws-mdaa/dataops-data-quality' # Must match module NPM package name
module_configs:
- ./dataops-data-quality.yaml # Filename/path can be customized
Module Config Samples and Variants
Copy the contents of the relevant sample config below into the ./dataops-data-quality.yaml file referenced in the MDAA config snippet above.
Minimal Configuration
Only required properties are included. Start here for a single data quality ruleset targeting one Glue table within an existing DataOps project.
# Contents available via above link
# Minimal configuration for DataOps Data Quality module.
# Only required properties are included.
# Map of ruleset names to Glue Data Quality ruleset definitions.
rulesets:
customer-data-quality:
targetTable:
databaseName: project:databaseName/customer-data
tableName: customers
ruleset:
- ruleType: IsComplete
column: customer_id
Comprehensive Configuration
Exercises all non-excluded schema properties at full depth. Defines Glue Data Quality rulesets for customer and order data validation, wired to a DataOps project for resource resolution. Start here when evaluating all available options for structured rules, raw DQDL strings, and multi-table validation patterns.
sample-config-comprehensive.yaml
# Contents available via above link
# Comprehensive config for the DataOps Data Quality module.
# Exercises ALL non-excluded schema properties at full depth.
# Defines Glue Data Quality rulesets for customer and order data
# validation, wired to a DataOps project for resource resolution.
# DataOps project name for data quality ruleset integration and naming.
projectName: test-dataops-project
# SNS topic ARN for job notifications and workflow alerts.
# Auto-resolved from project when projectName is set.
notificationTopicArn: arn:{{partition}}:sns:{{region}}:{{account}}:test-topic
# Map of ruleset names to Glue Data Quality ruleset definitions for automated table validation.
rulesets:
# Ruleset exercising structured rule array with all DataQualityRule properties
customer-data-quality:
# Description explaining the purpose and scope of the ruleset.
description: Validate customer data completeness and uniqueness
# Target table specifying which Glue Catalog table to validate.
targetTable:
# AWS account ID for cross-account Glue Catalog access.
catalogId: '{{account}}'
# Glue database name containing the target table.
databaseName: project:databaseName/customer-data
# Glue table name to validate with data quality rules.
tableName: customers
# Ruleset as an array of structured rule objects.
ruleset:
# IsComplete rule — checks column has no nulls
- ruleType: IsComplete
# Column name for column-specific rules.
column: customer_id
# Uniqueness rule — percentage-based threshold check
- ruleType: Uniqueness
column: email
# Comparison operator for threshold and value-based rules.
comparisonOperator: '>'
# Threshold value (0.0–1.0) for percentage-based rules.
threshold: 0.95
# RowCount rule — numeric value comparison
- ruleType: RowCount
comparisonOperator: '>'
# Numeric value for count and statistical rules.
value: 100
# ColumnValues rule — allowed values list with 'in' operator
- ruleType: ColumnValues
column: status
comparisonOperator: in
# Allowed values list for ColumnValues rule with 'in' operator.
values:
- active
- inactive
- pending
# ColumnDataType rule — validates column data type
- ruleType: ColumnDataType
column: created_at
# Expected data type for ColumnDataType rule.
dataType: DATE
# DataFreshness rule — validates data recency
- ruleType: DataFreshness
column: updated_at
# Duration specifying maximum data age.
duration: '24 hours'
# CustomSql rule — custom SQL-based validation
- ruleType: CustomSql
# SQL query for CustomSql rule, must return a single numeric value.
sql: 'SELECT COUNT(*) FROM customers WHERE customer_id IS NULL'
comparisonOperator: '='
value: 0
# Rule with WHERE clause — conditional validation
- ruleType: IsComplete
column: phone_number
# SQL WHERE clause to filter rows before applying the rule.
where: "country = 'US'"
# Ruleset exercising raw DQDL string form
order-data-quality:
description: Validate order data freshness and values
targetTable:
databaseName: project:databaseName/order-data
tableName: orders
# Ruleset as a raw DQDL string.
ruleset: |
Rules = [
IsComplete "order_id",
ColumnValues "status" in ["pending", "completed", "cancelled"],
RowCount > 0
]
# Ruleset with Redshift source metadata and SMUS asset mapping
redshift-inventory-quality:
description: Validate inventory data from Redshift
targetTable:
databaseName: project:databaseName/inventory-data
tableName: inventory
# Source configuration describing where the data lives.
source:
sourceType: redshift
connectionName: project:connections/redshift-jdbc
redshiftTable: public.inventory
# DataZone asset ID for SMUS publishing.
smusAssetId: asset-abc-123
ruleset:
- ruleType: IsComplete
column: product_id
# Recommendation-based ruleset (no explicit rules)
auto-recommended-rules:
description: Auto-generated rules from Glue DQ recommendations
targetTable:
databaseName: project:databaseName/customer-data
tableName: customers
# Glue Data Quality recommendation run ID.
recommendationRunId: dqrun-abc-123-def
# Dynamic targets for runtime table discovery.
dynamicTargets:
- name: raw-parquet-data
s3DirUri: s3://my-data-lake/raw/parquet/
source:
sourceType: s3
s3Format: parquet
# SMUS publishing configuration for DataZone integration.
smusPublishing:
domainId: dzd_my_domain
accountId: '{{account}}'
region: '{{region}}'
# roleArn: arn:{{partition}}:iam::{{account}}:role/dq-publisher
# domainKmsKeyArn: arn:{{partition}}:kms:{{region}}:{{account}}:key/abc-123
Standalone Configuration (No Project)
Demonstrates standalone data quality rulesets with explicit KMS, bucket, deployment role, and security configuration (no projectName). Use this when deploying outside of a DataOps project, providing infrastructure references directly.
# Contents available via above link
# Sample config for the DataOps Data Quality module.
# Demonstrates standalone data quality rulesets with explicit KMS,
# bucket, deployment role, and security configuration (no projectName).
# KMS key ARN for encrypting DataOps resources and data.
kmsArn: arn:{{partition}}:kms:{{region}}:{{account}}:key/test-key-id
# S3 bucket name for project storage (scripts, artifacts, temp files).
bucketName: test-dq-bucket
# IAM role ARN for deployment operations and resource management.
deploymentRoleArn: arn:{{partition}}:iam::{{account}}:role/test-deploy-role
# Glue security configuration name for job encryption.
securityConfigurationName: test-security-config
# SNS topic ARN for job notifications and workflow alerts.
notificationTopicArn: arn:{{partition}}:sns:{{region}}:{{account}}:test-topic
# Map of ruleset names to Glue Data Quality ruleset definitions for automated table validation.
rulesets:
customer-data-quality:
# Description explaining the purpose and scope of the ruleset.
description: Validate customer data completeness and uniqueness
# Target table specifying which Glue Catalog table to validate.
targetTable:
# Glue database name containing the target table.
databaseName: project:databaseName/customer-data
# Glue table name to validate with data quality rules.
tableName: customers
# Ruleset as an array of structured rule objects.
ruleset:
- ruleType: IsComplete
column: customer_id
- ruleType: Uniqueness
column: email
comparisonOperator: '>'
threshold: 0.95
- ruleType: RowCount
comparisonOperator: '>'
value: 100
order-data-quality:
description: Validate order data freshness and values
targetTable:
databaseName: project:databaseName/order-data
tableName: orders
# Ruleset as a raw DQDL string.
ruleset: |
Rules = [
IsComplete "order_id",
ColumnValues "status" in ["pending", "completed", "cancelled"],
RowCount > 0
]
Important Notes
-
Tables Must Exist: The target table must exist in the Glue Catalog before the ruleset can be evaluated. Rulesets can be created before tables exist, but evaluation will fail until the table is created (typically by a crawler).
-
Deployment Order: This module should be deployed AFTER:
dataops-project-app(creates databases)dataops-crawler-app(creates crawlers)- Running crawlers to create tables
-
Project References: Use the
project:prefix to reference resources from the DataOps project:project:databaseName/my-databaseresolves to the project's database SSM parameter
-
Evaluation: Creating a ruleset does not automatically evaluate it. You must:
- Run a Glue Data Quality evaluation job
- Configure evaluation in a Glue ETL job
- Use EventBridge to trigger evaluations
-
DQDL vs Structured Rules: You can use either:
- Raw DQDL strings (more flexible, requires DQDL knowledge)
- Structured rule objects (type-safe, easier to maintain)