Project

The Data Ops Project CDK application is used to deploy the resources required to support and perform data operations on top of a Data Lake, primarily using Glue Crawlers and Glue Jobs.

Deployed Resources and Compliance Details

dataops-project

Project KMS Key - Used to encrypt all project information at rest across all project resources.

Usage access granted to project data engineer and execution roles (by key policy)
Usage/Admin access granted to data admin role (by key policy)

Project S3 Bucket - A storage location for project activities (scratch and temporary).

Read/write access granted (by prefix) to project data engineer, execution, and data admin roles (by bucket policy)
Used as temp location for all project glue jobs
Used to deploy/stage all glue job code
Can be used to store project-related derived data for downstream processing

Glue Databases - A Glue Catalog database will be created for each project database specified in the config.

Can be used by project crawlers and jobs to store crawled/generated tables

LakeFormation Grants - Grant access to project Glue databases and tables

Data lake location and read/write data lake permission grants can be automatically created for project execution and engineer roles
Data lake permission grants (read or write) can be configured on a per database (and optionally table) basis for additional principals
If using LakeFormation across accounts, database resource links and resource link describe grants can be created across accounts (required for cross account access)
When cross-account resource links are created, consumer accounts need KMS decrypt permissions on the Glue catalog KMS key (to read encrypted database metadata). If KMS keys are managed by external stacks (e.g., glue-catalog-app), you must add consumer account IDs to the kmsKeyConsumerAccounts configuration in those stacks. The dataops-project will attempt to grant permissions automatically, but this only works if the KMS keys are managed within the same stack.

Project Glue Security Config - Security config which will be used by all jobs under the project

Ensures all job output, logging, and bookmark data is encryped with the project KMS key

Project Glue SecurityGroups - Security groups which can be used by Glue Connections or other project resources

All egress permitted by default
Self-referencing ingress rule added by default (allows all traffic within security group, required by Glue)
All other ingress traffic denied by default

Glue Connections - Glue connections for reuse across project jobs and crawlers

Network connections for VPC access
- Can use either a project Security Group or an existing security group
JDBC connections for RDBMS access
- Credentials should be stored in a secret and referenced using dynamic references
- Note that secret rotation will break this configuration. Instead, use a Network/Vpc connection and directly consume credentials from Secret in Glue Job code

Glue Custom Classifiers - Glue classifiers for reuse across project crawlers

DataZone/SageMaker Project and Data Sources - Allows the DataOps project resources to be registered as DataZone/SageMaker project, data sources, and assets.

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

dataops-project: # Module Name can be customized
  module_path: '@aws-mdaa/dataops-project' # Must match module NPM package name
  module_configs:
    - ./dataops-project.yaml # Filename/path can be customized

Module Config (./dataops-project.yaml)

Config Schema Docs

Simple Project Config

# Arns for IAM role which will be authoring code within the project
dataEngineerRoles:
  - arn: arn:{{partition}}:iam::{{account}}:role/sample-org-dev-instance1-roles-data-engineer

# Arns for IAM roles which will be provided to the projects's resources (IE bucket)
dataAdminRoles:
  - name: Admin

projectExecutionRoles:
  - arn: ssm:/sample-org/instance1/generated-role/glue-role/arn
  - id: generated-role-id:databrew

# DataOps failure notifications.
# For jobs, this includes state changes of "FAILED", "TIMEOUT", and "STOPPED".
# For crawlers, this includes state changes of "Failed".
failureNotifications:
  email:
    - user1@example.com
    - user2@example.com

# A list of security groups which will be created for
# use by various project resources (such as Lambda functions, Glue jobs, etc)
securityGroupConfigs:
  test-security-group:
    # The id of the VPC on which the SG will be used
    vpcId: test-vpcid
    # Optional - The list of custom egress rules which will be added to the SG.
    # If not specified, the SG will allow all egress traffic by default.
    securityGroupEgressRules:
      ipv4:
        - cidr: 10.10.10.0/24
          protocol: TCP
          port: 443
      sg:
        - sgId: sg-12312412123
          protocol: TCP
          port: 443

# Optional - The ID of the KMS key which will encrypt all S3 outputs of Jobs run under this project.
# If not specified, the project key will be used.
s3OutputKmsKeyArn: ssm:/sample-org/instance1/datalake/kms/id

# Optional - The Arn of the KMS key used to encrypt the Glue Catalog. Specific access to this key
# will be granted to Glue executor roles for the purpose of decrypting
# Glue connections. If not specified, the standard SSM param created
# by the Glue Catalog Settings module will be used.
glueCatalogKmsKeyArn: ssm:/sample-org/shared/glue-catalog/kms/arn

# Project-level Lake Formation configuration
# Defines tag vocabulary that can be used across all databases in the project
lakeFormation:
  # Define Lake Formation tags with all possible values (tag vocabulary)
  # These tags are created at the account level and can be used by any database
  lfTags:
    - tagKey: environment
      tagValues: [dev, test, prod]
    - tagKey: data_tier
      tagValues: [bronze, silver, gold]
    - tagKey: data_classification
      tagValues: [public, internal, confidential]

# (optional)  Definitions for custom classifiers. Referred to by name in the crawler configuration files.
classifiers:
  # (optional)  Example of a CSV Classifier.  See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-csvclassifier.html
  classifierCsv:
    classifierType: 'csv'
    configuration:
      csvClassifier:
        allowSingleColumn: false
        containsHeader: 'PRESENT'
        delimiter: '~'
        disableValueTrimming: false
        header:
          - columnA
          - columnB
        quoteSymbol: '^'
  # (optional)  Example of a Grok Classifier.  See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-grokclassifier.html
  classifierGrok:
    classifierType: 'grok'
    configuration:
      grokClassifier:
        classification: special-logs
        customPatterns: 'MESSAGEPREFIX .*-.*-.*-.*-.*'
        grokPattern: '%{TIMESTAMP_ISO8601:timestamp} \[%{MESSAGEPREFIX:message_prefix}\] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}'
  # (optional) Example of a JSON Classifier.  See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-jsonclassifier.html
  classifierJson:
    classifierType: 'json'
    configuration:
      jsonClassifier:
        jsonPath: '$[*]'
  # (optional) Example of an XML Classifier.  See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-xmlclassifier.html
  classifierXml:
    classifierType: 'xml'
    configuration:
      xmlClassifier:
        classification: xml-data
        rowTag: '<row item_a="A" item_b="B"></row>'

# (optional)  Definitions for crawler connections. Referred to by name in the crawler configuration files.
connections:
  # (optional)  Example of a Network Connection.  See: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-csvclassifier.html
  connectionVpc:
    connectionType: NETWORK
    description: VPC Connection Example
    physicalConnectionRequirements:
      availabilityZone: '{{region}}a'
      subnetId: subnet-123abc456def
      securityGroupIdList:
        - sg-890abc123asc
  # (optional)  Example of a Network Connection which uses the SG produced in the project config
  connectionVpcWithProjectSG:
    connectionType: NETWORK
    description: VPC Connection Example
    physicalConnectionRequirements:
      availabilityZone: '{{region}}a'
      subnetId: subnet-09ba402b76a346ffb
      projectSecurityGroupNames:
        - test-security-group
  # (optional)  Example of a JDBC Connection.
  connectionJdbc:
    connectionType: JDBC
    # To understand the supported values in connectionProperties see: https://docs.aws.amazon.com/glue/latest/webapi/API_Connection.html
    connectionProperties:
      JDBC_CONNECTION_URL: 'jdbc:awsathena://AwsRegion=[REGION];UID=[ACCESS KEY];PWD=[SECRET KEY];S3OutputLocation=[LOCATION]'
      JDBC_ENFORCE_SSL: true
    description: JDBC Connection Example
    physicalConnectionRequirements:
      availabilityZone: '{{region}}a'
      subnetId: subnet-123abc456def
      securityGroupIdList:
        - sg-890abc123asc

# (Optional) - Generate a SageMaker Project for this DataOps Project
sagemaker:
  # The SSM Parameter containing domain config details for a SageMaker Domain created by the MDAA SageMaker module
  domainConfigSSMParam: /sample-org/shared/sagemaker/domain/test-domain/config
  # Optional - if true, the project data admin role will be added as an owner of the SMUS project.
  # This requires the domain to either be in AUTOMATIC user assignment, or for the data admin role
  # to already be added to the domain as a user profile.
  createDataAdminOwners: false
  project:
    profileName: test-profile
    # Optional - Add data sources for Glue databases not created by this DataOps Project.
    # Note - Data sources can be created for project databases using the createSagemakerDatasource database property.
    dataSources:
      # Data Source name
      test-source:
        # The name of a database not created by this DataOps Project
        databaseName: non-project-database
# (Optional) - Generate a DataZone/Project for this DataOps Project
# datazone:
#   project:
#     # The SSM Parameter containing domain config details for a DataZone Domain created by the MDAA Datazone module
#     domainConfigSSMParam: /sample-org/shared/datazone/domain/test-domain/config

# (Optional) List of Databases to create. Referred to by name in the crawler configuration files.
databases:
  test-database1:
    description: Test Database 1
    locationBucketName: some-bucket-name
    locationPrefix: data/test1
    lakeFormation:
      # If true (default false), LakeFormation read/write/super grants will be automatically created
      # for the database for project data admin roles
      createSuperGrantsForDataAdminRoles: true

      # If true (default false), LakeFormation read grants will be automatically created
      # for the database for project data engineer roles
      createReadGrantsForDataEngineerRoles: true

      # If true (default false), LakeFormation read/write grants will be automatically created
      # for the database and its S3 Location for project execution roles
      createReadWriteGrantsForProjectExecutionRoles: true

      # Removing cross-account resource links for testing
      # createCrossAccountResourceLinkAccounts:
      #   - "12312412"

      # Optional - the name of the resource links to be generated
      # If not specified, defaults to the database name
      createCrossAccountResourceLinkName: 'testing'
      grants:
        # Each grant is keyed with a name which is unique within the context
        # of the database
        example_read_grant:
          # # (Optional) Specify the database permissions level ("read", "write", "super")
          # # Defauls to "read"
          databasePermissions: read
          # # (Optional) Specify the table permissions level ("read", "write", "super")
          # # Defauls to "read"
          tablePermissions: read
          # (Optional) - List of tables for which to create grants
          # If not specified, permissions are granted to all tables in the database.
          tables:
            - test-table
          # List of principal references in the "principals" section to which the permissions will be granted
          principals:
            # Each principal (principalArns key) must be named uniquely within the context of the database
            principalA:
              # Arn of IAM SAML IDP
              federationProviderArn: some-federation-provider-arn
              # Federated username
              federatedUser: some-user-name
            principalB:
              federationProviderArn: some-federation-provider-arn
              # Federated group
              federatedGroup: some-group-name
          # Can directly specify the principalArn.
          principalArns:
            principalC: some-other-role-arn

  # Condensed DB config
  test-database2:
    description: Test Database 2
    locationBucketName: some-bucket-name
    locationPrefix: data/test2
    lakeFormation:
      createSuperGrantsForDataAdminRoles: true
      createReadGrantsForDataEngineerRoles: true
      createReadWriteGrantsForProjectExecutionRoles: true
      # Removing cross-account resource links for testing
      # createCrossAccountResourceLinkAccounts:
      #   - "12312412"
      grants:
        example_condensed_read_grant:
          principalArns:
            principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role

  # A Database which will also create a Datazone Datasource (Requires the Datazone project to be configured in this config)
  test-database3:
    description: Test Datazone Datasource
    locationPrefix: data/test-database3
    createDatazoneDatasource: true

  # Verbatim DB Name Config
  test-database4:
    description: Test Database 4
    verbatimName: true
    locationBucketName: some-bucket-name
    locationPrefix: data/test4
    lakeFormation:
      createSuperGrantsForDataAdminRoles: true
      createReadGrantsForDataEngineerRoles: true
      createReadWriteGrantsForProjectExecutionRoles: true
      # Removing cross-account resource links for testing
      # createCrossAccountResourceLinkAccounts:
      #   - "12312412"
      grants:
        example_condensed_read_grant:
          principalArns:
            principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role

  # Iceberg Compliant DB Name Config
  test-database5:
    description: Test Database 5
    icebergCompliantName: true
    locationBucketName: some-bucket-name
    locationPrefix: data/test5
    lakeFormation:
      createSuperGrantsForDataAdminRoles: true
      createReadGrantsForDataEngineerRoles: true
      createReadWriteGrantsForProjectExecutionRoles: true
      # Removing cross-account resource links for testing
      # createCrossAccountResourceLinkAccounts:
      #   - "12312412"
      grants:
        example_condensed_read_grant:
          principalArns:
            principalA: arn:{{partition}}:iam::{{account}}:role/cross-account-role

  # Tag-Based Access Control Database Config
  test-database6:
    description: Test Database with Tag-Based Access Control
    locationBucketName: some-bucket-name
    locationPrefix: data/test6
    lakeFormation:
      createSuperGrantsForDataAdminRoles: true
      createReadWriteGrantsForProjectExecutionRoles: true

      # Assign specific tag values to this database
      databaseTagValues:
        - tagKey: environment
          tagValues: [dev]
        - tagKey: data_tier
          tagValues: [bronze]
        - tagKey: data_classification
          tagValues: [public]

      # Define tag-based grants using LF-Tag expressions
      tagBasedGrants:
        # Grant for development environment access
        dev_access:
          principalArns:
            dev-role: arn:{{partition}}:iam::{{account}}:role/dev-data-user
          permissions: [DESCRIBE, SELECT]
          resourceType: TABLE
          lfTagExpression:
            environment: [dev]
            data_tier: [bronze, silver]

        # Grant for production read access to public/internal data
        prod_read_access:
          principalArns:
            prod-reader: arn:{{partition}}:iam::{{account}}:role/prod-data-reader
          permissions: [DESCRIBE, SELECT]
          resourceType: TABLE
          lfTagExpression:
            environment: [prod]
            data_classification: [public, internal]

  # SageMaker Project Data Source
  test-database-sus:
    description: Test SageMaker Database
    # Creates a datasource in the SageMaker Project associated with this DataOps Project
    createSagemakerDatasource: true
    locationBucketName: some-bucket-name
    locationPrefix: data/test-sus

  # DataZone Project Data Source
  # test-database-datazone:
  #   description: Test DataZone Database
  #   # Creates a datasource in the DataZone Project associated with this DataOps Project
  #   createDatazoneDatasource: true
  #   locationBucketName: some-bucket-name
  #   locationPrefix: data/test-sus