Skip to content

Data Lake

This Data Lake CDK application is used to configure deploy the resources required to define a secure S3-based Data Lake on AWS.


Deployed Resources and Compliance Details

DataLake

Data Lake KMS Key - This key will be used to encrypt all Data Lake resources which support encryption at rest (including the Data Lake S3 Buckets).

  • Key usage access granted to all data lake roles (via key policy)
  • Encrypt access granted to S3 service to allow S3 Inventory data to be written (via key policy)
  • Additional permissions may be granted via IAM policy

Data Lake S3 Buckets - These buckets will be deployed to form the persistence basis of the Data Lake.

  • Bucket versioning will be automatically enabled
    • Only super user level access may delete object versions (permanent deletion)
    • Write access otherwise only allows creation of delete markers
  • Bucket will be encrypted by default using the Data Lake KMS Key
    • By default, exclusive use of Data Lake KMS key will be enforced via bucket policy
    • If exclusive use not enforced, an alternative KMS key may be specified
    • BucketKey feature enabled to minimize impact on KMS Service during high volume read/write operations
  • Bucket policy will enforce use of SSL
  • Access policies statements (in Bucket Policy) are configured per prefix, and can be read, write, and super user level
  • By default, a defaultDeny bucket policy statement will be added to deny bucket read/write actions to any role not specified in the config
    • Configurable by bucket
  • Each bucket may have S3 Inventory enabled to automatically produce inventory either on the bucket, or written to an external bucket
  • Each bucket may have S3 Lifecycle rules attached

S3 Lifecycle Rules - A set of lifecycle rule configurations which can be applied across data lake buckets

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

          datalake: # Module Name can be customized
            module_path: "@aws-mdaa/datalake" # Must match module NPM package name
            module_configs:
              - ./datalake.yaml # Filename/path can be customized

Module Config (./datalake.yaml)

Config Schema Docs

roles:
  DataAdmin: # The Logical Config Role name
    # A list of role ids or SSM params which specify which physical roles will be bound to the Logical Config Role
    - arn: arn:{{partition}}:iam::{{account}}:role/Admin
    - name: Admin
    - id: AROA1234567890
  DataUser:
    - id: ssm:/sample-org/instance1/generated-role/test-role/id
    - arn: ssm:/sample-org/instance1/generated-role/data-scientist/arn
    - id: generated-role-id:test-role
    - id: generated-role-id:data-scientist
    # Relsolves to the role generated by IAM Identity Center/SSO for the 'data_scientist' permission_set
    - name: data_scientist
      sso: true

# Definitions of access policies which grant access to S3 paths for specified Logical Config Roles.
# These Access Policies can then be applied to Data Lake buckets (they will be injected into the corresponding bucket policies.)
accessPolicies:
  Root: # A friendly name for the access policy
    rule:
      prefix: / # The S3 prefix path to which policy will be applied in the bucket policies.
      # A list of Logical Config Roles which will be provided ReadWriteSuper access.
      # ReadWriteSuper access allows reading, writing, and permanent data deletion.
      ReadWriteSuperRoles:
        - DataAdmin
  Data: # A friendly name for the access policy
    rule:
      prefix: /data
      ReadRoles:
        - DataUser

lifecycleConfigurations:
  SampleConfiguration1: # A friendly name for life cycle transition rules configuration.
    SampleRule1:
      Status: Enabled # Enabled or disabled
      Prefix: test_prefix # (Optional) Prefix within S3 bucket to which the rule applies.
      ObjectSizeGreaterThan: 500 # (Optional)
      ObjectSizeLessThan: 10000 # (Optional)
      AbortIncompleteMultipartUploadAfter: 2 # (Optional) Number of days after initiation of multi part creation.
      Transitions: # (Optional) Storage class to move the current version of objects to after after object upload.
        - Days: 30 # (Optional) Number of days after object creation.
          StorageClass: STANDARD_IA # (Optional) Storage class to move the object to
        - Days: 60
          StorageClass: GLACIER_IR
        - Days: 150
          StorageClass: GLACIER
        - Days: 240
          StorageClass: DEEP_ARCHIVE
      ExpirationDays: 270 # (Optional) Number of days. Current version of object will expire these many days after object creation.
      # ExpiredObjectDeleteMarker: True # Permanently delete expired objects. Cannot be set if ExpirationDays is set
      NoncurrentVersionTransitions: # (Optional) Storage class to move the previous versions of objects to after after object upload.
        - Days: 30 # (Optional) Number of days after object creation.
          StorageClass: STANDARD_IA # (Optional) Storage class to move the object to
          NewerNoncurrentVersions: 1 # (Optional) Number of latest non-current versions to retain.
        - Days: 60
          StorageClass: GLACIER_IR
          NewerNoncurrentVersions: 2
        - Days: 150
          StorageClass: GLACIER
          NewerNoncurrentVersions: 3
        - Days: 240
          StorageClass: DEEP_ARCHIVE
          NewerNoncurrentVersions: 4
      NoncurrentVersionExpirationDays: 270 # (Optional) Number of days. Non-current object will expire these many days after object creation.
      NoncurrentVersionsToRetain: 5 # (Optional) Number of latest non-current versions to retain.
    SampleRule2:
      Status: Enabled # Enabled or disabled
      Prefix: test_prefix # (Optional) Prefix within S3 bucket to which the rule applies.
      ObjectSizeGreaterThan: 500 # (Optional)
      ObjectSizeLessThan: 10000 # (Optional)
      AbortIncompleteMultipartUploadAfter: 2 # (Optional) Number of days after initiation of multi part upload
      Transitions:
        - Days: 30
          StorageClass: STANDARD_IA
        - Days: 60
          StorageClass: GLACIER_IR
        - Days: 150
          StorageClass: GLACIER
        - Days: 240
          StorageClass: DEEP_ARCHIVE
      ExpiredObjectDeleteMarker: True
      NoncurrentVersionTransitions:
        - Days: 30
          StorageClass: STANDARD_IA
          NewerNoncurrentVersions: 1
        - Days: 60
          StorageClass: GLACIER_IR
          NewerNoncurrentVersions: 2
        - Days: 150
          StorageClass: GLACIER
          NewerNoncurrentVersions: 3
        - Days: 240
          StorageClass: DEEP_ARCHIVE
          NewerNoncurrentVersions: 4
      NoncurrentVersionExpirationDays: 270 # Number of days. Non-current object will expire these many days after object creation.
      NoncurrentVersionsToRetain: 5
  SampleConfiguration2: # A friendly name for life cycle transition rules configuration.
    SampleRule1:
      Status: Enabled # Enabled or disabled
      Prefix: test_prefix # (Optional) Prefix within S3 bucket to which the rule applies.
      Transitions: # (Optional) Storage class to move the current version of objects to after after object upload.
        - Days: 30 # (Optional) Number of days after object creation.
          StorageClass: STANDARD_IA # (Optional) Storage class to move the object to
    SampleRule2:
      Status: Enabled # Enabled or disabled
      Prefix: test_prefix # (Optional) Prefix within S3 bucket to which the rule applies.
      NoncurrentVersionTransitions:
        - Days: 30
          StorageClass: STANDARD_IA
          NewerNoncurrentVersions: 1

# The set of S3 buckets which will be created, and the access policies which will be applied.
buckets:
  raw:
    defaultDeny: false
    #enableEventBridgeNotifications: true
    createFolderSkeleton: false
    #Inventory data will be written for each listed name/prefix under /inventory/<name>
    inventories:
      all-data:
        prefix: data
    accessPolicies:
      - Root
      - Data
    lifecycleConfiguration: SampleConfiguration1

  transformed:
    createFolderSkeleton: true
    enableEventBridgeNotifications: true
    lakeFormationLocations:
      all-data:
        prefix: data
      all-data2:
        prefix: data2
    accessPolicies:
      - Root
      - Data
    lifecycleConfiguration: SampleConfiguration2