Skip to content

Basic Terraform Data Lake

This basic S3 Data Lake sample illustrates how to create an S3 data lake on AWS. Access to the data lake may be granted to IAM and federated principals, and is controlled on a coarse-grained basis only (using S3 bucket policies). This sample uses Terraform module implementations.

This architecture may be suitable when:

  • Data is primarily unstructured and will not be consumed via Athena.
  • User access to the data lake does not need to be governed by fine-grained access controls.

Basic Terraform Datalake


Deployment Instructions

The following instructions assume you have CDK bootstrapped your target account, and that the MDAA source repo is cloned locally. More predeployment info and procedures are available in PREDEPLOYMENT.

  1. Deploy sample configurations into the specified directory structure (or obtain from the MDAA repo under sample_configs/basic_terraform_datalake).

  2. Edit the mdaa.yaml to specify an organization name. This must be a globally unique name, as it is used in the naming of all deployed resources, some of which are globally named (such as S3 buckets).

  3. If required, edit the mdaa.yaml to specify context: values specific to your environment.

  4. Ensure you are authenticated to your target AWS account.

  5. Optionally, run <path_to_mdaa_repo>/bin/mdaa ls from the directory containing mdaa.yaml to understand what stacks will be deployed.

  6. Optionally, run <path_to_mdaa_repo>/bin/mdaa synth from the directory containing mdaa.yaml and review the produced templates.

  7. Run <path_to_mdaa_repo>/bin/mdaa deploy from the directory containing mdaa.yaml to deploy all modules.

Additional MDAA deployment commands/procedures can be reviewed in DEPLOYMENT.


Configurations

The sample configurations for this architecture are provided below. They are also available under sample_configs/basic_terraform_datalake within the MDAA repo.

Config Directory Structure

basic_datalake
   mdaa.yaml
└───datalake
    └───main.tf
└───glue-catalog
    └───main.tf 

mdaa.yaml

This configuration specifies the global, domain, env, and module configurations required to configure and deploy this sample architecture.

Note - Before deployment, populate the mdaa.yaml with appropriate organization and context values for your environment

# Contents available in mdaa.yaml
# All resources will be deployed to the default region specified in the environment or AWS configurations.
# Can optional specify a specific AWS Region Name.
region: <your-aws-region-name>

## Pre-Deployment Instructions

# TODO: Set an appropriate, unique organization name
# Failure to do so may resulting in global naming conflicts.
organization: <your-org-name>

# TODO: If using an S3 Terraform backend, uncomment these lines and set the backend S3 bucket and DynamoDB table names.
# If not configured, local state tracking will be used.
terraform:
  override:
    terraform:
      backend:
        s3:
          bucket: <your-tf-state-bucket-name>
          dynamodb_table: <your-tf-state-lock-ddb-table>

# One or more domains may be specified. Domain name will be incorporated by default naming implementation
# to prefix all resource names.
domains:
  # The named of the domain. In this case, we are building a 'shared' domain.
  shared:
    # One or more environments may be specified, typically along the lines of 'dev', 'test', and/or 'prod'
    environments:
      # The environment name will be incorporated into resource name by the default naming implementation.
      dev:
        use_bootstrap: false
        # The target deployment account can be specified per environment.
        # If 'default' or not specified, the account configured in the environment will be assumed.
        account: default
        # The list of modules which will be deployed. A module points to a specific MDAA CDK App, and
        # specifies a deployment configuration file if required.
        modules:
          # This module will deploy the S3 data lake buckets.
          # Coarse grained access may be granted directly to S3 for certain roles.
          glue-catalog:
            module_type: tf
            module_path: ./glue-catalog/
          # This module will deploy the S3 data lake buckets.
          # Coarse grained access may be granted directly to S3 for certain roles.
          datalake1:
            module_type: tf
            module_path: ./datalake/
            mdaa_compliant: true

datalake/main.tf

This terrafrom module will consume the MDAA DataLake TF module to create a datalake.

# Contents available in datalake/main.tf
# Copyright © Amazon.com and Affiliates: This deliverable is considered Developed Content as defined in the AWS Service Terms.

variable "region" {
  description = "The region to be deployed to"
  type        = string
}

variable "org" {
  description = "The org name used in the naming convention"
  type        = string
}

variable "domain" {
  description = "The domain name used in the naming convention"
  type        = string
}

variable "env" {
  description = "The env name used in the naming convention"
  type        = string
}

variable "module_name" {
  description = "The module_name name used in the naming convention"
  type        = string
}

variable "force_destroy" {
  description = "If true, the resources will be force destroyed"
  type        = bool
  default = false
}

locals {
  # Sample Roles
  data_admin_role_arn     = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/Admin"
  data_engineer_role_arn  = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/DataEngineer"
  data_scientist_role_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/DataScientist"
}

module "mdaa_datalake" {
  # checkov:skip=CKV_TF_1:Ensure Terraform module sources use a commit hash:Not required.
  # checkov:skip=CKV_TF_2:Ensure Terraform module sources use a tag with a version number:Not required.

  # TODO: Point to the MDAA Terraform Git Repo  
  # If using Git SSH, be sure to use the git::ssh://<url> syntax. Otherwise TF might download the module, but checkov will fail to.
  source        = "<your-git-url>//modules/datalake"
  force_destroy = var.force_destroy
  module_name = var.module_name
  bucket_definitions = {
    # RAW BUCKET
    "raw" = {
      base_name = "raw"
      access_policies = {
        "root" = {
          READWRITESUPER = {
            role_arns = [local.data_admin_role_arn],
          }
        }
        "data" = {
          READ = {
            role_arns = [local.data_scientist_role_arn],
          }
          READWRITE = {
            role_arns = [local.data_engineer_role_arn],
          }
        }
      }
    }
    # CURATED BUCKET
    "curated" = {
      base_name = "curated"
      access_policies = {
        "root" = {
          READWRITESUPER = {
            role_arns = [local.data_admin_role_arn],
          },

        }
        "data-product-A" = {
          READWRITE = {
            role_arns = [local.data_engineer_role_arn, local.data_scientist_role_arn],
          }
        }
        "data-product-B" = {
          READ = {
            role_arns = [local.data_scientist_role_arn],
          }
          READWRITE = {
            role_arns = [local.data_engineer_role_arn],
          }

        }
      }
    }
  }
}

data "aws_caller_identity" "current" {}

# Creates a Data Engineer Athena Workgroup
module "example_workgroup" {
  # checkov:skip=CKV_TF_1:Ensure Terraform module sources use a commit hash:Not required.
  # checkov:skip=CKV_TF_2:Ensure Terraform module sources use a tag with a version number:Not required.


  # TODO: Point to the MDAA Terraform Git Repo  
  # If using Git SSH, be sure to use the git::ssh://<url> syntax. Otherwise TF might download the module, but checkov will fail to.
  source        = "<your-git-url>//modules/athena-workgroup"
  module_name = var.module_name
  base_name                      = "data-engineer"
  force_destroy                  = var.force_destroy
  bytes_scanned_cutoff_per_query = 10000000000
  data_admin_role_arn            = local.data_admin_role_arn
  service_execution_role_arns    = [local.data_engineer_role_arn, local.data_scientist_role_arn]
}

glue-catalog/main.tf

This terrafrom module will consume the MDAA GlueCatalog TF module to create a datalake.

# Contents available in glue-catalog/main.tf
# Copyright © Amazon.com and Affiliates: This deliverable is considered Developed Content as defined in the AWS Service Terms.

variable "region" {
  description = "The region to be deployed to"
  type        = string
}

variable "org" {
  description = "The org name used in the naming convention"
  type        = string
}

variable "domain" {
  description = "The domain name used in the naming convention"
  type        = string
}

variable "env" {
  description = "The env name used in the naming convention"
  type        = string
}

variable "module_name" {
  description = "The module_name name used in the naming convention"
  type        = string
}

variable "force_destroy" {
  description = "If true, the resources will be force destroyed"
  type        = bool
  default = false
}


module "glue-catalog" {
  # checkov:skip=CKV_TF_1:Ensure Terraform module sources use a commit hash:Not required.
  # checkov:skip=CKV_TF_2:Ensure Terraform module sources use a tag with a version number:Not required.

  # TODO: Point to the MDAA Terraform Git Repo
  # If using Git SSH, be sure to use the git::ssh://<url> syntax. Otherwise TF might download the module, but checkov will fail to.
  source        = "<your-git-url>//modules/glue-catalog-setting"
  module_name = var.module_name
}