Skip to content

Basic Data Lake

This basic S3 Data Lake sample illustrates how to create an S3 data lake on AWS. Access to the data lake may be granted to IAM and federated principals, and is controlled on a coarse-grained basis only (using S3 bucket policies).

This architecture may be suitable when:

  • Data is primarily unstructured and will not be consumed via Athena.
  • User access to the data lake does not need to be governed by fine-grained access controls.

Data Science


Deployment Instructions

The following instructions assume you have CDK bootstrapped your target account, and that the MDAA source repo is cloned locally. More predeployment info and procedures are available in PREDEPLOYMENT.

  1. Deploy sample configurations into the specified directory structure (or obtain from the MDAA repo under sample_configs/basic_terraform_datascience_platform).

  2. Edit the mdaa.yaml to specify an organization name. This must be a globally unique name, as it is used in the naming of all deployed resources, some of which are globally named (such as S3 buckets).

  3. If required, edit the mdaa.yaml to specify context: values specific to your environment.

  4. Ensure you are authenticated to your target AWS account.

  5. Optionally, run <path_to_mdaa_repo>/bin/mdaa ls from the directory containing mdaa.yaml to understand what stacks will be deployed.

  6. Optionally, run <path_to_mdaa_repo>/bin/mdaa synth from the directory containing mdaa.yaml and review the produced templates.

  7. Run <path_to_mdaa_repo>/bin/mdaa deploy from the directory containing mdaa.yaml to deploy all modules.

Additional MDAA deployment commands/procedures can be reviewed in DEPLOYMENT.


Configurations

The sample configurations for this architecture are provided below. They are also available under sample_configs/basic_terraform_datascience_platform within the MDAA repo.

Config Directory Structure

basic_datalake
   mdaa.yaml
   tags.yaml
   roles.yaml
└───datascience
    └───main.cf
    └───providers.cf
└───roles
    └───main.cf
    └───providers.cf

mdaa.yaml

This configuration specifies the global, domain, env, and module configurations required to configure and deploy this sample architecture.

Note - Before deployment, populate the mdaa.yaml with appropriate organization and context values for your environment

# Contents available in mdaa.yaml
# All resources will be deployed to the default region specified in the environment or AWS configurations.
# Can optional specify a specific AWS Region Name.
region: default

## Pre-Deployment Instructions

# TODO: Set an appropriate, unique organization name
# Failure to do so may resulting in global naming conflicts.
organization: <your-org-name>

# TODO: If using an S3 Terraform backend, uncomment these lines and set the backend S3 bucket and DynamoDB table names.
# If not configured, local state tracking will be used.
terraform:
  override:
    terraform:
      backend:
        s3:
          bucket: <your-tf-state-bucket-name>
          dynamodb_table: <your-tf-state-lock-ddb-table>

# One or more domains may be specified. Domain name will be incorporated by default naming implementation
# to prefix all resource names.
domains:
  # The named of the domain. In this case, we are building a 'shared' domain.
  shared:
    # One or more environments may be specified, typically along the lines of 'dev', 'test', and/or 'prod'
    environments:
      # The environment name will be incorporated into resource name by the default naming implementation.
      dev:
        use_bootstrap: false
        # The target deployment account can be specified per environment.
        # If 'default' or not specified, the account configured in the environment will be assumed.
        account: default
        # The list of modules which will be deployed. A module points to a specific MDAA CDK App, and
        # specifies a deployment configuration file if required.
        modules:
          # A roles module deployment will be used to generate IAM roles
          roles:
            module_type: tf
            module_path: ./roles/
          # A Data Science Team module will deploy the resources required for the
          # data science platform.
          example-team:
            module_type: tf
            module_path: ./datascience/

roles/main.tf

A Terraform module which will deploy IAM roles required for the Data Science platform

# Contents available in roles/main.tf
# Create Sagemaker Team Execution Role

// Variables
variable "region" {
  description = "The region to be deployed to"
  type        = string
}

variable "org" {
  description = "The org name used in the naming convention"
  type        = string
}

variable "domain" {
  description = "The domain name used in the naming convention"
  type        = string
}

variable "env" {
  description = "The env name used in the naming convention"
  type        = string
}

variable "module_name" {
  description = "The module_name name used in the naming convention"
  type        = string
}

data "aws_caller_identity" "current" {}

locals {
  account   = data.aws_caller_identity.current.account_id
  team_name = "ds-team-one" # Provide a Suitable name for your team
}

module "role_name" {
  # checkov:skip=CKV_TF_1:Ensure Terraform module sources use a commit hash:Not required.
  # checkov:skip=CKV_TF_2:Ensure Terraform module sources use a tag with a version number:Not required.

  source      = "<your-git-url>/naming_convention"
  base_name   = "${local.team_name}-exec-role"
  module_name = var.module_name
}

resource "aws_iam_policy" "datascience_user_policy" {
  name        = "${module.role_name.base_resource_name}-exec-policy"
  path        = "/"
  description = "Provides basic service access to the team execution role"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowBedrockAccess"
        Effect = "Allow"
        Action = [
          "bedrock:ListFoundationModels*",
          "bedrock:ListCustomModels*",
          "bedrock:InvokeModel",
          "bedrock:InvokeModelWithResponseStream",
          "bedrock:GetFoundationModel*",
          "bedrock:GetGuardrail"
        ]
        Resource = [
          "arn:aws:bedrock:${var.region}::*-model",
          "arn:aws:bedrock:${var.region}:${local.account}:*",

        ]
      },
      {
        Sid      = "SageMakerLaunchProfileAccess"
        Effect   = "Allow"
        Action   = ["sagemaker:CreatePresignedDomainUrl"]
        Resource = "*"
        Condition = {
          StringEquals = {
            "sagemaker:ResourceTag/userid" = "$${aws:userid}"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role" "sagemaker_team_exec_role" {
  name = module.role_name.base_resource_name
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = [
            "sagemaker.amazonaws.com",
            "bedrock.amazonaws.com",
            "ec2.amazonaws.com"
          ]
        }
      }
    ]
  })

}

resource "aws_iam_role_policy_attachment" "datascience_user_policy_attachment" {
  role       = aws_iam_role.sagemaker_team_exec_role.name
  policy_arn = aws_iam_policy.datascience_user_policy.arn
}

datascience/main.tf

A Terraform module which will deploy the Data Science platform by consuming the MDAA Data Science module.

# Contents available in datascience/main.tf
# Copyright © Amazon.com and Affiliates: This deliverable is considered Developed Content as defined in the AWS Service Terms

// Variables
variable "region" {
  description = "The region to be deployed to"
  type        = string
}

variable "org" {
  description = "The org name used in the naming convention"
  type        = string
}

variable "domain" {
  description = "The domain name used in the naming convention"
  type        = string
}

variable "env" {
  description = "The env name used in the naming convention"
  type        = string
}

variable "module_name" {
  description = "The module_name name used in the naming convention"
  type        = string
}


data "aws_caller_identity" "current" {}

locals {

  account_id = data.aws_caller_identity.current.account_id
  team_name  = "ds-team-one"

  # Roles
  data_admin_role_arn = "arn:aws:iam::${local.account_id}:role/Admin"
  sso_group_name      = "DataScientist"
  team_user_role_arn  = "arn:aws:iam::${local.account_id}:role/DataScientist"
  team_exec_role_arn  = "arn:aws:iam::${local.account_id}:role/<your-team-exec-role>" # Team Execution Role 

  # Define a policy prefix that works across environments. 
  # These policies may be attached to Permission sets
  policy_prefix = lower(format(
    "%s-%s-%s-%s",
    var.org,
    var.domain,
    var.module_name,
    local.team_name
  ))

  ## Provide information about Identity Store ID and Network configuration where the Sagemaker Domain will be created
  identity_store_id = "<Identity Store Id>" # Example: "d-012345abcd"
  network = {
    vpc_id  = "<vpc id>"
    subnets = ["<subnet-id1>", "<subnet-id2>"],
    #(OPTIONAL) S3 Prefix list for various regions. 
    s3_prefix_list = {
      "ca-central-1" = "pl-7da54014",
      "us-east-1"    = "pl-63a5400a"
    }
  }
}

module "mdaa_ds_team" {
  # checkov:skip=CKV_TF_1:Ensure Terraform module sources use a commit hash:Not required.
  # checkov:skip=CKV_TF_2:Ensure Terraform module sources use a tag with a version number:Not required.
  source                      = "<your-git-url>/datascience-team"
  module_name                 = var.module_name
  base_name                   = local.team_name
  data_admin_role_arn         = local.data_admin_role_arn
  team_user_role_arn          = local.team_user_role_arn
  team_exec_role_arn          = local.team_exec_role_arn
  verbatim_policy_name_prefix = local.policy_prefix
  sagemaker_domain_config = {
    auth = {
      mode                   = "SSO",
      identity_store_id      = local.identity_store_id
      assign_sso_group_names = [local.sso_group_name]

    }
    vpc_id         = local.network.vpc_id
    app_subnet_ids = local.network.subnets

    # Provide Ingress/Egress rules based on your network configuration
    # Below Sample Security Group Ingress/Egress rules allow the following traffic:
    # - Ingress: Traffic from the VPC CIDR block and within the SG
    # - Egress: Traffic to the VPC CIDR block, S3 prefix list CIDRs, and within the SG
    security_group_ingress_egress_rules = {
      # Ingress Rules
      cidr_block_ingress_rules = [
        {
          description = "Traffic originating from the VPC CIDR block"
          from_port   = 443,
          to_port     = 443,
          protocol    = "tcp",
          cidr_blocks = ["10.0.0.0/16"]
        }
      ]
      self_ingress_rules = [
        {
          description = "Self-Ref: Traffic from within the SG"
          from_port   = 0,
          to_port     = 0,
          protocol    = "all",
        }
      ]
      # Outbound Rules
      cidr_block_egress_rules = [
        {
          description = "Outbound: Traffic to the VPC CIDR block",
          from_port   = 443,
          to_port     = 443,
          protocol    = "tcp",
          cidr_blocks = ["10.0.0.0/16"]
        }
      ]

      self_egress_rules = [
        {
          description = "Self-Ref: Traffic from within the SG",
          from_port   = 0,
          to_port     = 0,
          protocol    = "all",
        }
      ]
      prefix_list_egress_rules = [{
        description     = "Outbound to s3 prefix list CIDRs"
        from_port       = 443,
        to_port         = 443,
        protocol        = "tcp",
        prefix_list_ids = ["${local.network.s3_prefix_list[var.region]}"]
      }]
    }

  }

}