Skip to content

Data Science Team

Note: This documentation is also available in a rendered format here.

Provisions a complete data science team environment including SageMaker AI Studio domains with user profiles, Athena workgroups, team S3 data buckets, KMS encryption, Lake Formation permissions, and IAM role management. Supports both IAM and SSO authentication modes with configurable security guardrails. Use this module when you need to onboard a data science team with a ready-to-use ML workspace, shared storage, and governed data access in a single deployment.


Deployed Resources

This module deploys and integrates the following resources:

Team Mini Lake and KMS Key - An S3-based mini data lake which the team can use as a persistence layer for their activities. Deployed using the Datalake KMS and Buckets L3 Construct.

Team Athena Workgroup and Results Bucket - An Athena Workgroup for use by the team, with a dedicated S3 bucket for query results. Deployed using the Athena Workgroup L3 Construct.

SageMaker AI Studio Domain and User Profiles - SageMaker AI Studio Domain configured to use the Team Execution Role, with optional user-specific User Profiles. Deployed using the Studio Domain L3 Construct.

SageMaker Read/Write Team Managed Policies - IAM managed policies providing read, write, and guardrail access to SageMaker. Policies are automatically added to team execution and mutable team user roles; immutable roles (such as IAM Identity Center/SSO roles) require manual policy binding via SSO permission set.

data-science-team


  • SageMaker Studio — Deploy a standalone Studio domain when you need Studio without the full data science team environment
  • SageMaker Notebooks — Deploy standalone SageMaker notebook instances as an alternative to Studio
  • Data Lake — Deploy data lake buckets that the team can access via Lake Formation permissions
  • Athena Workgroup — Deploy standalone Athena workgroups; this module provisions team workgroups automatically
  • Roles — Create IAM roles that can be referenced as team user or data admin roles
  • Lake Formation Access Control — Manage fine-grained Lake Formation grants for team access to data lake resources

Security/Compliance Details

This module is designed in alignment with MDAA security/compliance principles and CDK nag rulesets. Additional review is recommended prior to production deployment, ensuring organization-specific compliance requirements are met.

  • Encryption at Rest:
    • Team S3 buckets encrypted with team-specific customer-managed KMS key
    • SageMaker resources encrypted with team KMS key
    • Athena query results encrypted with team KMS key
  • Encryption in Transit:
    • All SageMaker and Athena communications use TLS
  • Least Privilege:
    • Separate read, write, and guardrail managed policies for SageMaker
    • Team bucket access restricted to team execution role, data admin roles, and team user roles via bucket policy
    • Results bucket access limited to team execution role, team user roles, and data admin roles
    • Workgroup access granted via IAM managed policy to team execution role and mutable team user roles
  • Separation of Duties:
    • Guardrail policy enforces security parameters on SageMaker resource creation
    • Team execution role requires explicit SageMaker service trust
    • Lake Formation permissions control team access to data lake resources
  • Network Isolation:
    • SageMaker AI Studio domain is VPC-bound with configurable security group ingress/egress rules
    • Direct internet access disabled

AWS Service Endpoints

The following VPC endpoints may be required if public AWS service endpoint connectivity is unavailable (e.g., private subnets without NAT gateway, firewalled environments, or PrivateLink-only architectures):

AWS Service Endpoint Service Name Type
SageMaker API com.amazonaws.{region}.sagemaker.api Interface
SageMaker Runtime com.amazonaws.{region}.sagemaker.runtime Interface
SageMaker Studio com.amazonaws.{region}.studio Interface
Athena com.amazonaws.{region}.athena Interface
Glue com.amazonaws.{region}.glue Interface
KMS com.amazonaws.{region}.kms Interface
S3 com.amazonaws.{region}.s3 Gateway
CloudWatch Logs com.amazonaws.{region}.logs Interface
STS com.amazonaws.{region}.sts Interface
SSM Parameter Store com.amazonaws.{region}.ssm Interface
Lake Formation com.amazonaws.{region}.lakeformation Interface
EFS com.amazonaws.{region}.elasticfilesystem Interface

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

datascience-team: # Module Name can be customized
  module_path: '@aws-mdaa/datascience-team' # Must match module NPM package name
  module_configs:
    - ./datascience-team.yaml # Filename/path can be customized

Module Config Samples and Variants

Copy the contents of the relevant sample config below into the ./datascience-team.yaml file referenced in the MDAA config snippet above.

Minimal Configuration

Deploys a SageMaker Studio domain with IAM auth, team S3 data lake, and Athena workgroup. Start here for a quick team environment with sensible defaults before customizing user profiles or lifecycle configs.

sample-config-minimal.yaml

# Contents available via above link
# Minimal config for the Data Science Team module.
# Deploys a SageMaker Studio domain with IAM auth, team S3 data
# lake, and Athena workgroup.

team:
  # See CONFIGURATION.md for role reference options (name, arn, id).
  # Admin roles granted access to team resources including KMS keys,
  # S3 buckets, and SageMaker resources.
  dataAdminRoles:
    - name: Admin

  # Execution role for SageMaker workloads. Must have
  # sagemaker.amazonaws.com service trust.
  teamExecutionRole:
    name: team-execution-role

  # (Optional) SageMaker Studio domain — the module's primary
  # resource. Without this, only S3/Athena are deployed.
  studioDomainConfig:
    # Authentication mode (enum: IAM, SSO)
    authMode: IAM
    # VPC ID for Studio domain deployment
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/vpc/id
    vpcId: vpc-id
    # Subnet IDs for Studio user applications
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetIds:
      - subnet-id
    # Admin roles for domain management
    dataAdminRoles:
      - name: Admin
    # At least one user profile
    userProfiles:
      example-user-id:
        userRole:
          name: team-execution-role

Comprehensive Configuration

Provisions a SageMaker Studio domain with IAM auth, user profiles, lifecycle configs, custom images, S3 mini data lake with inventory, and team access controls for collaborative ML development. Use this as a reference when you need full control over user profiles, lifecycle scripts, storage policies, and team-level access controls.

sample-config-comprehensive.yaml

# Contents available via above link
# Sample config for the Data Science Team module.
# Provisions a SageMaker Studio domain with IAM auth, user profiles,
# lifecycle configs, custom images, S3 mini data lake with inventory,
# and team access controls for collaborative ML development.
#
# This is the comprehensive configuration demonstrating all compatible
# properties with IAM authentication mode. For SSO auth mode, see
# sample-config-sso.yaml.

# Complete data science team infrastructure configuration. Defines
# SageMaker Studio domain, S3 mini data lake, Athena workgroup,
# execution roles, and user profiles.
team:
  # See CONFIGURATION.md for role reference options (name, arn, id).
  # Admin roles granted access to team resources including KMS keys,
  # S3 buckets, and SageMaker resources. Roles can be referenced by
  # name, arn, or id.
  dataAdminRoles:
    - name: Admin

  # (Optional) Team member roles for accessing shared resources like
  # data lake, SageMaker Studio, and collaborative tools.
  teamUserRoles:
    - id: generated-role-id:data-scientist
    # Immutable roles (e.g. SSO roles) are provided access only to
    # the team bucket and KMS key via resource policies.
    - name: AWSReservedSSO_datascientist_abcdefg
      immutable: true

  # Execution role for SageMaker workloads including training jobs,
  # endpoints, and notebooks. Must have sagemaker.amazonaws.com
  # service trust with sts:AssumeRole and sts:SetSourceIdentity.
  teamExecutionRole:
    id: generated-role-id:team-execution-role

  # (Optional) Custom policy name prefix for portable naming across
  # accounts with SSO integration. When set, uses this prefix
  # instead of the naming module for policy names.
  verbatimPolicyNamePrefix: 'some-prefix'

  # (Optional) S3 inventory configurations for team data lake
  # bucket content analysis and governance.
  inventories:
    team-inventory:
      # S3 prefix to include in the inventory report
      prefix: 'data/'
      # (Optional) Destination bucket for inventory reports.
      # Defaults to the source bucket under /inventory prefix.
      destinationBucket: 'test-inventory-bucket'
      # (Optional) S3 prefix within the destination bucket for
      # inventory report storage
      destinationPrefix: 'inventory-reports/'
      # (Optional) AWS account ID owning the destination bucket
      # for cross-account inventory delivery
      destinationAccount: '{{context:account-2}}'
  # (Optional) SageMaker Studio domain configuration for the team's
  # collaborative ML development environment.
  studioDomainConfig:
    # Authentication mode (enum: IAM, SSO)
    authMode: IAM
    # VPC ID for Studio domain deployment
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/vpc/id
    vpcId: vpc-id
    # Subnet IDs for Studio user applications
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetIds:
      - subnet-id
    # (Optional) KMS key ARN for EFS encryption
    kmsKeyArn: 'arn:{{partition}}:kms:{{region}}:{{account}}:key/test-efs-key'
    # (Optional) Memory limit in MB for lifecycle asset deployment
    # Lambda
    assetDeploymentMemoryLimitMB: 512
    # (Optional) S3 prefix for lifecycle asset storage
    assetPrefix: 'lifecycle-assets/'
    # (Optional) Default execution role for Studio applications
    defaultExecutionRole:
      id: generated-role-id:team-execution-role
    # (Optional) Admin roles for domain management
    dataAdminRoles:
      - arn: 'arn:{{partition}}:iam::{{account}}:role/DomainAdmin'
    # (Optional) Security group ingress rules
    securityGroupIngress:
      # (Optional) IPv4 CIDR block rules for security group traffic
      # control defining IP address-based access restrictions
      ipv4:
        # CIDR block specification for network access control
        - cidr: 10.0.0.0/24
          # (Optional) Description for the rule
          description: Allow HTTPS from internal network
          port: 443
          protocol: tcp
          # (Optional) Ending port number for port range rules
          toPort: 443
      # (Optional) Security group rules for cross-security group
      # traffic control
      sg:
        # Security group identifier for SG-based access control
        - sgId: ssm:/ml/sm/sg/id
          port: 443
          protocol: tcp
      # (Optional) Prefix list rules for security group traffic
      # control defining managed prefix list-based access
      # restrictions
      prefixList:
        - prefixList: pl-test-ingress
          description: Ingress from managed prefix list
          protocol: tcp
          port: 443
    # (Optional) Security group egress rules
    securityGroupEgress:
      prefixList:
        # Prefix list identifier for managed IP range access control
        - prefixList: pl-4ea54027
          description: prefix list for com.amazonaws.{{region}}.dynamodb
          protocol: tcp
          port: 443
        - prefixList: pl-7da54014
          description: prefix list for com.amazonaws.{{region}}.s3
          protocol: tcp
          port: 443
          # (Optional) Ending port number for port range rules
          toPort: 443
      ipv4:
        - cidr: 0.0.0.0/0
          port: 443
          protocol: tcp
          # (Optional) Description for the rule
          description: Allow outbound HTTPS
      sg:
        - sgId: ssm:/ml/sm/sg/id
          port: 443
          protocol: tcp

    # (Optional) S3 prefix for shared notebook storage
    notebookSharingPrefix: notebooks

    # (Optional) Named user profiles for Studio domain. Key is the
    # user identifier: Session Name portion of aws:userid (IAM mode).
    userProfiles:
      example-user-id:
        # (Optional) Required for IAM AuthMode. The role from which
        # the user will launch the user profile in Studio.
        userRole:
          id: generated-role-id:data-scientist

    # (Optional) Default user settings for Studio applications
    defaultUserSettings:
      # (Optional) The kernel gateway app settings
      kernelGatewayAppSettings:
        # (Optional) A list of custom SageMaker images configured
        # to run as a KernelGateway app
        customImages:
          # The name of the AppImageConfig
          - appImageConfigName: 'appImageConfigName'
            # The name of the CustomImage
            imageName: 'imageName'
            # (Optional) The version number of the CustomImage
            imageVersionNumber: 1
        # (Optional) The default instance type and SageMaker image
        # ARN used by the KernelGateway app
        defaultResourceSpec:
          # (Optional) The instance type
          instanceType: 'ml.t3.medium'
          # (Optional) The ARN of the SageMaker image
          sageMakerImageArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image/test-image'
          # (Optional) The ARN of the image version
          sageMakerImageVersionArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image-version/test-image/1'
          # (Optional) The ARN of the Lifecycle Configuration
          lifecycleConfigArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-lcc'
        # (Optional) The ARN of the Lifecycle Configurations
        # attached to the user profile or domain
        lifecycleConfigArns:
          - 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-kernel-lcc'
      # (Optional) The JupyterLab app settings
      jupyterLabAppSettings:
        # (Optional) Indicates whether idle shutdown is activated
        # for JupyterLab applications
        appLifecycleManagement:
          # (Optional) Settings related to idle shutdown of Studio
          # applications
          idleSettings:
            # (Optional) The time that SageMaker waits after the
            # application becomes idle before shutting it down
            idleTimeoutInMinutes: 60
            # (Optional) Indicates whether idle shutdown is
            # activated for the application type
            lifecycleManagement: ENABLED
            # (Optional) The maximum value in minutes that custom
            # idle shutdown can be set to by the user
            maxIdleTimeoutInMinutes: 120
            # (Optional) The minimum value in minutes that custom
            # idle shutdown can be set to by the user
            minIdleTimeoutInMinutes: 30
        # (Optional) The lifecycle configuration that runs before
        # the default lifecycle configuration
        builtInLifecycleConfigArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-builtin-lcc'
        # (Optional) A list of Git repositories that SageMaker
        # automatically displays to users for cloning
        codeRepositories:
          # The URL of the Git repository
          - repositoryUrl: 'https://github.com/example/repo.git'
        # (Optional) A list of custom SageMaker images configured
        # to run as a JupyterLab app
        customImages:
          - appImageConfigName: 'jupyterLabAppImageConfig'
            imageName: 'jupyterLabImage'
            # (Optional) The version number of the CustomImage
            imageVersionNumber: 1
        # (Optional) The default instance type and SageMaker image
        # ARN used by the JupyterLab app
        defaultResourceSpec:
          # (Optional) The instance type
          instanceType: 'ml.t3.medium'
          # (Optional) The ARN of the SageMaker image
          sageMakerImageArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image/test-jupyterlab-image'
          # (Optional) The ARN of the image version
          sageMakerImageVersionArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image-version/test-jupyterlab-image/1'
          # (Optional) The ARN of the Lifecycle Configuration
          lifecycleConfigArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-jupyterlab-default-lcc'
        # (Optional) The ARN of the lifecycle configurations
        # attached to the user profile or domain
        lifecycleConfigArns:
          - 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-jupyterlab-lcc'
      # (Optional) The Jupyter server's app settings
      jupyterServerAppSettings:
        # (Optional) The default instance type and SageMaker image
        # ARN used by the JupyterServer app
        defaultResourceSpec:
          # (Optional) The instance type (JupyterServer apps only
          # support the 'system' value)
          instanceType: 'system'
          # (Optional) The ARN of the SageMaker image
          sageMakerImageArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image/test-jupyter-server-image'
          # (Optional) The ARN of the image version
          sageMakerImageVersionArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image-version/test-jupyter-server-image/1'
          # (Optional) The ARN of the Lifecycle Configuration
          lifecycleConfigArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-jupyter-server-default-lcc'
        # (Optional) The ARN of the Lifecycle Configurations
        # attached to the JupyterServerApp
        lifecycleConfigArns:
          - 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-jupyter-server-lcc'
      # (Optional) A collection of settings that configure the
      # RSessionGateway app
      rSessionAppSettings:
        # (Optional) A list of custom SageMaker images configured
        # to run as a RSession app
        customImages:
          - appImageConfigName: 'rSessionAppImageConfig'
            imageName: 'rSessionImage'
            # (Optional) The version number of the CustomImage
            imageVersionNumber: 1
        # (Optional) Specifies the ARNs of a SageMaker image and
        # image version, and the instance type
        defaultResourceSpec:
          # (Optional) The instance type
          instanceType: 'ml.t3.medium'
          # (Optional) The ARN of the SageMaker image
          sageMakerImageArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image/test-rsession-image'
          # (Optional) The ARN of the image version
          sageMakerImageVersionArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:image-version/test-rsession-image/1'
          # (Optional) The ARN of the Lifecycle Configuration
          lifecycleConfigArn: 'arn:{{partition}}:sagemaker:{{region}}:{{account}}:studio-lifecycle-config/test-rsession-default-lcc'
      # (Optional) A collection of settings that configure user
      # interaction with the RStudioServerPro app
      rStudioServerProAppSettings:
        # (Optional) Indicates whether the current user has access
        # to the RStudioServerPro app
        accessStatus: 'ENABLED'
        # (Optional) The level of permissions that the user has
        # within the RStudioServerPro app (default: User)
        userGroup: 'R_STUDIO_ADMIN'
      # (Optional) The security groups for the VPC that Studio uses
      # for communication
      securityGroups:
        - 'sg-test-default-user-sg'
      # (Optional) Specifies options for sharing SageMaker Studio
      # notebooks
      sharingSettings:
        # (Optional) Whether to include the notebook cell output
        # when sharing the notebook (default: Disabled)
        notebookOutputOption: 'Allowed'
        # (Optional) KMS encryption key ID used to encrypt the
        # notebook cell output in S3
        s3KmsKeyId: 'arn:{{partition}}:kms:{{region}}:{{account}}:key/test-sharing-key'
        # (Optional) The S3 bucket used to store the shared
        # notebook snapshots
        s3OutputPath: 's3://test-sharing-bucket/notebooks/'
      # (Optional) Studio web portal state (enum: ENABLED, DISABLED)
      studioWebPortal: ENABLED

    # (Optional) Lifecycle configurations for Studio apps
    lifecycleConfigs:
      # (Optional) Lifecycle config for the main Jupyter App. Runs
      # each time the main Jupyter app container is launched.
      jupyter:
        # (Optional) Assets staged in S3, then copied to SageMaker
        # container before lifecycle commands run. Available under
        # $ASSETS_DIR/<asset_name>/
        assets:
          testing:
            # Local file or directory path to deploy
            sourcePath: ./testing_asset_dir
            # (Optional) Glob patterns to exclude from asset
            # packaging
            exclude:
              - '*.pyc'
              - '__pycache__'
        # Lifecycle commands to execute
        cmds:
          - echo "testing jupyter"
          - sh $ASSETS_DIR/testing/test.sh

      # (Optional) Kernel gateway app lifecycle config. Runs each
      # time a kernel gateway container is launched.
      kernel:
        assets:
          testing:
            sourcePath: ./testing_asset_dir
        cmds:
          - echo "testing kernel"
          - sh $ASSETS_DIR/testing/test.sh

      # (Optional) JupyterLab lifecycle script (Studio Latest).
      # Runs each time a JupyterLab app container is launched.
      jupyterLab:
        assets:
          testing:
            sourcePath: ./testing_asset_dir
        cmds:
          - echo "testing jupyterLab"
          - sh $ASSETS_DIR/testing/test.sh

SSO Authentication Configuration

Demonstrates SSO auth mode with an existing security group and existing domain bucket. Choose this variant when your organization uses AWS IAM Identity Center (SSO) for federated user access to SageMaker Studio.

sample-config-sso.yaml

# Contents available via above link
# Sample config for the Data Science Team module with SSO authentication.
# Demonstrates SSO auth mode with an existing security group and existing
# domain bucket. Use this approach when integrating with AWS IAM Identity
# Center (SSO) for federated user access to SageMaker Studio.
#
# For IAM auth mode with full property coverage, see sample-config-comprehensive.yaml.

# Complete data science team infrastructure configuration with SSO auth.
team:
  # See CONFIGURATION.md for role reference options (name, arn, id).
  # Admin roles granted access to team resources including KMS keys,
  # S3 buckets, and SageMaker resources.
  dataAdminRoles:
    - arn: 'arn:{{partition}}:iam::{{account}}:role/Admin'

  # (Optional) Team member roles for accessing shared resources.
  teamUserRoles:
    - name: AWSReservedSSO_datascientist_abcdefg
      # (Optional) Flag indicating the role should be resolved as
      # an AWS SSO auto-generated role
      sso: true

  # Execution role for SageMaker workloads including training jobs,
  # endpoints, and notebooks.
  teamExecutionRole:
    name: team-execution-role

  # (Optional) SageMaker Studio domain configuration with SSO auth
  # and existing security group / domain bucket.
  studioDomainConfig:
    # Authentication mode (enum: IAM, SSO)
    authMode: SSO
    # VPC ID for Studio domain deployment
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/vpc/id
    vpcId: vpc-id
    # Subnet IDs for Studio user applications
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetIds:
      - subnet-id
    # (Optional) Existing security group ID. Use this instead of
    # securityGroupIngress/securityGroupEgress when you have a
    # pre-existing security group to attach to the Studio domain.
    securityGroupId: sg-existing-sg-id
    # (Optional) Domain bucket configuration for shared storage.
    # Use this when you have a pre-existing S3 bucket for the
    # domain instead of letting the module create one.
    domainBucket:
      # S3 bucket name to use as the domain bucket
      domainBucketName: 'existing-domain-bucket'
      # Role used to deploy lifecycle assets. Must be assumable by
      # Lambda with write access to the domain bucket.
      assetDeploymentRole:
        name: asset-deployment-role
    # (Optional) Named user profiles for Studio domain. Key is the
    # user identifier: SSO User ID (SSO mode).
    userProfiles:
      sso-user-id:
        # (Optional) The role from which the user will launch the
        # user profile in Studio.
        userRole:
          name: AWSReservedSSO_datascientist_abcdefg
          sso: true
    # (Optional) Default user settings for Studio applications
    defaultUserSettings:
      # (Optional) Studio web portal state (enum: ENABLED, DISABLED)
      studioWebPortal: DISABLED

Config Schema Docs