Skip to content

DataSync

Note: This documentation is also available in a rendered format here.

Deploys AWS DataSync agents, storage locations (S3, NFS, SMB, Object Storage), and transfer tasks for automated data movement between on-premises and AWS storage services, or between AWS storage services. Common scenarios include migrating large datasets from on-premises NFS or SMB shares to S3, synchronizing data between AWS regions, or scheduling recurring transfers from network-attached storage into your data lake.


Deployed Resources

This module deploys and integrates the following resources:

  • DataSync Agent Activation: Registers agents with your AWS account. Agents read/write data at on-premises locations. Deploy multiple agents in different AZs/subnets for resiliency. Agents must be deployed before activation — refer to AWS DataSync agent requirements.
  • DataSync Locations: Endpoints for tasks. Supports S3, NFS, SMB, and Object Storage (cloud-based) location types. Locations requiring credentials (SMB, Object Storage) must have credentials pre-stored in Secrets Manager.
  • DataSync Tasks: Configurations for data transfer and synchronization between two locations, with scheduling, filtering, and transfer options.
  • EC2 Security Group: Security group for DataSync agent-to-service data transfer.
  • KMS Encryption Key: Encrypts DataSync execution logs.
  • CloudWatch Log Group: Task execution logging.

DataSync

DataSync Deployment Architecture

DataSyncArchitecture


  • Data Lake — DataSync can transfer data to and from data lake S3 buckets
  • SFTP Server — Deploy an SFTP server as an alternative ingestion method for data transfer
  • Roles — Create IAM roles for DataSync S3 location access

Security/Compliance Details

This module is designed in alignment with MDAA security/compliance principles and CDK nag rulesets. Additional review is recommended prior to production deployment, ensuring organization-specific compliance requirements are met.

  • Encryption at Rest:
    • DataSync task execution logs encrypted with customer-managed KMS key
    • S3 locations use IAM role-based access with bucket encryption
  • Encryption in Transit:
    • DataSync transfers data over TLS
  • Least Privilege:
    • S3 locations use dedicated IAM roles with scoped bucket access
    • SMB and Object Storage credentials stored in Secrets Manager
    • Agent activation keys are time-limited (30 minutes)
  • Network Isolation:
    • Agents connect via VPC endpoints (PrivateLink)
    • Security group controls ENI traffic for data transfer (port 443) and control traffic (TCP 1024-1064)
    • No public internet access required

AWS Service Endpoints

The following VPC endpoints may be required if public AWS service endpoint connectivity is unavailable (e.g., private subnets without NAT gateway, firewalled environments, or PrivateLink-only architectures):

AWS Service Endpoint Service Name Type
DataSync com.amazonaws.{region}.datasync Interface
KMS com.amazonaws.{region}.kms Interface
S3 com.amazonaws.{region}.s3 Gateway
CloudWatch Logs com.amazonaws.{region}.logs Interface
Secrets Manager com.amazonaws.{region}.secretsmanager Interface
STS com.amazonaws.{region}.sts Interface
EC2 com.amazonaws.{region}.ec2 Interface

Prerequisite and Pre-deployment Tasks

Prerequisite

  • VPC Endpoint for DataSync service. The security group of the VPC Endpoint must allow control traffic from the DataSync agent on TCP port range 1024-1064. Refer to Network requirements for VPC endpoints for detailed network requirements.
  • A security group for DataSync tasks. When DataSync tasks are running, DataSync agents will transfer data to DataSync service via ENIs on TLS traffic port 443. The security group must allow TCP inbound traffic on port 443 from the agent hosts.
  • For SMB and cloud-based storage location types, a secret in Secrets Manager is needed to store credentials. The secret must contain values in the format below:
    • For SMB location: {user:< username >,password:< pwd >}
    • For cloud-based object storage: {"accessKey":< access_key >","secretKey":"< secret_key >"}

Note: If you want MDAA to handle the above security group requirement, two-stage deployment is required.

  1. Put the information in the connection: section. Put the agents: configuration but do not specify activationKey: parameter in the agent configuration (Refer to the example for agent1: further below.)
  2. Run the first pass MDAA deployment. MDAA will deploy the security group and required ingress rules.
  3. Retrieve the agent activation key(s) and put in the agents: configuration, one for each agent.
  4. Run the second pass MDAA deployment. MDAA will register the agent(s) and other DataSync resources.

Pre-deployment Tasks

This process must be completed prior to DataSync deployment using MDAA.

Pre-DeploymentTask

  1. Deploy DataSync agent in the platform of choice. You may deploy it on EC2 using DataSync AMI or another hypervisor platform. Refer to Deploy your AWS DataSync agent for detailed guidelines.

  2. Gather information that will be needed to retrieve the agent activation key in the next step:

    • The IP address of the DataSync Agent host (deployed in step 1)
    • The IP address of the VPC Endpoint for DataSync service
  3. Retrieve agent activation key from a host or workstation with connectivity to the DataSync agent on port 80. The activation key can be retrieved using CLI or AWS Management Console.

    • Using CLI: curl "http://<agent-ip-address>/?gatewayType=SYNC&activationRegion=<aws-region>&privateLinkEndpoint=<IP address from the same subnet/AZ of VPC endpoint>&endpointType=PRIVATE_LINK&no_redirect"

      Refer to the step 4 of Creating an AWS DataSync agent with the AWS CLI for more information.

  4. Put the activation key retrieved in the previous step into the activationKey parameter of the DataSync module configuration file.


Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

datasync: # Module Name can be customized
  module_path: '@aws-mdaa/datasync' # Must match module NPM package name
  module_configs:
    - ./datasync.yaml # Filename/path can be customized

Module Config Samples and Variants

Copy the contents of the relevant sample config below into the ./datasync.yaml file referenced in the MDAA config snippet above.

Minimal Configuration

Deploys VPC networking, an agent, two S3 locations, and a transfer task between them. Start here for a basic S3-to-S3 data transfer setup with a single agent.

sample-config-minimal.yaml

# Contents available via above link
# Minimal DataSync module configuration.
# Deploys VPC networking, an agent, two S3 locations, and a
# transfer task between them.

# (Optional) VPC configuration for DataSync agent deployment.
vpc:
  # ID of the VPC for DataSync deployment
  # Often created by your VPC/networking stack.
  # Example SSM: ssm:/path/to/vpc/id
  vpcId: vpc-009ce5ec1cff75fx6
  # CIDR block of the VPC for security group rules
  vpcCidrBlock: 10.0.0.0/8

# (Optional) Map of agent names to DataSync agent configurations.
agents:
  agent1:
    # Subnet ID for data transfer ENIs
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetId: subnet-0c27f330c0ea98xx5
    # IP address of the DataSync agent host
    agentIpAddress: 1.1.1.1

# (Optional) DataSync locations organized by storage protocol type.
locations:
  s3:
    source-location:
      # S3 bucket ARN
      # Often created by the Data Lake module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<datalake_module_name>/bucket/<zone_name>/arn
      s3BucketArn: arn:{{partition}}:s3:::source-bucket
      # IAM role ARN for DataSync S3 access
      # Often created by the Roles module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<roles_module_name>/role/<role_name>/arn
      bucketAccessRoleArn: arn:{{partition}}:iam::{{account}}:role/datasync-s3-role
    destination-location:
      # S3 bucket ARN
      # Often created by the Data Lake module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<datalake_module_name>/bucket/<zone_name>/arn
      s3BucketArn: arn:{{partition}}:s3:::destination-bucket
      # IAM role ARN for DataSync S3 access
      # Often created by the Roles module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<roles_module_name>/role/<role_name>/arn
      bucketAccessRoleArn: arn:{{partition}}:iam::{{account}}:role/datasync-s3-role

# (Optional) Map of task names to DataSync task configurations.
tasks:
  my-task:
    # Name of an MDAA-generated source location
    sourceLocationName: source-location
    # Name of an MDAA-generated destination location
    destinationLocationName: destination-location

Comprehensive Configuration

Transfers data between on-premises storage and AWS using DataSync agents, locations (S3, SMB, NFS, object storage), and tasks with scheduling, filtering, and transfer options. Start here when evaluating all available options for location types, multi-agent resiliency, scheduling, and transfer filtering.

sample-config-comprehensive.yaml

# Contents available via above link
# DataSync module configuration.
# Transfers data between on-premises storage and AWS using DataSync
# agents, locations (S3, SMB, NFS, object storage), and tasks with
# scheduling, filtering, and transfer options.

# (Optional) VPC configuration for DataSync agent deployment. MDAA
# creates a security group and VPC endpoint for the DataSync service.
vpc:
  # ID of the VPC for DataSync deployment
  # Often created by your VPC/networking stack.
  # Example SSM: ssm:/path/to/vpc/id
  vpcId: vpc-009ce5ec1cff75fx6
  # CIDR block of the VPC for security group rules
  vpcCidrBlock: 10.0.0.0/8

# (Optional) Map of agent names to DataSync agent configurations.
# Agents must be deployed externally before activation.
agents:
  # Agent without activation key (first-pass deployment creates
  # VPC endpoint and security group only)
  agent1:
    # Subnet ID for data transfer ENIs
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetId: subnet-0c27f330c0ea98xx5
    # IP address of the DataSync agent host
    agentIpAddress: 1.1.1.1
  # Agent with activation key (second-pass deployment registers
  # the agent)
  agent2:
    # (Optional) Agent activation key (expires in 30 minutes)
    activationKey: XXXXX-XXXXX-XXXXX-XXXXX-XXXXX
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetId: example-subnet
    agentIpAddress: 1.1.1.2
  # Agent with externally managed security group and VPC endpoint
  agent3:
    activationKey: XXXXX-YYYYY-XXXXX-YYYYY-XXXXX
    # Often created by your VPC/networking stack.
    # Example SSM: ssm:/path/to/subnet/id
    subnetId: '{{resolve:ssm:/path/to/subnet-id-ssm}}'
    # (Optional) Existing VPC endpoint ID (if omitted, MDAA creates
    # one)
    vpcEndpointId: '{{resolve:ssm:/path/to/vpce-ssm}}'
    agentIpAddress: 1.1.1.3
    # (Optional) Existing security group ID (if omitted, MDAA
    # creates one)
    securityGroupId: sg-123456

# (Optional) DataSync locations organized by storage protocol type.
locations:
  # S3 locations
  s3:
    s3location1:
      # S3 bucket ARN (or dynamic reference)
      # Often created by the Data Lake module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<datalake_module_name>/bucket/<zone_name>/arn
      s3BucketArn: example-bucket-name1
      # IAM role ARN for DataSync S3 access
      # Often created by the Roles module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<roles_module_name>/role/<role_name>/arn
      bucketAccessRoleArn: '{{resolve:ssm:/path/to/role-arn-ssm}}'
      # (Optional) Subdirectory prefix within the bucket
      subdirectory: /some/prefix
    s3location2:
      # Often created by the Data Lake module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<datalake_module_name>/bucket/<zone_name>/arn
      s3BucketArn: example-bucket-name2
      subdirectory: /some/prefix
      # Often created by the Roles module.
      # Example SSM: ssm:/{{org}}/{{domain}}/<roles_module_name>/role/<role_name>/arn
      bucketAccessRoleArn: some-role-arn
      # (Optional) S3 storage class for transferred files
      # (enum: DEEP_ARCHIVE, GLACIER, INTELLIGENT_TIERING,
      # ONEZONE_IA, OUTPOSTS, STANDARD, STANDARD_IA)
      s3StorageClass: INTELLIGENT_TIERING

  # SMB locations
  smb:
    smb-loc1:
      # Names of MDAA-generated agents (mutually exclusive with
      # agentArns)
      agentNames:
        - agent2
      # Secrets Manager secret name storing SMB credentials
      # ({user, password})
      secretName: some-secret-name
      # (Optional) Active Directory domain name
      domain: some-ad-domain-name
      # SMB server hostname or IP address
      serverHostname: some.smbserver.hostname
      # SMB share subdirectory path
      subdirectory: /some/subdirectory
      # (Optional) SMB protocol version
      # (enum: AUTOMATIC, SMB2, SMB3; default: AUTOMATIC)
      smbVersion: AUTOMATIC
    smb-loc2:
      # ARNs of externally registered DataSync agents (mutually
      # exclusive with agentNames)
      agentArns:
        - arn:{{partition}}:datasync:{{region}}:{{account}}:agent/existing-agent-id
      secretName: some-secret-name
      domain: some-ad-domain-name
      serverHostname: some.smbserver.hostname
      subdirectory: /some/subdirectory

  # NFS locations
  nfs:
    nfs_loc1:
      agentNames:
        - agent2
        - agent3
      # NFS server hostname or IP address
      serverHostname: some.nfsserver.hostname
      # NFS export path
      subdirectory: /some/subdirectory
      # (Optional) NFS protocol version
      # (AUTOMATIC, NFS3, NFSv4_0, NFSv4_1; default: AUTOMATIC)
      nfsVersion: NFSv4_0
    nfs_loc2:
      # ARNs of externally registered agents
      agentArns:
        - arn:{{partition}}:datasync:{{region}}:{{account}}:agent/existing-agent-id
      serverHostname: another.nfsserver.hostname
      subdirectory: /another/subdirectory

  # Object storage locations (e.g. Google Cloud Storage)
  objectStorage:
    gcp1:
      agentNames:
        - agent2
      # Object storage bucket name
      bucketName: some-object-storage-name
      # Object storage server endpoint
      serverHostname: some-object-storage.endpoint.hostname
      # Secrets Manager secret storing credentials
      # ({accessKey, secretKey})
      secretName: some-secret-name
      # (Optional) Server port (default: 443)
      serverPort: 443
      # (Optional) Subdirectory prefix
      subdirectory: /some/prefix
      # (Optional) Server protocol (default: HTTPS)
      serverProtocol: HTTPS
    gcp2:
      agentArns:
        - arn:{{partition}}:datasync:{{region}}:{{account}}:agent/existing-agent-id
      bucketName: another-object-storage
      serverHostname: another.endpoint.hostname
      secretName: another-secret-name

# (Optional) Map of task names to DataSync task configurations.
tasks:
  # Task using MDAA-generated location names with scheduling and
  # include filters
  mytask1:
    # Name of an MDAA-generated source location
    sourceLocationName: s3location1
    # Name of an MDAA-generated destination location
    destinationLocationName: s3location2
    # (Optional) Schedule for periodic execution
    schedule:
      # Cron or rate expression
      scheduleExpression: cron(0 * * * ? *)
      # (Optional) Enable or disable the schedule
      status: ENABLED
    # (Optional) Transfer options
    options:
      # (Optional) Preserve or ignore file access time
      # (BEST_EFFORT, NONE; default: BEST_EFFORT)
      atime: BEST_EFFORT
      # (Optional) Bandwidth limit in bytes per second
      bytesPerSecond: 1048576
      # (Optional) Group ID handling
      # (INT_VALUE, NAME, NONE; default: INT_VALUE)
      gid: INT_VALUE
      # (Optional) CloudWatch log level
      # (BASIC, TRANSFER, OFF)
      logLevel: TRANSFER
      # (Optional) Preserve or ignore file modification time
      # (PRESERVE, NONE; default: PRESERVE)
      mtime: PRESERVE
      # (Optional) Object tag handling
      # (PRESERVE, NONE; default: PRESERVE)
      objectTags: PRESERVE
      # (Optional) Overwrite behavior at destination
      # (ALWAYS, NEVER; default: ALWAYS)
      overwriteMode: ALWAYS
      # (Optional) POSIX permission handling
      # (PRESERVE, NONE; default: PRESERVE)
      posixPermissions: PRESERVE
      # (Optional) Preserve deleted files at destination
      # (PRESERVE, REMOVE; default: PRESERVE)
      preserveDeletedFiles: PRESERVE
      # (Optional) Block/character device metadata handling
      # (NONE, PRESERVE; default: NONE)
      preserveDevices: NONE
      # (Optional) SMB security descriptor copy flags
      # (OWNER_DACL, OWNER_DACL_SACL, NONE; default: OWNER_DACL)
      securityDescriptorCopyFlags: OWNER_DACL
      # (Optional) Task queueing behavior (ENABLED, DISABLED;
      # default: ENABLED)
      taskQueueing: ENABLED
      # (Optional) Transfer mode (CHANGED, ALL)
      transferMode: CHANGED
      # (Optional) User ID handling
      # (INT_VALUE, NAME, NONE; default: INT_VALUE)
      uid: INT_VALUE
      # (Optional) Data integrity verification mode
      # (ONLY_FILES_TRANSFERRED, POINT_IN_TIME_CONSISTENT, NONE)
      verifyMode: ONLY_FILES_TRANSFERRED
    # (Optional) Include filter rules (one member max)
    includes:
      - # Filter type (SIMPLE_PATTERN)
        filterType: SIMPLE_PATTERN
        # Pipe-delimited patterns (must begin with /)
        value: '/data*|/ingestion*'
  # Task using external location ARNs with exclude filters
  mytask2:
    # ARN of an existing source location
    sourceLocationArn: '{{resolve:ssm:/path/to/source/location/arn}}'
    # ARN of an existing destination location
    destinationLocationArn: '{{resolve:ssm:/path/to/destination/location/arn}}'
    # (Optional) KMS key ARN for CloudWatch log group encryption
    # (if omitted, MDAA creates a new KMS key)
    logGroupEncryptionKeyArn: arn:{{partition}}:kms:{{region}}:{{account}}:key/test-log-key
    options:
      transferMode: CHANGED
      verifyMode: ONLY_FILES_TRANSFERRED
    # (Optional) Exclude filter rules (one member max)
    excludes:
      - filterType: SIMPLE_PATTERN
        # Pipe-delimited patterns for exclusion
        value: '*.tmp|*.temp'

Config Schema Docs