Lambda CSV Parquet Transformer

This blueprint illustrates how to use an EventBridge-triggered DataOps Lambda function to transform small CSV files into parquet, as they are uploaded into an S3 data lake.

This blueprint may be suitable when:

Small-medium sized csv files are being regularly uploaded to the data lake, and need to be quickly transformed into parquet, perhaps in a standardized zone of the lake.

While the blueprint doesn't immediately handle partitioning, or additional transformation, the Lambda function can be easily extended to provide these capabilities.

Usage Instructions

The following instructions assume you have already deployed your Data Lake (possibly using MDAA). If already using MDAA, you can merge these sample blueprint configs into your existing mdaa.yaml.

Deploy sample configurations into the specified directory structure (or obtain from the MDAA repo under sample_blueprints/lambda_csv_parquet).
Edit the mdaa.yaml to specify an organization name to replace <unique-org-name>. This must be a globally unique name, as it is used in the naming of all deployed resources, some of which are globally named (such as S3 buckets).
Edit the mdaa.yaml to specify a project name which is unique within your organization, replacing <your-project-name>.
Edit the mdaa.yaml to specify appropriate context values for your environment.
Optionally, edit lambda_csv_parquet/lambda_csv_parquet/src/lambda/lambda_csv_parquet/lambda_csv_parquet.py to handle additional transformation and partitioning.
Ensure you are authenticated to your target AWS account.
Optionally, run <path_to_mdaa_repo>/bin/mdaa ls from the directory containing mdaa.yaml to understand what stacks will be deployed.
Optionally, run <path_to_mdaa_repo>/bin/mdaa synth from the directory containing mdaa.yaml and review the produced templates.
Run <path_to_mdaa_repo>/bin/mdaa deploy from the directory containing mdaa.yaml to deploy all modules.
Before loading csv files, you will need to provide the generated lambda-etl role with access to your datalake bucket(s). Additionally, the source bucket must have EventBridge integration enabled.

Additional MDAA deployment commands/procedures can be reviewed in DEPLOYMENT.

Configurations

The sample configurations for this blueprint are provided below. They are also available under sample_blueprints/lambda_csv_parquet within the MDAA repo.

Config Directory Structure

lambda_csv_parquet
│   mdaa.yaml
│   tags.yaml
│
└───lambda_csv_parquet
    └───roles.yaml
    └───project.yaml
    └───lambda.yaml

mdaa.yaml

This configuration specifies the global, domain, env, and module configurations required to configure and deploy this sample architecture.

Note - Before deployment, populate the mdaa.yaml with appropriate organization and context values for your environment

# Contents available in mdaa.yaml
# All resources will be deployed to the default region specified in the environment or AWS configurations.
# Can optional specify a specific AWS Region Name.
region: default

# One or more tag files containing tags which will be applied to all deployed resources
tag_configs:
  - ./tags.yaml

## Pre-Deployment Instructions

# TODO: Set an appropriate, unique organization name, likely matching the org name used in other MDAA configs.
# Failure to do so may resulting in global naming conflicts.
organization: <unique-org-name>

# One or more domains may be specified. Domain name will be incorporated by default naming implementation
# to prefix all resource names.
domains:
  # TODO: Set an appropriate domain name. This domain name should be unique within the organzation.
  <your domain name>:
    # One or more environments may be specified, typically along the lines of 'dev', 'test', and/or 'prod'
    environments:
      # The environment name will be incorporated into resource name by the default naming implementation.
      dev:
        # The target deployment account can be specified per environment.
        # If 'default' or not specified, the account configured in the environment will be assumed.
        account: default
        #TODO: Set context values appropriate to your env
        context:
          # The arn of a role which will be provided admin privileges to dataops resources
          data_admin_role_arn : <your-data-admin-role-arn>
          # The name of the datalake S3 bucket where the csv files will be uploaded
          datalake_src_bucket_name: <your-src-datalake-bucket-name>
          # The prefix on the datalake S3 bucket where the csv files will be uploaded
          datalake_src_prefix: <your/path/to/csv>
          # The name of the datalake S3 bucket where the parquet files will be written
          datalake_dest_bucket_name: <your-dest-datalake-bucket-name>
          # The prefix on the datalake S3 bucket where the parquet files will be written
          datalake_dest_prefix: <your/path/to/parquet>
          # The arn of the KMS key used to encrypt the datalake bucket
          datalake_kms_arn: <your-datalake-kms-key-arn>
          # The arn of the KMS key used to encrypt the Glue Catalog
          glue_catalog_kms_arn: <your-datalake-kms-key-arn>
        # The list of modules which will be deployed. A module points to a specific MDAA CDK App, and
        # specifies a deployment configuration file if required.
        modules:
          # This module will create all of the roles required for the lambda function
          roles:
            module_path: "@aws-mdaa/roles"
            module_configs:
              - ./lambda_csv_parquet/roles.yaml
          # This module will create DataOps Project resources which can be shared
          # across multiple DataOps modules
          project:
            module_path: "@aws-mdaa/dataops-project"
            module_configs:
              - ./lambda_csv_parquet/project.yaml
          # This module will create the csv to parquet lambda functon
          lambda-csv-parquet:
            module_path: "@aws-mdaa/dataops-lambda"
            module_configs:
              - ./lambda_csv_parquet/lambda.yaml

tags.yaml

This configuration specifies the tags to be applied to all deployed resources.

# Contents available in tags.yaml
tags:
  costcentre: '123456'
  project: data-ecosystem

lambda_csv_parquet/roles.yaml

This configuration will be used by the MDAA Roles module to deploy IAM roles and Managed Policies required for this sample architecture.

# Contents available in roles.yaml
# The list of roles which will be generated
generateRoles:
  lambda-etl:
    trustedPrincipal: service:lambda.amazonaws.com
    # A list of AWS managed policies which will be added to the role
    awsManagedPolicies:
      - service-role/AWSLambdaBasicExecutionRole
    suppressions:
      - id: "AwsSolutions-IAM4"
        reason: "AWSLambdaBasicExecutionRole approved for usage"

lambda_csv_parquet/project.yaml

This configuration will create a DataOps Project which can be used to support a wide variety of data ops activities. Specifically, this configuration will create a number of Glue Catalog databases and apply fine-grained access control to these using basic.

# Contents available in dataops/project.yaml
# Arns for IAM roles which will be provided to the projects's resources (IE bucket)
dataAdminRoles:
  # This is an arn which will be resolved first to a role ID for inclusion in the bucket policy.
  # Note that this resolution will require iam:GetRole against this role arn for the role executing CDK.
  - arn: "{{context:data_admin_role_arn}}"

# List of roles which will be used to execute dataops processes using project resources
projectExecutionRoles:
  - id: generated-role-id:lambda-etl

s3OutputKmsKeyArn: "{{context:datalake_kms_arn}}"
glueCatalogKmsKeyArn: "{{context:glue_catalog_kms_arn}}"

lambda_csv_parquet/lambda.yaml

This configuration will create the transformation Lambda function using the DataOps Lambda module.

# Contents available in dataops/lambda.yaml
# The name of the dataops project this crawler will be created within.
# The dataops project name is the MDAA module name for the project.
projectName: project

# List of functions definitions
functions:
  # Required function parameters
  - functionName: lambda_csv_parquet # Function name. Must be unique within the config.

    layers:
      # See https://aws-sdk-pandas.readthedocs.io/en/latest/install.html#aws-lambda-layer
      - "arn:aws:lambda:{{region}}:336392948345:layer:AWSSDKPandas-Python313:1"

    # (Optional) Function Description
    description: Transforms CSVs into Parquet

    # Function source code directory
    srcDir: ./src/lambda/lambda_csv_parquet

    # Code path to the Lambda handler function.
    handler: lambda_csv_parquet.lambda_handler

    # The runtime for the function source code.
    runtime: python3.13

    # The role with which the Lambda function will be executed
    roleArn: generated-role-arn:lambda-etl

    # Number of times Lambda (0-2) will retry before the invocation event
    # is sent to DLQ.
    retryAttempts: 2
    # The max age of an invocation event before it is sent to DLQ, either due to
    # failure, or insufficient Lambda capacity.
    maxEventAgeSeconds: 3600

    # (Optional) Number of seconds after which the function will time out.
    # Setting to 300s (5 min) to allow time for transformation. May need to increase to accommodate larger files.
    timeoutSeconds: 300

    environment:
      DEST_BUCKET_NAME: "{{context:datalake_dest_bucket_name}}"
      DEST_PREFIX: "{{context:datalake_dest_prefix}}"

    # (Optional) Size of function execution memory in MB
    # Default is 128MB
    memorySizeMB: 512

    # (Optional) Size of function ephemeral storage in MB
    # Default is 1024MB
    ephemeralStorageSizeMB: 1024

    # Integration with Event Bridge for the purpose
    # of triggering this function with Event Bridge rules
    eventBridge:
      # Number of times Event Bridge will attempt to trigger this function
      # before sending event to DLQ. Note that Event Bridge Lambda invocation
      # is async, so Lambda Function execution errors will generally be handled
      # on the Lambda side itself.
      retryAttempts: 10
      # The max age of an event before Event Bridges sends it to DLQ.
      maxEventAgeSeconds: 3600
      # List of s3 buckets and prefixes which will be monitored via EventBridge in order to trigger this function
      # Note that the S3 Bucket must have Event Bridge Notifications enabled.
      s3EventBridgeRules:
        staging-update-event:
          # The bucket producing event notifications
          buckets:
            - "{{context:datalake_src_bucket_name}}"
          # Optional - The S3 prefix to match events on
          prefixes:
            - "{{context:datalake_src_prefix}}"