Glue CSV Parquet Transformer
This blueprint illustrates how to use an Scheduled Glue Workflow to trigger an DataOps Glue ETL Job to transform large CSV files into parqeut.
This blueprint may be suitable when: files are being regularly uploaded to the data lake, and need to be quickly transformed into parquet, perhaps in a standardized zone of the lake.
- Medium-Large sized csv files in the datalake needs to be transformed into parquet, perhaps in a standardized zone of the lake.
While the blueprint doesn't immediately handle partitioning, or additional transformation, the Glue ETL Job can be easily extended to provide these capabilities.

Usage Instructions
The following instructions assume you have already deployed your Data Lake (possibly using MDAA). If already using MDAA, you can merge these sample blueprint configs into your existing mdaa.yaml.
-
Deploy sample configurations into the specified directory structure (or obtain from the MDAA repo under
sample_blueprints/glue_csv_parquet). -
Edit the
mdaa.yamlto specify an organization name to replace<unique-org-name>. This must be a globally unique name, as it is used in the naming of all deployed resources, some of which are globally named (such as S3 buckets). -
Edit the
mdaa.yamlto specify a project name which is unique within your organization, replacing<your-project-name>. -
Edit the
mdaa.yamlto specify appropriate context values for your environment. -
Optionally, edit
glue_csv_parquet/glue_csv_parquet/src/glue/glue_csv_parquet/glue_csv_parquet.pyto handle additional transformation and partitioning. -
Ensure you are authenticated to your target AWS account.
-
Optionally, run
<path_to_mdaa_repo>/bin/mdaa -l lsfrom the directory containingmdaa.yamlto understand what stacks will be deployed. -
Optionally, run
<path_to_mdaa_repo>/bin/mdaa -l synthfrom the directory containingmdaa.yamland review the produced templates. -
Run
<path_to_mdaa_repo>/bin/mdaa -l deployfrom the directory containingmdaa.yamlto deploy all modules. -
Before loading csv files, you will need to provide the generated
glue-etlrole with access to your datalake bucket(s).
Additional MDAA deployment commands/procedures can be reviewed in DEPLOYMENT.
Configurations
The sample configurations for this blueprint are provided below. They are also available under sample_blueprints/glue_csv_parquet whithin the MDAA repo.
Config Directory Structure
glue_csv_parquet
│ mdaa.yaml
│ tags.yaml
│
└───glue_csv_parquet
└───roles.yaml
└───project.yaml
└───jobs.yaml
mdaa.yaml
This configuration specifies the global, domain, env, and module configurations required to configure and deploy this sample architecture.
Note - Before deployment, populate the mdaa.yaml with appropriate organization and context values for your environment
# Contents available in mdaa.yaml
# All resources will be deployed to the default region specified in the environment or AWS configurations.
# Can optional specify a specific AWS Region Name.
region: default
# One or more tag files containing tags which will be applied to all deployed resources
tag_configs:
- ./tags.yaml
## Pre-Deployment Instructions
# TODO: Set an appropriate, unique organization name, likely matching the org name used in other MDAA configs.
# Failure to do so may resulting in global naming conflicts.
organization: test-glue-blueprint #<unique-org-name>
# One or more domains may be specified. Domain name will be incorporated by default naming implementation
# to prefix all resource names.
domains:
# TODO: Set an appropriate project name. This project name should be unique within the organzation.
<your-project-name>:
# One or more environments may be specified, typically along the lines of 'dev', 'test', and/or 'prod'
environments:
# The environment name will be incorporated into resource name by the default naming implementation.
dev:
# The target deployment account can be specified per environment.
# If 'default' or not specified, the account configured in the environment will be assumed.
account: default
#TODO: Set context values appropriate to your env
context:
# The arn of a role which will be provided admin privileges to dataops resources
data_admin_role_arn : <your-data-admin-role-arn>
# The name of the datalake S3 bucket where the csv files will be uploaded
datalake_src_bucket_name: <your-src-datalake-bucket-name>
# The prefix on the datalake S3 bucket where the csv files will be uploaded
datalake_src_prefix: <your/path/to/csv>
# The name of the datalake S3 bucket where the parquet files will be written
datalake_dest_bucket_name: <your-dest-datalake-bucket-name>
# The prefix on the datalake S3 bucket where the parquet files will be written
datalake_dest_prefix: <your/path/to/parquet>
# The arn of the KMS key used to encrypt the datalake bucket
datalake_kms_arn: <your-datalake-kms-key-arn>
# The arn of the KMS key used to encrypt the Glue Catalog
glue_catalog_kms_arn: <your-datalake-kms-key-arn>
# The list of modules which will be deployed. A module points to a specific MDAA CDK App, and
# specifies a deployment configuration file if required.
modules:
# This module will create all of the roles required for the GLUE ETL Job
roles:
module_path: "@aws-mdaa/roles"
module_configs:
- ./glue_csv_parquet/roles.yaml
# This module will create DataOps Project resources which can be shared
# across multiple DataOps modules
project:
module_path: "@aws-mdaa/dataops-project"
module_configs:
- ./glue_csv_parquet/project.yaml
# This module will create the csv to parquet GLUE ETL Job
jobs:
module_path: "@aws-mdaa/dataops-job"
module_configs:
- ./glue_csv_parquet/jobs.yaml
# This module will create an AWS Glue Workflow which will schedule the csv to parquet GLUE ETL Job
workflow:
module_path: "@aws-mdaa/dataops-workflow"
tag_configs:
- ./tags.yaml
module_configs:
- ./glue_csv_parquet/workflow.yaml
tags.yaml
This configuration specifies the tags to be applied to all deployed resources.
glue_csv_parquet/roles.yaml
This configuration will be used by the MDAA Roles module to deploy IAM roles and Managed Policies required for this sample architecture.
# Contents available in roles.yaml
# The list of roles which will be generated
generatePolicies:
GlueJobPolicy:
policyDocument:
Statement:
- SID: GlueCloudwatch
Effect: Allow
Resource:
- "arn:{{partition}}:logs:{{region}}:{{account}}:log-group:/aws-glue/*"
Action:
- logs:CreateLogStream
- logs:AssociateKmsKey
- logs:CreateLogGroup
- logs:PutLogEvents
suppressions:
- id: "AwsSolutions-IAM5"
reason: "Glue log group name not known at deployment time."
generateRoles:
glue-etl:
trustedPrincipal: service:glue.amazonaws.com
# A list of AWS managed policies which will be added to the role
awsManagedPolicies:
- service-role/AWSGlueServiceRole
generatedPolicies:
- GlueJobPolicy
suppressions:
- id: "AwsSolutions-IAM4"
reason: "AWSGlueServiceRole approved for usage"
glue_csv_parquet/project.yaml
This configuration will create a DataOps Project which can be used to support a wide variety of data ops activities. Specifically, this configuration will create a number of Glue Catalog databases and apply fine-grained access control to these using basic.
# Contents available in dataops/project.yaml
# Arns for IAM roles which will be provided to the projects's resources (IE bucket)
dataAdminRoles:
# This is an arn which will be resolved first to a role ID for inclusion in the bucket policy.
# Note that this resolution will require iam:GetRole against this role arn for the role executing CDK.
- arn: "{{context:data_admin_role_arn}}"
# List of roles which will be used to execute dataops processes using project resources
projectExecutionRoles:
- id: generated-role-id:glue-etl
s3OutputKmsKeyArn: "{{context:datalake_kms_arn}}"
glueCatalogKmsKeyArn: "{{context:glue_catalog_kms_arn}}"
glue_csv_parquet/lambda.yaml
This configuration will create the transformation Glue ETL Job using the DataOps Glue Job module.
# Contents available in dataops/glue.yaml
# The name of the dataops project this crawler will be created within.
# The dataops project name is the MDAA module name for the project.
projectName: project
templates:
# An example job template. Can be referenced from other jobs. Will not itself be deployed.
glue-csv-parquet-template:
# (required) Command definition for the glue job
command:
name: "glueetl"
pythonVersion: "3"
# (required) Description of the Glue Job
description: Template to create a Job that transforms CSVs into Parquet
defaultArguments:
--job-bookmark-option: job-bookmark-disable
--raw_bucket: "{{context:datalake_src_bucket_name}}"
--raw_bucket_prefix: "{{context:datalake_src_prefix}}"
--transformed_bucket: "{{context:datalake_dest_bucket_name}}"
--transformed_bucket_prefix: "{{context:datalake_dest_prefix}}"
--enable-glue-datacatalog: "True"
--region_name: "{{region}}"
# (optional) maximum concurrent runs. See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-ExecutionProperty
executionProperty:
maxConcurrentRuns: 1
# (optional) Glue version to use as a string. See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
glueVersion: "4.0"
maxRetries: 0
# (optional) Number of minutes to wait before sending a job run delay notification.
notificationProperty:
notifyDelayAfter: 1
# (optional) Number of workers to provision
#numberOfWorkers: 1
# (optional) Number of minutes to wait before considering the job timed out
timeout: 60
# (optional) Worker type to use. Any of: "Standard" | "G.1X" | "G.2X"
# Use maxCapacity or WorkerType. Not both.
# workerType: "G.1X"
executionRoleArn: generated-role-arn:glue-etl
# Viewing real-time logs provides you with a better perspective on the running job.
# https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html
continuousLogging:
# For allowed values, refer https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.RetentionDays.html
# Possible values are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653, and 0.
logGroupRetentionDays: 14
jobs:
glue_csv_parquet:
template: "glue-csv-parquet-template" # Reference a job template.
command:
scriptLocation: ./src/glue/glue_csv_parquet/glue_csv_parquet.py
allocatedCapacity: 2
description: Job for Transform CSVs into Parquet