ETL Jobs

The Data Ops Job CDK application is used to deploy the resources required to support and perform data operations on top of a Data Lake using Glue Jobs.

Deployed Resources and Compliance Details

dataops-job

Glue Jobs - Glue Jobs will be created for each job specification in the configs

Automatically configured to use project security config
Can optionally be VPC bound (via Glue connection)
Automatically configured to use project bucket as temp location
Can use job templates to promote reuse/minimize config duplication

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

dataops-job: # Module Name can be customized
  module_path: '@aws-mdaa/dataops-job' # Must match module NPM package name
  module_configs:
    - ./dataops-job.yaml # Filename/path can be customized

Module Config (./dataops-job.yaml)

Config Schema Docs

Sample Job Config

Job configs can be templated in order to reuse job definitions across multiple jobs for which perhaps only a few parameters change (such as input/output paths). Templates can be stored separate from job configs, or stored together with job configs in the same file.

# (Optional) Name of the Data Ops Project
# Name the project the resources of which will be used by this job.
# Other resources within the project can be referenced in the job config using
# the "project:" prefix on the config value.
projectName: dataops-project-test

# Alternatively, if projectName is not provided, you can supply parameters directly:
# kmsArn: arn:aws:kms:region:account:key/key-id  # KMS key for encrypting job artifacts and logs
# deploymentRoleArnArn: arn:aws:iam::account:role/role-name  # IAM role for Glue job execution
# bucketName: my-project-bucket  # S3 bucket for storing job scripts and temporary data
# securityConfigurationName: my-security-config  # Glue security configuration for encryption settings
# notificationTopicArn: arn:aws:sns:region:account:topic-name  # SNS topic for job status notifications

templates:
  # An example job template. Can be referenced from other jobs. Will not itself be deployed.
  ExamplePythonTemplate:
    executionRoleArn: some-arn
    # (required) Command definition for the glue job
    command:
      # (required) Either of "glueetl" | "pythonshell"
      name: 'glueetl'
      # (optional) Python version.  Either "2" or "3"
      pythonVersion: '3'
      # (required) Path to a .py file relative to the configuration.
      scriptLocation: ./src/glue/python/job.py
    # (required) Description of the Glue Job
    description: Example of a Glue Job using an inline script
    # (optional) List of connections for the glue job to use.  Reference back to the connection name in the 'connections:' section of the project.yaml
    connections:
      - project:connections/connectionVpc
    # (optional) key: value pairs for the glue job to use.  see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
    defaultArguments:
      --job-bookmark-option: job-bookmark-enable
    # (optional) maximum concurrent runs.  See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-ExecutionProperty
    executionProperty:
      maxConcurrentRuns: 1
    # (optional) Glue version to use as a string.  See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
    glueVersion: '2.0'
    # (optional) Maximum capacity.  See: MaxCapcity Section: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html
    # Use maxCapacity or WorkerType.  Not both.
    #maxCapacity: 1
    # (optional) Maximum retries.  see: MaxRetries section:
    maxRetries: 3
    # (optional) Number of minutes to wait before sending a job run delay notification.
    notificationProperty:
      notifyDelayAfter: 1
    # (optional) Number of workers to provision
    #numberOfWorkers: 1
    # (optional) Number of minutes to wait before considering the job timed out
    timeout: 60
    # (optional) Worker type to use.  Any of: "Standard" | "G.1X" | "G.2X"
    # Use maxCapacity or WorkerType.  Not both.
    #workerType: Standard

  # An example job template. Can be referenced from other jobs. Will not itself be deployed.
  ExampleScalaTemplate:
    executionRoleArn: some-arn
    # (required) Command definition for the glue job
    # (optional) key: value pairs for the glue job to use.  see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
    defaultArguments:
      --job-language: scala
    # (optional) Glue version to use as a string.  See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
    glueVersion: '5.0'

jobs:
  # Job definitions below
  PythonJobOne: # Job Name
    template: 'ExamplePythonTemplate' # Reference a job template.
    defaultArguments:
      --Input: s3://some-bucket/some-location1
    allocatedCapacity: 2
    continuousLogging:
      # For allowed values, refer https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.RetentionDays.html
      # Possible values are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653, and 0.
      logGroupRetentionDays: 3

  PythonJobTwo:
    template: 'ExamplePythonTemplate' # Reference a job template.
    defaultArguments:
      --Input: s3://some-bucket/some-location2
      --enable-spark-ui: 'true'
      --spark-event-logs-path: s3://some-bucket/spark-event-logs-path/JobTwo/
    allocatedCapacity: 20
    # (Optional) List of all the helper scripts reference in main glue ETL script.
    # All these helper scripts will be grouped at immediate parent directory level, which will result in dedicated zip.
    # After deployment, they will be alongside the main script. Hence, must be referenced by file names directly from main glue script
    # Example (main.py)
    # from core import core_function1, core_function2;
    # from helper_etl import helper_function1, helper_function2;
    additionalScripts:
      - ./src/glue/python/helper_etl.py
      - ./src/glue/python/utils/core.py
    # (Optional) List of additional files which will be available to the Glue Job next to the main script
    additionalFiles:
      - ./src/glue/scala/extra_file.txt

  # Job definitions below
  ScalaJobOne: # Job Name
    template: 'ExampleScalaTemplate' # Reference a job template.
    description: testing
    defaultArguments:
      --class: some.java.package.App
    allocatedCapacity: 2
    command:
      # (required) Either of "glueetl" | "pythonshell"
      name: 'glueetl'
      # (required) Path to a script file relative to the configuration.
      scriptLocation: ./src/glue/scala/App.scala
    # (Optional) List of additional files which will be available to the Glue Job next to the main script
    additionalFiles:
      - ./src/glue/scala/extra_file.txt
    # (Optional) List of additional jars which will be loaded into the Spark driver and executor JVMs for use
    # within the Scala script
    additionalJars:
      - ./src/glue/scala/lib/extra.jar