Skip to content

ETL Jobs

The Data Ops Job CDK application is used to deploy the resources required to support and perform data operations on top of a Data Lake using Glue Jobs.


Deployed Resources and Compliance Details

dataops-job

Glue Jobs - Glue Jobs will be created for each job specification in the configs

  • Automatically configured to use project security config
  • Can optionally be VPC bound (via Glue connection)
  • Automatically configured to use project bucket as temp location
  • Can use job templates to promote reuse/minimize config duplication

Configuration

MDAA Config

Add the following snippet to your mdaa.yaml under the modules: section of a domain/env in order to use this module:

dataops-job: # Module Name can be customized
  module_path: '@aws-mdaa/dataops-job' # Must match module NPM package name
  module_configs:
    - ./dataops-job.yaml # Filename/path can be customized

Module Config (./dataops-job.yaml)

Config Schema Docs

Sample Job Config

Job configs can be templated in order to reuse job definitions across multiple jobs for which perhaps only a few parameters change (such as input/output paths). Templates can be stored separate from job configs, or stored together with job configs in the same file.

# (Optional) Name of the Data Ops Project
# Name the project the resources of which will be used by this job.
# Other resources within the project can be referenced in the job config using
# the "project:" prefix on the config value.
projectName: dataops-project-test

# Alternatively, if projectName is not provided, you can supply parameters directly:
# kmsArn: arn:aws:kms:region:account:key/key-id  # KMS key for encrypting job artifacts and logs
# deploymentRoleArnArn: arn:aws:iam::account:role/role-name  # IAM role for Glue job execution
# bucketName: my-project-bucket  # S3 bucket for storing job scripts and temporary data
# securityConfigurationName: my-security-config  # Glue security configuration for encryption settings
# notificationTopicArn: arn:aws:sns:region:account:topic-name  # SNS topic for job status notifications

templates:
  # An example job template. Can be referenced from other jobs. Will not itself be deployed.
  ExamplePythonTemplate:
    executionRoleArn: some-arn
    # (required) Command definition for the glue job
    command:
      # (required) Either of "glueetl" | "pythonshell"
      name: 'glueetl'
      # (optional) Python version.  Either "2" or "3"
      pythonVersion: '3'
      # (required) Path to a .py file relative to the configuration.
      scriptLocation: ./src/glue/python/job.py
    # (required) Description of the Glue Job
    description: Example of a Glue Job using an inline script
    # (optional) List of connections for the glue job to use.  Reference back to the connection name in the 'connections:' section of the project.yaml
    connections:
      - project:connections/connectionVpc
    # (optional) key: value pairs for the glue job to use.  see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
    defaultArguments:
      --job-bookmark-option: job-bookmark-enable
    # (optional) maximum concurrent runs.  See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-ExecutionProperty
    executionProperty:
      maxConcurrentRuns: 1
    # (optional) Glue version to use as a string.  See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
    glueVersion: '2.0'
    # (optional) Maximum capacity.  See: MaxCapcity Section: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html
    # Use maxCapacity or WorkerType.  Not both.
    #maxCapacity: 1
    # (optional) Maximum retries.  see: MaxRetries section:
    maxRetries: 3
    # (optional) Number of minutes to wait before sending a job run delay notification.
    notificationProperty:
      notifyDelayAfter: 1
    # (optional) Number of workers to provision
    #numberOfWorkers: 1
    # (optional) Number of minutes to wait before considering the job timed out
    timeout: 60
    # (optional) Worker type to use.  Any of: "Standard" | "G.1X" | "G.2X"
    # Use maxCapacity or WorkerType.  Not both.
    #workerType: Standard

  # An example job template. Can be referenced from other jobs. Will not itself be deployed.
  ExampleScalaTemplate:
    executionRoleArn: some-arn
    # (required) Command definition for the glue job
    # (optional) key: value pairs for the glue job to use.  see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
    defaultArguments:
      --job-language: scala
    # (optional) Glue version to use as a string.  See: https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
    glueVersion: '5.0'

jobs:
  # Job definitions below
  PythonJobOne: # Job Name
    template: 'ExamplePythonTemplate' # Reference a job template.
    defaultArguments:
      --Input: s3://some-bucket/some-location1
    allocatedCapacity: 2
    continuousLogging:
      # For allowed values, refer https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_logs.RetentionDays.html
      # Possible values are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653, and 0.
      logGroupRetentionDays: 3

  PythonJobTwo:
    template: 'ExamplePythonTemplate' # Reference a job template.
    defaultArguments:
      --Input: s3://some-bucket/some-location2
      --enable-spark-ui: 'true'
      --spark-event-logs-path: s3://some-bucket/spark-event-logs-path/JobTwo/
    allocatedCapacity: 20
    # (Optional) List of all the helper scripts reference in main glue ETL script.
    # All these helper scripts will be grouped at immediate parent directory level, which will result in dedicated zip.
    # After deployment, they will be alongside the main script. Hence, must be referenced by file names directly from main glue script
    # Example (main.py)
    # from core import core_function1, core_function2;
    # from helper_etl import helper_function1, helper_function2;
    additionalScripts:
      - ./src/glue/python/helper_etl.py
      - ./src/glue/python/utils/core.py
    # (Optional) List of additional files which will be available to the Glue Job next to the main script
    additionalFiles:
      - ./src/glue/scala/extra_file.txt

  # Job definitions below
  ScalaJobOne: # Job Name
    template: 'ExampleScalaTemplate' # Reference a job template.
    description: testing
    defaultArguments:
      --class: some.java.package.App
    allocatedCapacity: 2
    command:
      # (required) Either of "glueetl" | "pythonshell"
      name: 'glueetl'
      # (required) Path to a script file relative to the configuration.
      scriptLocation: ./src/glue/scala/App.scala
    # (Optional) List of additional files which will be available to the Glue Job next to the main script
    additionalFiles:
      - ./src/glue/scala/extra_file.txt
    # (Optional) List of additional jars which will be loaded into the Spark driver and executor JVMs for use
    # within the Scala script
    additionalJars:
      - ./src/glue/scala/lib/extra.jar