Skip to main content

Migration

This section provides best practice guides and tools to migrate data processing applications from self-managed environments to Amazon EMR.

  1. Amazon EMR Migration Guide This is a comprehensive technical document that provides best practices and steps for migrating Apache Spark and Apache Hadoop from on-premises to AWS.

  2. Migrating to Apache HBase on Amazon S3 on Amazon EMR This whitepaper provides an overview of Apache HBase on Amazon S3 and guides data engineers and software developers in the migration of an on- premises or HDFS backed Apache HBase cluster to Apache HBase on Amazon S3. The whitepaper offers a migration plan that includes detailed steps for each stage of the migration, including data migration, performance tuning, and operational guidance.

  3. Data Migration: We recommend using AWS DataSync for migrating data from HDFS to S3. Check this blog post to review Datasync capabilities and how to get started with Data migrations.

  4. Data pipelines Migrations: The following tools can be useful in migrating your current data pipelines to AWS

    1. Oozie to MWAA
    2. Oozie to AWS Step Functions
  5. Data Governance: The following tools can helpful in migrating your current data catalogs to AWS

    1. Migrate metadata between Hive metastore and AWS Glue Data Catalog
    2. Hive Glue Catalog Sync Agent

For further assistance reach out to aws-bdms-emr@amazon.com