Data Migration

This document describes possible migrations paths you can follow when migrating data from an existing HBase cluster (e.g. on premise, or self-managed cluster on EC2) to Amazon EMR.

HBase snapshots

This is the most straight forward approach that doesn't require a complex setup and can easily be achieved using simple bash scripts. This approach is suitable if your data does not change frequently or when you can tolerate downtimes in your production systems to perform the data migration.

Below a list of steps that can be used to create a HBase Snapshot and transfer it to an Amazon S3 bucket. Please note that you can use the same approach to store snapshots on an HDFS cluster. If this is the case, replace the S3 target path in the following commands with the destination HDFS path (e.g. hdfs://NN_TARGET:8020/user/hbase) where you want to store the snapshots.

Create a snapshot of a HBase table

When creating a snapshot, it’s good practice to also add an identifier in the snapshot name to have a reference date of when the snapshot was created. Before launching this command please replace the variable TABLE_NAME with the corresponding table you want to generate the snapshot for. If the table is in a namespace different from default use the following convention NAMESPACE:TABLE_NAME. From the SOURCE cluster submit the following commands:

DATE=`date +"%Y%m%d"`
TABLE_NAME="YOUR_TABLE_NAME"
hbase snapshot create -n "${TABLE_NAME/:/_}-$DATE" -t ${TABLE_NAME}

To verify the snapshot just created, use the following command

hbase snapshot info -list-snapshots

Copy the snapshot to an Amazon S3 bucket

Note When migrating from an on premise cluster, make sure that you have Hadoop YARN installed in your cluster, as the commands rely on MR jobs to perform the copy to S3. Besides, you need to make sure that your Hadoop installation provides the hadoop-aws module that is required to communicate with Amazon S3.

Note If you're planning to use HBase with Amazon S3 as storage layer, you should use as TARGET_BUCKET the same S3 path that will be used as HBase S3 Root Directory while launching the EMR cluster. This minimize copies on S3 that are required when restoring the snapshots, thus reducing the restore time of your tables. To avoid any conflict during the snapshot copy, you should not start the EMR cluster (if using Amazon S3 as storage layer) before the end of the snapshot copy.

TARGET_BUCKET="s3://BUCKET/PREFIX/"
hbase snapshot export -snapshot ${TABLE_NAME/:/_}-$DATE -copy-to $TARGET_BUCKET

Restore Table when using Amazon S3

If you followed the notes in the previous step, you'll find the snapshot already available in HBase after launching the cluster.

Note If your snapshot was created from a namespace different from the default one, make sure to pre create it, to avoid failures while restoring the snapshot. From the EMR master node:

# Verify snapshot availability
HBASE_CMD="sudo -u hbase hbase"
$HBASE_CMD snapshot info -list-snapshots

# Review snapshot info and details
SNAPSHOT_NAME="YOUR_SNAPSHOT_NAME"
$HBASE_CMD snapshot info -snapshot $SNAPSHOT_NAME -size-in-bytes -files -stats -schema

# Optional - Create namespaces required by the snapshot
echo "create_namespace \"$NAMESPACE_NAME\"" | $HBASE_CMD shell

# Restore table from snapshot
echo "restore_snapshot \"$SNAPSHOT_NAME\"" | $HBASE_CMD shell

Resources

The following scripts allows you to migrate and restore HBase tables an namespaces using the snapshot procedure previously described.

Snapshot export - Generate HBase snapshots for all the tables stored in all the namespaces, and copy them on an Amazon S3 bucket.
Snapshot import - Restore all the snapshots stored in an Amazon S3 bucket.

Snapshots with Incremental Export

This approach might help in those situations where you want to migrate your data but at the same time you cannot tolerate much downtime in your production system. This approach helps to perform an initial bulk migration using the HBase snapshot procedure previously described, and then reconcile data received after the HBase snapshot generating incremental exports from the SOURCE table.

This approach works when the volume of ingested data is not high, as the procedure to reconcile the data in the DESTINATION cluster might require multiple iterations to synchronize the two clusters, along with the fact that might be error prone. The following highlights the overall migration procedure.

In the SOURCE cluster:

Create a snapshot of the HBase table you want to migrate. Collect the epoch time when the snapshot was taken, as this will be used to determine new data ingested in the cluster.
Export the snapshot on Amazon S3 org.apache.hadoop.hbase.snapshot.ExportSnapshot

In the DESTINATION cluster:

Import the snapshot in the cluster and restore the table

In the SOURCE cluster:

Generate an incremental export to S3 for data arrived in the cluster after taking the snapshot using the HBase utility org.apache.hadoop.hbase.mapreduce.Export

In the DESTINATION cluster:

Restore the missing data in the destination cluster using the HBase utility org.apache.hadoop.hbase.mapreduce.Import

Example Export Script

## Configurations
HBASE_CMD="sudo -u hbase hbase"
BUCKET_NAME="YOUR_BUCKET_NAME"
SNAPSHOT_PATH="s3://$BUCKET_NAME/hbase-snapshots/"
TABLE_NAME="TestTable"

# ==============================================================================
# (Simulate) Create TestTable with 1000 rows
# ==============================================================================
$HBASE_CMD pe --table=$TABLE_NAME --rows=1000 --nomapred sequentialWrite 1

# ==============================================================================
# Take initial table snapshot and copy it to S3
# ==============================================================================
DATE=`date +"%Y%m%d"`
EPOCH_MS=`date +%s%N | cut -b1-13`
LABEL="$DATE-$EPOCH_MS"

# snapshot creation
# Note: HBase performs a FLUSH by default when creating a snapshot
#       You can change this behaviour specifying the -s parameter
$HBASE_CMD snapshot create -n "${LABEL}-${TABLE_NAME}" -t $TABLE_NAME

# copy to S3
$HBASE_CMD org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot "${LABEL}-${TABLE_NAME}" -copy-to $SNAPSHOT_PATH

# ==============================================================================
# (Simulate) Data mutations to simulate data arrived after taking the snapshot
# ==============================================================================
# overwrite the first 100 elements of the table
$HBASE_CMD pe --rows=100 --nomapred sequentialWrite 1
# check first 100 rows will have an higher timestamp compared to the 101 element
echo "scan '$TABLE_NAME', {LIMIT => 101}" | $HBASE_CMD shell

# ==============================================================================
# Generate incremental data export
# ==============================================================================
# Retrieve the epoch time from the snapshot name that was previously created.
# This allow us to only export data modified since that moment in time.
$HBASE_CMD snapshot info -list-snapshots

# Incremental updates
LATEST_SNAPSHOT_EPOCH="$EPOCH_MS"
NEW_EPOCH_MS=`date +%s%N | cut -b1-13`
INCREMENTAL_PATH="s3://$BUCKET_NAME/hbase-delta/${TABLE_NAME}/${NEW_EPOCH_MS}"
$HBASE_CMD org.apache.hadoop.hbase.mapreduce.Export ${TABLE_NAME} $INCREMENTAL_PATH 1 $LATEST_SNAPSHOT_EPOCH

Example Import Script

## Configurations
HBASE_CMD="sudo -u hbase hbase"
BUCKET_NAME="YOUR_BUCKET_NAME"
SNAPSHOT_PATH="s3://$BUCKET_NAME/hbase-snapshots/"

HBASE_CONF="/etc/hbase/conf/hbase-site.xml"
HBASE_ROOT=$(xmllint --xpath "//configuration/property/*[text()='hbase.rootdir']/../value/text()" $HBASE_CONF)

# ==============================================================================
# Import and Restore HBase snapshot
# ==============================================================================

## List Snapshots on S3 and take note of the snapshot you want to restore
$HBASE_CMD snapshot info -list-snapshots -remote-dir $SNAPSHOT_PATH
SNAPSHOT_NAME="SNAPSHOT_NAME" # e.g. "20220817-1660726018359-TestTable"

## Copy snapshot on the cluster
$HBASE_CMD snapshot export \
    -D hbase.rootdir=$SNAPSHOT_PATH \
    -snapshot $SNAPSHOT_NAME \
    -copy-to $HBASE_ROOT

# Restore initial snapshot
echo "restore_snapshot '$SNAPSHOT_NAME'" | $HBASE_CMD shell

# ==============================================================================
# Replay incremental updates
# ==============================================================================
TABLE_NAME=$(echo $SNAPSHOT_NAME | awk -F- '{print $3}')
INCREMENTAL_PATH="s3://$BUCKET_NAME/hbase-delta/${TABLE_NAME}/${NEW_EPOCH_MS}"
$HBASE_CMD org.apache.hadoop.hbase.mapreduce.Import ${TABLE_NAME} ${INCREMENTAL_PATH}

Snapshots with HBase Replication

This approach describes how to migrate data using the HBase cluster replication feature that allows you to establish a peering between two (or more) HBase clusters so that they can replicate incoming data depending on how the peering was established.

In order to use this approach, a network connection between the SOURCE and DESTINATION cluster should be present. If you're transferring data from an on premise cluster and you have large volumes of data to replicate, you might establish the connection between the two clusters using AWS Direct Connect or you can establish a VPN connection if this is a one time migration.

The below section highlight the overall procedure to establish the replication.

In the SOURCE cluster, create a HBase peering with the DESTINATION cluster and then disable the peering so that data is accumulated in the HBase WALs.
In the SOURCE cluster, take a snapshot of the table you want to migrate and export it to S3.
In the DESTINATION cluster, import and restore the snapshot. This creates the metadata (table description) required for the replication and also restore the data present in the snapshot.
In the SOURCE cluster, re-enable the HBase peering with the DESTINATION cluster, so that data modified up to that moment will start to be replicated in the DESTINATION cluster.
Monitor the replication process from the HBase shell to verify the lag of replication before completely switch on the DESTINATION cluster, and shutdown the SOURCE cluster.

Create one-way peering: SOURCE → DESTINATION

Note The configuration for the replication should be enabled by default in HBase. To double check, verify hbase.replication is set to true in the hbase-site.xml in the SOURCE cluster.

To create the HBase peering, you need to know the DESTINATION ip or hostname of the node where the Zookeeper ensemble used by HBase is located. If the destination cluster is an Amazon EMR cluster this coincides with the EMR master node.

Once collected this information, from the SOURCE cluster execute the following commands to enable the peering with the destination cluster and start accumulating new data in the HBase WALs:

# The HBase command might be different in your Hadoop environment depending on
# how HBase was installed and which user is used to properly launch the cli.
# In most installations, it's sufficient to use the `hbase` command only.
HBASE_CMD="sudo -u hbase hbase"
MASTER_IP="**YOUR_MASTER_IP**" # e.g. ip-xxx-xx-x-xx.eu-west-1.compute.internal
PEER_NAME="aws"
TABLE_NAME="**YOUR_TABLE_NAME**"

## Create peering with the destination cluster
echo "add_peer '$PEER_NAME', CLUSTER_KEY => '$MASTER_IP:2181:/hbase'" | $HBASE_CMD shell

## List peers in the source cluster
echo "list_peers" | $HBASE_CMD shell

## Disable the peer just created, so that we can keep new data in the LOG (HBase WALs) until the snapshots are restored in the DESTINATION cluster
echo "disable_peer '$PEER_NAME'" | $HBASE_CMD shell

## enable replication for the tables to replicate
echo "enable_table_replication '$TABLE_NAME'" | $HBASE_CMD shell

Now you can switch to the DESTINATION cluster and restore the initial snapshot taken for the table. Once the restore is complete, switch again on the SOURCE cluster and enable the HBase peering to start replicating new data ingested in the SOURCE cluster since the initial SNAPSHOT was taken.

HBASE_CMD="sudo -u hbase hbase"
PEER_NAME="aws"
echo "enable_peer '$PEER_NAME'" | $HBASE_CMD shell

To monitor the replication status you could use the hbase command status 'replication' from the HBase shell on the SOURCE cluster.

Migrate HBase 1.x to HBase 2.x

When using HDFS

The migration path from HBase 1.x to HBase 2.x, can be accomplished using HBase snapshots if you're using HDFS as storage layer. In this case you can take a snapshot on the HBase 1.x cluster and then restore it on the HBase 2.x one. Although it is highly recommended to migrate to the latest version of HBase 1.4.x before migrating to HBase 2.x, it is still possible to migrate from older version of the 1.x branch (1.0.x, 1.1.x, 1.2.x, etc).

When using Amazon S3

When HBase is backed by Amazon S3, the migration process depends on the target EMR version. The schema and metadata structure differ significantly between HBase 1.x and 2.x, and certain HBase 2.x releases (starting with EMR 6.4.0) introduce breaking changes to the hbase:meta table.

Migration from EMR 5.x to EMR ≤ 6.3.1

If your target EMR release includes HBase ≤ 2.2.x (e.g., EMR 6.3.1), you can migrate directly from any EMR 5.x cluster running HBase 1.x. These versions maintain backward compatibility and automatically handle metadata evolution during cluster initialization.

Steps

Shut down the HBase 1.x cluster gracefully, flushing and disabling all the tables to prevent any data loss.
Launch a new EMR cluster (version ≤ 6.3.1) using the same S3 bucket for storage.
Allow HBase to initialize. The system will perform automatic schema upgrades as needed.
Verify access to tables.

Migration from EMR 5.x to EMR ≥ 6.4.0 or 7.x

From EMR 6.4.0 onward, HBase introduces additional metadata column families and schema validations. Direct migration from older HBase 1.x clusters can lead to startup failures and schema incompatibilities.

Attempting a direct migration from an older HBase 1.x cluster, will fail to initialize the HMaster service with the following exception:

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family table does not exist in region hbase:meta ...

Option 1: Staged Migration (Recommended for EMR 6.4.x – 7.3.x)

This two-phase migration approach leverages EMR 6.3.1 (HBase 2.2.6) as an intermediary step to safely handle schema evolution.

Steps

Stage 1 — Source Cluster Preparation (EMR 5.x)
- Flush and disable all tables to prevent any data loss.
- Terminate the cluster to prevent having two active clusters.
Stage 2 — Intermediate Migration (EMR 6.3.1)
- Launch an EMR 6.3.1 cluster pointing to the same S3 location.
- Allow HBase 2.2.6 to initialize and add missing column families automatically.
- Once stable, disable all tables again.
Stage 3 — Final Migration (EMR ≥ 6.4.x or 7.x)
- Launch the target EMR cluster version.
- Re-enable all tables.
- Validate data.

Option 2: Direct Migration with Schema Fix (For EMR ≥ 7.4)

Steps

Step 1 — Source Cluster Preparation (EMR 5.x)
- Flush and disable all tables to prevent any data loss.
- Terminate the cluster to prevent having two active clusters.

Step 2 — Configure Store File Tracking

Launch the target cluster to leverage the DefaultStoreFileTracker implementation by adding the following to your EMR configurations:

{
    "Classification": "hbase-site",
    "Properties": {
        "hbase.store.file-tracker.impl": "org.apache.hadoop.hbase.regionserver.storefiletracker.DefaultStoreFileTracker"
    }
}

Step 3 — Initialize HMaster
- On the first startup, HMaster will fail after automatically adding missing column families.
- It will abort with a PleaseRestartMasterException.
- Systemd will restart HMaster automatically, and subsequent initialization will succeed.
Step 4 — Post-Migration Actions
- Once HMaster is stable, enable all tables.
- Convert tables to the new Store File Tracker format if required. For example:
```
change_sft 'table_name','FILE'
```

For additional guidance on this procedure, see Migrating to Amazon EMR version 7.4.0 or later in the Amazon EMR documentation.

Summary

Approach	When to use?	Complexity
Batch - HBase Snapshots	Data doesn't change frequently or when you can tolerate high service downtime	Easy
Incremental - HBase Snapshots + Export	The data doesn't change frequently and you have large tables	Medium
Online - HBase Snapshots + Replication	Data changes frequently and high service downtime cannot be tolerated	Advanced

HBase snapshots​

Create a snapshot of a HBase table​

Copy the snapshot to an Amazon S3 bucket​

Restore Table when using Amazon S3​

Resources​

Snapshots with Incremental Export​

Example Export Script​

Example Import Script​

Snapshots with HBase Replication​

Create one-way peering: SOURCE → DESTINATION​

Migrate HBase 1.x to HBase 2.x​

When using HDFS​

When using Amazon S3​

Migration from EMR 5.x to EMR ≤ 6.3.1​

Migration from EMR 5.x to EMR ≥ 6.4.0 or 7.x​

Option 1: Staged Migration (Recommended for EMR 6.4.x – 7.3.x)​

Option 2: Direct Migration with Schema Fix (For EMR ≥ 7.4)​

Summary​

HBase snapshots

Create a snapshot of a HBase table

Copy the snapshot to an Amazon S3 bucket

Restore Table when using Amazon S3

Resources

Snapshots with Incremental Export

Example Export Script

Example Import Script

Snapshots with HBase Replication

Create one-way peering: SOURCE → DESTINATION

Migrate HBase 1.x to HBase 2.x

When using HDFS

When using Amazon S3

Migration from EMR 5.x to EMR ≤ 6.3.1

Migration from EMR 5.x to EMR ≥ 6.4.0 or 7.x

Option 1: Staged Migration (Recommended for EMR 6.4.x – 7.3.x)

Option 2: Direct Migration with Schema Fix (For EMR ≥ 7.4)

Summary