Skip to content

ETL in the Cloud – MERGE statements

As an ETL (Extract, Load, Transform) developer on a traditional relational database, the MERGE process for managing dimension tables has always been a fairly straightforward one to me. Nearly every database platform has its own version of the MERGE statement to a varying degree, that allows you to either Update, Delete, or Insert depending on what values have changed or whether the record existed already. ACID transactions in dimension tables are crucial for managing data accurately–this is table stakes for data warehouse development.
So, what are the options when migrating these processes to the Cloud? Ideally, you’re moving away from an RDMS Stored-Procedure process and into a big data framework like, Spark, Hive, Pig, etc. AWS offers these in various forms as a service, so you only pay for what you use–Glue and EMR mainly fill this role. However, the famed UPSERT is largely a neglected feature in big data frameworks, so you can quickly finds yourself at the perfect intersection of ACID and BASE transactions. We know these differences in transaction types exist for good reason–doing a table scan on terabytes of data to update one value automatically being one of them–but paying close attention to this gap when migrating your ETL to the Cloud is extremely important.
Here I outline a few of the popular big data frameworks many turn to first when migrating from SQL based stored procedures to the Cloud and their support for merge statements as of this writing. Many of these feature gaps can be solved through more advanced engineering tricks that aren’t in scope for this article.

Apache Spark

SparkSQL claims that you can run SQL statements to query unstructured data, which is absolutely true. However, it falls short in its ability to do SQL updates, meaning it doesn’t have any support at all. This removes it as a candidate for a Cloud equivalent to your on-premises merge process unless you’re willing to manage your dimensions elsewhere and truncate and load your final target tables regularly (daily).

Data Bricks Delta Lake

The release of Data Bricks Delta Lake goes a long way in solving the ACID transaction problem, and treats it as a data lake problem instead of a relational database one. Delta Lake now offers MERGE INTO to accomplish an UPSERT, simulating ACID transactions against a “table” in your data lake (more on what tables are in Delta Lake here).

   SELECT userId, latest.address AS address, latest.deleted AS deleted FROM (
     SELECT userId, MAX(struct(TIME, address, deleted)) AS latest
     FROM changes GROUP BY userId
) latestChange
ON latestChange.userId = users.userId
WHEN MATCHED AND latestChange.deleted = TRUE THEN
    UPDATE SET address = latestChange.address
WHEN NOT MATCHED AND latestChange.deleted = FALSE THEN
    INSERT (userId, address) VALUES (userId, address)

Managing your SCD as a flat file veiled as a database table has a lot of benefits, but an important consideration if you are migrating a conformed dimension: Do the other teams in your organization who use this dimension know how to interact with this data effectively? If not, you are still left with the last target database update to solve for.

Apache Hive

As of Hive 2.2 a MERGE statement is now available and shows some promise. While a Hive transaction table is not the relational database table we’re ultimately aiming for, this offers you another way to handle SCD UPSERTs automatically if you’re willing to truncate and load your final database table.
The downside to this process for extremely large dimension tables is that you’re rebuilding the table in Hive, which is not trivial in its compute needs. Additionally, these merge statements essentially produce multiple insert, delete, and update statements for every row, so there would certainly be more latency compared to non-ACID Hive tables. Further, this process requires files to be in ORC format.
In conclusion, if you are planning to move your ETL to the Cloud, the future state of your dimension tables is worth some heavy consideration. In some organizations, a heavily curated conformed dimension can span many business units, so it will be imperative that the technology you choose to maintain your data is met with enough technical aptitude across those teams. ACID transactions are still a very important part of many legacy data warehouses.