Stop burning cash on AWS crawlers

Stop burning cash on AWS crawlers

Real-time partition discovery at a fraction of the cost

ยท

3 min read

A typical data lake setup in AWS is structured more or less like this:

  1. Something writes data to a landing zone in S3

  2. A Glue crawler scans the landing zone, looking for new partitions and schema changes

  3. The Glue data catalog is updated, reflecting the latest changes

You don't need the second step if your table schemas are stable.

Or better yet, you don't need to do it with crawlers; there's a faster and cheaper way.

Premise

One recommended way to update the Glue catalog when new partitions are created is to run a crawler. The same goes for schema evolution or new tables.

But it's far from the only way. For instance, if you use Glue ETL jobs, you can configure them to automatically update the data catalog without rerunning the crawlers.

What if your landing zone is populated by something other than Glue jobs?

  • Use crawlers for a simple (but costly) approach to automatic schema evolution.

  • If schema evolution is covered, such as with dlthub, keep reading.

Partition discovery with AWS Lambda

You can use a combination of AWS Lambda + S3 Event Notifications to update the data catalog with the new partitions. The cost will be insignificant, and it will be nearly real-time.

Step 1 - Setup

I'm assuming you already have a bunch of partitioned tables periodically updated by crawlers.

For starters, you need a way to map the S3 paths with their respective tables in the Glue catalog. It can be a naming convention, a pattern, or a record in DynamoDB. It's irrelevant how you do it as long as it's reliable.

Step 2 - Create a Lambda function

This is the first pillar of our solution. This function will update the tables' metadata in the Glue catalog.

The following code will manage the partition discovery:

import os
import boto3
from urllib.parse import unquote_plus

glue = boto3.client("glue")


class PartitionManager:
    def __init__(self, database_name, table_name, record):
        self.database_name = database_name
        self.table_name = table_name
        self.record = record
        self.storage_descriptor = glue.get_table(
            DatabaseName=self.database_name, Name=self.table_name
        )["Table"]["StorageDescriptor"]

    def update_partitions(self):
        # S3 keys have some characters encoded in HEX values.
        # The `unquote_plus` function decodes these characters.
        partition_path = unquote_plus(os.path.dirname(self.record["s3"]["object"]["key"]))
        # For HIVE partitioning strategy
        partitions = [p for p in partition_path.split("/") if "=" in p]
        partition_values = [p.split("=")[1] for p in partitions]

        try:
            glue.get_partition(
                DatabaseName=self.database_name,
                TableName=self.table_name,
                PartitionValues=partition_values,
            )
        except glue.exceptions.EntityNotFoundException:
            partition_location = self.storage_descriptor['Location'] + partition_path
            storage_descriptor = self.storage_descriptor.copy()
            storage_descriptor["Location"] = partition_location

            glue.create_partition(
                DatabaseName=self.database_name,
                TableName=self.table_name,
                PartitionInput={
                    "Values": partition_values,
                    "StorageDescriptor": storage_descriptor,
                }
            )

Then, for the Lambda handler:

def handler(event, context):
    records = event["Records"]

    for record in records:
        # ๐Ÿ‘‰ Add your S3 path <-> Table mapping logic here ๐Ÿ‘ˆ

        manager = PartitionManager(database_name, table_name, record)
        manager.update_partitions()
๐Ÿ’ก
The assumption is that your data is partitioned in HIVE format (which I recommend). If it isn't, you need to adapt the update_partitions function.

Step 3 - Enable S3 Notifications

Once you finish the Lambda function, you must configure its trigger. This is the second and last pillar of our solution.

Go to your S3 landing zone bucket and configure S3 Notifications to trigger your newly created function on s3:ObjectCreated:Put events. You can even add filtering rules to allow only specific prefixes or suffixes.

Final notes

The above code is minimal, and you may want to add some guardrails, such as retries or DLQ. Nevertheless, once you implement the mapping logic, it's almost ready to go.

Not sure where to get started? Drop us an email, and we'll see how we can help!

ย