DataChef's Blog

Custom dashboard for Great Expectations

Shahin — Wed, 24 Apr 2024 11:45:28 GMT

Introduction

Nowadays, Great Expectations is a very viable option for most of the organizations to introduce data quality solution for their data platform. Simplicity, being easily customizable and relying on the toolbox that is mostly known to modern data engineers, are the common factors which help with lowering the adaption barrier.

However, there is one part of Great Expectation, which I hoped to see significant improvement some day. The static dashboards! They are:

Static! And not much customizable.
Lack governance and require extra effort to manage users.
Lack features like a search bar, or historical or any other kind of specialized reports.
Lack direct access to logs.

In one of our recent projects at DataChef, we found ourselves, faced with these limitations, and thought it's time to build on top of the flexibility which Great Expectation already provides. The idea was simple. When running the suits, we wanted to generate metrics based on the findings, so we can use any modern dashboarding tool, to visualize the reports. For us, the following were the main pros of this approach:

Customizability of the dashboards.
Relying on existing monitoring dashboards, and providing a holistic view over the whole life cycle of the data products (not just data quality).
Less maintenance and user management requirements.

How to do it?

The main part of the process, to extract metrics, remains the same as usual. The interesting part happens, after running the checkpoint. This is where you have something like the following:

checkpoint = context.get_validator(    app_id,    context,    validator=validator).run()

The result of this piece, is a very nested object containing the report of the expectation suit, which unfortunately is not very well documented. However, the concept is simple, which helps us to extract valuable metrics from the result. We need to:

Find the validation identifier.
Get the corresponding result object.
For each column in the result set, publish related metrics, if it applies.

This is how it looks like in code:

def get_columns(result_object: list[dict], delimiter: str = '_') -> str | None:    """Given the result object, extract the column name.    The column name might be a list, in that case, make a string,    using delimiter.    Args:        result_object (list[dict]): Great Expectation's result object.        delimiter (str): Delimiter to use for list columns.    Returns:        column name if exists.    """    kwargs = result["expectation_config"]["kwargs"]    column = kwargs.get("column")    columns = kwargs.get("columns")    column_list = kwargs.get("column_list")    if column:        return column    elif columns:        return "_".join(columns)    elif column_list:        return "_".join(column_list)def process_checkpoint(checkpoint: list[dict], target_table: str) -> None:    """Process Great Expectation checkpoint object, and invoke `publish_metrics` method.    Args:        checkpoint (list[dict]): The great expectation checkpoint object.        target_table (str): Name of the target table used by GX.    """    run_results = checkpoint.get("run_results")    if not run_results:        raise Exception("Couldn't find run_result. Make sure you are using correct GX API")    run_id = next(iter(run_results))    results = run_results[run_id]["validation_result"]["result"]    for result in results:        columns = get_columns(result)        rule = result["expectation_config"]["expectation_type"]        element_count = result["result"]["element_count"]        success = 1 if result["success"] else 0        error = 1 if result["exception_info"]["raised_exception"] else 0        failure_rate = result["result"]["unexpected_percent"]        publish_metrics(            target_table=target_table,            element_count=element_count,            success=success,            error=error            failure_rate=failure_rate        )def publish_metrics():    ...

And with that, all you need to do, is define the publish_metrics method, targeting your desired system. We used AWS CloudWatch for it, but in theory, any other system can be used.

For visualizations, you can use any system too. We used Grafana, which was very useful for our needs, especially since our organization's observability already depended on it.

Conclusion

I think Great Expectation is already doing an impressive job, on the validation side of the data quality. It would've been nice, if the project, was expanding its flexibility of choice, to the dashboarding part as well.

We covered how simple this can be extracted in this blog post. The only possible bottleneck would be the change of the APIs used for metrics, which are not guaranteed to remain the same.

Navigating Data Platforms: Principles for Success

Bram Elfrink — Wed, 17 Apr 2024 12:14:12 GMT

Welcome to the first part of our series on data platforms! In this series, we are going to dive into data platforms. In part 1 of the series, we start out by going over the main principles for platform success.

In our unique role as a data consultancy, we have seen many customers that are gravitating towards the ideas of a data mesh. The decentralised nature of data mesh requires a self-serve data platform to enable the autonomy of the teams.

We have experienced different ways in which our clients have tried to tackle this. We are writing this series to help companies in navigating the complexities of building and managing a successful data platforms.

About The (Data) Platform

It has been known for some time that the later on a problem is found in the development process, the more expensive it is to fix. As a response to this, responsibilities such as security, privacy and financial operations are being pushed further up the chain to the development teams. This push is called the shift-left paradigm.

This shift offers significant benefits, such as identification of issues early on, encouraging a sense of ownership and providing the development teams with autonomy. Next to all the benefits, it does introduce one big challenge: additional cognitive load.

You dont want the teams that are involved with the development process to be so caught up in the additional responsibilities that they can only allocate a marginal amount of time towards bringing value to the business. So, how can we still get the benefits of the shift-left paradigm, while keeping the cognitive load of the teams at a healthy level?

That is where platforms come in. The aim of a platform is to reduce the cognitive load for their end users by providing them solutions for cross-cutting concerns. This is done by creating enablers, accelerators and abstractions (from here on out Im going to substitute this with services) to remove the complexity and friction of managing the lifecycle of their products, allowing the development teams to focus on what matters. This could be as simple as providing a document with best-practices to reduce cloud cost, creating a library to help end users provision infrastructure that already addresses security requirements and other best practices, or even providing a fully managed service in the case that multiple end-users have a need for the service.

In this series, we are going to focus on platforms developer experience is central and the platform is kept as lean as possible. Team Topologies calls such platforms Thinnest Viable Platforms or TVP for short.

Main Principles for Platform Success

A platform that is not used is worthless.

You could try to mandate the use of the platform, but then you lose one of the best metrics to asses how valuable the platform and its services actually are: the amount of adoption. A platform that solves problems, will be used.

To make a platform useful, there are 3 main principles to follow:

Create services that the end users of the platform actually need; the service should resolve a real pain point.
Focus on developer experience to make it as easy as possible for the end users to adopt and use a service.
Keep the platform as thin as possible.

Understanding Platform Customers

To be able to create services that the users of the platform actually need, there needs to be a clear understanding of who the end users are.

Understanding the end user is not a one-time activity, it is a continuous process that should start before you begin to work on a platform service, and should continue long after the service is live. Remember, if you make the life of your users easier, they will advocate for your solution, and that is incredibly powerful for adoption.

When starting out with the platform, we encourage you to first segment the types of stakeholders. Each segment will have different needs and preferences. Understanding who is your most influential segment will help you prioritise and focus on their needs first.

From our experience, the main stakeholder segments are:

Developers: they are often the end-users of the services that the platform provides. As they work with the services on a daily basis, and the services are mostly there to make their life easier, this should be your most influential segment.
Product owners: they are interested in the roadmap of the platform, to know when and how the platform will accelerate their team. They are also interested in metrics that show how their solution performs. Think about technical maturity, (cloud) cost, technical debt, etc.
Governance: for the sake of illustration, Ive grouped the various governance bodies like security, privacy and architecture under one umbrella. These bodies define policies which the platform is able to incorporate in the services they provide. A concrete example: security wants all data on S3 buckets to be encrypted by customer-managed keys. The platform can enforce such a policy in their infrastructure-as-code library that they provide to workload teams, ensuring these policies are addressed by design.
Leadership: their main interests are improving the developer experience, being able to verify that the platform team is actually accelerating teams and overall statistics to provide insight into the workloads that the platform serves. Think about (cloud) cost, projection of cost for next year, compliance scores, etc.

When starting out with a platform, we recommend to conduct focus groups with the main stakeholder segments that we have identified. The focus groups will provide a lot of ideas that, after grouping and prioritising, are great input for your backlog.

To be able to keep evolving the platform to best serve the end users, there should be ways for them to provide feedback. We prefer the use of internal channels (Teams / Slack), so it is transparent what is being asked for across all stakeholders.

Putting Developer Experience at the Core

Developer Experience should be the key focus for anything that the platform provides.

From a process point-of-view, you want to minimise the dependencies that end-users have on the platform team. One of the best ways of doing so is by opting for self-service instead of a ticketing approach. This provides faster resolution times, greater autonomy for the end-users and a reduced workload for the platform team. We encourage the platform to make use of GitOps, as it provides a great starting point for many of the self-service tasks, while only relying on technologies that are widely adopted by organisations.

From a "service" point-of-view, you want to make it as easy as possible for an end-user to get started. This requires great and concise getting-started documentation, working examples and even bootstrap scripts so adopters hit the ground running.

Keeping the Platform Nimble

Historically, platforms do not always have the best reputation. They are often associated with mandated and hard-to-use services, endless ticketing, and long lead times. Instead of enabling and accelerating end-users, they often do the opposite. We want to prevent that from happening here, and because of that, there needs to be strong product management in the platform team.

For any newly requested service, there should be a strong indication that it is worth providing. The time and cost saved should outweigh the investment. As a general rule you want to make sure that services will be used by more than 1 stakeholder.
For existing services, periodically check if they are still relevant. Services that are not used by workload teams do not add value, they only incur operational cost, hence they should be deprecated.

Next to keeping the amount of offerings as small as possible, there should also be a strong focus on building the simplest solution possible that addresses the need. A great example of this can be found in Team Topologies description of a TVP:

This TVP could be just a wiki page if that's all you need for your platform - it says we use this cloud provider and we only use these services from a cloud provider and here's the way we use them. That might just be a wiki page - that might be your platform.

Having a nimble platform makes it possible to keep addressing the changing needs of the end-users, and making sure that the platform is doing what it is supposed to do: empowering the organisation.

Stay tuned for the next post, where we dive into how we set up the technical foundation for a platform.

FP-Growth Algorithm and How to Avoid Its Dark Side!

Ali Yazdizadeh — Tue, 09 Apr 2024 14:00:01 GMT

Context

A few weeks ago we were contacted by FrieslandCampina to help them with a problem they faced on their recommendation engine. Being one of the biggest dairy companies in the world they sell hundreds of dairy products to millions of customers across the globe. To help with their sales team, they developed a recommendation system that suggests the next items for each customer based on their purchase history. At the heart of this system lies the FP-Growth algorithm and this algorithm failed to run successfully for some subset of their customers. This issue interrupted the project since it forced the user to manually intervene and skip these problematic subsets. Here we discuss this algorithm in depth and along the way find out what went wrong with those mysterious problematic subsets.

FP-Growth Algorithm

The FP-Growth Algorithm, short for Frequent Pattern Growth, facilitates the construction of an FP-tree, which captures the frequency of item occurrences and their interrelations. Through a recursive procedure, the tree is efficiently mined to identify the most common itemsets, enabling the recommendation of new items based on these established itemsets.

Let's clarify these concepts further and delve deeper into the mechanics of the algorithm.

Itemsets
Itemsets are sets of items bought together in a transaction. This is our starting point!
Support and Min_Support
Support is an important metric in the FP-Growth algorithm. It is defined as the percentage of all transactions that include a pattern. For example, based on the below table, the Support([item_3]) is 72% since it could be found in 72% of all transactions. Min_Support is the most important parameter of this model which determines the cutoff value for a pattern to be considered, so if the Min_Support is set to 0.5 all patterns that are in less than half of the transactions will be ignored. This will have important consequences which we will discuss later!
Association Rules
Association Rules are what our model learned from the data. It has an antecedent which means the items already in the itemset, and a consequent which is the recommendation based on the current itemset, this rule has confidence which shows how certain our model is about this rule. In practice, a cut-off of 0.5-0.8 can be applied to the confidence.
Final Result (Predictions)
Finally, we want a prediction column containing the recommendations we found for each customer based on their current itemset. Note that some of the customers may not get any recommendations this can be tuned by the Min_Support and Min_Confidence parameters.

Where is the Dark Side?

FP-Growth was first designed by J Han et al 2000 as a faster alternative to classical methods like the Apriori algorithm. Later a parallelized version of this algorithm was proposed (H Li et al 2008), which we are using through pyspark.ml library. This implementation harnesses the power of multiple CPU cores you have to speed up the process.

As the graph below shows, the pyspark implementation of FP-Growth can run on millions of itemsets (here they have about 10 items each) in practical time.

So where is this Dark Side we need to avoid? In this kind of problem, we usually look for dataset size as the measure of the scale of the problem but this can be misleading since there is another very important metric here: the size of each itemset.

The chart above shows how important the size of itemsets is. It was impossible to run it on my laptop after an average size of 60! This can be traced back to the algorithm where it creates FP-Tree and searches them for patterns recursively. This step can grow exponentially when the size of the itemset is increased. This was exactly what went wrong with those "problematic" subsets in FrieslandCampina! The problematic ones had signficantly higher median size of itemsets that led to infinite computaion/memory needs.

How to solve it?

Solving this problem can mean different things based on the business requirements and development capacities you have!

A simple yet effective solution would be to adjust the Min_Support parameter based on the size of the itemsets and increase it accordingly (from 0.2 to 0.8). This way you are applying a powerful cut-off that first filters out some of the items that dont have enough support and later decreases the number of association rules resulting in faster applications of them on the dataset. There is an edge case where all of the customers bought almost all of the available items where this fails but in that case, there is not so much to gain from recommendation anyway!

Another solution which is more robust but is more complicated and developer-consuming is to break each itemset into smaller manageable sub-itemsets and treat them as separate transactions. This splitting can be done in many overlapping or non-overlapping ways (which can be a source of confusion!). This way you decrease the complexity of the problem with the cost of losing some of the information.

Conclusion

Here we learned about the FP-Growth algorithm and how it can help the sales team with new product recommendations. We discussed this algorithm in depth and showed one of its caveats that if ignored can cause computation problems. We finally proposed some solutions for that problem.

Speed up dbt workflow with Task

Zambo — Mon, 12 Feb 2024 12:44:21 GMT

Do you remember the last time you opened your laptop and thought:

"I can't wait to spend half my day writing configs in YAML."?

Yeah, neither do I.

But if you use dbt heavily, I'm afraid you and YAML are in for the long run. Even with the help of dbt-codegen, there's still a lot of manual work involved:

Write odd-looking CLI commands like this one

dbt run-operation generate_source --args '{"schema_name": "jaffle_shop", "database_name": "raw", "table_names":["table_1", "table_2"]}'

Send the output to some file
Don't screw up the naming conventions
Repeat for each new SQL model
...

If you're anything like me, this gets old fast.

So after writing about 30 lines of dbt YAML configurations, I said, "Enough is enough!" and wrote 400 more lines to automate the process.

How Task made everything better

I wanted something to automate repetitive work that I could set up quickly and without hassle. After some searching, I landed on Task.

Task is a task runner / build tool that aims to be simpler and easier to use than, for example, GNU Make.

In other words, write a sequence of commands in YAML, give them a name, and run them like this:

task formerly-tedious-action

That's exactly what I needed.

On top of the very fundamental aspect of "it runs stuff", Task comes packed with many other valuable features:

Easy to install and platform-agnostic
String manipulation and other helpful stuff with Go's template engine
Task dependencies
Fingerprinting and other ways to prevent unnecessary work

Plus, dozens of other goodies. Honestly, just read the docs.

I don't care about your fancy pants introduction. Show me the code!

📥

You can add this Taskfile.yaml in your dbt project, install Task and try it there. The only prerequisite is that you're either on Linux or Mac. I haven't entirely adapted the tasks for Powershell yet.

Alright, let's see it in action. You can follow along by forking this repository and opening it in Codespaces. It's already configured with everything you need.

The first time will take a while, as it builds the devcontainer. When it's done, you'll be presented with the browser version of VSCode and an almost empty dbt project.

In the assets/ folder, you can find a Duckdb database that is already pre-filled with some data. You can browse the database by running

harlequin assets/demo.duckdb

and it will open a SQL IDE in your terminal that looks like this:

As you can see, there are a couple of different databases (mixpanel and shopify). These represent your raw sources, with some tables each.

Let's get started with actual development with dbt. Close the SQL IDE and write task in your terminal. This runs the default task, which is configured to show this output:

Usually, when starting a new dbt project, the first action would be to define our sources. If using dbt-codegen you'd write something like this:

dbt --quiet run-operation generate_model_yaml --args '{"model_name": "stg_jaffle_shop__orders"}' > models/staging/jaffle_shop/stg_jaffle_shop__orders.yml

With the tasks provided, you can instead just write

task dbt:gsy SCHEMA=mixpanel

that will output

You can see that the task automatically determines where to place the file and enforces a naming convention.

This is entirely arbitrary. If you don't like your lowest layer to be called bronze, you can just edit the BASE_LAYER variable under dbt/Taskfile.yaml.

These commands are still running dbt-codegen in the background, but taking advantage of Task's many features makes it possible to create automations with relatively small effort and (hopefully) minimal bash scripting.

Let's see a couple more examples.

task dbt:gms SOURCE=mixpanel -- events users

will create the base models

bronze_mixpanel__events.sql and
bronze_mixpanel__users.sql

Now, you need to create their respective YAML files. Let's also say that you already created a silver model, for instance user_events.sql. Your models/ folder should look something like this

Just run the following

task dbt:gmy -- __events users user_events

(only model suffixes are necessary, but they can't be ambiguous, so we prepend events with __ to allow for it).

And here's the output

(Note: this last command runs the models as a precondition if they haven't been run before. This is to ensure that columns get correctly picked up by Codegen)

Conclusion

These were just examples of what can be achieved with Task, and maybe an exaggeration. Yet, the project takes advantage of many of the various features and should provide a decent starting point for those wanting to get started.

In the context of dbt workflows, so much more could be done: for instance, one could make a task (or group thereof) that makes sure that all the column and table metadata gets properly managed and centralized (by taking advantage of Jinja docs, and the yq utility).

Extra notes

You sure have noticed that dbt related tasks are invoked like this: task dbt:.... We invoke them from the project's root, but the Taskfile is in the dbt/ sub-directory. In Task, it's possible to include Taskfiles in others and specify the directory in which they run, like so:

includes:  dbt:    taskfile: ./dbt    dir: ./dbt

This way, you can easily scope and invoke your tasks from a single centralized point. It's also possible to have a global Taskfile that can be used from anywhere!

Software Engineering for Data Engineers: Introduction

Shahin — Tue, 23 Jan 2024 11:04:23 GMT

Transitioning from a web development background to data engineering, I've encountered a significant cultural shift. One of the most striking differences is the apparent lack of established software engineering practices within many data engineering teams at various companies. This discrepancy, while noticeable, is not without reason. A majority of online resources offering guidance on becoming a data engineer tend to emphasize the following:

SQL
Spark, with a particular focus on Python
Designing data-intensive applications

These are undoubtedly excellent starting points. However, it's important to remember that adhering to fundamental software development practices is essential regardless of the software type. These practices, which have evolved over the years, play a pivotal role in enhancing various aspects of software quality, complementing the core business requirements of the products.

Data products are no exception to this rule. The landscape for creating data products today is diverse, with some solutions overlapping and others introducing innovative approaches to longstanding challenges in software development.

My key criterion in selecting or designing a solution is its ability to maintain a "boring" software development experience. Throughout my years of experience, I've understood that a certain level of predictability in being a data engineer is advantageous. Maintaining a routine, especially in the development of data products, is crucial to ensure that the final products are:

Long-term reliable
Easily adaptable by new engineers
Flexible to accommodate new use cases

Without marketing teams to glamorize our work, our data products must stand on merit. They must prove their worth through reliability and effectiveness.

To achieve this, we must embrace the best software engineering practices, which guide us through various stages, including design, implementation, and maintenance.

In this series of articles, I aim to delve into these topics. My goal is to assist the community in enhancing the quality of the products we develop by leveraging the wisdom and experience of the greats in our field.

P.S. Living in the age of A.I., where much of our work could be automated, it's wise to shift our focus towards quality. This will help us stay relevant in the future of this industry and prepare us for what it has in store.

Mysterious Spark Checkpoints Behaviour

Shahin — Wed, 20 Dec 2023 15:24:35 GMT

It all started from a change in the checkpoint path of our Spark applications. We use Spark Structured streaming and AWS S3 buckets to maintain checkpoints. Lets say we were using s3://bucket/spark/topic/ as the checkpoint path, and we changed it to s3://bucket/spark/topic/v1.

Given that we werent touching anything from the content of this directory, and we didnt want to double ingest the data and do the chore of reprocessing everything and cleanup, we decided to move on with a straightforward migration script to move existing checkpoint details to the new path. On the paper, everything was clear and expected to work, right?

Nope! The error:

java.lang.AssertionError: assertion failed: There are \[1\] sources in the checkpoint offsets and now there are \[3\] sources requested by the query. Cannot continue.

The initial search ends in this StackOverflow thread and circles back to fault tolerance semantic docs, but it doesnt make sense! Right? We havent changed anything regarding the sources or the applications business logic. There are also no hard-coded values to tell Spark they are changing the checkpoint path.

The code was introduced many versions ago to Spark, and all it does is check the number of sources used by the current process and compare it with the file extracted from the S3 bucket to make sure they match. Dont they fit?

No! We just deleted the existing offset to pass over this error and realized the new checkpoints are rather strange:

v1{"batchWatermarkMs":0,"batchTimestampMs":1703081374762,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.statefulOperator.useStrictDistribution":"true","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}} {"topic_1":{"0":326}} {"topic_1":{"0":326}}{"topic_1":{"0":326}}

Lines 3 to 5 are supposed to show the sources used in the query, and for some reason, the only source topic we have is repeated three times in the updated application.

Further digging into the code, we found the logical plan of the application drives the whole checkpoint mechanism. It takes one df.explain("formatted"), and we realize there are three occurrences of StreamingRelation with Kafka as argument and the same set of output columns.

We are reading one topic and pass it to business logic, which has not been changed. Is that causing the issue? Checking the business logic, we realize we have, indeed, three select statement on the input data frame, and we union their results together.

But why wasnt this issue happening before, and it is now? Comparing our branches, there is another change in the code: The way we invoke the business logic on the input data frame has a slight difference. Back then, we were doing the following:
df.writeStream.foreachBatch(transformation)

And we changed it to the following to make it more clear:

df.transform(transformation).writeStream

Each of these approaches has its benefits; however, the main difference is how the Spark optimizer orchestrates the job. For the first one, it only relies on the input batch as a single data frame to do many selects; however, for the latter, it distributes the tasks into N number of reads + selects, where N is the number of select statements.

So, in theory, if we add a new select Statement to a new application: itll contradict the fault tolerance rules and fail unless we delete the checkpoints! Of course, if thats not the desired behavior, we can contain it into foreachBatch callback, and even cache the initial data frame, and have consistency with the cost of some parallelism.

That was it. A storage behavior, with some curiosity, turned into a learning point for us. What is your experience with this subject?

No Code, All Insight: SageMaker Canvas Connects Data Analysts to Machine Learning

DataChef — Sun, 12 Nov 2023 10:56:35 GMT

What is No-Code ML?

As a data scientist, I was always skeptical of no-code solutions since they usually provide so little flexibility that makes them practically useless or tries to provide too much flexibility that makes their UI/UX impossible to navigate and use! And honestly, I gave SageMaker a try with the same mindset but a few minutes in I solved a real-world machine-learning problem and had my model ready to deploy! This breakthrough can help close the gap between data scientists and data analysts in your team.

What is SageMaker Canvas?

No-Code Machine Learning tools like Canvas provide a web application that lets users train and deploy machine learning models without writing any line of code and only by using the website. Canvas has the sweet balance between being customizable and being easy to navigate and use. Currently, Canvas can solve these problems:

Tabular Data for Regression or Classification
TimeSeries Forecasting
Image Data for Classification
Text Data for NLP Problems

As well as the ability to use pre-trained models.

How to access the Canvas?

Go to SageMaker Dashboard and choose Canvas
Create a SageMaker Domain
Create a UserProfile (wait till the Domain status is InService)
Open the Canvas (this may take a few minutes)

Note: Ensure you meet any prerequisites, like having an AWS account.

An example usage of Amazon Canvas for creating custom models

How to use the Canvas?

AWS provided very interesting demos and learning materials for Canvas. The first step can be this example for package tracking which solves a regression problem on tabular data: SageMaker Canvas Demo (awsplayer.com) or this AWS Hands-on Lab Amazon SageMaker Canvas | Hands-on lab (awsplayer.com)

Pricing model:

Amazon SageMaker Canvas offers a 2-month free tier under the AWS Free Tier. The pricing model charges based on session duration ($1.90/hour) plus the cost of training and/or deploying models. For detailed pricing, visit the AWS pricing page.

Some Notes on This Pricing:

It can be super expensive for very large datasets compared to a solution that a data scientist can code and run so keep that in mind. For example, training a classification model on 1M rows and ten columns will cost you about 300$ (!) which can be done for a few dollars if you do it yourself!
The pricing for the NLP and Computer Visions tasks is more reasonable; my explanation is that for Tabular data, SageMaker uses their AutoML service, which trains 250 models in parallel to find the right model, so it can be costly!

When to use it?

You are a data analyst who understands the problem but doesnt have the time/expertise to code it yourself.
The dataset size is not huge (i.e., <100K rows)
The model and algorithm you want to use are relatively standard and nothing new or fancy.
Your company has a group of data scientists and analysts who want to collaborate. You can create models in SageMaker Canvas and then share them with data scientists to use in the SageMaker Studio.

When to avoid it?

You are an experienced data scientist who feels free to code. This still can be an easy solution for you if the cost is not a significant factor!
You want to build a custom model with specific architecture.

What exactly is No-Code ML, and how does it differ from traditional machine learning methods?

The Future of Logistics: AI's Impact on Supply Chain Management

DataChef — Tue, 07 Nov 2023 10:39:13 GMT

Introduction

Artificial Intelligence (AI) has come a long way since its early days as an academic curiosity in the mid-20th century. From intelligent assistants to autonomous cars, AI is reshaping the world as we know it. But besides its fancy applications in consumer products, today AI is a technology that is transforming various industries to their core. One of the domains that is seeing rapid transformation is the supply chain and logistics industry. In this article, we will explore some of the most exciting use cases of AI in this domain and discuss how businesses can benefit from this transformation.

Crafting the Crystal Ball: Enhanced Demand Forecasting

Businesses in the past used to rely on past sales and intuition to predict demand. However, in todays world, AI has become the modern-day oracle. Analyzing diverse datasets that cover market dynamics, customer behavior, and even global events provides an unparalleled view of future demand. Better demand forecasting does not just predict sales; it also helps businesses synchronize inventory levels with production cycles. Such precise forecasting guarantees that companies operate with lean inventories, minimizing waste while remaining flexible enough to respond to sudden demand surges.

Building Stronger Supply Chain Alliances: Supplier Relationship Management

Managing a diverse pool of suppliers has always been akin to juggling. Each supplier is a distinct entity with its own performance metrics and delivery histories. Artificial intelligence simplifies this process by thoroughly analyzing each suppliers track record, assessing potential risks, and forecasting their future reliability. With AI, businesses can now establish stronger, data-driven relationships with their suppliers, ensuring that the supply chain remains unbroken.

Beyond the Naked Eye: The Magic of Computer Vision

While computer vision technologies have stirred controversy when applied to human scenarios, such as surveillance, their application in supply chain contexts is undeniably beneficial. Imagine an ever-watchful guardian in warehouses, one that misses nothing. Thats computer vision for you. Beyond just tracking inventory, it scrutinizes product quality, detects anomalies, and even aids in intelligent storage solutions. With computer vision, manual errors are significantly reduced, ensuring that the supply chain operates with clockwork precision.

The Logistics Maestro: Refined Transportation Management

Transporting goods is a complex symphony of variables: routes, fuel costs, vehicle health, and load distribution. AI algorithms analyze real-time traffic data, weather patterns, and historical transit times to orchestrate the most efficient shipment paths. This minimizes delays and ensures on-time deliveries, a critical measure of customer satisfaction and business success. But AI doesnt just direct the flow; it ensures each vehicle carries the perfect pitch by balancing loads. Sophisticated machine learning models predict the best way to utilize cargo space, reducing the number of trips and saving fuel. This not only slashes costs but also reduces the carbon footprint, playing a symphony for sustainability. The result? Goods are transported faster, cheaper, and smarter.

Green Benefits and Hidden Challenges

Operational efficiency and environmental sustainability arent always perceived as partners. AI challenges this notion. By improving routes, it cuts down on fuel use. Better warehouse practices lead to less energy use and waste. Simply put, AI doesnt just boost profitsit shows a dedication to environmentally-friendly actions.

Conclusion

As with any transformative journey, AIs implementation into supply chain and logistics isnt without its hurdles. Only some of these challenges are technical - for example, the integration with existing legacy systems. But since were talking about one of the most impactful innovations in recent history, the main challenges are on a higher level, and they concern our relationship with this technology as human beings.

It all starts with data. Key challenges include ensuring data privacy and using unbiased datasets to guarantee that the solutions we are building are inclusive. In addition to this, we shouldnt ignore the fact that AI is more and more associated with the workforces apprehensions about job losses. Its crucial to find a middle ground and to promote a human-centric approach. Combining AIs insights and automation capabilities with human intuition and industry expertise is the best way to ensure we embrace innovation without compromising civilization.

What are some practical applications of AI in supply chain management?

Amazon Bedrock: Simplifying Generative AI with AWS for 2023

DataChef — Mon, 23 Oct 2023 07:22:49 GMT

Introduction

Generative AI has seen a surge in interest over the past few years, with a plethora of applications ranging from content creation to complex reasoning tasks. Recognizing the growing demand, AWS has introduced its managed service, Amazon Bedrock, this week in Europe. This promising new service offers streamlined access to cutting-edge foundation models from esteemed AI startups as well as from Amazons own treasure trove.

Whats the Buzz About?

Amazon Bedrock is distinguished by its serverless approach, ensuring users can effortlessly kick-start and customize foundation models using their own unique data. Even better? The seamless integration with AWS utilities allows developers to embed these models directly into their applications. The best part is users neednt fuss about the intricate details of the underlying infrastructure Amazon Bedrock handles that for you!

Getting Started with Amazon Bedrock

Currently, the service is available in selected AWS regions, notably us-east-1 (N. Virginia) and eu-central-1 (Frankfurt). However, its crucial to note that the model availability does vary. For instance, while us-east-1 offers a comprehensive suite of all the models, eu-central-1 currently limits users to models from just Amazon and Anthropic.

To dive in:

Request Access: You begin by requesting access to your chosen foundation model. Once AWS forwards your request to the model owner (like Anthropic or Stability AI), youll be on standby until the green light is given.
Play and Explore: Not entirely sure of a models capability? Fret not! The Amazon Bedrock dashboard features a playground that lets you test and tweak it to understand its full potential.
Embed in Your App: Once youre satisfied, use AWS APIs like boto3 to integrate the model into your application. Need some guidance? The official AWS repository has got you covered with examples for each model: Amazon Bedrock Workshop on GitHub.

Example usage of Amazon Bedrock on its Playground page for text and image generation

Why Should You Hop on the Bedrock Train?

Simplicity and integration are at the heart of Amazon Bedrock. It offers a unified API for both text and image generation that dovetails perfectly with the AWS ecosystem. This means no more reliance on third-party APIs that might take you out of the AWS environment. The result? A sturdier, more flexible, and future-proof product.

What Models currently Bedrock supports?

Titan by Amazon: A versatile tool for tasks like text generation, classification, question answering, information extraction, and personalized text embeddings.
Jurassic by AI21 Labs: Ideal for varied language tasks, including (but not limited to) text generation, summarization, and question answering.
Claude by Anthropic: This model shines in areas such as thoughtful dialogue, complex reasoning, content creation, and even coding! Its strengths lie in Constitutional AI and harmlessness training.
Command by Cohere: Tailored for businesses, this model excels at generating text-based responses based on prompts.
Llama 2 by Meta: Perfectly fine-tuned for dialogue-centric applications.
Stable Diffusion by Stability AI: For the visually inclined, this image generation model crafts stunning visuals, artwork, logos, and designs.

Whats the Cost?

Amazon Bedrock offers a two-tiered pricing structure. Users can either opt for:

On-Demand: A flexible pay-as-you-go option without the strings of time-based commitments.

The on-demand pricing is roughly the same as what each of these companies (e.g. Anthropic, Stability, ) offer as API cost on their own platform.

Provisioned Throughput: For those seeking assured performance to match their applications demands in return for a time-based commitment.

This can save you a lot of money if you have a predictable workload.

Conclusion

In conclusion, If you want a seamless, AWS-native way to add Generative AI to your application, Amazon Bedrock is definitely something you need to check out. With its ease of use, diverse model offerings with customization options, and special pricing for provisioned workload, its something worth exploring!

How can I get started with Amazon Bedrock?

The Ultimate Guide to User Authentication: Building vs. Third-Party Services

DataChef — Mon, 09 Oct 2023 02:44:44 GMT

Introduction

In the process of developing software applications, each project necessitates the creation of a mechanism for granting the correct users access (i.e., authentication) and defining their permissions (i.e., authorization). This combination of functionalities is referred to as an Auth system.

When deciding between a third-party solution and building an authentication system in-house, there are many important factors you need to consider. There are pros and cons of both approaches. You must determine which of these two choices aligns better with your application requirements.

In this article, I will demonstrate the pros and cons of both approaches in detail to help you make the right decision for handling authentication and authorization in your app.

First, lets understand the key difference between user authentication and authorization.

Authentication vs Authorization

Authentication is the verification of a users identity based on their claims. For instance, on a website, users are categorized as Admin or Customer. When a user logs in as an admin, the app checks if they are indeed an admin for successful authentication, and the same process applies to customers.

In essence, authentication confirms a users identity as per their assertion. Whereas authorization determines a users access rights within the app after they are authenticated. For instance, a customer can view and edit their own details but not those of others. Meanwhile, an admin can access all customer details except passwords, highlighting the importance of defining access limits based on authorization levels.

The Building Blocks of User Authentication

The authentication process has evolved from basic username and password systems to more advanced methods that incorporate multi-factor authentication (MFA), single sign-on (SSO), going passwordless (biometric or otp-based auth), and social identity providers. These advances are essential in the modern age to protect user data and enhance the user experience.

Before we dive into the pros and cons of building our own authentication system versus using a third-party solution, lets establish a foundational understanding of what user authentication involves:

User Data Management: This includes storing and managing user profiles, credentials (e.g., passwords or tokens), and user attributes (e.g., email addresses, names, roles).

Authentication Mechanisms: Implementing secure methods for verifying the identity of users, such as password-based authentication, social login (e.g., Google or Facebook), password-less authentication, or multi-factor authentication (MFA).

Security: Ensuring data encryption, protection against common attacks (e.g., XSS, CSRF, and SQL injection), and secure storage and exchange of sensitive user information.

Scalability: The ability to handle growing numbers of users and concurrent authentication requests efficiently.

Customization: Tailoring the authentication process and making it flexible to add more authentication mechanisms and information about users to match your applications unique requirements and user experience.

Compliance: Data compliance identifies the rules and regulations around your data that you must follow, either imposed by the client you work with or the ruling body in your operating region.

Next, Lets explore the pros and cons of building our own authentication system.

Building Your Own Authentication System

Pros

Customization: Building your own authentication system allows you complete control over it. You can choose one or more authentication mechanism(s) for your users best experience. You can define your own user attributes according to your needs. Overall, You can design the authentication flow to align perfectly with your applications needs.

Data Control & Compliance: You have full ownership of user data, which can be critical for compliance, privacy, or regulatory reasons.

Cost Control: People generally think using 3rd party solutions is always cheaper, but this is not true. Once your application grows to ~100k+ users, the 3rd party pricing might hurt you. Depending on your applications scale, building your own system can be more cost-effective in the long run.

Cons

Complexity: Developing a secure authentication system is complex and time-consuming. It requires a deep understanding of security best practices, including hashing algorithms, token management, user session management and more. It becomes much more complex when we deal with authentication in distributed systems.

Maintenance Burden: Building your own system means youre responsible for ongoing maintenance, bug fixes, and security updates. This can divert resources and attention from your core application features. The company I work at has a dedicated team of engineers to maintain the auth system.

Security Risks: Building your own system can introduce security risks if not done correctly. Any vulnerabilities could lead to data breaches and compromised user accounts.

Using a 3rd party Authentication Service

To better analyze the pros and cons of a 3rd party Auth system, lets have a look at the features provided by two popular players, Auth0 and AWSCognito.

AWS Cognito

Amazon Cognito serves as a comprehensive identity platform for web and mobile applications, functioning as a user directory, authentication server, and authorization service. It supports OAuth 2.0 access tokens and AWS credentials, enabling authentication and authorization from various sources, including the built-in user directory, enterprise directories, and consumer identity providers like Google and Facebook.

Amazon Cognito relies on two integral components to function effectively:

User Pool: This component serves as a centralized user directory, managing user registration, authentication, and account recovery. User Pools enable seamless user management and authentication processes within the Amazon Cognito framework.
Identity Pool: The Identity Pool component facilitates secure short-term access to various AWS services to authenticated users or even guest users. It ensures controlled access and permissions, allowing users to interact securely with other AWS resources and services.

Both the User Pool and Identity Pool can function autonomously or collaboratively, offering flexibility in their operational integration.

Next, we will look at the working flow of the User pool, Identity pool, and also the combined use of these two.

1.Working Flow of User Pool

Scenario - 1. Integration with your own backend

Provide user credentials to the Amazon Cognito user pool to get authenticated and receive a token if verified successfully.
Send this token in subsequent requests to access your backend/DB. The backend checks for the token validity with Cognito and grants access to these resources.

Scenario - 2. Integration with AWS Ecosystem

The first step remains the same in this scenario as well. We request the user pool to authenticate the user and send a token back to the requester.
When accessing AWS serverless APIs placed behind an API Gateway, the gateway natively integrates with the user pool to validate the token received in the request; successful validation grants access to resources, while failure results in an immediate request denial.

2.Working Flow of Identity Pool

The identity pool grants short-term access to AWS services based on the role matched via Token Attributes or defined IAM Policies.
While creating the identity pool, we can configure identity providers like a user pool, any of the social logins, or a custom provider.
We can also configure whether unauthenticated ( guest ) users are allowed to have restrictive access to AWS services.
Upon the completion of authentication by identity providers, they communicate with the Identity Pool by issuing a specific token. Upon token reception, the Identity Pool proceeds to authorize users to varying levels of access.
To implement different access levels, the Identity Pool leverages IAM Roles, offering a structured approach to access management.
Utilizing AWS Security Token Service (STS), the Identity Pool furnishes users with credentials to securely access AWS resources. Below is the AWS Console to create and configure the identity pool.

3.Working Flow of User Pool and Identity Pool together

These both work seamlessly together. We have discussed all those points above. It is just that we need to configure User Pool as the identity provider for Identity Pool to make them work together.

AWS Cognito Features

User Pool Creation: Developers create a User Pool in AWS Cognito, which serves as a user directory for the application.
User Registration and Authentication: Users can register and log in using email/password, social identity providers, or federated identities (such as Google or Facebook). AWS Cognito handles the authentication process, including MFA if configured.
Identity Federation: AWS Cognito supports identity federation, allowing users to sign in through external identity providers like SAML or OpenID Connect.
User Profile Management: Developers can store user attributes in Cognito User Pools and customize the schema to include additional data.
Authorization and Access Control: AWS Cognito provides basic access control through groups and roles. Developers can define custom authorization logic within their applications.
User Migrations: It offers tools for migrating users from existing systems to Cognito User Pools.
Security and Compliance: AWS Cognito includes security features like adaptive authentication and encryption. It helps meet compliance requirements, including GDPR and HIPAA.
Integration with AWS Services: Cognito integrates seamlessly with other AWS services, allowing developers to grant fine-grained permissions for accessing AWS resources.

Key Points for AWS Cognito:

Managed service by AWS for user identity and authentication. Identity federation for integrating with external identity providers. Basic access control and user profile management. Seamless integration with other AWS services. Strong security features and compliance support.

2.Auth0

Auth0 is an Identity-as-a-Service (IDaaS) platform that provides authentication and authorization services. It is designed to make it easy for developers to add secure authentication and authorization to their applications. Dont confuse Auth0 with OAuth. Auth0 is a company that provides auth services whereas OAuth(Open Authorization) is an open protocol that allows secure authorization in a standard way.

Auth0 Features

User Registration and Authentication: Developers integrate Auth0 into their application by adding the Auth0 SDK or configuring it through the Auth0 Dashboard. Users can register and log in using various identity providers such as social media accounts, enterprise SSO, or traditional username/password combinations. Auth0 handles the authentication process, including multi-factor authentication (MFA) if configured.
Identity Providers: Auth0 supports a wide range of identity providers out of the box, including Google, Facebook, Microsoft Azure AD, and more. Developers can also configure custom identity providers if needed.
Authorization and Access Control: Auth0 provides role-based access control (RBAC) and fine-grained authorization policies. Developers can define rules to restrict or grant access based on user attributes, roles, and permissions.
Single Sign-On (SSO): Auth0 enables SSO across multiple applications, allowing users to log in once and access multiple services without re-entering their credentials.
User Profile Management: Auth0 offers features for managing user profiles and attributes. Developers can customize the user profile schema to store additional information.
Customization: Auth0 allows customization of the login page and email templates to match the applications branding and user experience.
Security and Compliance: Auth0 provides security features like threat detection, anomaly detection, and password policies. It helps developers meet compliance requirements like GDPR and HIPAA.
Analytics and Monitoring: Auth0 provides detailed analytics and logs for monitoring user activity and system performance.

Key Points for Auth0:

Easy integration with various identity providers.
Robust authentication and authorization capabilities.
Customization options for branding and user experience.
Built-in security and compliance features.
Support for SSO across multiple applications.
Comprehensive analytics and monitoring.

Differences Between Auth0 and AWS Cognito:

Managed Service Provider: Auth0 is a standalone Identity-as-a-Service (IDaaS) platform, while AWS Cognito is an AWS-managed service.
Identity Provider Support: Auth0 offers a wide range of built-in identity providers, whereas AWS Cognito primarily focuses on integrating with AWS services and provides basic identity federation capabilities.
Customization: Auth0 offers extensive customization options for branding and user experience, whereas AWS Cognito provides more limited customization.
Integration: AWS Cognito seamlessly integrates with other AWS services, making it a preferred choice for AWS-centric applications, while Auth0 is agnostic to cloud providers.
Pricing Model: Auth0 typically follows a pay-per-user pricing model, while AWS Cognito pricing depends on factors like active users and data storage. With all comparisons, Cognito is much cheaper and fits with the AWS stack decently well.
Use Cases: Auth0 is well-suited for applications where identity management and authentication are the primary focus. AWS Cognito is a good fit for AWS-centric applications that require basic identity management and want to leverage other AWS services.

Ultimately, the choice between Auth0 and AWS Cognito depends on the specific needs and preferences of the application and the development environment.

So far, we have discussed the features provided by two popular auth providers and we have also compared them. Now, we are better positioned to discuss the pros and cons of 3rd party auth services over a self-developed authentication solution.

Pros

Rapid Development: Third-party authentication services offer pre-built solutions that can be integrated quickly, accelerating your development timeline. Comes with out-of-the-box support for multiple auth mechanisms, from having old school methods like username-password auth to the newest methods like passwordless auth.
Security Expertise: These service providers invest heavily in hiring security experts to design and develop the auth system, reducing the likelihood of security vulnerabilities and ensuring best practices are followed. These services are battle-tested against common security risks and provide features like MFA, fraud detection, and threat analysis to protect user accounts.
Scalability: Third-party services are typically built to scale effortlessly, handling user authentication for even the largest applications.
Global Reach: Many providers offer worldwide server distribution, ensuring low-latency access for users across the globe.

Cons

Limited Customization: While configurable, third-party solutions may not perfectly align with your specific needs, leading to compromises in functionality or user experience.
Cost: Depending on your user base and usage, third-party services can become expensive over time, as you pay based on the number of users or authentication requests.
Data Control: Youll need to trust the third-party provider with user data, which may not be suitable for applications with strict data privacy requirements.
Vendor Dependency: Relying on a third-party service introduces a dependency on its availability and reliability, which may impact your applications uptime. Vendor lock-in is another problem where it gets very challenging to migrate to another solution if needed.

So far, we have learned the following things: The purpose of having an auth system. The pros and cons of 3rd party authentication services. The pros and cons of building our own authentication system.

Lets use this knowledge in deciding the auth approach that best suits your project and client/organization.

Making the Decision

The decision to build your own authentication system or use a third-party service hinges on various factors:

Project Scope: Consider the complexity of your application and the resources available. Smaller projects with limited resources may benefit from third-party solutions, while larger, more complex applications may require a custom approach.
Development Resources: In-House Expertise: Assess your development teams expertise in authentication and security. Building your system requires a deep understanding of security best practices.
Availability of Time and Talent: Evaluate whether you have the time and talent to build and maintain a custom solution without sacrificing other critical development tasks. If you have a small development team (<20-30), its more suitable for you to use a 3rd party solution keeping compliance issues into consideration.
Customization Needs: Evaluate the degree of customization required for your authentication process. If your application demands a highly tailored experience, building your own system may be the better choice.
Security Requirements: If your application handles sensitive data or must comply with specific security standards, a third-party service with a strong security track record can be a safer bet.
Long-Term Perspective: Think about the long-term maintenance and scalability of your authentication system. Building your own requires ongoing commitment, while third-party services can offload much of that responsibility.
Time-to-Market: Consider your applications time-to-market requirements. Sometimes we get considerably less time to build and launch a product to compete in the market. In such cases, third-party solutions offer faster integration but may have limitations.
Budget Constraints: Consider your budget constraints and how the pricing models of third-party services align with your financial resources.
Data Compliance: See if the 3rd party service you choose complies with the regulatory requirements of your users data. Otherwise, you would need to build your own auth. When working with banking clients, many times we need to deploy applications on-premise due to compliance issues, and this is where most of the 3rd party services are ruled out.

These are the eight pointers you should remember while deciding the authentication system you should use for your application.

Conclusion

User authentication is a critical aspect of modern application development, and deciding to build your own system or use a third-party solution is not to be taken lightly. Both options have their merits and drawbacks, and the choice ultimately depends on your projects unique requirements, resources, and priorities.

Building your authentication system can provide complete control and cost savings if you have the in-house expertise, time, and customization needs. On the other hand, leveraging third-party solutions offers rapid implementation, security features, and scalability, making them a pragmatic choice for many projects.

In either case, prioritize security, user experience, and regulatory compliance. Whichever path you choose, its essential to stay informed about evolving authentication standards and best practices to ensure your applications ongoing security and success.

What is user authentication, and why is it important?

ECS vs. EKS: Choosing the Right AWS Container Orchestration Solution

DataChef — Mon, 02 Oct 2023 14:31:27 GMT

Introduction

In the world of containerized applications, Amazon Web Services (AWS) offers two prominent services for orchestrating containers: Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Both have their unique features and benefits, making it essential to understand their differences and choose the right one for your specific use case. In this blog, we will explore ECS and EKS, look into AWS Fargate, and discuss when to use each service, their deployment options, scaling mechanisms, and pricing structures.

Amazon ECS: Container Orchestration Made Simple

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that simplifies the deployment, management, and scaling of Docker containers. ECS offers two deployment options: EC2 launch type and AWS Fargate. Lets take a closer look at each:

EC2 Launch Type

In this mode, you manage the underlying EC2 instances on which your containers run. ECS schedules containers onto your instances based on resource requirements and availability. While you have more control over the infrastructure, you also bear the responsibility of managing the underlying EC2 instances, including scaling and patching.

AWS Fargate

Fargate is a serverless computing engine for containers. It abstracts away the underlying infrastructure, allowing you to focus solely on your containerized applications. Fargate automatically scales your application as needed, making it a hassle-free option for developers who want to focus on code rather than infrastructure.

Amazon EKS: The Power of Kubernetes

Amazon Elastic Kubernetes Service (EKS), on the other hand, is a managed Kubernetes service that enables you to deploy, manage, and scale containerized applications using Kubernetes. Kubernetes is an open-source container orchestration platform known for its robust features, including advanced scaling, self-healing, and extensive ecosystem support.

When to Use ECS vs. EKS

Choosing between ECS and EKS depends on your specific use case and requirements. Heres a guideline to help you decide:

Use Amazon ECS When:

You prefer a simplified, opinionated container orchestration service.
Your team has experience with Docker and wants a seamless AWS integration.
You need to run containers with minimal setup and management overhead.
Cost optimization is a primary concern, as ECS can be more cost-effective for certain workloads.

Use Amazon EKS When

You require the full power and flexibility of Kubernetes.
Your application architecture is complex, and you need advanced orchestration features.
You have an existing investment in Kubernetes and want to leverage it on AWS.
Multi-cloud or hybrid cloud deployments are part of your strategy, as Kubernetes offers more portability.

Deployment Comparison: ECS vs. EKS

Both ECS and EKS offer multi-container deployment capabilities, but they use different approaches:

ECS Multi-Container Task ECS allows you to define multiple containers within a single task definition. These containers share the same network namespace, storage volumes, and lifecycle. This is suitable for microservices architectures where tightly coupled containers must run together.
EKS Pods Kubernetes, used in EKS, uses the concept of Pods to group one or more containers together. Pods share the same network and storage resources and are scheduled together on the same node. Kubernetes provides more advanced orchestration features for multi-container deployments, making it a better choice for complex applications.

Scaling Mechanisms

Scaling in ECS and EKS is based on different principles:

ECS Scaling ECS supports both manual and auto-scaling. You can use Amazon EC2 Auto Scaling groups with ECS to automatically adjust the number of EC2 instances based on CPU or memory utilization. ECS Service Auto Scaling can also scale the number of tasks within a service based on custom-defined CloudWatch alarms.
EKS Scaling Kubernetes offers powerful scaling capabilities, including Horizontal Pod Autoscaling (HPA). HPA automatically adjusts the number of pods based on CPU or custom metrics, ensuring your application can handle varying workloads effectively. Kubernetes also provides Cluster Autoscaler to scale the underlying nodes based on resource constraints.

Price Comparison

Pricing for ECS and EKS varies depending on your specific use case and deployment choices. Heres a high-level overview:

ECS Pricing ECS pricing includes charges for the underlying EC2 instances, load balancers (if used), and any additional AWS services you consume, such as Amazon RDS or Amazon S3. AWS Fargate pricing is based on vCPU and memory allocation for tasks.
EKS Pricing EKS pricing includes charges for the EKS control plane, EC2 instances (if used), and other AWS resources. You are also responsible for the underlying EC2 instances in your EKS cluster. Additionally, you may incur charges for AWS Load Balancers, storage, and other associated services.

ECS vs. EKS - Which is Better?

The choice between Amazon ECS and Amazon EKS ultimately boils down to your specific requirements and expertise. ECS is an excellent choice for teams looking for simplicity, automation, and cost-effectiveness for certain workloads. On the other hand, EKS is the right choice for those who need the full power of Kubernetes, especially for complex applications and multi-cloud strategies. In summary, ECS offers a straightforward path for container orchestration, while EKS provides the robustness and extensibility of Kubernetes. Regardless of your choice, AWS has the right tools to support your containerized application needs. Evaluate your projects goals, technical requirements, and teams expertise to make an informed decision between ECS and EKS.

What is the difference between EKS and ECS?

The Rise of Controlism AI Art: Exploring Stable Diffusion and ControlNet

DataChef — Thu, 28 Sep 2023 06:50:59 GMT

Introduction

Last week, an AI painting created by the Stable Diffusion model gained so much attention on social media because of its eye-catching nature. MrUgleh, the creator of this image, shared his workflow in this Reddit post so other people can reproduce it. Here, we first discuss this new genre of AI art, then tell you how you can create such artwork on your own machines.

Controlism, a new AI Art genre!

Artists have historically embedded secondary messages or patterns within their paintings. However, with advancements in AI, a new genre in AI Art is emerging.

Human Being Painting by M. Farshchian vs. ControlNet Picture AI Art - Original Face used as Pattern Here, Image by DreamingTulpa

Over the past few years, Stable Diffusion models have democratized art creation, enabling even those without artistic backgrounds to produce beautiful paintings simply by describing them to an AI. One intriguing application of these models is creating non-canonical but functional QR-code art.

Sample QR-Code Created using ControlNet, More Pictures

Using the same principle of guiding the StabelDiffusion into using a pattern (here, it was originally the QR-Code pattern but got generalized to any pattern), we can create such cool images; credit to MrUgleh for discovering this workflow!

Original MrUgleh AI Arts

How Stable Diffusion works, simply put!

Stable Diffusion is a type of artificial intelligence (AI) model that can generate images from text descriptions. It works by starting with a random noise image and gradually adding details, guided by the text description.

The model is trained on a massive dataset of images and text descriptions, which allows it to learn the relationship between words and images. When you give the model a text description, it uses its knowledge to generate an image that matches the description as closely as possible.

Here is a simplified explanation of how Stable Diffusion works:

The model starts with a random noise image.
It then applies a series of diffusion steps to the image, gradually adding details to it. At each diffusion step, the model uses the text description to guide the addition of more information.
The model continues to diffuse the image until it is satisfied with the result.

Example of Denoising, Image from The Illustrated Stable Diffusion Jay Alammar

ControlNet is a neural network model that can control Stable Diffusion models. It can be used to control Stable Diffusion in a variety of ways. For example, it can be used to:

Specify the composition of the image, such as the placement of objects and the background.
Control the style of the image, such as whether it is realistic or cartoonish.
Generate images from reference images.
Generate images with specific attributes, such as a certain color palette or texture.
ControlNet significantly improves Stable Diffusion by giving users more control over image generation. This is how we can use a pattern image to guide the model to create images we like.

How To Create Controlism, AI Art?

Depending on your familiarity and expertise, there are several ways to dabble in Controlism AI art:

Simple Route: HuggingFace Space

Use this HuggingFace space developed by AP; the only pitfall is that it will be temporary, so if you are reading this article in the far future, this space is probably closed by then! You can still duplicate it and run it at your own cost. (HF charges for GPU spaces starting from 0.6 to 3.15 $/hour for small to big GPUs, or if you have GPU, clone the repo and run it locally.)

Robust Approach: StableDifussion WebUI

Running this needs GPU so if you dont have it you can use online services like RunDiffusion.

The main way of creating such artworks with Stable Difussion is using the StableDifussion WebUI (also know as A1111) tool. Then, inorder to create Controlism AI Art, we can add the ControlNet plugin to it. Follow these steps to set up your A1111:

Install stabledifussion-webui

You need to have Python 3.10 installed (No newer or older versions)

Install controlnet extension

Here is the instruction on how to add ControlNet extension to a1111: Mikubill/sd-webui-controlnet: WebUI extension for ControlNet (github.com). You basically need to paste this URL in the Extension tab and click add.

Download needed models

You need to download a base StableDifussion model and a ControlNet model. The best ones Ive found so far are:

SD (put this in the models/StableDiffusion folder): Realistic_Vision_V5.1_noVAE ControlNet (put this in the models/ControlNet folder): control_v1p_sd15_qrcode_monster

Note these are large files (around 5 GB total)

Finally, generate the images using the right parameters. Parameters here can play an important role, Here is the best one I found to generate images such as the one Mr.Ugleh created: Control Strength: 1.2, Pixel Perfect: True, In the end, your a1111 page should look like this:

The Proprietary Method: ArtBreeder The ArtBreeder platform lets you generate ControlNet images based on your patterns and prompts. Available at ArtBreeder.com

Conclusion

Generative AI is advancing at a staggering pace, making it challenging to keep up with its latest innovations. As its quality improves and its capabilities diversify, it paves the way for applications in a broader range of fields, potentially expanding its market reach. In this article, weve explored one such innovation and provided insights on how you can experiment with it and potentially integrate it into your products.

What is Controlism AI Art?

From Words to Numbers: Understanding Language Model Embeddings in Python

DataChef — Wed, 06 Sep 2023 22:11:17 GMT

Introduction

If youve ever wondered how computers, which fundamentally understand only numbers, can grasp the intricacies of human language filled with words carrying complex meanings, youre not alone. In this article, we dive deep into the world of language model embeddings and teach you how to utilize them in Python.

What is Word Embedding?

Word embedding is the representation of words in a dense numerical format that computers can use to understand them. The beauty of this approach lies in its ability to convert the nuanced connotations and relationships between words into numbers. A good embedding:

Capture similarities/differences: Good embeddings can accurately capture the relationships between words. For instance, cat and kitten might be closer in the embedding space than cat and dog. Capture features of the words: Beyond mere relationships, embeddings should also capture characteristics of the word itself, like whether its a noun or verb, or if its associated with positivity or negativity.

Overview of Embedding Models: Leading Open Source Solutions and API-based Offerings

Theres an ocean of embedding models available today. The sentence transformers library, an extension of the HuggingFace Transformers library, collected many of them under one library. Based on their embedding leaderboard, one of the best-performing models currently is the all-mpnet-base-v2 model from Microsoft.

Regarding readily available API services, the OpenAI Embedding API and Cohere AI embedding API are top choices for developers. These allow for quick integration without the hassle of deploying your own model.

Prepare Your Python Environment

To dive into word embeddings, youll first need to set up your environment. The sentence transformers library is an excellent starting point. Install it using:

pip install sentence-transformers

But be cautious! This command will also install libraries like torch and transformers, which can be heavy and might take some time.

Example Word Embedding

Now, lets use the example in the introduction to see the embeddings in action. Well use the all-mpnet-base-v2 model from Microsoft, which is available in the sentence transformers library. First, well import the model and the SentenceTransformer class from the library:

from sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-mpnet-base-v2')

Next, well calculate the embeddings for the words cat, kitten, and dog using the encode() method:

# Get embeddings cat_vector = model.encode(["cat"])[0]kitten_vector = model.encode(["kitten"])[0]dog_vector = model.encode(["dog"])[0]

Finally, well calculate the cosine similarity between the vectors:

# calculate the cosine similarity between the vectors cat_kitten_similarity = cosine_similarity([cat_vector], [kitten_vector])[0][0] cat_dog_similarity = cosine_similarity([cat_vector], [dog_vector])[0][0]dog_kitten_similarity = cosine_similarity([dog_vector], [kitten_vector])[0][0]

The output:

Similarity between cat and kitten: 0.7964548468589783Similarity between cat and dog: 0.6081227660179138Similarity between dog and kitten: 0.520395815372467

The results show that the model has captured the relationships between the words. The similarity between cat and kitten is higher than that between cat and dog, which is higher than that between dog and kitten.

What is Sentence embedding?

Moving a step ahead of word embeddings, we have sentence embeddings. While word embeddings represent individual words as vectors, sentence embeddings represent entire sentences. This is particularly useful when the meaning of a word depends on the context its used in.

Example Sentence Embedding

Language is intricate. Often, simply changing a words position can alter a sentences entire meaning. For instance, I love dogs, not cats carries a different sentiment from I love cats, not dogs. Sentence embeddings can capture these nuances.

Moreover, with the advent of multilingual embeddings, if the model is trained on multilingual texts, one can compare sentences across different languages, making translation and multilingual NLP tasks more efficient.

Conclusion

In conclusion, this post has provided a glimpse into the world of language model embeddings, highlighting their significance in understanding and representing words and sentences in a numerical format. From the foundational concepts of word and sentence embeddings to practical Python setups and real-world examples, weve walked through the essential aspects of this subject.

What is the fundamental purpose of word embedding?

How to Leverage Infrastructure as Code: Best Practices for Cloud Efficiency

DataChef — Wed, 23 Aug 2023 06:01:36 GMT

What does Infrastructure as Code (IAC) mean?

Infrastructure as Code (IAC) refers to the practice of managing and provisioning technology infrastructure using code and automation techniques. Instead of manually configuring servers, networks, and other infrastructure components, IAC allows developers and operations teams to define and manage these resources using code, often in the form of scripts or configuration files.

With IAC, infrastructure setup, and management become consistent, repeatable, and version-controlled, improving efficiency, scalability, and reliability. This approach treats infrastructure provisioning and maintenance in the same way as software development, enabling teams to apply best practices from software engineering, such as version control, collaboration, and testing, to the infrastructure domain

Benefits of IAC

Infrastructure Automation

IAC eliminates the need for labor-intensive GUI interactions. Instead, it empowers teams to define deployment specifics in human-readable code. This streamlined approach accelerates provisioning, reduces manual intervention, and ensures consistent setups.

Example:

resource "aws_instance" "example" { ami = "ami-12345678"instance_type = "t2.micro" }

Here we are creating an AWS EC2 instance using a custom terraform script. The script specifies the desired Amazon Machine Image (AMI) and instance type. Upon applying this script, Terraform will automatically create the specified EC2 instance without manual intervention.

Reproducibility

IACs code-driven approach ensures the exact replication of infrastructure across diverse environments. This ensures that what works in development is faithfully recreated in testing and production, minimizing inconsistencies and bolstering reliability.

Example:

resource "aws_instance" "example" {count = 3 ami = "ami-87654321" instance_type = "t2.small" }

Reproducibility is achieved through the use of the count parameter. By specifying count = 3, this code will create three instances of type t2.small using the specified AMI. This ensures that the infrastructure can be consistently reproduced with the same configuration whenever the code is applied, preventing configuration drift and maintaining a uniform environment.

Scalability

IAC facilitates automated scaling to meet changing demands. As workloads fluctuate, infrastructure resources can be adjusted dynamically, optimizing performance and resource utilization without manual intervention.

Example:

resource "aws_autoscaling_group" "example" {name = "example-autoscaling-group" launch_configuration = aws_launch_configuration.example.id min_size = 2 max_size = 5 }

Here Scalability is demonstrated through the use of the
aws_autoscaling_group resource. This script defines an Auto Scaling Group in AWS. The groups configuration includes minimum(min_size), and maximum(max_size) instance counts. As demand fluctuates, the Auto Scaling Group automatically adjusts the number of instances within the specified range. This dynamic scaling ensures that the infrastructure scales up or down based on load, optimizing resource utilization and performance.

Reduced Risk

Traditional manual setups are prone to errors and configuration discrepancies. IAC minimizes these risks by enforcing consistent configurations through code. This reduces human-related mistakes and limits the potential for configuration drift, enhancing system stability.

Example:

resource "aws_security_group" "example" {name = "example-security-group" description = "Example security group"ingress {from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"]  } }

Here Reduced Risk is achieved by enforcing security best practices through code-defined configurations. The script defines an AWS security group with rules that allow incoming TCP traffic on port 80 from any source (0.0.0.0/0). By specifying security rules in code, it ensures that network access is controlled and documented, minimizing potential security vulnerabilities and maintaining consistent security policies across deployments.

Faster Deployment

IAC automates the provisioning process, reducing deployment times and improving time-to-market for applications. IAC expedites deployment cycles by automating the provisioning process. This automation significantly shortens deployment times, enabling organizations to swiftly bring applications to market, respond to opportunities, and meet business needs promptly.

Example:

resource "aws_ecs_task_definition" "example" { family = "example-task"  network_mode = "awsvpc"  requires_compatibilities = ["FARGATE"]  execution_role_arn = aws_iam_role.example.arn }

This script defines an Amazon Elastic Container Service (ECS) task definition using the FARGATE launch type, which abstracts away the underlying infrastructure management. This accelerates deployments by enabling rapid scaling of containers without the need to manage EC2 instances directly. The awsvpc network mode enhances isolation, and the specified execution role (aws_iam_role.example.arn) grants the necessary permissions. This setup streamlines the deployment process, allowing applications to be quickly deployed and scaled within isolated environments.Incorporating these IAC benefits into operational practices enhances efficiency, minimized errors, and improved scalability.

Best Practices Of IAC

Version Control

Version Control involves using a version control system, such as
Git, to manage changes made to the codebase. In this context, the script can be tracked, shared, and collaborated upon using version control. Developers can commit changes, branch the code for experimentation, and roll back to previous versions if needed. This practice ensures a history of modifications, enhances collaboration, and safeguards against unintended changes.

# test/main_test.go func TestExampleInstance(t *testing.T) { // Test infrastructure creation and validation }

Modularity

Modularity is achieved by organizing infrastructure code into reusable modules, which can be referenced and instantiated from other parts of the configuration. In this script, the main.tf file defines a modular AWS VPC resource. This module abstracts the creation and configuration of a VPC with a specified CIDR block. By placing this resource definition in a separate module, it can be reused across multiple configurations, promoting consistency and reducing duplication of code. This practice simplifies maintenance and updates while enhancing the overall structure of the IAC codebase.

# modules/vpc/main.tf resource "aws_vpc" "example" {cidr_block = "10.0.0.0/16" }

Testing

Testing is implemented through this script using Gos testing framework. The function TestExampleInstance represents a test case that ensures the correctness of the infrastructure creation and validation process. In the body of this function, actual test assertions and validation logic would be written to check if the infrastructure defined in the Terraform code behaves as expected. Running tests like these helps catch errors early and verifies that the IAC code meets the intended requirements.

# test/main_test.go func TestExampleInstance(t *testing.T) { // Test infrastructure creation and validation }

Documentation

Documentation is improved through the use of meaningful comments and annotations in the script. The comment at the beginning of the file (# main.tf) provides a brief description of the files purpose.

Additionally, within the resource block, the script includes a tag definition that assigns a name to the instance using the key Name and the value Example Instance. This tag provides human-readable context to the resource and documents the purpose of the instance within the infrastructure. Clear and concise comments contribute to enhanced code understanding and maintenance.

# main.tf resource "aws_instance" "example" { ami = "ami-12345678"  instance_type = "t2.micro" tags = { Name = "Example Instance"  } }

Continuous Integration and Deployment (CI/CD)

This script defines two pipeline stages: validate and deploy. During the validate stage, Terraform is initialized and then the Terraform validate command is executed to validate the syntax and structure of the Terraform code.

In the deploy stage, the code is applied with terraform apply -auto approve, automatically deploying the infrastructure without manual intervention. This automated pipeline enhances development workflows by ensuring that code is validated and deployed consistently, promoting reliability and accelerating the deployment process

stages: - validate - deploy validate:    stage: validate     script:     - terraform init     - terraform validate deploy:     stage: deploy     script:     - terraform apply -auto-approve

Immutable Infrastructure

resource "aws_instance" "example" { ami = "ami-87654321" instance_type = "t2.micro" instance_initiated_shutdown_behavior = "terminate" }

Conclusion

Infrastructure as Code revolutionizes cloud management by merging development and operations into a seamless, code-driven approach. Embracing IAC and following best practices empowers organizations to efficiently manage their cloud infrastructure, ensuring agility, scalability, and reliability. With cloud infrastructure services like AWS, the IAC paradigm becomes even more potent, ushering in a new era of cloud management.

In summary, understanding IAC and its best practices, especially in the context of AWS Cloud, is pivotal for organizations seeking to optimize their cloud infrastructure management strategies. Treating infrastructure as code, you unlock the potential for streamlined, automated, and consistent cloud operations that pave the way for future growth and innovation.

What is Infrastructure as Code (IAC)?

Choosing the Best Data Manipulation Package in Python: A Comprehensive Comparison

DataChef — Tue, 15 Aug 2023 21:01:30 GMT

Introduction

Pandas is one of the most widely used data manipulation libraries in Python, known for its ease of use and powerful functionality. However, as the data size grows, Pandas can become slow and memory-intensive. In this blog post, we will compare four data manipulation libraries in Python: Pandas, Koalas, Dask, and Polars, to evaluate their performance, distributed computing capabilities, and expressiveness.

Authors Hot Takes on Data Manipulation Libraries 🔥

If you are in a hurry and dont want to read the whole blog post, here are my hot takes on these libraries:

Pandas: Life feels so easy here! Its the most widely used data manipulation library in Python and for good reason. Its easy to learn, easy to use and has a wide range of functionality. However, it can be slow and memory-intensive for large datasets.
Koalas: dont adopt this unless you are a Spark power user since you need to be familiar with Sparks concepts to use it effectively. Another BIG downside of koalas is being a true dependency hell at least in 2023! It doesnt work with Python 3.10 and above! It also trow random errors because of using deprecated versions of Numpy and Pandas. Truly a nightmare!
Dask: Despite being the only one not to use a cute animal name! Dask seems a very good fit if you have access to a multi-node cluster like HPC clusters since it can leverage all of those computing power. It gets a lot of performance boost from using lazy loading and parallel processing. However, the concept of lazy loading can be confusing for some users and its not as easy to use as Pandas.
Polars: Written in the loved Rust language, Polars provides a more efficient memory representation and multi-threading, making it faster than Pandas for large datasets. Very good choice if you have a single machine with multiple cores. Its also very easy to use and has a very similar API to Pandas.

Comparison: Performance on Different Data Sizes

Pandas: Performs well on small to medium-sized datasets, but struggles with large datasets due to memory limitations.
Koalas: Built on top of Apache Spark, Koalas provides a Pandas-like API for distributed computing, making it better suited for larger datasets.
Dask: Allows parallel and distributed processing of large datasets, which can be significantly faster than Pandas on larger datasets.
Polars: Uses a more efficient memory representation and multi-threading, making it faster than Pandas for large datasets.

Comparison: Distributed Computing Capabilities

Pandas: Single-node, in-memory computing.
Koalas: Multi-node, distributed computing on top of Apache Spark.
Dask: Multi-node, distributed computing with a similar API to Pandas.
Polars: Single-node, multi-threaded computing.

Comparison: Integration with Visualization Tools and Other Python Libraries

Pandas: Widely used and well-integrated with visualization libraries like Matplotlib and Seaborn, as well as Python ML libraries.
Koalas: Can be converted to a Pandas DataFrame for integration with other libraries, but may lose distributed computing benefits.
Dask: Can be converted to a Pandas DataFrame for integration with other libraries, but may lose distributed computing benefits.
Polars: Less widely used and may not be as well-integrated with other libraries.

Comparison: Query Language Expressiveness

To compare the expressiveness of the query languages, lets look at how each library handles a common task: filtering the data for transactions where the quantity sold is greater than 10, and grouping by product ID to calculate the average price.

Pandas

import pandas as pddata = pd.read_csv("sales_data.csv") filtered_data = data[data["Quantity Sold"] > 10] grouped_data = filtered_data.groupby("Product ID")["Price"].mean()

Koalas

from databricks import koalas as ksdata = ks.read_csv("sales_data.csv") filtered_data = data[data["Quantity Sold"] > 10] grouped_data = filtered_data.groupby("Product ID")["Price"].mean()

Dask

import dask.dataframe as dddata = dd.read_csv("sales_data.csv") filtered_data = data[data["Quantity Sold"] > 10] grouped_data = filtered_data.groupby("Product ID")["Price"].mean().compute()

Polars

import polars as pldata = pl.read_csv("sales_data.csv") filtered_data = data.filter(data["Quantity Sold"] > 10) grouped_data = filtered_data.groupby("Product ID").agg(pl.col("Price").mean())

As you can see, the query language for all of these libraries is very similar to Pandas, with only minor differences in syntax. This makes it easy to switch between libraries or use multiple libraries in the same project.

Benchmark: Data Preparation

For this comparison, we will use a dataset of retail sales transactions. The dataset contains millions of rows, with columns such as Transaction ID, Product ID, Quantity Sold, Price, Date, and Store ID. We have preprocessed the data by removing any missing or duplicate values, and converting categorical variables to numerical encodings. The following code snippet create a sample random dataset with N rows:

import numpy as np import pandas as pd from datetime import datetime, timedeltadef generate_data(n):    transaction_id = np.arange(1, n+1)     product_id = np.random.randint(1, 1001, n)     quantity_sold = np.random.randint(1, 21, n)     price = np.random.uniform(1, 100, n)     start_date = datetime.now() - timedelta(days=365)     end_date = datetime.now()     date = [start_date + (end_date - start_date) * np.random.random() for _ in range(n)]     store_id = np.random.randint(1, 101, n)     data = pd.DataFrame({         "Transaction ID": transaction_id,         "Product ID": product_id,         "Quantity Sold": quantity_sold,         "Price": price,         "Date": date,         "Store ID": store_id     })    return data

Benchmark: Filtering and GroupBy Performance

To compare the performance of the different libraries, we will use the previous sections code snippets to filter the data for transactions where the quantity sold is greater than 10, and group by product ID to calculate the average price. We will run this code snippet on datasets of different sizes, ranging from 10 thousand to 10 million rows.

Here is the result for the filtering benchmark:

Note: Dask use lazy loading so it doesnt actually load the data until we call compute() method. Thats why it show a constant performance regardless of dataset size.

And here is the result for groupby benchmark:

Note: Here I ran the benchmark on a machine with 7 cores (AMD Ryzen 4000) and 16GB of RAM. The results may vary on different machines.

Conclusion

The choice of data manipulation library depends on the size of the dataset, the complexity of the data manipulation tasks, and the need for integration with other Python libraries. Pandas is a versatile and powerful library for small to medium-sized datasets, while Koalas, Dask, and Polars are better suited for larger datasets.

What is Data Manipulation, and Why is it Important?

Optimizing Production Workflows: A Guide to Docker and Kubernetes

DataChef — Mon, 31 Jul 2023 08:01:30 GMT

Introduction

The importance of efficient and scalable application deployments in the ever-changing technology landscape cannot be overstated. Docker and Kubernetes have emerged as leading solutions to meet these demands, providing businesses with streamlined processes and numerous advantages. In this insightful article, we will delve into the distinctions between Docker and Kubernetes, examine their respective use cases, and guide the decision-making process to identify the optimal choice that aligns with specific business needs.

Understanding Docker

Docker, an industry-leading open-source containerization platform, empowers developers to encapsulate applications and their dependencies within portable containers. This approach guarantees a consistent and isolated environment, ensuring seamless application performance across diverse systems. Leveraging Docker, organizations can effortlessly create, deploy, and scale applications, making it an optimal solution for individual developers or small teams seeking efficient and reliable software development and deployment practices.

Also, we have Docker Compose, which is a tool that simplifies the management of multi-container Docker applications. It allows users to define services, networks, and volumes in a YAML file, enabling easy configuration, setup, and orchestration of complex container-based environments with a single command.

Understanding Kubernetes

Kubernetes, often referred to as K8s, stands as a leading open-source container orchestration platform. Its core capabilities revolve around automating the deployment, scaling, and supervision of containerized applications across extensive clusters of machines. With its prowess in handling intricate, multi-container applications and boasting exceptional scalability, Kubernetes emerges as an optimal selection, particularly for larger enterprises or organizations navigating high-traffic workloads.

Kubernetes Dashboard is a web-based user interface for Kubernetes cluster management. It provides a graphical representation of cluster resources, allowing administrators to view, manage, and troubleshoot applications. With detailed insights into deployments, services, pods, and more, it facilitates efficient monitoring and control, simplifying complex Kubernetes operations.

Use Cases: When to Choose Docker

While Docker and Kubernetes frequently complement each other in various setups, it is vital to recognize specific scenarios where Docker excels independently. Dockers strengths lie in:

Local Development Offering a lightweight container environment, Docker streamlines application development and testing, alleviating the burden of intricate infrastructure setup for developers.
Microservices Architecture Ideal for small-scale applications and initial containerization endeavors, Dockers user-friendly approach facilitates agile development and swift deployment of individual microservices.
Single-Host Deployment Dockers efficiency and minimal resource overhead present a compelling and practical solution for resource-constrained environments or uncomplicated single-machine setups.
Simplified Application Distribution Choose Docker for distributing applications effortlessly, as Docker containers encapsulate all dependencies, ensuring consistency and smooth deployment across various environments.

Use Cases: When to Choose Kubernetes

As the demand and complexity of applications rise, Kubernetes emerges as a paramount solution, offering a suite of sophisticated features to manage containerized environments effectively. Deliberate the adoption of Kubernetes in the following compelling scenarios:

Scalability and High Availability Embrace Kubernetes for applications requiring seamless handling of substantial traffic loads. With automatic scaling and robust failover capabilities, Kubernetes ensures uninterrupted service delivery even under intense demand. It provides the necessary infrastructure for high availability and elastic scalability.
Multi-Container Applications Rely on Kubernetes adeptness in orchestrating the deployment, scaling, and load balancing of intricate applications comprising multiple interdependent containers. Kubernetes ensures smooth coordination and resource optimization, making it an ideal choice for complex architectures.
Cluster Management Streamline cluster management with Kubernetes, empowering efficient workload distribution across machine clusters. Kubernetes ensures optimal performance and reliability across multiple nodes, simplifying complex operations and allowing organizations to manage large-scale deployments effectively.
Resource Optimization Leverage Kubernetes to optimize resource utilization and cost efficiency. By dynamically scaling resources based on demand, Kubernetes enables efficient utilization of computing resources, ensuring organizations make the most of their infrastructure investments.

Making a choice between Docker Vs Kubernetes

When confronted with the choice between Docker and Kubernetes, a prudent assessment of our applications scale and complexity is paramount. Docker emerges as an optimal and lightweight solution, catering to smaller projects and individual developers, with simplicity at its core.

It provides an easy entry point into containerization and enables quick iteration in development environments. Dockers strength lies in its ability to efficiently package applications with their dependencies, allowing for seamless portability across various environments, from development to production.

On the other hand, Kubernetes shines in navigating larger deployments and intricate architectures, catering to the demands of complex scenarios. As applications grow in size and complexity, Kubernetes offers advanced features for managing containerized applications at scale, ensuring high availability, and simplifying cluster management. Its powerful orchestration capabilities enable automatic scaling based on demand, intelligent load distribution, and automated container lifecycle management, enhancing productivity and resource utilization.

Additionally, it is essential to factor in available resources, scalability requirements, and the learning curve associated with each technology to make an informed and strategic decision. Organizations should consider their specific needs and goals, evaluating factors such as team expertise, existing infrastructure, and the expected future growth of the application.

A thoughtful evaluation will enable them to select the most appropriate solution, whether its adopting Docker for smaller projects with straightforward requirements or embracing Kubernetes for handling larger, more complex workloads that demand robust container orchestration.

Ultimately, a well-informed choice will pave the way for streamlined development, efficient deployment, and successful application management.

In conclusion, Docker and Kubernetes are powerful tools for streamlined and cost-effective application deployment. Understanding their differences and evaluating specific business needs can lead to informed decisions that align with organizational goals and achieve efficient deployments.

Can Docker and Kubernetes be used together?

What Cloud Has to Offer to Life Science: Storage

DataChef — Tue, 25 Jul 2023 09:11:10 GMT

Introduction

In a previous post, we discussed how using serverless computing can assist life science researchers in analyzing their data in a more efficient and cost-effective manner. It also aids them in deploying their tools in a user-friendly way at lower costs.

In this post, we want to discuss a different cloud service, storage, that can be useful for life scientists. Scientific research typically generates a considerable amount of data from experiment results, which contain valuable information. Storing this data in a maintainable and, more importantly, shareable manner can pose a significant infrastructure challenge. Public clouds like AWS provide exciting options for such use cases. One example is AWS S3 Glacier, optimized for data retrieved infrequently or with latency tolerance, which is the case for many life science datasets.

Overview of the Storage Problem in Life Sciences

Traditionally, researchers tend to run their computational workload on their personal computers or High-Performance Computing (HPC) clusters. The limitation of personal computers is evident for both Storage Capacity and Shareability. On the other hand, HPC, which are data centers specifically designed for scientific workloads, are much better at the storage capacity problem. However, they still need to catch up on shareability, especially the HPCs, which do not provide a file-sharing service.

How Cloud can help: AWS S3 Glacier as a solution

Public cloud comes with almost infinite Storage Capacity and the highest level of shareability since they have data centers worldwide that provide milliseconds latency access to your files. For example, AWS has the Simple Storage Service (S3) for file storage and sharing. In order to use this service, you first create a bucket (think of it as a directory) and put your files (called Object in this context) in it. But here is the exciting part! AWS offers different storage classes that you can choose for your Objects. These storage classes are optimized for different use cases. The table below compares some of these classes:

Storage Type	S3 Standard	S3 Glacier Instant Retrieval	S3 Glacier Deep Archive
Storage Cost for 1TB/Month	23.55 USD	4.10 USD	1.03 USD
First byte latency	milliseconds	milliseconds	hours
Retrieval charge	N/A	0.03 USD / GB	0.02 USD / GB
Minimum storage duration charge	N/A	90 days	180 days

Note: Data transfer (from S3 to the Internet) costs around 0.05 ~ 0.09 USD per GB!

Here we describe two scenarios that you should go with AWS S3 in your research project:

1. When shareability matters!

Imagine you are collaborating on a project with researchers from other universities (usually from different countries). In such cases, you need to share data for both raw data and results of your experiments with your colleagues, so if you are locked to an HPC that can only be accessed via your university network, it can be very challenging. In this case, AWS S3 Glacier Instant Retrieval can be a good option since the number of downloads is limited, and the file size can be huge.

2. Large files which rarely retrieved but need to be stored for a long time.

Its common to generate massive amounts of data out of your experiments. These data must be stored somewhere; otherwise, you must delete them to make space for your other experiments. You still need to keep these data since they may be needed during your paper review. In this case, AWS S3 Glacier Deep Archive is the best choice, offering the lowest price when retrieval time latency is not a concern!

Note: As we showed in the previous section, data transfer costs can be high in AWS S3, so for use cases with intense download requirements, you may want to consider free transfer options like LuxProvide.

Conclusion

In conclusion, the advent of cloud storage solutions such as AWS S3 Glacier has made it much easier for life scientists to store and share vast amounts of data generated from their research work. Depending on data shareability and retrieval frequency requirements, different AWS storage classes can be utilized to achieve cost efficiency. While its important to consider data transfer costs when choosing a cloud storage provider, the flexibility, global availability, and robustness of these services make them a compelling option for life science research data storage needs.

References :

What is the benefit of using cloud storage for life science researchers?

5 Common Mistakes to Avoid When Using AWS

DataChef — Mon, 10 Jul 2023 12:43:56 GMT

Introduction

In todays fast-paced technological landscape, businesses are afforded the opportunity to test out new business models for their clientele and witness immediate outcomes. Thanks to cutting-edge cloud technologies, such as AWS Cloud, both established corporations and startups are able to utilize these advancements to their fullest potential. This article aims to impart valuable knowledge regarding common pitfalls to avoid when utilizing AWS Cloud.

What is AWS Cloud?

The AWS Cloud is a comprehensive suite of cloud computing services provided by Amazon Web Services (AWS). It offers a highly scalable and flexible infrastructure that enables businesses and individuals to leverage on-demand computing resources, storage, and applications via the Internet. The AWS Cloud provides a variety of services, including virtual servers (EC2), managed databases (RDS), object storage (S3), content delivery networks (CloudFront), and serverless computing (Lambda), among others. These services allow organizations to quickly provision and scale resources based on their needs, eliminating the need for upfront infrastructure investments and providing cost-effective solutions. The AWS Cloud offers high reliability, security, and global availability, which empowers businesses to innovate and deploy applications and services with ease.

Avoid using a root account

Once you have set up your AWS account, you can access basic cloud services through your root account. While most users tend to use their root account to manage their resources, it is recommended to use it only for managing AWS Organization and AWS IAM in the console. The reason behind this is that the root account has default access to all resources and can accidentally delete mission-critical resources. Instead, it is advisable to create an admin user with account credentials to manage your cloud resources. Additionally, it is highly recommended to enable multi-factor authentication for your root user to ensure maximum security.

Attention! Price Alert

When using AWS, its easy to overlook the daily check of your AWS Bills. Developers often create expensive resources that they forget to stop or terminate after completing their work. To prevent unexpected billing costs, we strongly suggest setting up a Billing Alert for your AWS account from the beginning. You can easily set up a billing alarm for individual services or daily billing quotas for all services in your AWS Billing console.

Forgotten IAM Basics

When using any AWS service, the AWS IAM service is involved in the task at hand. As a crucial aspect of AWS, users must remember to rotate their AWS IAM users password and account credentials, detach temporary permissions once the task is completed, and delete the account of a user who has left the company. To ensure the security of your AWS, the AWS IAM console guides admin users in setting up password and access key rotation rules, attaching temporary policies for short-term use, and setting alarms based on the users last console sign-in time.

Using VPCs with Public CIDR Blocks

If youre using AWS VPC, its important to understand how to set up the network service to communicate with both your internal AWS cloud resources and outside networks on the public internet. When creating a VPC, choose an available CIDR block for your AWS region, but be aware that using a public CIDR block may cause issues with peer connections between your VPC and other services. To avoid future problems, consider using private CIDR blocks from the start. This way, if needed, you can easily connect to other networks locally. For example, if you have a high workload and use multiple AWS services but want to use a third-party database solution outside of AWS, you can use the VPC peering option to connect your internal network to the outside network. By using AWS VPC Peering, you can avoid database response time latency. We highly recommend creating AWS VPC CIDR blocks using only private IP blocks to ensure smooth and efficient communication between your resources.

S3 Bucket Policy Permissions

Amazon S3 is a crucial service for AWS Cloud, whether youre a beginner or an established company. It allows you to store and provide data to your users. However, many cloud users often overlook the permissions of their Amazon S3 bucket, which can unintentionally make sensitive data publicly accessible. To prevent this, we recommend creating custom IAM policies that include deny action policy statements alongside allow actions. These policies can help manage bucket accessibility options for all AWS cloud users and even restrict editing of Amazon S3 bucket permissions by users with admin roles.

For developing your cloud resources, we recommend using an Infrastructure as Code (IaC) solution instead of creating them on the AWS Console. This is because it lets you easily set up your cloud resources for another AWS account or organization anytime. Additionally, the AWS Console can be complicated for viewing all in-use resources or what is inside your AWS account. Infrastructure code enables you to easily read your resource setup options and the services connected with each other. To fully utilize the AWS Cloud, it is important to steer clear of common mistakes that could negatively affect your business. Following these best practices can help you achieve optimal performance, improve cost efficiency, and enhance the security of your AWS environment. By avoiding common mistakes, you can ensure that your organization is utilizing AWS to its fullest potential and maximizing the benefits of cloud computing.

Why should I avoid using a root account to manage AWS resources?

What Cloud Has to Offer to Life Science: Serverless

DataChef — Fri, 07 Jul 2023 09:48:00 GMT

Introduction

As you delve deeper into coding and your code becomes too heavy to run on your laptop, you may need to transfer your workload to a Cloud or High-Performance Computing (HPC) server. Computational Biology and Bioinformatics as a field have long passed this level with the everyday increasing data such as genetic sequences that call for more storage space and better computing capabilities. This challenge is classically solved by providing the researchers with Virtual Machines (powerful computers in data centers) to meet their computational needs. Although being a useful method for fairly simple tasks, it comes with a cost that is time that needs to be spent on configuration and maintenance of the VM for that workload. This needs to be done by either the researcher themselves or a technical support person, both undesirable in an academic environment. Here we explore Serverless as a new paradigm in Cloud computing for life science workloads and show its potential to help researchers in several aspects of their projects.

The Concept of Serverless

What is Serverless Serverless computing is a cloud-based model that eliminates the need for developers or users to worry about the underlying infrastructure. Unlike traditional models, in serverless computing, the responsibility for managing servers, databases, and other critical infrastructure elements is shifted to the cloud provider. Users are thus freed from the burden of setting up and managing these systems, allowing them to focus more on their core tasks.
How Serverless Works In the serverless model, applications are typically broken down into individual functions that are executed in response to events. These functions, known as Function-as-a-Service (FaaS), run in stateless compute containers that are fully managed by the cloud provider. Once an event triggers a function, the cloud provider automatically allocates the necessary resources to run it. When the function is complete, the resources are de-provisioned, thus creating an on-demand, highly efficient system.

The Applicability of Serverless in Life Science

There are two general workload types in a research project that can benefit from moving to Serverless.

Data Pipeline

The first one is data pipelines which involve preprocessing of input data as well as running the main algorithm on it and storing the final result. Grzesik et al. Classify this pipeline into three main steps (data preparation, parallel processing, and post-processing) and emphasize that the parallel processing step can benefit from Serverless. Here is a diagram showing such workflow:

Credit: Grzesik et al 2021

Deployment

The second one that is less explored but has much potential is the Serverless deployment of bioinformatics tools. Traditionally, when researchers develop a computational tool or an algorithm, there are two ways of publishing it so the final users, usually biologists, can access it.

The first one is publishing the source code/tool on websites like GitHub and providing the necessary instruction on how users can set up the tool on their computers. This is not always desirable since the setup can be complicated, and biologists prefer easy-to-use interfaces like websites, so the other option would be to put your tool behind an API and deploy it on the cloud so the user can access it easily without the need to do any setup from your website. Here is where Serverless can be very useful to reduce the cost and make the deployment process much more seamless.

Consider these two scenarios to see how moving the workload to Serverless can drastically reduce the cost:

Scenario 1 (High User Number, Low Complexity Algorithm):

Many bioinformatic algorithms are relatively low complex and can be run in a short time, and use small memory, but because of the high number of users, you need to be able to serve multiple requests at the same time, serverless with its massive scalability become very helpful. Here is the cost comparison for this scenario:

Type	Requests / Month	Average Computation Time on 1 vCPU (ms)	RAM	Cost ($)
VM (EC2 t4g.small)	10000	One Medium Instance Needed	2GB*	12.096
Serverless (AWS Lambda)	10000	1000 ms	1GB	0.168

In the case of a VM, we need more RAM since all the traffic needs to be served from one instance, while in serverless, in case of multiple concurrent requests, AWS automatically fires up new instances to meet with the request load.

Scenario 2 (Low User Number, High Complexity Algorithm):

Another common pattern in bioinformatics tools is relatively complex algorithms that take more time and memory to run but have few users. You still need to have a dedicated powerful instance up and running all the time which will cost you so much, but with serverless, you only pay for the time that the algorithm was running, so the cost will drop drastically. Here is an example:

Type	Requests / Month	Average Computation Time on 1 vCPU (ms)	RAM	Cost ($)
VM (EC2 t4g.large)	1000	One Large Instance Needed	8GB	48.384
Serverless (AWS Lambda)	1000	60,000 ms	8GB	8.0002

Note that workloads with both high user numbers and also high computation needs may not be suitable for typical serverless architectures and other options such as Kubernetes services like what LuxProvide offers.
AWS Free Tier Covers up to 400,000 GB-second of computation and 1M request/month for AWS Lambda, so basically, both of these two scenarios will be technically free forever!

Hurdles to Serverless Adoption in Life Sciences

Despite its high potential to make bioinformatics workload cheaper and more efficient, its adoption has been slow in the academic world due to multiple reasons

Potential financial problem of the academic world with Cloud The pay-as-you-go pricing model of Serverless is very appealing to an agile company that wants to iterate fast between ideas and prototype them, but academia is definitely not famous for agility and risk-taking, so sometimes they actually prefer a high constant bill for a VM instead of a variable lower cost for Serverless.
Cloud learning curve Despite the ease of use for Serverless compared to a VM, its still a new technology that you need to spend time learning. Especially if its your first time using a cloud provider like AWS, it can be haunting to find your way through the thousands of services on their website!
Data security concerns Despite the rigorous data protection protocols enforced by public cloud providers, there are certain researchers working with highly sensitive information who remain hesitant to transfer their data to the public cloud. For example, research on patient data that can be compromised and used against them by insurance companies needs to be handled very carefully. For these use cases, some special cloud/HPC providers like LuxProvide might be a better option that ensures maximum possible security/privacy.

Conclusion

In this post, we covered various reasons why Serverless is a good choice for bioinformatics workload in terms of both cost and ease of deployment. We also delved into why the implementation of Serverless technology has been lackluster in this domain, despite its numerous advantages. We believe by leveraging serverless computing; the life science community can accelerate research, drive innovation, and ultimately contribute to important breakthroughs in our understanding of biological processes and diseases.

References :

What is serverless computing?

Spatial Data for Business Growth

DataChef — Thu, 22 Jun 2023 11:45:04 GMT

What is Spatial Data?

Spatial data, or geospatial data, refers to information about a physical object that numerical values can represent in a geographic coordinate system. Essentially, its the kind of data that tells you where. It has long been used in transportation, urban planning, and environmental management, but its potential extends beyond these realms.

Spatial data can be used to give you information about business operations across regions and specific to them- it allows you to identify and represent critical patterns and geographical trends. Information about individuals and groups and physical information that is fundamental for operations can be identified through spatial data.

Many organizations are now discovering that leveraging spatial data insights can unlock operational efficiencies and elevate the customer experience. Spatial data, commonly generated through Geographic Information Systems (GIS) mapping and geospatial analysis, offers a unique opportunity to establish more robust and stronger correlations and avail the benefits of a geographical basis of knowledge.

The Utility of Spatial Data

With the help of advanced geospatial analysis tools, businesses can visualize complex datasets on a map, interpret them in the context of their geographical relationships, and gain a more profound understanding of their operational landscape. From tracking supply chain movements to understanding the geographical distribution of their customers, spatial data can bring clarity to otherwise complicated scenarios.

This works on multiple levels. The main benefit of spatial data is not just that it accounts for geospatial information- it is also that it can easily be represented in visual form. This makes spatial data more challenging to use and to understand. While numbers and spreadsheets may need to be more intuitive and resistant to insight, represented graphically, spatial data allows relevant information to be seen clearly and simply, allowing fast understanding and on-the-fly decisions essential for modern enterprises operating at speed.

A pertinent example is multi-linked data sets generated between healthcare statistics and ones geographical location. The spatial data can be conveniently sorted into zones, each with a specific healthcare demographic. Statistics could range from average life expectancies to vaccination rates to even distributions of the most common medication. This allows a very efficient diversification of private medicinal practices and businesses along with various government schemes, clearly delineating a pattern and facilitating an on-ground physical change.

Make your business more efficient

Operational efficiency is performing business functions with optimal resource utilization, maximizing output, and minimizing waste. By integrating spatial data analysis into their operations, businesses can identify patterns and trends that were previously unseen, resulting in better-informed decision-making.

For instance, logistics and delivery companies can use GIS mapping to optimize routes, reducing fuel costs and delivery times. Retail businesses can analyze the geographical distribution of their customers to make strategic decisions about where to open new stores or distribution centers. Similarly, utilities can use spatial data to optimize the placement and maintenance of infrastructure like power lines or pipelines, reducing costs and improving service.

Why is this more efficient? Well, the spatial data processing will be significantly more accurate than any other heuristic analysis that would lead a firm to make a concrete business decision. Having a machine process, the intricacies save time and effort and guarantee surgical precision when it comes to action.

Spatial data offers a key advantage at its core- relevance. With the ability to identify relevant information, companies can make informed decisions and drastically improve the quality of their choices. You could know where to open the next outlet of your shoe store or which distribution centers to avoid because of potential inefficiency. The key to knowledge and spatial data growth opens up a whole new world of it.

Please the Consumer

The benefits of spatial data arent limited to internal operations; they also significantly improve the customer service experience. By understanding customer datas where component, companies can offer personalized, location-based services that elevate the overall customer experience. Each snippet of data corresponds to a specific metric of consumer response. These can be tallied over time to create sustainable business ecosystems that are both profitable and highly revered by a consumer.

Businesses can use geospatial data to identify their customers locations and tailor their services accordingly. The utility needs in certain areas could be higher than others, which may prompt diversification and commercial collaboration. This would be effective in identifying distribution centers, figuring out how to concentrate employment by region, and determining the next steps for the organizations growth by finding areas ripe for expansion. Businesses depend on two things: getting to customers and keeping them. And spatial data can help enterprises get better at both of those things.

Last but not least, geospatial models could be used to document the environmental implications of specific business changes by understanding the geological patterns of the area. Correlating those to focus on your business product delivery and minimizing externalities provides excellent growth and marketing opportunities.

Staying ahead of the curve is paramount in a rapidly evolving business environment. By investing in GIS mapping and geospatial analysis tools, businesses can harness the power of where, turning geographical insights into a competitive advantage. A map on the ground is a roadmap to the future and a map to the consumers minds; by knowing where, businesses can get there first.

Secure the edge with spatial excellence.

What is spatial data analysis?

DataChef's Data Product Programming Model: An overview

DataChef — Mon, 06 Mar 2023 00:00:00 GMT

The Problem

Has anything like this ever happened to you?

Developing three Spark applications to move 100 rows of Excel data through a pipeline?
Developing a streaming pipeline but only receiving a few events per week?
Spending several sprints on a product that the business only needs once?
Platform complexity causes data products to be delivered late.

You are not alone, and the following might interest you! 🤞🏻

Developers typically first design the main architecture when working on a data platform. This happens regardless of the methodology they use. Any future data products should be based on this architecture. Consequently, it is designed to support the organizations futuristic data product possibilities.

Team members interests sometimes influence this design. Kafka will likely be part of that architecture if the leading developers are Kafka experts. It is reasonable to expect the main architecture to incorporate bleeding-edge technologies or paradigms when the CTO promises an innovative solution.

Everything seems bright and smooth at first. Also, the system meets business needs to a certain extent. However, the real problem arises when the team realizes the following:

Actual and expected throughput differ significantly.
The availability of data sources and products varies.
There are some experimental products, and some data products may only need to be processed once.
As a result, delivery timelines and estimates are becoming unrealistic.

The Only Constant in Life is Change - Heraclitus

And thats true in the world of software and data products.

What are we missing?

A major problem with having the architecture before building the data products is that no one knows what they should look like. The data mesh paradigm, for example, provides extensive documentation on how to shape a data product. However, their primary focus is on the data and platform rather than the architecture. Typically, architecture is designed once, and then data teams build products based on it.

Some teams develop sample data products, proofs of concept, or actual data products based on one data source. However, they lose sight of all the future data products they will need to build.

Due to this limited view of the available data and what should be expected in the future, the problems listed in the previous section will occur.

Due to the unlimited resources provided by cloud services today, optimizing your application for the highest possible load is easier than ever. The scalability potential of the proposed architecture would interest any manager, precisely how much it can scale up. However, nobody is interested in the scale-down potential. The bleeding-edge stream processing architecture processed 10k events in six months, and now all eyes are on developers for the actual impact, which costs the organization $5000 per month.

Listen to your customers!

We miss a valuable asset that can help us prevent this issue. Our customers, the people who are about to use our product. This is, of course, the first principle of a successful outcome. Listen to their needs. And listen carefully. After all, what we create is supposed to generate value for the organization by optimizing our customers.

However, this doesnt mean that we need technical guidance from them. They are not tech-savvy like us, but they are business experts. And we need to understand each other with clear boundaries.

On the other hand, they also need to understand our challenges. Communication should be bidirectional. It helps them to see if the complexity of their requirements implementation is worth the development cost.

Only when we design our products based on clear mutual understanding and tailor them towards their needs will we be confident that the products weve developed provide the optimal value we were looking for.

In need of a common language

While working on client projects at DataChef, we realized this problem and committed to overcoming it. In our work, we place a high priority on the value of the products we create and the impact they have on the overall performance of the organization. Feeling the lack of effective communication, we developed a common language.

This language, called the Data Products Programming Model, is incorporated into our process of creating data products and helps avoid the issues we initially mentioned in this article.

What is the Data Products Programming Model?

To design a data product, we have three main goals to achieve. Each data product we create should:

Understandable: All parties interacting with it should be able to understand it.
Easy to Implement: Fast delivery and short feedback cycle helps to prevent unnecessary costs of developing complex data products.
Lightweight: So its easy to maintain and change in the future, to adapt to new requirements.

With these three goals defined, the programming model acts like a compass to help the engineering team achieve these qualities. Now lets have a high-level overview of how it happens:

Understandable

The programming model should be simple enough for non-technical people to pick up and use in the design conversation. One of the primary audiences for this programming model is business experts. Of course, they are not interested in learning the technical complexity of existing systems. Therefore, we dont want to spend time training them on an unfamiliar concept in which they probably wont have much interest. The programming model achieves this goal by being defined around a single image, which takes less than 5 minutes to describe. (so no special certification or resume points are expected from it 🤓).

Lightweight

By:

Making the business aspect of the product understandable to engineers.
Visibility of the implementation challenges to business customers.
Tailoring a unique architecture for each data product.

Incorporating a programming model into the data product design process results in lightweight components that are easy to maintain and manipulate.

Easy to implement.

This one is a direct outcome of the last two goals. We ensure the final implementation will be reasonably low cost by designing a lightweight and understandable architecture.

Whats next?

In this blog post, we started by defining the main problem and discussed how we try to resolve this issue for our clients at DataChef. In the rest of this series, well cover the end-to-end process of utilizing this model and integrating it with our data product design workflow.

Stay tuned.

What is the main problem in traditional data product development architecture?

Everything you need to know about AWS Machine Learning Specialty Exam

DataChef — Mon, 12 Sep 2022 00:00:00 GMT

If you are going to take the AWS Certified Machine Learning Specialty exam, you might have a lot of questions regarding the things you need to know before taking the $300 exam.

I passed the exam in July 2022, and here I will be writing a comprehensive guide on what the exam entails along with some additional info, like the benefits of having the certificate, how to save on certificate expenses, and advice for the exam session. However, I am not going to provide the answer to the questions that I saw during the exam, as that would compromise the integrity of taking the exam. If I were to go back in time, I would follow this guide, as it includes everything that worked/didnt work for me.

But if youre interested in one particular part, you can always use the anchor links in the table of contents.

Why go for the certificate? (or is it worth it?)

You will see a lot of reasons listed here and there (educational websites, data blogs, etc.), on why you should take this exam. These might include higher pay, more credibility, popularity, etc. Although these are not false, to be realistic, passing the exam is not going to make you a Machine Learning hero overnight.

In my opinion, two main logical reasons for taking the exam are:

Youre looking for a job, and you want to showcase your ML skills on AWS.
Your current employer needs you to have the certificate in order to collaborate with other companies.

AWS also offers a few benefits in case you pass the exam:

You get a 50% discount on your next certification exam (quite useful)
You get access to buying shirts and other gear that says AWS-certified! (interesting for some, but no for others)
You get to join the Global AWS Certified Community.
You get to join the Subject Matter Expert (SME) Program, which allows you to review certification exams, which in turn, might lead to recognition and some items from the store!

My intention for this paragraph was to give you a clear and realistic view of what the certificate can do for you. Now lets get down to what it takes to pass the certification exam.

What does the exam entail?

The exam is a combination of 65 multiple choice/multiple answer questions. In case the question has multiple answers, you will receive instructions on how many you need to choose.

Overall, four general domains will be covered in the exam. Here is a list with a breakdown of their appearance in the exam:

Data Engineering (20%: ~13 questions)
Exploratory Data Analysis (24%: ~16 questions)
Modeling (36%: ~23 questions)
ML Implementation & Operations (20%: ~13 questions)

It is important to know that the exam is not exclusively about ML on AWS, but it is centered around Machine Learning in general, and how it can be done using AWS.

Data Engineering

This part focuses on data preparation practices and specifically includes AWS storage services (mainly S3), data streaming services (mainly Kinesis), ETL services (Glue, Athena, EMR), and coordination services (AWS Pipeline, Step Functions, etc).

Exploratory Data Analysis

In general data science projects, this stage is when you have the preprocessed data ready and start doing analytics to get insights. The exam will test your knowledge of differentiating between different kinds of analytics, visualizations, and challenges you might face in real-life settings (outliers, missing values, etc).

Modeling

This is the main part of the exam, focusing on data science and machine learning both in general and the tools to do modeling on AWS. This part comprises more than a third of the whole exam. Since this area is broad, were gonna break it down into smaller sub-domains.

1. General Machine Learning

This part is about the general foundation of Machine Learning: regularization, training metrics and where each of them is an optimal choice, deep learning fundamentals like activation functions, convolutions, recurrent neural nets, etc. If you want to take the ML Specialty Exam theres a good chance that you already know these concepts, as you most probably have trained models before, but you would definitely want to brush up on them.

2. SageMaker

Well, it is the most sophisticated service that Amazon offers for AI and Machine Learning, so it would actually make sense that a good portion of the exam is dedicated to it. There are two main subsections in this part. SageMaker services and tools, and the built-in algorithms.

SageMaker Tools

Studio, canvas, feature store, Notebook instances and their configurations, Autopilot, Hyperparameter Tuning, Training jobs, GroundTruth, Model Monitor, Neo, etc. You have to know about them all, where they are useful and how you can configure them for a specific job.

SageMaker Built-in Algorithms

You can see the full list of SageMaker Built-in Algorithms here. To successfully answer these questions, not only do you need to know what each of them does and where they can be useful, but you also need to know the following:

What are the valid data input types
The types of instances you can use for training
The types of instances you can use for inference
Whether they support GPUs or not
Whether they support distributed training or not

3. High-level SaaS ML Services

This part is dedicated to high-level ready-for-inference Software-as-a-Service (SaaS) machine learning products. These include services like Comprehend, Rekognition, Translate, Forecast, etc, that are powered by trained machine learning models. For this part, you have to know which is helpful and when. Sometimes the question asks which option would achieve some task, with the least effort or quickest way possible. When you see these phrases, always look for SaaS, and put them as a higher priority while investigating other options (dont jump to conclusions quickly)!

ML Implementation and Operations

This part is more inclined toward the ML Engineers and DevOps Specialists than data scientists. Having hands-on experience with AWS can be beneficial here. The questions revolve around best practices of productizing ML models on AWS infrastructure, balancing loads, making highly-available and resilient architectures, pipelining your workflow, and applying security measures.

So now lets get to the main point that is the most probable reason youre reading this article.

How to prepare for the exam

This exam is not about your machine learning/engineering/data science knowledge. So never assume you can pass it, solely relying on your ML knowledge or AWS experience. I would like to think about this exam as International English Language Testing System (IELTS), where even native English speakers can score poorly if they dont know what they will be measured for.

AWS Resources

The official webpage for AWS Certified Machine Learning Specialty Exam can be a good starting point. At the bottom, you can find links to some sample questions, webinars, white papers and the AWS Skill Builder.

These resources are quite informative, but the information is scattered and overall is not the fastest way towards passing the exam. If you can take your time with studying all the material, you might not even need external resources. However, the exam at around 300 is an expensive one, and if you dont feel confident you would be able to pass it in a short time (as was the case with me), you can spend an extra couple of bucks on some external resources.

External Resources

Udemy

The most helpful resource I found on the internet was this Udemy course. It is well-structured, regularly updated and maintained. They cut to the chase and will focus on the types of questions that might show up on the exam.

The original price is a whopping 170, but fortunately and like many other Udemy courses, there is an on-going discount which enables you to buy the course for an affordable 20 price. I think it is definitely worth it.

A couple of sample questions are available with this course, but for a full sample exam, you need to pay another 20 to buy this course. I didnt, because I found cheaper and more extensive exam sets in

WhizLabs

Whizlabs offers both educational resources and sample questions. The AWS Certified Machine Learning Specialty section contains one free practice exam of 15 questions. There are three full practice exams that cost around 20 each, but with 30 you can buy a 3-months standard subscription, which allows you to view all the courses available on the WhizLabs, plus their sample questions.

I used their practice tests to see if Im ready for the real exam, and Im glad I did so. On my first try, I scored 74%, which showed me that Im not on top of the material, as I imagined I was. I mostly failed to correctly answer questions about specific configuration details, for SaaS and streaming services. I tried to convince myself that the Whizlabs practice exam was a lot harder than the real exam. A guess that proved to be wrong when I took the real one. In fact, I think the difficulty was a lot like the real exam, and some questions were even duplicates of what I had seen on Whizlabs.

I strongly recommend retaking the practice exams until you score 100% on all of them. Do this in intervals, so you have enough time to forget what youre going to forget!

Pluralsight

Pluralsight has a path of 7 courses dedicated to the exam. I already had a subscription, so it only made sense to go through the material, and to be fair, it is quite extensive and in general covers the topics. But then again, I would recommend the aforementioned Udemy course over this one.

A Cloud Guru

A Cloud Guru also has a dedicated preparation course for the exam. In my opinion, the courses offered by PluralSight and Udemy have a higher quality and better prepare you for the exam. But it comes with a full practice test and free hands-on labs in the AWS environment. However, I would only recommend buying a subscription if you are also interested in learning more cloud concepts.

Aside from the resources mentioned above, you can find many more on the internet. Obviously, I have not had the chance to try them all. There are also some free sample questions, but I have been dubious about some of the provided answers. You can, of course, use them with caution. However, after a point, the incremental time spent to find and study resources does not translate to a significantly higher exam score. Theres a point where enough is enough, and you need to take the plunge!

And with that, lets get to how to register for the exam.

Exam Registration

Logging in

First, you must login to AWS Training & Certification website. You have two options for logging in:

Logging in with AWS Partner

If your current company is an AWS Partner, you can use your business email address for logging in. But note that once you acquire a certificate with this business email, you cannot change the e-mail address later easily.

If you leave your company, you must write to the Support team and inform them that you will no longer have access to your work email. Then they would help with changing your email address. But its better to avoid such complications if you have no definite reason for registering with your work email.

Logging in with Amazon

This is the same account you use for shopping from Amazon. It would be a valid login option.

At the same time, if your employer wants to benefit from having your certificate, you can authorise this in your dashboard and connect your account to your employers AWS Partner account.

When logged in, you will be directed to the certmetrics website.

Accommodation for non-native speakers

If your native language is not English, and you choose to take the exam in English, you can benefit from an extra 30 minutes.

The official exam duration is 180 minutes (3 hours), which in my opinion, is just right. Its neither too long if you take your time reviewing questions nor too short that pressures you. But better safe than sorry. Although I didnt myself, you can request a non-native accommodation in your dashboard.

However, its nice to know that the exam is also offered in Japanese, Korean, and Simplified Chinese.

Choosing the exam provider and its type

You have the liberty of choosing a delivery method from two different providers: Pearson VUE, and PSI. If one is better than the other, I wouldnt know because, for both my certificates (ML Specialty and Cloud Practitioner), I used Pearson VUE, and I was mostly satisfied with my choice.

You also have the liberty of choosing on-site or online exams. I chose online, as I didnt have a reason not to. However, if you dont have a stable internet connection or a quiet and decluttered space to take the exam in, you might want to consider on-site exams. Unfortunately, they are mainly offered in big cities. In the Netherlands, where I live, the on-site Pearson VUE exam is offered at test centers in Amsterdam and Utrecht.

Saving costs

The ML Specialty Exam costs around $300 (excluding taxes), which many might consider as expensive. This is why many might want to know how they can get discounts.

As stated in the benefits section, youll get a 50% discount coupon each time you pass an exam. So if you, for example, take the Cloud Practitioner exam first (which costs around $100) and then take the ML Specialty, you would end up paying 100 + (300/2) = 250 for two certificates, whereas going straight for the ML Specialty will cost you $300 for one certificate. Plus, achieving the Cloud Practitioner certificate can help you prepare for the ML Specialty.

Checking in for the exam

If you have opted for online exams, make sure you have all the necessary requirements ready and tested. In Pearson VUEs case, it involves running their application and testing your hardware devices. You can check in on your exam day 30 minutes prior to the start of the scheduled time, which I recommend you do, in case unexpected problems arise.

Have a valid ID at hand. Passports, residence permits, and drivers licenses are acceptable. If you are originally from China, Cuba, Iran, North Korea, Sudan, and Slovenia, you would need an ID issued from another country, or they wouldnt let you take the exam.

Make sure the room you are taking the exam in, is quiet, your desk is empty, and there are no electronic devices in your proximity (even smart watches). The proctor (your exam supervisor) would ask you to take pictures of the room. Also, make sure nobody comes into the room while you are taking the exam, otherwise, you would be disqualified. I specifically asked my roommate NOT to bring me coffee! :D

No breaks, even for going to the bathroom, are permitted during the exam. So for the 3 hours-long exam, I would strongly recommend you empty your bladder and not take much liquid before. However, you are allowed to bring a water bottle to the exam session.

Make sure you are well rested, as 3 hours staring at the monitor and thinking about details that might trick you is a tiring task.

During exam

Read each question carefully. Most of the questions come with a scenario. Pay attention to details that are intentionally put in the question body. Dont get excited when you read the first few words. Chances are that there would come a phrase later on, that falsifies your first guess. Pay double attention to the last sentence. Look for phrases that say most important, in the most efficient, or cheapest way. These can be game changers.

Dont spend too much time on one question. If youre unsure about it, flag the question. When youre done with the rest, you can spend the remaining time on the ones you have flagged. This is, in my opinion, the most efficient way.

Dont get overwhelmed if you dont know the answer to multiple questions in a row. It is supposed to be like that, and most probably, others feel the same way. Take a deep breath and remind yourself that you have read this in our blog!

The proctor will warn you if you look anywhere but at your screen (so you cant stare at the ceiling while thinking about the question). Staring for 3 hours straight at a screen can take a toll on your eyes, so my suggestion is to shut your eyes while looking at the webcam from time to time. As stated before, youre allowed to bring a water bottle to your exam session. Take sips from time to time. It helps keep you refreshed.

Exam results

Pass/Fail Criteria

According to the official AWS Exam Guide your score will be scaled in the range of 100-1000 and you would need to achieve at least 750 points to pass the exam. They use a difficulty scaling model to account for distribution of difficult questions in the version of exam you are taking.

When to expect results

Immediately after your exam, youll get the result in terms of whether or not you passed. If you did, youd receive your certificate and report of your score a couple of days later.

The impact of studying for the exam

Aside from the certificate, you might be interested to know whether studying for the exam can help boost your knowledge. To be honest, I think the answer to this question is highly dependent on your current knowledge of data science and AWS. For me, as an AWS data scientist consultant who uses SageMaker on a daily basis, studying didnt make much of a difference. Of course, I had to memorize, for example, what type of instance each built-in algorithm supports, but these things are usually a google-search away.

However, if you are new to AWS or SageMaker in particular, studying for the exam can provide a road map to get up to speed.

Thats it! If you have additional questions, or recommendations to make this post better, please reach out to me via email.

What is AWS ML Specialty certification?

How to Use S3 for Storing Great Expectation Results

DataChef — Thu, 28 Apr 2022 07:31:22 GMT

Introduction

Great Expectations(GE) allows us to oversee data pipelines and helps us to create data quality checks. In our previous post on running GE on EMR, we explored how to set up GE for your data pipeline and validate your data sources. Recently, GE offered us a new feature to store validation results in AWS S3.

By default, validation results are stored in the uncommitted/validations/ subdirectory of your great_expectations/ folder. This may not be the best approach as validation results can contain sensitive data, and you may want to store them in a more secure location. This post will show you how to achieve this requirement by providing the GE way of storing them and our alternative for getting the same result. The only difference is that our process is more configurable and Easier To Change (ETC).

How to configure an expectation store with S3?

This section will briefly touch on how to store your validation results in S3 according to the official GE documentation. There are a few prerequisites to setting this up, and they are as follows:

Having a working installation of GE (please refer to our previous post in case you dont).
Configured Data Context.
Configured Expectation Suite and Checkpoint (also covered in our previous post).
Installed boto3 (AWS SDK for Python) for your local environment.
And last but not least, created an S3 bucket (that will store your validation results).

While the official GE documentation covers most of the setup for this adequately, we would like to point out that the key step here is creating the Data Context. The Data Context manages your project configurations and is the primary entry point for your GE deployment. The Data Context is also responsible for identifying where to store your validation results. Take quick look at what it can do for you:

The following is an excerpt from the default configuration file (great_expectations.yml) created when you initialize GE. What happens here is: Data Context creates a validation_store under the base directory of uncommitted/validations/ of your great_expectations project.

stores:validations_store:    class_name: ValidationsStore    store_backend:    class_name: TupleFilesystemStoreBackend    base_directory: uncommitted/validations/

As you can probably guess, you need to modify this piece of information in the .yml file to let GE know that you are going to store your results in S3. After you have created the bucket and have the prefix at hand, you can make changes to this file as shown here:

stores:     validations_S3_store:    class_name: ValidationsStore    store_backend:    class_name: TupleS3StoreBackend    bucket:     prefix:

Once these configurations are in place, you can run a checkpoint to see the validation results stored in the S3 bucket. Although this process will provide you with the desired outcome, some problems are associated with it. Firstly, with the increase in project size, if you need to store validation results in different S3 buckets (with different permissions for different users). Increasing the length of your YAML file or having an assigned person take care of these configurations is not a pragmatic approach. Secondly, when you provide incorrect bucket names or prefixes, your YAML files wont validate this until you run your expectations and get an error (this is a trap). We addressed these shortcomings and chose a different path, using Python instead of YAML file to configure the Data Context will make it easy for you to assign different buckets for each validation result. Here is how you can do it:

# Load your Great Expectations suites for suite in suites:    data_context = data_context_config(validation_bucket_name=suite.name)    # Run the validation and other actions

Initialize and Configure Great Expectation DataContext in Python

If you want to achieve the same result without using the YAML file, you must instantiate your DataContext and configure it in code. GE has provided us with documentation on how to do this. So, we are not going to discuss the details of this process. However, we will show you the most important step and which component you need to instantiate a new DataContext.

Make sure to install a recent version of great expectations supporting the DataContext object. This is an excerpt of the code you need to create for your own DataContext.

First we create a function to instantiate a DataContext with given parameters and set parameters in data context configuration:

from great_expectations.data_context.types.base import (    DataContextConfig,    DatasourceConfig,)def create_data_context(    validations_bucket_name,    validations_bucket_prefix,    data_docs_site_bucket_name,):    data_context_config = DataContextConfig(   datasources={            "spark_datasource": DatasourceConfig(                class_name="SparkDFDatasource",                data_asset_type={                    "module_name": "sample_dataset",                    "class_name": "CustomSparkDataSet",                },            ),       },       stores={            "validations_S3_store": {                "class_name": "ValidationsStore",                "store_backend": {                    "class_name": "TupleS3StoreBackend",                    "bucket": validation_bucket_name,                    "prefix":validation_bucket_prefix,                },            },       validations_store_name="validations_S3_store",       data_docs_sites={            "s3_site": {                "class_name": "SiteBuilder",                "store_backend": {                    "class_name": "TupleS3StoreBackend",                    "bucket": data_docs_site_bucket_name,                    "base_public_path": "data-docs-website-domain",                },                "site_index_builder": {                    "class_name": "DefaultSiteIndexBuilder",                    "show_cta_footer": True,                },            }        },)

In this sample config we have added Spark (spark_datasource) as our data source and configured s3 bucket storage for both validations and data docs. Lets see how these parameters works.

datasources={    "spark_datasource": DatasourceConfig(        class_name="SparkDFDatasource",        data_asset_type={            "module_name": "sample_dataset",            "class_name": "CustomSparkDataSet",        },    ),},

Next we defined our stores, in this case we had a single store named validations_S3_store and we have configured it to store validation results in S3. The stores parameter contains a dictionary, and each key corresponds to a separate store. The value of each key is a dictionary that contains the class name of the store and the store backend(see docs). And then set the validation store to be used in the parameter validations_store_name

stores={    "validations_S3_store": {        "class_name": "ValidationsStore",        "store_backend": {            "class_name": "TupleS3StoreBackend",            "bucket": validation_bucket_name,            "prefix":validation_bucket_prefix,        },    },},validations_store_name="validations_S3_store",

If you want to even have more stores, you can add them to the stores dictionary.

stores={    "validations_S3_store": {        "class_name": "ValidationsStore",        "store_backend": {            "class_name": "TupleS3StoreBackend",            "bucket": validation_bucket_name,            "prefix":validation_bucket_prefix,        },    },    "validations_local_store": {            "class_name": "ValidationsStore",            "store_backend": {                "class_name": "TupleFilesystemStoreBackend",                "base_directory": "uncommitted/validations/"            },        },},validations_store_name="validations_S3_store", # <- change this to "validations_local_store" to use local storage

The last parameter we need to set is data docs site, for this parameter we set the backend class to be TupleS3StoreBackend and define the bucket.

data_docs_sites={    "s3_site": {        "class_name": "SiteBuilder",        "store_backend": {            "class_name": "TupleS3StoreBackend",            "bucket": data_docs_site_bucket_name, # bucket name for the data docs site            "base_public_path": "data-docs-website-domain", # set a domain to view data docs site from        },        # the site index builder is responsible for generating an index.html file        "site_index_builder": {            "class_name": "DefaultSiteIndexBuilder",             "show_cta_footer": True,        },    }},

Conclusion

As part of this post, which is an extension of our previous one, we have attempted to make it easy for developers and users of GE to store validation results. In large data-driven projects, there is a need to validate data quality, and this is where GE comes in. But with GE, you can store your results in the project folder or use YAML configurations to store them in a data store (preferably S3). But using YAML configurations comes at the cost of overdependence on human credibility and loss of configure-ability. Herein, we came up with a solution of using the DataContext (GE native) python object to store validation results and configure data docs in a developer-friendly approach. We hope you have enjoyed the post. Please feel free to reach us if you have any questions/or feedback.

Assessing Pre-training Bias with AWS SageMaker Clarify

DataChef — Mon, 07 Feb 2022 07:18:54 GMT

Introduction

In the previous article, we discussed why explaining machine learning (ML) models is as important as their performance and how it benefits all the stakeholders. Now, we are going to see what explainability and fairness of ML models require in practice. Before trying to explain an ML model, we need to make sure that the data we feed to the model reflects the information that we wish the model to learn properly. To this end, we need to analyze the input dataset to determine if any biased information exists in the data and act accordingly to reduce those biases.

In this post, we are going to analyze a dataset using AWS SageMaker Clarify. We will discuss the implementation of SageMaker Clarify bias measures and their interpretations. The experiments are done in the AWS SageMaker environment, but you can implement these measures regardless of your coding environment.

You can find the Jupyter notebook of the experiments in this GitHub repository.

What is pre-training bias?

Since machine learning models look for certain patterns in the dataset, their performance depends on the quality of the data that was fed to them. If a dataset contains features/labels with imbalanced distributions, it will serve as a signal for the model to adapt its parameters in a way to reflect this imbalanced information.

However, In real-world applications the data doesnt necessarily represent the patterns that we want models to learn. The datasets may be biased toward certain features or groups of features that only correlate with our desired output and have no causal relationship with the phenomena that we wish to model. These biases can force the model to deviate towards those unbalanced features and consequently favor certain features against others. This situation will become critical especially when the unbalanced features represent information that will impact certain groups of people, making the ML models prediction unfair.

A machine learning system is considered to be biased if it favors/disfavors certain groups of individuals. So, it is very important to check for biases in the dataset and employ bias reduction techniques prior to feeding the data to the model.

For example, consider an ML-based application that will assess peoples financial information and decide whether they are suitable for loans or not. In this case, the dataset on which the model is trained, may favor certain groups of people and this will cause the model to reduce the chances for others getting the loan.

By determining the existence of pre-training bias in the dataset, we want to detect the lack of balance in the dataset and mitigate them as much as we can, so the ML models can be trained on a fair dataset.

Getting the dataset

For the experiments, we are going to download the Bank Marketing Data Set from the UCI Machine Learning Repository. The dataset contains data regarding direct marketing campaigns of a Portuguese banking institution. The dataset consists of 41188 samples each of which has 21 attributes. These attributes are related to the clients, information about the last contacts of the current campaign, social and economic context information, and the output variable which indicates the conversion of the client. For the sake of simplicity, we will only consider the bank clients data along with the target variable. For a more thorough understanding of the dataset, please refer to this paper by Moro et al. 2014.

Lets examine this dataset to see if any potential biases exist in it.

Experimental setup

Before jumping into the experiments, go ahead and download the dataset from here. We will use the data contained in bank-additional-full.csv and store it in an S3 bucket for later use. Since we are going to use SageMaker Clarify to detect potential biases in our dataset, it is convenient to conduct the experiments in a Jupyter notebook in the SageMaker environment. So, lets open a SageMaker Jupyter notebook and start coding. First, lets get the SageMaker session and its corresponding region. Also, we need a role for further use of AWS resources. We will also use the Boto3 S3 client to download the dataset from S3.

from sagemaker import Sessionfrom sagemaker import get_execution_roleimport boto3session = Session()region = session.boto_region_namerole = get_execution_role()s3_client = boto3.client("s3")

Go ahead and download the dataset from S3 into the SageMaker environment. We are now ready to explore the dataset and compute the bias measures for its features.

Detecting pre-training bias in the dataset using SageMaker Clarify

Here, we are going to see if any possible biases exist in the dataset. SageMaker Clarify makes it easy to run bias detection jobs on datasets and computes various bias measures automatically.

Lets first read the downloaded data using a Pandas dataframe.

import pandas as pd# initial columnsbank_client_attributes_names = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']target_attribute_name = ['y']col_names = bank_client_attributes_names + target_attribute_namedata_path = 'bank-additional-full.csv'df = pd.read_csv(data_path, delimiter=';', index_col=None)df = df[col_names]

As we said earlier, we only use attributes of bank client data. These attributes are age, job, marital, education, default, housing, and loan. We will use these attributes to develop a model for predicting the target variable which indicates whether the client subscribes to the bank offer or not. For a description of each of these attributes refer to the dataset repository.

Here is what the head of the Dataframe looks like:

  age        job  marital    education  default housing loan   y0   56  housemaid  married     basic.4y       no      no   no  no1   57   services  married  high.school  unknown      no   no  no2   37   services  married  high.school       no     yes   no  no3   40     admin.  married     basic.6y       no      no   no  no4   56   services  married  high.school       no      no  yes  no

Lets save and upload the client information subset of the dataframe into S3.

from sagemaker.s3 import S3Uploaderfrom sagemaker.inputs import TrainingInputbucket = ''prefix = ''df.to_csv('df.csv', index=False)df_uri = S3Uploader.upload("df.csv", "s3://{}/{}".format(bucket, prefix))

Now, lets use SageMaker Clarify to run a pre-training bias detection job on the Dataframe we just uploaded to S3. To this end, first we need to import the clarify package from sagemaker. After that, we will define a processor for the job using clarify.SageMakerClarifyProcessor, which takes as input the SageMaker execution role and session, type and number of instances for the job, and a job prefix. We then specify the S3 path in which we want to save the pre-training bias detection results using bias_report_output_path. In order to run the job, we have to define two clarify objects, namely a DataConfig object, and a BiasConfig object.

The DataConfig object specifies the input and result data paths, features and target variable names, and the dataset type.

The BiasConfig object configures the options of the pre-training bias detection job. Here, we set label_values_or_threshold=['yes'] to indicate the positive outcome of bank customers who accepted the banks offer. We also set the facet_name to the list of all feature names to indicate that we want to examine all the features for the bias detection job.

Finally, we call the clarify_processor.run_pre_training_bias() function with the DataConfig and BiasConfig objects as its arguments to run the job. This will take several minutes depending on the size of the data and the type and number of processing instances of the clarify_processor. After the job is finished, the report is saved in the S3 in the bias_report_output_path.

from sagemaker import clarifyclarify_processor = clarify.SageMakerClarifyProcessor(    role=role,    instance_count=1,    instance_type="ml.m5.xlarge",    sagemaker_session=session,    job_name_prefix='clarify-pre-training-bias-detection-job')bias_report_output_path = "s3://{}/{}/clarify-bias".format(bucket, prefix)bias_data_config = clarify.DataConfig(    s3_data_input_path=df_uri,    s3_output_path=bias_report_output_path,    label="y",    headers=df.columns.to_list(),    dataset_type="text/csv",)bias_config = clarify.BiasConfig(    label_values_or_threshold=['yes'],    facet_name=['age', 'job', 'marital', 'education', 'default', 'housing', 'loan'],)clarify_processor.run_pre_training_bias(    data_config=bias_data_config,    data_bias_config=bias_config)

Clarify computes several metrics for detecting the pre-training bias in the dataset. These metrics depend on the distribution of each feature and the posterior probability of the target variable given each value of the feature. For more information regarding the terminology and definition of terms used by Clarify, please read the official documentation.

Here is the screenshot of how the report file looks like for the job feature:

Now, lets elaborate more on each bias metric.

Class Imbalance (CI)

According to SageMaker Clarify documentation, the class imbalance metric is defined as $$ CI = \frac{n_a - n_d}{n_a + n_d}, $$

where $n\_a$ and $n\_d$ denote the number of observed labels for the favored and disfavored feature values, respectively. Lets compute this metric by ourselves to have a more clear understanding of what these numbers represent.

Suppose we want to compute the class imbalance metric for people whose jobs are self-employed. First, we can see that around 3.5 % of the people on our dataset are self-employed.

ratio = len(df[df['job'] == 'self-employed']) / len(df)print(100 * ratio)

3.450033990482665

This suggests that self-employed people may have been underrepresented in our dataset.

For computing the class imbalance for self-employed people, we are going to regard the number of people who are self-employed as $n\_d$, and the number of people who are not self-employed as $n\_a$. This is because we want to check that if the self-employed people are the disadvantaged group. Obviously, the sum of these two numbers equal the number of dataset samples. Defining $n\_d$ and $n\_a$, we are ready to compute the class imbalance:

def class_imbalance(df: pd.DataFrame, feature_name: str, feature_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    ci = (n_a - n_d)/(n_a + n_d)    return ciprint(class_imbalance(df, 'job', 'self-employed'))

0.9309993201903467

Which is the same as the CI metric for job feature with value self-employed in the report file which was computed by SageMaker Clarify.

This number means that the difference between the number of self-employed and not self-employed people is 93% of the entire population. In an ideal situation where CI is equal to zero, there would be an equal number of self-employed and non-self-employed people in the dataset.

The $CI=0.93$ suggests that a model trained on this data, may favor non-self-employed people more (indicated by $n\_a$ in the formula), and we should use the models output for self-employed people in a more cautious way.

Difference in Positive Proportions in Labels (DPL)

According to the SageMaker Clarify documentation, the DPL metric is defined as the $$ DPL = q_a - q_d, $$

Where $q\_a=n\_a^{(1)}/n\_a=$P(positive label | favored feature value) and $q\_d=n\_d^{(1)}/n\_d=$P(positive label | disfavored feature value), and the superscript (1) denotes the subset of samples that have a positive label for their class. Lets implement it and see that our results agree with the report file:

def dpl(df: pd.DataFrame, feature_name: str, feature_value: str, label_name: str, label_value:str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    # p(label_value|not feature_value) or q_a    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == label_value)]    ) / n_a    # p(label_value|feature_value) or q_d    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == label_value)]    ) / n_d    return p_positive_given_a - p_positive_given_dprint(dpl(df, 'job', 'self-employed', 'y', 'yes'))

0.008077098359025064

As it is obvious from the code and the formula, this metric shows the difference between the probability of non-self-employed people with positive class and the probability of self-employed people with positive class.

Numbers close to zero indicate that two disjoint groups of people have the same proportion of positive class samples, whereas positive or negative numbers with large absolute value, indicates that people in group a or people in group d have a higher proportion of positive outcomes. We should be careful in using datasets with high DPL values for different groups since it may cause bias in the model training.

Kullback-Leibler Divergence (KL)

The KL divergence is a statistic for probability distributions and measures the relative entropy between two probability distributions. You can check the wikipedia article for a thorough explanation. Note that this measure is not symmetric. Lets compute the KL divergence for self-employed people according to SageMaker Clarify documentation:

import numpy as npdef kl_divergence(    df: pd.DataFrame,    feature_name: str,    feature_value: str,    label_name: str,    positive_label_value: str,    negative_label_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == positive_label_value)]    ) / n_d    p_negative_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == negative_label_value)]    ) / n_d    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == positive_label_value)]    ) / n_a    p_negative_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == negative_label_value)]    ) / n_a    kl_pa_pd = p_positive_given_a * np.log(p_positive_given_a/p_positive_given_d) + p_negative_given_a * np.log(p_negative_given_a/p_negative_given_d)    return kl_pa_pdprint(kl_divergence(df, 'job', 'self-employed', 'y', 'yes', 'no'))

0.00033994894805770663

The KL statistic shows how much the datasets posterior distributions of being in the positive class for the self-employed and non-self-employed people are close to each other. The KL range of values are in the interval $\[0, \\infty)$. Values near zero mean the outcomes are similarly distributed for the different facets, and, positive values mean the label distributions diverge, the more positive the larger the divergence.

Jensen-Shannon Divergence (JS)

The JS divergence is based on KL divergence, but, it is a symmetric measure, i.e., given two probability distributions, $JS(p, q)=JS(q, p)$. The formula for computing JS divergence is as follows: $$ JS = \frac{1}{2}(KL(P_a || P) + KL(P_d || P)), $$

where $P=\\frac{1}{2}(P\_a + P\_d)$. Lets implement this measure according to SageMaker Clarify documentation.

def js_divergence(    df: pd.DataFrame,    feature_name: str,    feature_value: str,    label_name: str,    positive_label_value: str,    negative_label_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == positive_label_value)]    ) / n_d    p_negative_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == negative_label_value)]    ) / n_d    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == positive_label_value)]    ) / n_a    p_negative_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == negative_label_value)]    ) / n_a    p_positive = 0.5 * (p_positive_given_a + p_positive_given_d)    p_negative = 0.5 * (p_negative_given_a + p_negative_given_d)    term1 = p_positive_given_a * np.log(p_positive_given_a/p_positive) + p_negative_given_a * np.log(p_negative_given_a/p_negative)    term2 = p_positive_given_d * np.log(p_positive_given_d/p_positive) + p_negative_given_d * np.log(p_negative_given_d/p_negative)    return 0.5 * (term1 + term2)print(js_divergence(df, 'job', 'self-employed', 'y', 'yes', 'no'))

8.405728506055265e-05

The range of JS values is the interval $\[0, \\ln(2))$, where values near zero mean the labels are similarly distributed whereas positive values mean the label distributions diverge, the more positive the larger the divergence. This metric indicates whether there is a big divergence in one of the labels across facets.

Kolmogorov-Smirnov Distance (KS)

According to SageMaker Clarify documentation, The KS statistic is computed as the maximum divergence of label between distributions between features of an interested group and its complement in the dataset. The formula for computing the KS distance is as follows: $$ KS = \max(|P_a(y) - P_d(y)|), $$

where $P\_a(y)$\=P(label y|feature value a). Lets implement this statistic:

def ks_distance(    df: pd.DataFrame,    feature_name: str,    feature_value: str,    label_name: str,    positive_label_value: str,    negative_label_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == positive_label_value)]    ) / n_d    p_negative_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == negative_label_value)]    ) / n_d    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == positive_label_value)]    ) / n_a    p_negative_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == negative_label_value)]    ) / n_a    ks_distance = np.max([np.abs(p_positive_given_a - p_positive_given_d), np.abs(p_negative_given_a - p_negative_given_d)])    return ks_distanceprint(ks_distance(df, 'job', 'self-employed', 'y', 'yes', 'no'))

0.008077098359025148

The range of KS distance values is in the interval $\[0, +1\]$, where values near zero indicate the labels were evenly distributed between feature values in all outcome categories, and, values near one indicate the labels for one outcome were all in one feature value, and finally, intermittent values indicate relative degrees of maximum label imbalance.

L-p norm (LP)

The L-p norm metric measures the sum of L-norm of the posterior distributions of positive and negative classes of the interested group given the feature value and its complement. Lets implement this measure according to SageMaker Clarify documentation. As you can see, the default norm for this measure is the Euclidean or L-2 norm.

def lp_norm(    df: pd.DataFrame,    feature_name: str,    feature_value: str,    label_name: str,    positive_label_value: str,    negative_label_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == positive_label_value)]    ) / n_d    p_negative_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == negative_label_value)]    ) / n_d    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == positive_label_value)]    ) / n_a    p_negative_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == negative_label_value)]    ) / n_a    lp = np.power((p_positive_given_d - p_positive_given_a)**2 + (p_negative_given_d - p_negative_given_a)**2, 0.5)    return lpprint(lp_norm(df, 'job', 'self-employed', 'y', 'yes', 'no'))

0.011422742043954775

The range of LP values is the interval $\[0, \\sqrt{2})$, where values near zero mean the labels are similarly distributed, whereas positive values mean the label distributions diverge, the more positive the larger the divergence.

Total Variation Distance (TVD)

According to SageMaker Clarify documentation, the TVD is actually the sum of L-1 norms of the posterior distributions of positive and negative classes of the interested group given the feature value and its complement. Here is the implementation of TVD:

def tvd(    df: pd.DataFrame,    feature_name: str,    feature_value: str,    label_name: str,    positive_label_value: str,    negative_label_value: str) -> float:    n_d = len(df[df[feature_name] == feature_value])    n_a = len(df[df[feature_name] != feature_value])    p_positive_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == positive_label_value)]    ) / n_d    p_negative_given_d = len(        df[(df[feature_name] == feature_value) & (df[label_name] == negative_label_value)]    ) / n_d    p_positive_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == positive_label_value)]    ) / n_a    p_negative_given_a = len(        df[(df[feature_name] != feature_value) & (df[label_name] == negative_label_value)]    ) / n_a    tvd_value = 0.5 * ((np.abs(p_positive_given_d - p_positive_given_a)                               + np.abs(p_negative_given_d - p_negative_given_a)))    return tvd_valueprint(tvd(df, 'job', 'self-employed', 'y', 'yes', 'no'))

0.008077098359025106

The range of TVD values is the interval $\[0, 1)$, where values near zero mean the labels are similarly distributed, whereas positive values mean the label distributions diverge, the more positive the larger the divergence.

Mitigating the bias from the dataset

As we reviewed, there are various measures to see whether a dataset is biased toward certain features or not. However, depending on the ML problem and the nature of each of these measures, it is up to the user to decide which measure is more suitable to use. This is because of the fact that adjusting the dataset to have less bias with respect to a certain measure may not reduce the bias with respect to other measures. Mitigating the bias of the dataset based on the proper measure, ensures that the data reflects fair information regarding the problem at hand.

In general, there is no generic technique to reduce bias. Depending on the problem, one can apply the following techniques to reduce the datasets bias prior to the training phase:

We can remove critical features that may cause the model to be unfair to certain groups of people, for example gender, age, address, etc. However we must be careful to distinguish between correlations and causations of features with labels. For example, we may want to keep the education information of people when we want to hire them, but we dont want to let education information bias our judgment if we want to grant loans to people.
We can augment the data by applying undersampling/oversampling techniques to balance the distribution of certain biased features in the dataset. These techniques are also applicable in the classification of imbalanced datasets. To name an example, you can take a look at the Synthetic Minority Oversampling Technique (SMOT).

For more detailed information regarding the pre-training bias detection measures and its mitigation techniques, you can refer to this paper by the Amazon team.

Conclusion

In this blog post we reviewed the pre-training bias detection measures by SageMaker Clarify and implemented them from scratch to have a better understanding of how they work and what they present. It is important to note that there is no general measure of dataset bias for all ML problems and the choice of bias measure highly depends on the nature of the problem and the notion of fairness that we seek. We also discussed some techniques to reduce these biases in the dataset before feeding them to the ML model. Hopefully, our better understanding of the dataset and reducing its lack of balance contributes to developing ML systems that produce fair results.

Serverless Machine Learning Inference

Ali Yazdizadeh — Wed, 02 Feb 2022 18:06:10 GMT

Why Serverless Machine Learning Inference?

So now you have trained your machine learning model, and it works well; finally other people can use it. To make your model accessible to others, you need to deploy it on a server that gets the inputs and returns your models prediction, but having a server up and running all the time may not be a cost-effective option for you.

This can be the case in many scenarios. For example, your model may be for proof of concept or research purposes and doesnt have a large request load. Another scenario would be models with intermittent request loads, so there is so much idle time, and you dont want to pay for a server when its not used.

Here Serverless Inference comes to actions where you only pay for what you have used and not the idle times. In our post on Serverless Architectures - Pause, Think, and then Redesign, we talked about the advantages of serverless architecture in general. Here we will focus on its use cases in machine learning inference.

To get a sense of how cost-effective serverless inference can be, here we compared a 10000 request per day workload endpoint using both serverless AWS Lambda and an ordinary SageMaker Endpoint. For both approaches, we have used a Random Forest Regressor model from Skearn.

Type of Architecture	No of Requests per day	Latency of Each Request	Amount of Memory Allocated	Total Cost Estimate/ Month (USD)
Serverless (Lambda)	10000	Warm start ~ 200 ms / Cold Start ~ 3s	512 MB	0.57
SageMaker Endpoint (ml.t2.medium instance)	10000	190 ms	-	40.32

This post will review and compare the available options for building a machine learning serverless inference endpoint within AWS services. You can find the best one according to your specific use case.

What are the real-time API options?

For most use cases, you need to deploy your model as a REST-API so others can send a POST request and get a response with low latency (usually not more than a second). This part will cover what options are available for creating such an API.

AWS Lambda

How does this service work?

Lambda is an event-driven, serverless computing platform that runs your provided script in response to a trigger. There are many options for triggering a Lambda function. For machine learning inference, its usually a REST-API call that comes from API-Gateway to Lambda. Our Lambda function gets the input data and returns the predictions using our trained model.

Note that there are two ways to deploy a Lambda function. One option is to use a .zip file that can not be over 50MB (its unzipped size shouldnt be over 250 MB) and contains all your inference code and model. The other option is through a Docker image from ECR, which can be up to 10GB in size, so you have more space for your model artifacts and dependencies.

How is the pricing?

AWS charges you for a combination of the number of requests you have sent to the lambda and the memory and time each request has consumed. You can allocate any amount of memory between 128 MB and 10,240 MB (with 1 MB steps) to your function.

CPU Architecture	Duration	Requests
x86	$0.0000166667 for every GB-second	$0.20 per 1M requests
ARM	$0.0000133334 for every GB-second	$0.20 per 1M requests

For example with a 1M requests load that each of the requests took about 200ms, with 1024MB of memory allocated, the price for an ARM based Lambda would be:

1M * 0.2 + 1M * 200ms * 0.001 * 0.0000133334 = 0.20 USD (monthly request charges) + 2.67 USD (monthly compute charges) = 2.87 USD

For which scenarios is this service ideal?

Lambda is ideal for real-time APIs with unpredictable request loads since it can scale fast. Also, its pricing is suitable for a type of request that takes a short time to be processed. So as long as you are fine with its limitations (10GB Memory cap and CPU numbers), Lambda can be a good option. However, one of its drawbacks for machine learning inference is that you have to develop a CI/CD pipeline to update the Lambda function whenever you update/retrain the model.

Amazon SageMaker Serverless Inference

How does this service work?

Introduced in re:invent 2021, SageMaker serverless inference is a new option for deploying your model in SageMaker. Unlike traditional deployment options that use specific EC2 instances, SageMaker Inference uses Lambda to serve your model. Hence, it has both the advantages and limitations of Lambda, plus the better integrity with SageMaker environment that offers you a better CI/CD workflow through SageMaker Projects and SageMaker Pipelines.

How is the pricing?

SageMaker uses a bit different pricing than Lambda. It charges you based on a combination of milliseconds that it takes to process and the amount of data you send/receive. See more on the SageMaker Pricing page.

For which scenarios is this service ideal?

If your team uses SageMaker services, this new option is more straightforward to deploy than Lambda. Note that SageMaker inference has a lower memory cap (6144 MB instead of 10GB), so if 6144 MB is not enough for your model, you should probably consider Lambda or a regular SageMaker Endpoint.

What are the Batch Transform options?

Sometimes you dont need a real-time API, and requests can be responded to with longer latency, or you can collect them and make the inference periodically. Its called batch transform, and In this part, we will cover what options are available for that.

AWS Fargate

How does this service work?

AWS Fargate is a serverless container orchestration service that helps you deploy and scale your containerized application, and its compatible with Amazon ECS service. If you want to use Fargate for a real-time inference API, it would be similar to deploying a dockerized web application on EC2 but in a serverless way. But as we will see in the pricing part, AWS Fargate is more optimized for batch processing jobs. These jobs are either started periodically or by a trigger of an event and are finished after a while. AWS Fargate is commonly used for tasks longer than 15 minutes, which Lambda does not support.

How is the pricing?

Fargate charges you on a combination of the virtual CPUs (vCPUs) and memory you have used. Since its designed for large tasks, it has a 1 minute minimum for billing. For a larger vCPU, you have larger memory options, up to 30 GB Memory for 4 vCPUs.

For which scenarios is this service ideal?

In summary, Fargate is not a good choice for a real-time API. Suppose you have a batch job and also want good customization for your environment, and you dont want to use SageMaker Batch Transform. In that case, Fargate can be a good option.

Amazon SageMaker Batch Transform

How does this service work?

SageMaker Batch Transform is a part of SageMaker inference services that helps you with the prediction of large datasets that dont need to be real-time. Its not technically serverless as it works the same way as SageMaker Endpoint, but instead of opening an API to outside, it downloads the input data from S3, gets the prediction from the loaded model for that data using a POST request to localhost, and saves the result in S3. This can scale well with your use case as you can increase the number of instances and send POST requests for a batch of inputs instead of one at a time.

How is the pricing?

Using SageMaker Batch Transform, you are charged based on how long your job takes to finish, and on the instance type you chose. See more on the SageMaker Pricing page.

For which scenarios is this service ideal?

SageMaker Batch Transform is an ideal choice for jobs that dont require to be done in a real-time manner, such as personalized recommendations. Since it is well integrated into the SageMaker environment, you can easily use it as a part of your SageMaker pipeline.

Conclusion

Here, we discussed four different AWS Services for machine learning serverless inference. In summary, if you have an unpredictable request loads pattern and your model doesnt need GPU or huge memory to run, you should consider Lambda or Sagemaker Serverless Inference. If you dont need a real-time endpoint, you can use SageMaker Batch Transform or build your version using Fargate.

The following table summarizes what we have covered in this post:

AWS Service	Suitable Type	Cost Estimate (USD)	Development Effort	Latency for RealTime API
Lambda	RealTime API	0.019	Medium	Warm start Latency ~ 200 ms / Cold Start Latency ~ 3s
SageMaker Serverless Inference	RealTime API	0.04032	Low	Warm start Latency ~ 200 ms / Cold Start Latency ~ 3s
Fargate	Batch Transform	0.0146	High	NA
SageMaker Batch Transform	Batch Transform	0.067	Low	NA

The cost column has been calculated for one day with 10000 requests. Each request takes 200ms to process and has a 1KB size for both input and output. The minimal CPU and memory have been selected for a Random Forest model.

How to Access Encrypted Data on S3 Through Jupyter Kernel

DataChef — Mon, 10 Jan 2022 13:04:45 GMT

Introduction

As cloud storage of data becomes increasingly common, so does the need for data security. Data Encryption refers to the process of converting your data into a ciphertext so that only people (or groups) having access to a secret key or password can read or modify it. It is important to understand that data encryption offers a considerable defence in the case of attacks such that compromised data can only be read by authorised personnel.

That being the case as a data scientist, you might need to encrypt your data before uploading it to AWS S3 buckets. AWS offers broadly two types of encryption mechanisms: encryption in transit and encryption at rest (difference between data in transit and data at rest). As part of this blog post, we are going to look at encryption at rest and offer an easy way for a data scientist to encrypt/decrypt data through a Jupyter kernel whilst following best practices. Encryption at rest (AWS) can be done in four ways:

Server-Side Encryption (SSE-S3): Ask S3 to encrypt your objects (data) when you upload and then decrypt them when you download. It is totally managed by AWS and is the most cost-effective option.
Server-Side Encryption (SSE-KMS): Similar to SSE-S3, but in this case, you use AWS KMS service to manage your encryption via Customer-Managed Keys (CMKs). There are some additional benefits like audit trails that show when your CMK was used and by whom.
Server-Side Encryption (SSE-C): You manage the encryption keys and S3 manages the encryption/decryption when you access your objects. Keep in mind that AWS does not account for recovering, storing or managing your keys for this option.
Client-Side Encryption: You can use the AWS-KMS key or CMKs but you need to do the encryption/decryption yourself when you upload/download data. Also, make sure you have the key (and other necessary tools) to decrypt it when you download the data.

There are a few differences between these methods (in the area of ownership, the flexibility of use, and pricing) and this article provides an overview of them. However, in this blog post, we are going to focus on how to use Jupyter Kernels to upload and download encrypted data to S3 buckets for data scientists using SSE-S3 and SSE-KMS.

There are a couple of ways to do this encryption/decryption. Lets assume you use Python to access data from S3 buckets. Then you could use the AWS SDK for Python (boto3) to do the encryption/decryption (one such guide). But this comes at a price. This would mean copying the contents of the file in the kernel memory at runtime and may not be ideal if your file size is quite large. In such a scenario we recommend using PySpark to do the encryption/decryption using spark runtime configurations. And heres how you can do that.

Reading and Writing encrypted data

Setting up EMR configurations for encrypting at rest

Before cutting to the chase, let us pause a bit and think about what happens when we spin up an EMR cluster for a spark application. The cluster uses configurations that we have defined in order to set up the application. And in this step, we can specify the type of encryption algorithm (and encryption keys) during the instance startup. This will ensure all files handled in the EMR cluster (including uploading data to S3) that uses EMRFS will be encrypted with MyKMSKeyId. This can be defined as follows:

    aws emr create-cluster --release-label emr-4.7.2 or earlier    --emrfs Encryption=ClientSide,ProviderType=KMS,KMSKeyId=MyKMSKeyId

Replace MyKMSKeyId with the Key ID or ARN of the AWS KMS that you want to encrypt with. You can also specify this option using the AWS CLI (more information about this), but keep in mind that you choose either but not both. Additionally, setting security configurations when you create an EMR cluster overrides the cluster configuration that we have set with the above command. An instance of how you can set up cluster configuration for encryption at rest is as follows:

    aws emr create-security-configuration --name "MySecConfig" --security-configuration '{    "EncryptionConfiguration": {        "EnableInTransitEncryption": false,        "EnableAtRestEncryption": true,        "AtRestEncryptionConfiguration": {                "S3EncryptionConfiguration": {                "EncryptionMode": "SSE-S3"            },            "LocalDiskEncryptionConfiguration": {                "EncryptionKeyProviderType": "AwsKms",                "AwsKmsKey": MyKMSKeyId            }        }     }}'

Writing encrypted data via the Jupyter Kernel

A more flexible way of getting your data encrypted would be to specify the desired configuration in the Jupyter kernel itself. Be aware that you need to setup this option everytime you start a new Jupyter Kernel. And if this option is not set, then the kernel will throw you a NoSuchElementException if you try to access the configuration.

The option you need to add to ensure your data will be encrypted is as follows:

    spark.conf.set("fs.s3.enableServerSideEncryption", "true")    spark.conf.set("fs.s3.serverSideEncryption.kms.keyId", "MyKMSKeyId")

where spark -> runtime SparkSession,
fs -> EMRFS properties

Once this option(configurations) has been set, the same command can be used to validate the presence of an encryption configuration.

In this figure, we encrypted our data via SSE-KMS, you can also other encryption methods. But keep in mind that you cannot use more than one encryption method in the same kernel at the same time. However, you can change your encryption method at runtime in the kernel (more information on this).

One other important point to note is that there is a difference between SparkContext configurations (available via _sc) and spark runtime configurations. You wont get your PySpark configurations (the ones that were defined using the spark variable) via the following command -> sc._conf.getAll() , where sc -> SparkContext. The only way you can set/unset or view spark runtime configuration is through spark.conf(). This is PySparks configuration API and these settings are automatically propagated to Hadoop during I/O operations. Thats how the file system knows you want to encrypt the objects uploaded.

To encrypt your data via SSE-S3 (using AWS managed S3 keys) is as follows:

spark.conf.set("fs.s3.enableServerSideEncryption", "true")spark.conf.set("fs.s3.serverSideEncryptionAlgorithm", "AES256")

As you can probably guess, this configuration doesnt require you to mention a key file because this is managed by AWS itself. After uploading the file, you can verify it in your AWS console by clicking on the file and checking the encryption configuration. The following setting will appear for your file if it is encrypted via SSE-S3.

Reading Encrypted data via the Jupyter Kernel

Reading from S3 buckets where the data is encrypted is fairly simple. You dont have to specify your key while decryption. This is because S3 reads the encryption settings, sees the key ID, sends off the encrypted symmetric key to AWS KMS, asking for that to be decrypted. If the user/role has sufficient permission, S3 gets the key back, decrypts the file and returns it. So for an encrypted file, reading the file can be done by using the following command :

spark.read.csv("s3://bucket_name/key/filename.csv",header=True)

However, you cannot read encrypted data from a file if the IAM role attached to your cluster doesnt have the permissions to access the KMS keys and the bucket. In our scenarios (SSE-KMS and SSE-S3) we have already created these permissions beforehand. In case you want that flexibility of being able to encrypt/decrypt at your service, go for Client-Side Encryption. The key differences are, as already mentioned, is the flexibility, ease-of-use and type of security (what type of encryption you choose depends on your use case).

Some other points you need to take care for successful encryption

When you spin up an EMR cluster you can specify an EC2 instance profile (default or custom). This instance profile should have the permission to access the KMS key that you want to encrypt/decrypt with. If you choose your KMS key in AWS KMS, make sure that the key administrator field is populated with the IAM role for the EC2 instance profile. The EC2 instance profile should have permission to access the key.
The spark configuration you use on the Jupyter notebook has the same lifespan as the kernel. So, if your kernel dies your spark configurations are lost. Restarting the kernel would also require you to set up those configurations again. Alternatively, if your write or read commands fails with the KeyNotFound exception, it means something is wrong with the configuration you have set. So restart your kernel and check your spark configuration using spark.conf.get(key) where the key is the name of the configuration.
There is a way to change your spark configuration at runtime. This involves using the unset() function for the configurations you have set. For instance, suppose if you had configured to use SSE-KMS and you want to use SSE-S3 now, then the following commands will unset the encryption. Once this is done, you can use the set() function previously mentioned to encrypt the file via SSE-S3.
```
 spark.conf.unset("fs.s3.enableServerSideEncryption") spark.conf.unset("fs.s3.serverSideEncryption.kms.keyId")
```
Additonally, we can enforce encryption any object you want to upload to an S3 bucket. This option can be enabled both from the ec2 instance and from the AWS console. From the AWS console, go to the bucket configurations, select properties, and then default encryption. But this affects only the new objects you upload (old objects in the bucket remain unaffected).

From the ec2 instance, it is slightly more complicated as you need to edit the properties in hadoop/etc/hadoop/core-site.xml (more information about this).

Conclusion

In this blog post, weve provided a method to read/write encrypted data in S3 buckets using the Jupyter Kernel. We also mentioned some other ways to use your cluster configuration and Hadoop configurations to set up the same thing. Additionally, we have mentioned some key takeaways from our experiment with encryption using configurations and provided some answers to common problems you could face while carrying out this process. We hope you have enjoyed the post. Please feel free to reach us if you have any questions/feedback.

Stopping the nightmares of building a recommender system

DataChef — Wed, 05 Jan 2022 15:30:03 GMT

As a data science developer, I have seen several unsuccessful recommender system projects through my career. To be honest, developing a recommender system from scratch is not a simple project. Fortunately, all these failed projects show similar situations and areas of improvement. In this blog post, I am going to share my experiences of these common challenges in building a recommender system and tell you the story on how we decided to fix them at DataChef.

A disaster in building recommender systems

Whether you are in a company that offers several services to its customers or you are in an e-commerce business that sells a wide variety of products, you probably know that you need a recommender system to personalize the experience of your users and boost up your business revenue. While recommender systems are promising machine learning (ML) solutions for maximizing the chances of grabbing your clients attention, in reality, only a fraction of these systems are implemented in an appropriate way to benefit the business.

If you are to manage a team of engineers to build a recommender system in your company, you will be overwhelmed by the numerous challenges and complexities that will arise during the implementation of the system. These challenges include time management of the project, communication issues between team members and higher-level managers, orchestrating the results of each persons work, technical complexities, etc. Meanwhile, time is passing by and you start to reach the projects deadline and everything has become a headache. At this point, lots of time and money have been spent and decision-makers are starting to lose interest in the outcome of the project. Consequently, the chances of dropping the project are increasing.

Given the number of failed recommender system projects, it is a hard decision for a companys decision-makers to start a machine learning journey. Actually, I cant disagree with them. Developing a recommender system from scratch may not be a reasonable thing to do. I wouldnt invest a significant amount of money on something that needs to be built from the very beginning simply because it needs to be tailored to specific needs. For example, who would invest in a car factory that wants to extract iron first to build a car?

It would be a miracle for the decision-makers to just push a button, skipping all of these costs and risks, and see whether a baseline recommender system is suitable for their business or not. It is only then that they can decide if they want to invest in developing a specific recommender system.

The dark sides of building recommender systems

There are several reasons why the development of a recommender system has a tendency to fail. Here are a few of them I have encountered.

1. Not just about data scientists

In companies, people usually think that, when it comes to recommender systems or ML projects in general, data scientists are all they need. Data scientists spend most of their time on developing and testing recommender models [1], although this is only one component of the entire recommender system pipelines. There are other elements to be considered when building a recommender system. We need data engineers to develop data transformation pipelines and data warehouses, DevOps engineers to provide governance, networking, and infrastructure orchestration, and software engineers to build APIs, log systems, and databases. Developing all of these components is a tedious task for a data scientist alone.

In one of my experiences in developing a recommender system, the data science team spent several weeks developing various models and evaluating them. They were so invested in that process that they forgot to consider the companys biggest expectation: the speed at which the company could develop a solution and the value it adds to the business.

Since a company needs a working product rather than several laborious and possible solutions, the decision-makers agreed that the team had failed to accomplish the requirements, and the project was canceled. Incomplete comprehension of the projects scope was one of the main reasons for its failure. In other words, there were no data engineers and machine learning engineers assigned to the project from the very beginning.

2. Hard to find the right people

Another difficulty of developing a recommender system is finding the appropriate people for the required roles. The reason is mainly the lack of data-related experts in the industry. Let us consider the decision-makers point of view in this situation. Supposing they are aware of the scope for developing a recommender system, the minimum number of roles required are data engineers, machine learning engineers, and data scientists. It is necessary to identify the right candidates, conduct multiple interviews with them, assign them some related toy projects, evaluate their level of expertise, and finally hire them.

If you were involved in hiring people for technical roles, you probably have been in this situation as well. In my experience, out of dozens of candidates, only one of them is suitable for the job. Even after we hire the right staff, it takes weeks for them to be appropriately onboarded and get familiar with the companys context.

At this point, if everything goes well, we, as decision-makers, are ready to spend months waiting for the team to develop a proof-of-concept recommender system. This entire process is full of uncertainty and risk, and costs a lot of time and money.

3. Time consuming tasks

We now have a team of data scientists, ML and data engineers that have a good understanding of the projects scope, prepare its setup, and are ready to develop a recommender system. The goal of a recommender system is pretty straightforward; a system that suggests suitable products to customers based on their behaviors. Nonetheless, developing a production-ready recommender system takes a lot of time since it is a complex system with multiple components depending on each other.

According to Deloitte research [2], developing a successful recommender system project takes an average of twelve months. That includes developing all the phases of a machine learning pipeline. Members of the team are each responsible for a specific component of the system. Here are some examples of components that need to be developed:

Data engineers develop data ingestion pipelines that reliably gather data from different sources and store them into data stores.
Data scientists develop pipelines that perform preprocessing and feature engineering on the data and store them in the feature stores, build and train recommender system models, evaluate models and select the best performing ones.
ML engineers are responsible for orchestrating and automating the entire system and developing feedback loops.

Obviously, developing all of these components from scratch in a timely manner is a hard task to do. Currently, there are no frameworks to help teams develop their own recommender system and help them with all the recommender system components. As a result, it takes months to develop a working baseline recommender system.

There is hope for recommender systems

With all that being said, should we stop creating custom-made recommender systems for our clients? At first glance, it may seem that building a recommender system is not worth the time and money. The good news is that most of the recommender system components perform repetitive tasks that are common in every business domain in which the system will be used. Only a small fraction of the system components needs to be modified to have a custom recommender system for each business.

In the software world, people are constantly creating general solutions as building blocks on top of which specific applications are built. These building blocks may be viewed as templates that encapsulate repetitive procedures that others can modify in order to fit their specific use-cases. We can adopt the same approach in the case of recommender systems.

Most of the ML operations (MLOps) of a recommender system pipeline can be implemented in the form of templates. Such templates reduce the risk of failure by providing building structures that are highly customizable and independent of specific recommender system procedures.

If we have a functional MLOps environment for the recommender system as a template, people can focus on optimizing the system according to their domain of expertise, rather than struggling to develop a production-ready baseline recommender system.

To resolve these issues, at DataChef we decided to leverage our experiences and come up with a recommender system template that serves as a baseline system and can be deployed at day one. So on the very first day, the baseline recommender system is working and gathering feedback data by recommending products and services to end-users. Besides that, everyone can play their own role to improve the system further.

For instance,

Business strategists and managers can focus on the real product-related challenges of the business and marketing strategies, rather than worrying about orchestrating the recommender system project and questioning its success. They can design better strategies and KPIs to optimize the business performance by utilizing all the business data and taking advantage of data sciences full potential.
Data scientists can focus on improving the recommender system and tailoring it to fit specific business needs. They can perform many feature engineering methods with various data sources, try state-of-the-art recommender system algorithms, A/B test various models and employ machine learning in its excellent way.
Data Engineers can focus on developing challenging data pipelines and best practices that serve the required data in a live and reliable manner.
Machine learning engineers can focus on improving the entire system by automating as much as possible and designing feedback loop strategies according to MLOps principles.

With these considerations, we developed DataChef RecSys that can be used as a baseline recommender system. In addition to that, we created a simple dashboard for the decision-makers to have a clear view of the business KPIs. You can deploy RecSys from AWS Marketplace for free. Wed love to hear how you used RecSys, and what we can do to make it even more useful for you and future generations of developers.

DataChef RecSys on AWS Marketplace

DataChef RecSys Documentation

We at DataChef are always happy to help companies deploy, customize, and further expand their recommender system to address their specific needs on their behalf. So keep in touch with us through hi@datachef.co, LinkedIn, or Twitter.

References:

[1] 76% Of Enterprises Prioritize AI & Machine Learning In 2021 IT Budgets

[2] Business impacts of machine learning

AWS re:Invent 2021 Recap: Data Engineering Announcements

DataChef — Thu, 23 Dec 2021 09:10:30 GMT

This year, AWS re:Invent was special in many ways. It was the first onsite re:invent after the pandemic forced the previous years conference to go virtual. Also, it was the first re:invent with new AWS CEO, Adam Selipsky delivering a keynote. AWS made more than 120 announcements during re:Invent 2021, many of them introducing a new service or a new feature for an existing service.

As part of this blog post, we are going to highlight the most announcements from the perspective of a Data Engineer. And we are going to focus on two things: what was the new feature or tool added, and how does it make a difference to what we have been doing. And if you are a Data Engineer, you wouldnt want to miss out on these.

AWS introduces serverless preview for EMR and Redshift
- What happened?
AWS provides a serverless option for EMR and Redshift to run data analytics in clusters without having to think about provisioning or maintaining these resources.
- What difference does it make?
For Redshift, it means that you dont have to choose several manual configurations and in a few clicks can start querying (pre-loaded) sample data. It also means that Redshift serverless enables you to query data directly in any format such as Parquet, S3 data lakes, as well as data in other databases like RDS and Aurora. For EMR, it means a little more. EMR serverless automatically provisions the resources required by your application. It adjusts resource allocation according to the need of the application. And since you only pay for the resources you use, EMR serverless is cost-effective.
Amazon Kinesis introduces data streams On-Demand
- What happened?
Kinesis Data streams on-demand now offers the opportunity to run gigabyte scale read and write throughput per minute without capacity planning.
- What difference does it make?
You dont have to provision or manage servers (as this is serverless). Also, you can now pay based on throughput consumed rather than provisioned resources. When you choose on-demand capacity mode, this service scales up/down depending on your workload automatically.
AWS introduces Amazon MSK Serverless in public preview
- What happened?
AWS launched a new type of MSK cluster that makes it easy for developers to maintain Apache Kafka without thinking about provisioning resources/servers. They have launched it in public-preview mode and in addition, offer a pay-as-you-go pricing model.
- What differences does it make?
With this launch, it is easier for developers to get started with Apache Kafka. It supports native AWS integrations, so switching your existing applications to MSK would not be an issue. Moreover, using the throughput-based pricing model there are no upfront costs. At the moment it is only available in the region us-east (Ohio) in public-preview mode.
AWS Lake Formation adds three new capabilities via Governed Tables
- What happened?
Firstly, Lake Formation introduced multi-table transaction support via governed tables. Secondly, these governed tables ensure your data storage is optimized for querying. And thirdly, they introduced row and cell permissions to enhance data security. Currently, only Amazon Athena, Amazon Redshift Spectrum and AWS Glue ETL scripts support querying governed tables.
- What difference does it make?
Multi-table transaction support means users dont have to create custom error-handling methods for updates and Lake Formation ensures a consistent view. Storage optimizations are achieved in governed tables through data compaction and garbage collection. Also, you dont need to worry about multiple S3 objects being populated by your upstream application.
AWS Chatbot now supports management of AWS resources in Slack
- What happened?
Previously you could only monitor AWS resources and retrieve diagnostics about your AWS resources through Slack, now you can run AWS CLI commands from Slack.
- What difference does it make?
Customers can now manage AWS resources directly from their Slack channels. They can securely run AWS CLI commands to scale EC2 instances, run AWS Systems Manager runbooks, and change AWS Lambda concurrency limits. Additionally, customers can also configure channel permissions to match their security concerns.
Amazon Athena now supports ACID transactions and introduces fine-grained security via Lake Formation.
- What happened?
Athena introduced ACID transactions. This enables multiple concurrent users to make reliable, row-level modifications to their Amazon S3 data from Athenas console, API, and ODBC and JDBC drivers. On top of this, Athena also trained fine-grained permissions for accessing data for these ACID-compliant tables.
- What difference does it make?
Using Lake Formation Data Filtering, administrators can now grant column-, row-, and cell-level permissions on their Amazon S3 data lake tables that are enforced when Athena users query this data. With ACID-compliant transactions, you can now make regulatory updates to your data in Athena without needing a custom record locking solution. And with time travel capability (newly-added feature), you can recover data that was recently deleted using just a SELECT statement.
Amazon S3 adds new S3 Event Notifications
- What happened?
S3 Event notifications now help to build event-driven applications which are triggered when objects are transitioned or expired on S3 buckets. And you can send these notifications to SNS, EventBridge, SQS and Lambda.
- What difference does it make?
Using this feature you can have an automatic tracking of your data in DynamoDB Tables or AWS Glue Catalogs. These notifications are now available for S3 Lifecycle, S3 Intelligent-Tiering, object tags, and object access control lists.
AWS launches SQL Notebook support for Amazon Redshift
- What happened?
Redshift announced SQL notebook support to enable data analysts/scientists to author queries more easily, organizing multiple SQL queries and annotations on a single document.
- What difference does it make?
You can combine your SQL queries in a single document in the notebook. Additionally, you can also share this notebook with team members. Markdown cells in the notebook also help in the proper documentation for your queries.
AWS introduces Amplify Studio
- What happened?
Amplify Studio is a visual development environment that offers UI developers new features(in preview mode). Amplify Studio automatically translates designs made in Figma to human-readable React UI component code.
- What difference does it make?
Amplify Studio offers developers the ability to do plug-and-play with React-UI components that are fully customizable. It also enables developers to connect these components to backend configuration via Amplify studio. And all this comes with minimal coding.
AWS launches Karpenter v0.5
- What happened?
Karpenter is the new Kubernetes cluster autoscaling project that helps you with provisioning EC2 instances and Kubernetes pods under a minute.
- What difference does it make?
Previously customers needed to create autoscaling EC2 groups to support increasing workloads and improve cost efficiency. With Karpenter, this responsibility is removed from the customers shoulder. Karpenter auto-scales accordingly, adds/removes instances as required and removes overhead costs on over-provisioning and scaling-down.
Amazon Timestream introduces offers three new features
- What happened?
Amazon Timestream has added three new capabilities, namely, scheduled queries, multi-measure records, and magnetic storage writes to make time-series data processing faster.
- What difference does it make?
With scheduled queries, customers can schedule your large queries for computing aggregates, roll-ups and Timestream takes care of processing these large source tables and creates a destination table (for easier reporting). With magnetic storage writes, customers no longer have to maintain a memory store with a large data retention period for the purpose of processing late arrival data. With these new features, Timestream has made it easier to analyse IoT (sensor) data and Dev-Ops metrics.
AWS introduces SDKs for RUST, Kotlin and Swift
- What happened?
AWS introduces SDKs for RUST, Kotlin and Swift to enable programmers in these languages to follow best practices and interact with AWS using these languages.
- What difference does it make?
If your team use one of these languages, you can expect an easier interaction with AWS resources.

This was the last part of our AWS re:Invent 2021 recap. We hope you found it useful. Please feel free to reach out to us in case you have questions or feedback. Thank You for the support!

AWS re:Invent 2021 Recap: DevOps Announcements

Ali Yazdizadeh — Wed, 22 Dec 2021 09:10:30 GMT

This year, AWS re:Invent was special in many ways, first onsite re:invent after the pandemic forced the previous years conference to go virtual and also the first re:invent with AWS new CEO, Adam Selipsky delivering a keynote.

AWS made more than 120 announcements during re:Invent 2021, many of them introducing a new service or new feature for an existing service.

In this blog post, we are only going to highlight the most important DevOps announcements that you definitely shouldnt miss!

AWS Cloud Development Kit (AWS CDK) v2 is now generally available
- What happened?
The AWS Cloud Development Kit (CDK) is an open-source software development framework that helps you define your cloud infrastructure as code (IaC) using various programming languages. Unlike the first version, you dont need to install an individual package for each services like S3, lambda etc. AWS gathered the AWS Construct Library into a single package called aws-cdk-lib in the CDK v2.
- What difference does it make?
AWS CDK v2 seems to provide an easier way to work and deploy your application so it can result in a faster and more efficient development.
New Sustainability Pillar for the AWS Well-Architected Framework
- What happened?
The AWS Well-Architected Framework helps customers with best practices across multiple pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. It also comes with a tool to help you apply these best practices to your application and cloud infrastructure.
- What difference does it make?
AWS introduced a new pillar for Sustainability to help organizations learn, measure, and improve workloads using environmental best practices for cloud computing.
AWS announces Construct Hub general availability
- What happened?
AWS announced Construct Hub, a registry for open-source construct libraries. It supports various construct tools such as CDK, CDK for Kubernetes, and CDK for Terraform.
- What difference does it make?
Construct Hub will simplify the process of cloud development by building your construct on top of other peoples work so you can develop faster and benefit from the community.
Introducing AWS re:Post, a new, community-driven, questions-and-answers service
- What happened?
AWS introduced re:Post, which is a Q & A website for AWS customers (you need an AWS account to ask or answer). You can think of this as a StackOverFlow for AWS services.
- What difference does it make?
Its still soon to talk about the impact of this service but since previously it could be so confusing to find your answer through other websites likes AWS Forums or Stackoverflow, this service looks promising. It also integrates with AWS Support so in case the community didnt answer a question, customers with Premium Support subscriptions receive responses from AWS employees for their questions.
Amazon S3 announces a price reduction up to 31% in three storage classes
- What happened?
The cost of storage has been reduced for three S3 storage classes. 31% for S3 Standard-Infrequent Access and S3 One Zone-Infrequent Access now only applied to 9 regions and 10% reduction for S3 Glacier Flexible Retrieval storage class.
- What difference does it make?
With growing amounts of data, storage costs can make a considerable share of your cloud bill. So its a good idea to consider other S3 storage classes (other than normal frequent access) whenever applicable to reduce your storage costs. For more information, see the S3 pricing page.
AWS customers can now find, subscribe to, and deploy third-party applications that run in any Kubernetes environment from AWS Marketplace
- What happened?
AWS customers can now find, subscribe to, and deploy third-party Kubernetes applications from AWS Marketplace on any Kubernetes cluster, in any environment.
- What difference does it make?
This extends the existing AWS Marketplace for Containers capabilities. It also enables users to deploy third-party Kubernetes applications to on-premises environments using Amazon Elastic Kubernetes Service Anywhere (EKS-Anywhere).
Introducing Amazon CloudWatch RUM for monitoring applications client-side performance
- What happened?
Amazon CloudWatch RUM is a real-user monitoring capability that helps you identify and debug issues on web applications client-side and enhance the end users digital experience.
- What difference does it make?
CloudWatch RUM enables application developers and DevOps engineers to reduce avergae time to resolve the client-side performance issues by enabling a faster resolution.
Announcing Amazon DevOps Guru for RDS
- What happened?
Amazon DevOps Guru is an ML-powered capability that automatically detects and diagnoses performance and operational issues within Amazon Aurora.
- What difference does it make?
This new tool helps developers to find and resolve issues and bottlenecks in Relational Database Systems within minutes rather than days.

In the previous post, we covered the Data Science announcements, and in the next post, well cover the most important announcements for Data Engineers.

AWS re:Invent 2021 Recap: Data Science Announcements

Ali Yazdizadeh — Tue, 21 Dec 2021 09:10:30 GMT

AWS made more than 120 announcements during re:Invent 2021, many of them introducing a new service or new feature for an existing service.

In this blog post, we are only going to highlight the most important Data Science announcements that you definitely shouldnt miss!

Amazon SageMaker Studio Lab (currently in preview), a free, no-configuration ML service
- What happened?
Amazon SageMaker Studio Lab is a free service for machine learning that provides a Jupyterlab IDE with both CPU and GPU as an option for the backend. Its going to be the AWS challenger for the Google Colab service.
- What difference does it make?
Since its a free service and doesnt need an AWS account to work with, it can be a good place for experimenting while you need to worry about neither cost nor identity management. SageMaker studio Lab comes with its own limitation such as limited time of GPU access (4 hours) and its poor integration with other AWS service. Therefore as soon as you pass the experimentation phase, you need to go to other AWS services like SageMaker or SageMaker Studio to deploy your model in production.
Amazon SageMaker Inference Recommender
- What happened?
Amazon SageMaker Inference Recommender helps you choose the best available compute instance and configuration to deploy machine learning models for optimal inference performance and cost.
- What difference does it make?
Choosing the right instance that is neither lower nor higher than your need was always challenging since it can result in low performance/high latency and unnecessary expensive instances respectively. So this feature can be helpful in such cases.
Amazon SageMaker Serverless Inference
- What happened?
Amazon SageMaker Serverless Inference is a new inference option that enables you to easily deploy machine learning models for inference without configuring or managing the underlying infrastructure. Amazon SageMaker Serverless Inference uses Lambda Functions under the hood to deploy your model.
- What difference does it make?
With SageMaker Serverless Inference, you pay only for the duration of running the inference code and the amount of data processed, not for idle time. You also dont need to worry about the server configuration. Since it uses Lambda it has some limitations: like a maximum memory of 6144MB and the problem of cold starts. To learn more about Serverless architecture, see our post on how to choose your serverless architecture.
Amazon SageMaker Training Compiler
- What happened?
Amazon SageMaker Training Compiler is a new feature of SageMaker that can accelerate the training of deep learning models by up to 50% through more efficient use of GPU instances. Popular deep learning frameworks like PyTorch and TensorFlow are supported and they can be used with minimal change to your training script. SageMaker Training Compiler accelerates training by converting DL models from their high-level language representation to hardware-optimized instructions.
- What difference does it make?
Its a good choice for models with considerable training times, as a rule of thumb for 30 minutes and more.
Amazon SageMaker Studio Monitoring Spark jobs running on EMR
- What happened?
Amazon recently announced that SageMaker Studio Notebooks could visually, browse and connect to Amazon EMR clusters. Starting today, with the built-in integration with EMR, you can do interactive data preparation and machine learning at petabyte scale within the single universal SageMaker Studio notebook..
- What difference does it make?
With the growing amount of data you need for training deep learning models, you need an easy way for your data scientist to use Spark on EMR. This new feature offers this ability.
Amazon SageMaker Pipelines now integrates with SageMaker Model Monitor and SageMaker Clarify
- What happened?
Amazon SageMaker Pipelines is a machine learning service from AWS that helps you build end-to-end machine learning workflows. It now supports integration with Amazon SageMaker Model Monitor and Amazon SageMaker Clarify.
- What difference does it make?
You can easily incorporate model quality and bias detection in your ML workflow with these integrations. The increased automation can help reduce your operational burden in building and managing ML models.
Amazon SageMaker Model Registry now supports endpoint visibility, custom metadata and model metrics
- What happened?
SageMaker Model Registry enables data scientists to catalog their ML models. Now also provides endpoint visibility from SageMaker Studio, so you will be able to store custom metadata, and read/write a broad range of metrics for your models.
- What difference does it make?
This new feature helps data scientists to keep track of their models training more conveniently.
Introducing Amazon Lex Automated Chatbot Designer (Preview)
- What happened?
AWS announced an automated chatbot designer for Amazon Lex. This service uses Machine Learning to analyze conversation transcripts and build a chatbot or a virtual assistant that can respond to users.
- What difference does it make?
AWS claims that this feature can reduce the the time/resources to develop a chatbot from weeks to just a few hours.

In our next post, well cover the most important announcements for DevOps Engineers.

10 tools for programmatic Identity Resolution at scale

Ali Yazdizadeh — Thu, 02 Dec 2021 06:41:00 GMT

What is identity resolution and what challenges do we have?

Identity resolution (aka Entity resolution) is the process of determining if multiple records represent the same identity in the real world, like a Company, Person, or Place.

For example, imagine you received the name and address of some IT companies from Government records and also from a third-party data provider. In the absence of a ground truth mapping, that shows the actual linkage between these two data sources, it becomes challenging to find records belonging to the same company.

For instance, all DataChef, DataChef.co, and DataChef (FoundersBuddy B.V.) points to the same company where DataChef Australia is an entirely different company, a knowledgeable reader can differentiate them, but its hard to find this by simply comparing their names. This is also the case for other attributes like address, peoples first and last name, and even phone number where there can be some variation like including country code or + sign or putting country code in parenthesis.

The computational complexity is yet another major problem for identification resolution. Imagine a dataset of N records with some duplicated records that belong to the same identity. We have to make N^2 comparisons. For big datasets it becomes impossible, so we have to reduce the computation complexity by doing something called indexing methods, so we just compare records pairs that have a high possibility to be the same. Considering our company name example, you break down your dataset into subparts that each part has a company name that starts with the same letter, so you dont need to compare records with different starting letters. This reduces computation significantly. We should define these blocking rules wisely to avoid missing potential records duplication.

A major use case of Identity resolution is resolving customer data profiles that come from different sources in an anonymous way. This can be challenging through a deterministic approach as people use different email addresses or share email addresses with other people, so a probabilistic approach can help in these situations. These tools use other data sources like IP address, location, activity time, we try to find the same identity in our data. For more information about Identity resolution for customer data, see the Neustar video.

Credit: Zingg AI

In the following parts, we will review some programmatic tools that can help you with identity resolution.

What are the open source solutions?

There are plenty of open-source packages written in Python, Java, and Scala that help you with identity resolution. Each of these packages works at some scale and with different performance, so you should find your best option based on your use case.

Python packages

Dedupe

Dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication, and identity resolution on structured data. Its well established in the open-source community (3.2k stars on GitHub!) and also has been used by big companies like ING Bank (see their story on YouTube).

The interesting part of this tool is its active learning approach for training where you dont need labeled data. You put your data (preferably preprocessed), and dedupe will start an active learning algorithm to find out the best approach for blocking this dataset.

It starts with asking question about pairs of records from your dataset (through an interactive terminal) and asking you to label them as match or distinct. You can finish it after a few questions (preferably at least ten matches and ten distinct record pairs). After that, it will cluster the matched records and return the output to the dataset that assigned a cluster_id to each record.

By default, parallelization (i.e., Spark) is not supported, but there is some work around it. You can also get your hands dirty with PySpark here is a blog post series that can help you.

It also offers paid support/consultancy and a web application in case you do not want to deploy it yourself under the name Dedupe.io.

Python Record Linkage Toolkit

Its a python library that links records in or between data sources. It provides most of the tools needed for record linkage and deduplication. Its more of a research tool that can be used on small and medium datasets.

Spark packages

Zingg

Its an open-source tool that can be used in a plug-and-play way. It has been written in Java and Scala. It can be run both on-premise, and on the cloud (technically anywhere with a Spark server). Spark helps with parallelization that makes it a good candidate as it can easily scale to large datasets. This package also comes with a pre-trained model so you can have a warm start.

These are two other Spark implementations of Identity resolution:

Graph-Based methods

Graph databases can be used for storing identity data. There are some tools that are built on top of these databases to perform the identity resolution.

Neo4J GDS Library

Neo4j is a graph database that uses a specific Query Language syntax called Cypher (something like SQL but for graph databases). Neo4j has a built-in library for graph data science algorithms like identity resolution that can be used directly on its database via the Cypher syntax.

For a case study on how using graph databases can be helpful for identity resolution, read Xandr-Tech story.

Zeta

Zeta Global is customer intelligent platform that collects customers data from different source and distills it down into a single view of a given customer. Zeta has been built on top of the Amazon Neptune graph database and showed to perform well at gigantic scale. See their story on the AWS re:Invent 2019.

Graph Neural Network SageMaker Jumpstart

This tool use Graph Neural Network, a branch of deep learning that showed to perform well on graph data input for various task such as node classification, edge classification and graph classification. It formalizes the problem of identity resolution to a graphs edge classification where we want to find whether two users are connected with an edge or not (whether they are the same identity or not). This SageMaker Jumpstart that use DGL package under the hood comes as a part of Amazon SageMaker service, so you need an AWS account to use it. To get more familiar with SageMaker see our post on .

This tool needs initial training data and Its still more of a research tool than a ready to use for prodcution.

What are the paid solutions?

Dedupe.io

Its the paid version of dedupe, where you have more support from the core developer of the package. Dedupe.io needs your data to be on their cloud platform, so in case thats not possible, you should go with the open-source library, but still the support is available.

Senzing

Senzing is an experienced company in this field. Their tool has a kind of warm start where you dont need initial training data, and it can do a good job with just common sense knowledge, but it can also learn in real-time to get better as your data grows.

They provide their product via two main ways:

their desktop application thats good for low and medium datasets
Senzing API where you can access it through Python, Java, or C.

Senzing API can be deployed to AWS using AWS CloudFormation. its also available for on-premise usage, so you dont send data to Senzing servers.

It is free for up to 100K records files, for bigger files, see the pricing.

Conclusion

The following table summarizes what we have shown in this post:

Tool	Popularity *	Actively Supported	Time to Learn/Develop	Support Parallelization by Default	Need Training Data	Paid/Free
Dedupe	1	Yes	Days	No	No	Free
Python Record Linkage Toolkit	3	No	Days	No	No	Free
Zingg	2	Yes	Days	Yes	No	Free
SparkER	4	Yes	Days	Yes	?	Free
JedAI	5	No	Days	Yes	?	Free
Neo4J GDS Library	?	Yes	Days	No	Both option	Free
Zeta	?	Yes	Days	No	No	Free
Graph Neural Network SageMaker Jumpstart	6	No	Weeks	No	Yes	Free
Dedupe.io	1	Yes	Hours	No	No	Paid
Senzing	?	Yes	Hours	No	No	Paid

Popularity for open-source packages has been calculated by the repositorys stars in GitHub.

Here we reviewed different programmatic tools for Identity resolution. There are plenty of other tools that we didnt include here, like the ones that are part of Customer Data Platform (CDP) or fully managed SaaS solutions. Hopefully, you can now compare these tools and find the best one for your particular use case.

Serverless Architectures - Pause, Think, and then Redesign.

DataChef — Tue, 30 Nov 2021 12:26:17 GMT

Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints. - Fred Brooks, Planning a Computer System: Project Stretch, 1962

Introduction

In this blog post, we are going to talk about serverless architectures and why a software designer should think before creating their tech-stack. When it comes to designing solutions we often come across a lot of questions regarding provisioning, costs, and durability of the intended architecture. In that context, this article provides you with an appetizer before you start thinking about the entree.

Considering the big investments in serverless made by the top three cloud vendors (AWS, Google Cloud, and Microsoft Azure) over the last few years, we will briefly touch on what is serverless and how easy it is to deploy and provision resources. Well look at various factors that influence your serverless design and why you should consider these factors before designing your application. Well also provide you with a high-level architecture that we designed to solve a use case for one of our clients in the area of data engineering. And since nothing is absolutely flawless, we will also enlist some drawbacks of serverless as well.

At its re:Invent conference, AWS today announced that four of its cloud-based analytics services, Amazon Redshift, Amazon EMR, Amazon MSK and Amazon Kinesis, are now available as serverless and on-demand services. ~ re:Invent 2021

Why Serverless and what are the various resources available within it?

The term Serverless refers to applications that are built with the notion that the organization is not responsible for maintaining the hardware or the software server-side processes involved in it. It means to outsource these responsibilities to someone else. The word outsourcing often comes with a need to trust the outsourced party and in recent years AWS has done a terrific job in being a secure, trustworthy partner in adopting serverless services for increased agility throughout your application stack (why you should use serverless). Serverless" does not mean no servers at all. It uses virtual machines (Micro VMs) underneath but on a smaller scale as compared to EC2 instances. In laymans terms, going serverless means developing and innovating more without thinking about maintaining your resources.

Mike Roberts in his article has categorized serverless architectures into two different avenues (which are sometimes overlapping):

Backend as a Service (BaaS) Applications that use third-party cloud-hosted services to manage server-side logic. Some examples of third-party services include identity and authentication as a service (AWS Cognito), logging as a service (Logsense, Amazon Elasticsearch Service) and analytics as a service (Amazon Kinesis).
Function as a Service (FaaS) Applications that involve server-side logic to be implemented by developers running in stateless mode. Stateless mode is often the preferred way to go since it offers us the ability to build applications where the server and client reside loosely coupled and hence are more tolerant to failure, disruptions, and can act independently. AWS Lambda, Google Cloud Functions and Azure Functions are examples of FaaS implementation.

What do you get with Serverless applications ?

You dont have to care about the infrastructure of the applications. Neither its provisioning nor maintenance.
You dont have to deal with failures related to scaling up/down of your application.
You only pay for what you use.
You dont have to worry about availability since the service provider (for example AWS) takes care of it under the hood.

In short, if an organization is thinking that going serverless requires an overhaul of their system and resources, it actually isnt. With a few best practices and even fewer lines of code serverless applications can be made fully functional. Most serverless applications have an event trigger associated with it, a compute engine (function) that takes care of the processing, and a sink (data store) or an invocation of another service.

The following image shows the three basic components needed for a serverless architecture [1]:

The types of resources are available to us via serverless computing framework in AWS are:

Compute Engine : Lambda, Fargate.
Application Integration (Event trigger): Eventbridge, SQS, SNS, etc.
Data Store: S3, DynamoDB etc.

And to give you an example of how simple it is to deploy a serverless application, the following is an example of a serverless file that deploys an API endpoint with the help of a flask application and uses Lambda to process requests made to this API. We will not go into the details of how to deploy serverless applications and why we used the options below because that is beyond the scope of this post (here is a link to demonstrate the deployment of a serverless application in aws). However, it is important to note that we have added monitoring (AWS Cloudwatch), key encryption (AWS SecretsManager) as part of the code above which may or may not be required depending on the use case. The following is an example of the entire code snippet we used to deploy this application in AWS :

service: serverless-flask-appframeworkVersion: "2"plugins:  - serverless-python-requirements  - serverless-wsgicustom:  wsgi:    app: app.app    packRequirements: false    pythonRequirements:    dockerizePip: non-linuxprovider:  name: aws  runtime: python3.9  lambdaHashingVersion: 20201221  iamRoleStatements:    - Effect: "Allow"      Action:        - "s3:*"        - "cloudwatch:*"        - "logs:*"        - "apigateway:GET"        - "apigateway:POST"        - "secretsmanager:GetSecretValue"      Resource: "*"  region: eu-west-1functions:  app:    handler: wsgi.handler    events:      - http: ANY /      - http: "ANY {proxy+}"

How to decide which Architecture to choose?

Now that we have demonstrated briefly just how easy it is to deploy applications using serverless we should address the more fundamental question. Ideally, this is a two-part question and the first part deals with the type of serverless invocation method. Depending on the event trigger and the type of use case, there are three types of serverless invocation methods [2]:

Asynchronous
Synchronous
Streaming

Scenario 1:
You have a requirement from the client that every request made by the client should be met by a response in return. In essence, there shouldnt be a considerable lag between the requests-response and every request has to be served. This is where you should choose synchronous invocation.

Scenario 2:
You want to serve requests from your client but ideally, you dont want to keep them waiting. You want to be able to process these requests and let the client know after the processing has been done through a notification (alert). Asynchronous invocation is the answer to this use case.

Scenario 3:
Suppose you have a use case in which the priority is that you have to keep up with the pace of data coming from the upstream, ensure data capture (persistence) and load the data in downstream applications. Streaming application is the solution to this and AWS Lambda supports them (you can find more about how to deploy them in this post). In the end, the question of whether you should go ahead with building streaming applications is determined by how real-time you want your application to be(difference between batch processing and real-time).

Each scenario will have a different invocation method (microservice call) associated with it. And this is where we come to the second part of the two-part question, namely what type of microservice you should choose for your serverless application and what factors you should decide that on. We are going to use Lambda, SQS , and SNS as the components of our serverless architecture to understand how these factors affect your overall architecture.

There are broadly five important factors to think about while making these architectural decisions :-

Scale - How many requests do you want to serve at a time ? For instance, AWS Lambda API offers up to 3000 requests (varies with region) at a given point in time and is shared for all functions in a region. SNS and SQS are both auto-scaling in nature (will scale up according to load) but you can also set a concurrency limit to this.
Durability - How important is it for you to handle failures without manual invention ? Lambda is highly-available but it does not offer you durability. So clients need to handle failures/retries themselves (unless you are smart enough to set up reprocessing). Compared to this both SNS and SQS stores all the messages they receive and are durable to an extent (multiple copies across multiple regions).
Persistence - How often do you expect a failure and hence do you want a persistent architecture? Herein you might want to consider the important differences between a stateless and stateful architecture.
Retry/Failure Handling - Synchronous invocations of Lambda are client dependent for retries. But for asynchronous invocations, Lambda retries twice. Similarly, if you setup SNS with Lambda it will do multiple retries followed by an exponential backoff (more on SNS delivery retries can be found here). For SQS, it is a slightly different story as the message remains in the queue until a successful Lambda invocation. So, the question is how do you want to handle failures and retries?
Pricing - The pricing for each service is available in the AWS calculator but the key point here is most of these microservices are billed per request and there are no extra invocation charges for instances like SNS, Lambda. For example, Lambda invocation costs 0.2 USD per request while SNS costs 0.5 USD per request (where 1 request corresponds to 64 KB of delivered data). However, it is important to set up monitoring for your application as your estimated costs may not match your actual costs if the Lambda keeps failing (or you manage to create an infinite loop inside the function).

So, what do I do with all this information ?

When you think of designing a serverless architecture, make sure your system is loosely-coupled and fail-fast. As mentioned earlier, choosing asynchronous architecture has its merits in this scenario.

Try to fan-out properly. Fanning Out is when messages are passed to multiple endpoints from a single source at the same point in time. SNS is an extremely good solution in this regard since it offers durability with a low cost and high persistence.

Avoid Bottlenecks. Consider a scenario where you have to use a Lambda function to do a heavy calculation and load the data is S3. Instead of directly calling the Lambda function, add a SQS queue and to help protect the downstream service invocations.

Since serverless applications run on micro VMs it would be wise to avoid infinite loops in your Lambda functions. Using best practises like setting up cloudwatch alarms for your Lambda-functions and reserved concurrency will certainly help avoid recursive patterns. These are some of the considerations we made while designing a serverless application for one of our clients.

An Overview of how these choices were made with a Customer Story

We decided to use a serverless architecture for a use case provided to us by one of our clients. The problem definition was to extract information from emails generated when a customer interacts with an online advertisement. And to feed this extracted information to businesses using a data pipeline. The basic requirement was to reduce the response time between the customer and business by automating information extraction from these emails. We came up with a solution involving an event trigger (SNS) , a data store (S3) and a compute engine (AWS Lambda).

The diagram above shows the high-level architecture we used. Every piece of the puzzle including automated extraction of information as configurations (using Dynaconf), raising SNS alerts, sending metrics to Cloudwatch was achieved through automated serverless deployment.

We managed to remove the manual process of notifying the business when a customer interacts with an online ad and also reduced the response time between customer and business by a factor of 10. In addition, we managed to send the data to external systems through Kafka and also keep track of the messages by creating a data pipeline. This data pipeline takes the messages in raw-format , anonymises the sensitive information and stores them in Athena solely for the purpose of future analysis.

So far, we spoke about serverless architectures in the context of data engineering or software engineering. But it is also possible to make your machine learning (ML) models serverless. Production-ready ML models come with high overhead costs on computation and memory requirements (assuming you are making something worthwhile). Keeping that aside, you can reduce or optimize your inference costs by making the inference architecture serverless (more on this in our upcoming posts).

Conclusion

In the realm of software development, one should be wary of using new technology blindly. It is wise to note that serverless, like others before it, has its drawbacks. While some of these drawbacks on tooling and configuration have been worked on (over the last couple of years) largely with the effort of Amazons Serverless Application Model (AWS SAM), some others like integration testing (for BaaS) needs improvement.

Some of the drawbacks that we (as developers) have faced with serverless:-

Cold Starts - In practice, it needs some time for a Lambda to handle the first request. This is because the platform needs some time to initialize internal resources. One way of avoiding this is sending periodic requests to your function to keep it active.
Some FaaS implementations such as Lambda do not provide out-of-the-box tools to test functions locally. Since Lambda is billed per request it may not be economical to test your functions on the cloud. Solution : create your own testing environment locally and use Lambda only when you are sure about the end result (in any case the invocation cost is quite low).
Vendor Lock In - Although serverless provides the pathway to deploy code very easily, it also locks in the architecture and the service to the specific vendor like AWS. In hindsight, it is not a problem if you are working with a single vendor (although you are quite dependent on the platform itself for resources). But, it could be an issue if you suddenly want to switch to a new provider (FaaS implementations are not compatible with respect to service providers).

In conclusion, we have shown how easy it is to deploy applications via serverless and also highlighted several factors you should consider while designing your architecture. We hope you have enjoyed reading this post and please feel free to reach out to us in case of questions, concerns, support or feedback.

Figure References :

Data Discovery with Amundsen and AWS Lake Formation

DataChef — Wed, 03 Nov 2021 12:47:17 GMT

What is Amundsen

Amundsen (named after the famous Norwegian polar explorer Roald Amundsen) is an open-source data discovery tool, which aims to improve the productivity of data analysts, data scientists, and engineers by enabling them to quickly find the desired datasets and their details without the need to constantly knock on the door of the usual few domain experts.

It goes beyond a simple metadata repository and provides Google-like search capabilities by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables).

Through a web interface, users can search for data that is available in the organization, as well as view details about the nature of the data, such as table and column descriptions, its freshness, column min/max values, who owns such datasets and much more.

Amundsen makes use of the notion of data builders to extract metadata from a source and load it into a graph database, so it can be queried by the frontend. The project comes packaged with extractor logic for a number of sources, such as AWS Glue, Google BigQuery, and Delta Lake to name a few.

Databuilders are designed to be easily extensible, and custom extractors can be developed to ingest metadata from sources like Excel or Conduktor.

Use Case

For our use case, we want to deploy Amundsen in a single AWS account that is responsible for data discovery. It will host Amundsen on AWS infrastructure and will take care of running the data builder workload to extract metadata from different sources. This way, there is a single Amundsen deployment to maintain, and we can offer the service to the entire organization.

Next to the data discovery account, we have separate AWS accounts for different products. Each product account can have data sources that they want to make discoverable to the rest of the organization.

The diagram below shows a simplified design using AWS Glue catalogs as a source.

Glue as source

We wanted to take full advantage of the fact that Amundsen supports AWS Glue as a data source. A tables metadata will appear in Amundsen as long as the Glue API call is able to list the table. This brought us to the idea to share the Glue catalog from all product accounts with one central Data Discovery account.

Sharing the catalogs with a central account, allows us to have a single Glue databuilder job that runs in the Data Discovery account and is configured to query only its Glue catalog, greatly simplifying the extraction process.

While testing this, we noticed that Amundsens Glue extractor logic was failing on tables shared with the DataDiscovery account (external tables). It seems that the case of external tables was not covered when the logic was initially written.

Luckily it was easy to fix this issue, and thanks to the active Amundsen community we were able to merge our fix in a couple of weeks, meaning that support for external tables should be available from the next release onwards.

Lake Formation for Access Management

For our use case, we use AWS Lake Formation for access management and cross-account resource sharing. Lake Formation makes it very easy to do access management on Glue resources. If you are used to managing permissions in a database, it will feel very natural to you. An AWS account has one or more Lake Formation administrators who are able to grant or revoke permissions on databases or tables.

Lake Formation has multiple ways of managing these permissions, either through LF-tags (recommended) or through named resources.

When sharing Glue catalog resources through the named resources method, administrators need to create a separate permission statement for each shared resource. Even though it is possible to grant the same access for ALL_TABLES in a database in one single statement, this can still become cumbersome to scale in a large enterprise with many datasets.

Using LF-tags is the recommended way to go, as you can assign such tags to resources that you want to share and then use LF-tags expressions to quickly set permissions for all tagged resources. However, the documentation provided by AWS on this topic does not go beyond setting some prerequisites and it took us a bit of trial and error to achieve a fully working setup. We dont want you to go through the same pain, so stay tuned for a blog post on how to configure Lake Formation for cross-account sharing using LF-tags.

Conclusion

In this post, we have provided a high-level overview of how we make use of Amundsen in combination with AWS Lake Formation to provide discovery of data in Glue catalogs to our entire organization.

Stay tuned for our follow-up posts that dive into more detail on:

how to deploy Amundsen on AWS
how to set up cross-account sharing with Lake Formation through LF-tags

Why Bother Explaining a Black-Box Model?

DataChef — Wed, 03 Nov 2021 07:24:30 GMT

Introduction

In recent years, the availability of large volumes of data and computational resources, combined with the optimization technology, has caused deep learning models to succeed in numerous representation learning and decision-making problems from various domains. In some cases, these models have even exceeded human-level accuracy, demonstrating that artificial intelligence can perform tasks comparable to human domain experts. To name a few examples, take a look at the Microsofts recent achievement in the field of natural language processing which is building the worlds largest and most powerful generative language model to date with 530 billion parameters, or DeepMinds computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known.

However, the _"no free lunch theorem" always exists, and the success of deep learning models is achieved with several costs. Not only do deep learning models require an enormous amount of high-quality input data and extensive computational resources to be trained, but also these models are highly complex because they have millions or even billions of trainable parameters. This complexity turns deep learning models into black-boxes that are hard to analyze . In other words, the lack of transparency of their mechanics makes it difficult to understand their chain of reasoning that leads to certain decisions, predictions or actions. Consequently, it is hard to trust these models and reliably apply them in sensitive situations where it is necessary to understand the problems context, such as healthcare and transportation. For example, it has been shown that deep learning models are fragile under small targeted perturbations called "adversarial examples"_, which drastically decrease their performance

Due to the weakness of such black-box models, there has been an emerging field in the machine learning community called _"explainable artificial intelligence"_. The goal of explainable artificial intelligence is to produce more explainable models while maintaining their high performance.

Why do we need explainability?

The term explainability is often used in parallel with the term interpretability. Still, there is no mathematically rigorous definition of explainability in the machine learning community since it depends on the context of the problem and the audience to which these explanations are provided. However, there are several requirements that machine learning explainability has to address, namely, trustworthiness, causality, reliability, fairness, and privacy.

From a psychological perspective, explanations are the currency that humans exchange beliefs. That is to say, explanations are answers to why questions, and the answers are accepted only if they make sense. This point of view suggests that trustworthiness is necessary for the machine learning models, or equivalently, it is crucial to know how often a model is right and for which examples it is right. This aspect of explainability of machine learning models is aligned with the social right to explanation which refers to the individual rights of presenting a reason for decisions that significantly affect an individual, particularly legally or financially.
Explainability is also important when you want to leverage a machine learning model to generate hypotheses about causal relationships between variables. Therefore, it is often desirable for the model to pick up causal relationships rather than mere associations and correlations.
Reliability is another need for machine learning models. Machine learning systems should be robust to noisy inputs and shifts in the input data domain. The behavior of a black-box model is unpredictable under these circumstances.
Moreover, machine learning models must be fair when applied in decision-making settings such as social, medical, or economic environments. To be more precise, the outputs of a machine learning model must not be affected by the biases in the training datasets (such as possible demographic and racial biases).
Finally, machine learning models must preserve the privacy of sensitive personal data and, therefore, it is essential for these models to have transparent mechanics.

Who benefits from explainability?

Producing explanations for machine learning models that are deployed in industrial settings depends on the specific audience of those machine learning models. Various stakeholders are demanding explanations of machine learning models, and so far the machine learning community seems to be falling short of expectations. These stakeholders include,

End-users of the machine learning model which consume the output of these models directly and require explanations to trust those outputs. Helping end-users build trust in the way that these models make decisions leads to a better user experience.
Executives and decision-makers of an enterprise who use the results of such models to develop the business strategy of the enterprise.
Data scientists and machine learning engineers who design and implement these models and must fully understand the mechanics of such models.
Domain experts who are often asked to audit the performance of these models.
And finally, regulators who may demand that these models satisfy certain criteria prior to applying them in real-world environments.

According to a study among approximately fifty organizations, most enterprises that deployed explanatory techniques within their organization utilize the explanations as a means to guide data scientists and machine learning engineers to design models rather than present them to the end-users. This indicates that there are still many efforts that need to be taken by the machine learning community to achieve explainable artificial intelligence.

Explainability through the lens of data scientists

The explainability of machine learning models empowers the designing process of those models from a technical perspective. Data scientists and machine learning engineers can take advantage of an explainable model in several use cases. One of the major applications of explainability is model debugging. A data scientist needs to understand the behavior of a model, particularly when applied to the specific inputs that result in low performance. Having an explainable model helps data scientists to know the relationship between various features and measure the contribution of each feature in the resulting output. Also, the explainability of a model guides data scientists through the feature engineering process that leads to better performance of the model.

Another benefit of explainability is to monitor the models performance after it is deployed. Since in real-world environments, the distribution of the models input may change over time, it is important to have an understanding of the models response when drifts occur in input features distributions and anticipate when a system fails. Furthermore, an explainable model makes it possible for data scientists to present their models behavior to other organization teams and collaborate with them to audit the model and improve its performance.

Overview of methods

So far, we have discussed the explainability of a machine learning model and its motivations. Now, let us have a high-level overview of the techniques of achieving it. It is worth mentioning that not all machine learning models have opaque mechanisms. Some basic models are intrinsically transparent, such as logistic/linear regression, k-nearest neighbors, decision trees, Bayesian models, rule-based learners, and general additive models. Although these models can explain their behavior, they might not achieve the performance of more complex models such as artificial neural networks in various machine learning tasks. On the other hand, complex models such as neural networks have a black-box nature and require explanations. In general, theres a tradeoff between the performance of the machine learning model and its explainability. Nevertheless, there are settings where explainability of the model is as important as its performance and the choice of machine learning model depends on the context of the problem.

Explaining complex machine learning models may be achieved by post-hoc approaches that try to extract useful information regarding the mechanics of the model after it is trained. There are various taxonomies of the explainability methods in machine learning literature depending on the different points of view. In the most general perspective, one can classify explainability methods into model-agnostic and model-specific methods. As their names suggest, model-agnostic methods refer to those techniques that can be applied to any machine learning model, whereas model-specific methods are tailored for specific models. Each of these methods can have local or global variants. Local explainability techniques aim to explain model behavior for a specific input sample, while global explainability techniques attempt to understand the high-level concepts and reasoning used by a model. Local explainability techniques are the most consumed methods in organizations.

The most common local explainability technique is attribution methods. Attribution methods employ the gradient information of the model with respect to the inputs implicitly or explicitly and measure each input features contribution to the models output. Two well-known attribution methods are LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (SHapley Additive exPlanations). We will elaborate on these techniques in our next blog posts. Other examples of local explainability techniques are counterfactual explanations which try to find the data point close to the input for which the decision of the classifier changes, and influential samples which try to find the most influential training data point to the models output for a specific test data. For a thorough review on the explanation techniques you can see this survey.

Conclusion

In this blog post we discussed the explainability of a machine learning model, its motivation and needs, and an overview of its methods. There is ongoing research in both industry and academia regarding explainable artificial intelligence. A variety of methods are being explored to achieve this goal. We will pragmatically dive into more details in future blogs and see these explanations in action.

Tiny caveats of using different S3 Schemes on AWS Glue

DataChef — Mon, 01 Nov 2021 15:24:00 GMT

Introduction

Suppose you have a scenario in which you want to transport data from one point to another. AWS Glue, being a fully-managed ETL service makes it a go-to choice for being the data integration/transportation assistant for any such use case. Additionally, lets assume you want to use S3 to store your data objects in the form of Delta files. Some common advantages of S3 as your data-storage service is its high-availability and security. Since security is a concern for all of us, we generally tend to have a uniform key encryption strategy for all kinds of data storage objects in S3. We found out recently that getting a uniform key encryption strategy may not be a piece of cake.

We, at DataChef, use S3 to store our data objects (mainly Delta files) and AWS Glue to primarily run data workflows. While working with AWS Glue and S3, we came across an interesting scenario that could be useful to someone facing this issue or a similar bug. We use s3 schemes on AWS Glue to write to Delta and in one such data pipeline we came across this error:

java.io.IOException: The error typically occurs when the default LogStore implementation, that is, HDFSLogStore, is used to write into a Delta table on a non-HDFS storage system.

And the solution is to use the community-developed s3 scheme s3a://. But here is the catch. For Delta files returned from Spark jobs, s3a:// uses a different key-encryption strategy as opposed to s3://. And therein lies the problem of key encryption inconsistencies (namely custom KMS key encryption as opposed to AWS s3s default key encryption) for different platforms.

How can we access S3 resources?

Amazon S3 supports RESTful architecture and every resource in S3 can be accessed programmatically through a unique URI or a URL. One can access S3 resources through URLs using one of the following ways:
virtual-hosted-style access : Bucket name is part of the domain for example: https://bucketname.s3.region.amazonaws.com
path-style access : Bucket name is not part of domain for example : http://s3.region.amazonaws.com/bucketname

Moreover, one can also access S3 objects through other AWS services such as AWS Glue, AWS Lambda programmatically through a URI scheme. They require specifying an Amazon S3 bucket name to the URI scheme: s3://bucket_name. In relation to this, there are two other S3 schemes: s3a and s3n.

What is the difference between s3, s3a, and s3n?

s3n:// A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools and vice-versa.
s3a:// Hadoops successor to s3n and is backward compatible with s3n. It also supports accessing files larger than 5GB and provides performance enhancements.
s3:// A block-based file system supported by S3. Files are stored as blocks just like in HDFS.

Since Hadoop 2.6, all work on S3 integration has been with S3A. S3N is not maintained except for security risks this helps guarantee security.[1]

KMS Key Encryption for S3 Buckets

S3 has a bucket policy enforcing encryption on all the data uploaded to the bucket with the KMS key. This allows any user to encrypt the data they upload with their own KMS key.

What happens when we try to access s3 buckets using the s3a scheme for writing to Delta Files?

Delta Lake provides ACID guarantees for any storage system (such as S3) it accesses. Delta Lake uses the scheme path (s3a://) to identify the storage system dynamically and uses the corresponding LogStore implementation to provide transactional guarantees. The LogStore API is responsible for providing the ACID guarantees (such as atomic visibility, mutual exclusion) for reading and writing to Delta. However, one interesting fact we found was that for .parquet files, Hadoop uses the default customer-managed KMS keys as configured for the bucket. But when we use something other than s3://, Hadoop will revert to using AWS managed AWS/s3 key for Delta log/manifest files.

Lessons Learned: Maintain the platform integrity

The use of s3a:// is deprecated and no longer supported by AWS. Hence it is always encouraged to use the s3:// scheme for accessing S3 through AWS Glue or other AWS services.
If you have to use s3a:// then it is important to specify the server-side encryption key. This can be set using the following option -> fs.s3a.server-side-encryption.key.
Use the following code in Spark builder part of the script which will allow the class to write to S3 using s3a://.

.config("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")

In Closing

In conclusion, we at DataChef, are looking to solve interesting problems smartly through the use of AWS-managed microservices (apart from anything and everything that is open-source). I believe it is our responsibility to share information that might benefit someone who is looking to solve a similar problem. Please feel free to reach out to us, in case you have encountered this, or a similar problem and share your thoughts.

References
[1]: S3 Support in Apache Hadoop

How We Paid 14,000$ Bill to AWS and Why We Liked It!

DataChef — Wed, 20 Oct 2021 14:18:03 GMT

Introduction

In the world of AWS, its easy to set up any infrastructure you want with a click, and because of this simple reason, its too easy to lose track of what you are currently running.

The Story

At DataChef, we have several AWS accounts for internal usage and testing. We dont deploy any long-running tasks on these accounts.

The other day I was testing an event driven system on AWS EventBridge; the system has two sets of components that get triggered by events and then publish another event. A software bug in the system caused it to process an event forever and created a loop in the whole system. This bug led to eating as many resources as possible in the period it was running, and we ended up with a 14,000$ bill for it.

Issue Discovery and Reaction

The Email

We have a Slack channel for aws-budget On July 1st, 2021 (16:15 PDT), there was a message from the AWS budget mentioning that a 10-dollar budget trigger went off with a 23K forecasted amount.

Finding the Root Cause

After getting this message, I jumped into the cost explorer to see what was happening.

Its evident from the charts that step functions were running more than they should have. Another bad news is that these step functions did not have any tags associated with them. At this moment, I did not know what system caused this. After checking the Lambda function executions, I found one that was executed several times. It was clear that this system might have been the root cause. I immediately stopped the system and tried to recreate the situation to verify my hypothesis.

Application Bug that Caused Resource Eating

As I already mentioned, the system has components that communicate using events. A simple demonstration of this would be the following diagram:

In this system, Lambda functions subscribe to an EventBridge rule and get triggered by each event that matches the rule. A single event is defined like this:

{    'Source': 'App',    'EventBusName': 'default',    'DetailType': 'TaskCompleted',    'Detail': null}

You can define your rules based on these attributes, like specifying to consume all the events from the source App.

In our case, the bug was in events published from the consumer side and the rule that triggered consumers. Our matching rule matched the rule that we produced, which led to infinite processing of that event. However, the system worked fine, and you could see the result, but this bug was eating resources behind the scenes.

Why Hadnt We Noticed the Problem Earlier?

There are several takeaways from this incident, but the most important one is this question: if something is wrong, you should know it as soon as possible, right?

The problem was because our budget control was not set correctly for our usage.

First, the threshold was set pretty low, and since it was often reached, we assumed it was safe. Secondly, we only had one monthly budget control alarm. So, when the cost went up in mid-June, nothing else was sent out by the budget control. The next one that came out was on July 1st, and the only eyebrow-raising piece of information was that huge forecast.

How Do We Avoid it in the Future?

After this incident, we investigated the root cause and what caused the delay between the incident and finding out about it. Here is a list of further steps to prevent this from happening again.

Tagging Resources From the Beginning

By tagging resources, it makes it much easier to check service costs in cost explorer. We want to make sure everyone tags created resources, but people can forget. We are working on a new recipe to address this problem. With proper tagging, we can always check how much the system costs (and whether our estimation about this was correct or not) and avoid having costs that are hard to discover.

We started using a tagging strategy for sandbox accounts. Although its not a very active account, having these guardrails is a must from the start. Remember when many MongoDB databases got hacked because people believed that MongoDB only listens to local connections?

We started tagging our resources in the simplest possible way with one mandatory tag, owner:name, and one optional tag, project, categorized as business tags. If youre picky about this, there is a whitepaper from AWS, which describes all you need for a robust tagging strategy.

Improve Budget Alerting

At the time of this incident, we had our billing alarms set, but from this experience, we found out some things are not correct:

Our alarm only notified us monthly (thats why we found out about the costs some days later)
Our budget was not realistic with our usage

We could have taken action much sooner about this problem and prevented it from happening with proper alerting. After this incident, we now have a daily budget to know if something is wrong the day after, alongside a monthly forecast alert to get notified if the cost forecast will exceed our threshold for the month.

If you dont have a budget alerting yet, navigate to Budgets in AWS and set it up now.

Monitor Resource Usage of an Application

After running something on AWS, you can guess how much it will cost to run it, but always check it with the Cost Explorer dashboard to ensure it. With proper tagging in place, it is a very straightforward task to do. Remember, this type of resource eating can happen in other cases too. For example, if a Lambda gets triggered by files uploaded to a bucket and uploads files to that same bucket, there will be a similar problem.

In Closing

We received a huge bill from AWS, and we liked it. We learned invaluable lessons about securing our AWS accounts against unwanted costs by setting up the right kind of guard rails by default.

We hope that the lessons it taught us will benefit you too after reading this post mortem.

Time Series Analysis with sktime on AWS SageMaker

Ali Yazdizadeh — Tue, 21 Sep 2021 08:52:52 GMT

Introduction

Have you ever tried to forecast Bitcoins price or checked tomorrows weather? This specific type of data is called time-series. Time-series is a series of data points collected over equally-spaced time intervals rather than just a one-time data recording.

sktime is a library for time-series analysis in Python. It provides a unified interface for time-series classification, regression, clustering, annotation, and forecasting. It comes with time-series algorithms and scikit-learn-compatible tools to build, tune and validate time series models.

This guide explains the procedure for using sktime as a toolbox for time series analysis on the SageMaker environment.

sktime.org

Using sktime in SageMaker Notebooks

Amazon SageMaker is a fully managed service that provides data scientists/analysts with the ability to quickly build, train, and deploy machine learning (ML) models without the need to worry about infrastructure.

SageMaker notebook instance is a machine learning compute instance running the Jupyter Notebook app. SageMaker manages to create the instance and provisions the related resources for the same instance. To use the sktime package on SageMaker Notebooks python kernel, one option is to install and use it using pip:

pip install sktime

It is easy and fast, but since you need to repeat this process each time you restart the notebook instance, its not so efficient! the recommended option is to start a training job since it is more reproducible and scalable and you pay only for the time needed to train the model.

Initiating a Training Job

On SageMaker, you can use different instances for:

Your notebook
Your training job, which usually needs a stronger instance (maybe needs GPUs too).
And hosting your model for prediction (usually a cheap instance with fewer resources but running for longer periods of time).

To start a training job, you pass your training script as a .py file to SageMaker along with the instances configuration. To run a training job on SageMaker, you have two options: Script mode and Docker mode.

Script Mode

For initiating a training job on SageMaker, a handful of popular Machine Learning Frameworks are accessible. When using a SageMaker built-in framework (such as scikit-learn), you can simply provide the Python code for training a model and use it with the SageMaker SDK as the entry point for the framework container. A complete list of built-in frameworks can be viewed on Amazon SageMaker Frameworks Page.

In the following example, we will use SageMakers sklearn framework with a few modifications to run our script that uses the sktime package. Since we are using the sklearn framework, the sktime package will not be available by default, so we need to include the installation of the sktime package in our script. The following function can do the pip installation of packages within a python script:

import subprocess, sysdef install(package):    subprocess.call([sys.executable, '-m', 'pip', 'install', package])install('sktime')

After importing the newly installed sktime package, we need to define an Estimator object and pass our training script to it. This can be done as follows:

# import sklearn frameworkfrom sagemaker.sklearn import SKLearn# define sklearn estimator object that recievesk_model = SKLearn(entry_point='your_training_script.py',             role=sagemaker.get_execution_role(),             framework_version='0.23-1',             instance_count=1,             instance_type='your_desired_instance_type',             output_path='your_desired_output_path',             base_job_name='your_deired_model_name_prefix')

Start a training job by providing the path to training data. This data path will be available to the training script via the SM_CHANNEL_TRAINING environmental variable.

sk_model.fit({'training': 'training_data_path'})

When the training job is finished, we can access our trained model located in output_path. A complete example (python script and the Jupyter Notebook) for starting a training job is available on the DataChef GitHub repository.

Container Mode

Introduction to Docker Containers

If youre familiar with Docker already, you can skip this and go to the next section.

Working on a project that uses multiple software with specific configurations can easily lead to a problem called Dependency Hell, where the project works well on your computer but faces a lot of errors on other peoples machines, as their computer doesnt have those specific dependencies or have non-compatible versions. Docker provides a simple way to package an application into a totally self-contained image, so it can easily be transferred between and run on different computers without worrying about dependencies.

Docker can be seen as a more robust solution to a problem that you may solve with environment managers like virtualenv or conda as it is completely language independent and comprises your whole operating system. This ML in Production blog post series is recommended for more information about Docker and how to use it in your machine learning projects. We also dug deeper into using Docker within SageMaker in our previous blog post.

docker.com

In the previous section, we used the sklearn built-in framework from SageMaker for our training job. SageMaker built-in frameworks are technically Docker images with those specific packages installed. In this section, we want to create our own image that can run python along with the desired version of our dependencies, including sktime. Docker uses a simple file called a Dockerfile to specify the image configuration. The following Dockerfile shows the required image to run a python script that uses sktime:

# Start from a predefined image that has already python3.7 installedFROM python:3.7# Now we install our desired dependency using pipRUN pip3 install --no-cache scikit-learn numpy pandas joblib sktime flask# we need to copy the training and inference script to our docker image so we can use them laterCOPY sktime-train.py /usr/bin/trainCOPY serve.py /usr/bin/serve# change the files access permissionsRUN chmod 755 /usr/bin/train /usr/bin/serve# expose port 8080 for inferenceEXPOSE 8080

Pushing the custom image to ECR

To start a training job with our custom image, we need to push it to an Elastic Container Registry (ECR) repository (something like GitHub but for Docker containers). Here we go through this process in a SageMaker notebook. Its also possible to do it from your local machine, but as Docker images can be quite large, we preferred the SageMaker notebook; see AWS CLI documentation for more information.

1. Permissions

To create and push the image to ECR, you need to give the required Permissions to the notebook instance. To do this, go to notebook instance configuration, and in the Permissions and encryption part, click on IAM role. After that, using the attach policy in the Permission tab, add SageMakerFullAccess and AmazonEC2ContainerRegistryFullAccess to the policies of this notebook.

2. Create ECR Repository

Now we can create the ECR repository to push our image to it using these bash scripts.

export REGION=eu-west-1export IMAGE_NAME='your-desired-name'aws ecr create-repository --repository-name $IMAGE_NAME --region $REGION

3. Building and Pushing Docker images to ECR

After creating the repository, we build the image according to our defined configuration in the Dockerfile and push it to the ECR repository:

export REGION=eu-west-1export IMAGE_NAME='sktime-v4'export ACCOUNT_ID=`aws sts get-caller-identity --query Account --output text`docker build -t $IMAGE_NAME:estimator -f Dockerfile .export IMAGE_ID=`docker images -q $IMAGE_NAME:estimator`docker tag $IMAGE_ID $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_NAME:estimatoraws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_NAME:estimatordocker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$IMAGE_NAME:estimator

Training the model

Here we copied a training script to our image that will be used in the training job. A sample training script is similar to script mode, but we dont need to define the model_fn function here.

Initiating the training job

To start a training job, we need to define an Estimator object using our image. Its similar to what we have done in script mode, except we need to pass our custom image URI to the Estimator object. This can be done with the following lines code:

import sagemakerfrom sagemaker.estimator import Estimatorrole = sagemaker.get_execution_role()sk_training = Estimator(             image_uri='your-image-address',             role=role,             instance_count=1,             instance_type='ml.m5.large',             output_path=output,             hyperparameters={                  'normalize': True,                  'test-size': 0.1,                  'random-state': 123})sk_training .fit(inputs={'training':training})

Deploying the model

After the training job is finished, we can deploy our model to an inference instance so well be able to send a request to it and get a real-time prediction. To do this, we use a simple Flask application that creates a REST API for our model to send predictions to the endpoint using a POST request. A sample serve script that can do this for us comprises of the following steps:

Loading the trained model
Defining a ping URL that SageMaker will use to make sure the inference instance is up and running
Defining an invocation URL that can get POST requests and use the trained model to return the prediction

Create the endpoint

The following code will deploy our trained model:

sk_predictor = sk_training .deploy(instance_type='ml.m5.large', initial_instance_count=1)# set the predictor input/output configurationssk_predictor.serializer = sagemaker.serializers.CSVSerializer()sk_predictor.deserializer = sagemaker.deserializers.CSVDeserializer()

Now we can get some example predictions, and plot them:

import matplotlib.pyplot as pltfig, ax = plt.subplots(figsize=(16, 4))x = airline_dataset.iloc[:,0].valuesy = airline_dataset.iloc[:,1].valuesx_train = x[:-36]x_test = x[-36:]y_train = y[:-36]y_test = y[-36:]fh = np.arange(1, len(y_test) + 1)y_pred = sk_predictor.predict(fh)x_ticks = [x[i] if i%5 == 0 else '' for i in range(len(x))] plt.plot(x_train, y_train, 'o-', c='#28334AFF', label='Training')plt.plot(x_test, y_test, 'o-', c='#FBDE44FF', label='True')plt.plot(x_test, y_pred, 'o-', c='#F65058FF', label='Prediction')plt.xticks(x, x_ticks, rotation=45)plt.legend()plt.show()

Deleting the endpoint

After you are finished with the prediction, dont forget to delete the inference instance. This can be done with the following code or manually through the AWS console (from SageMaker->Inference->Endpoints):

sk_predictor.delete_endpoint()

Conclusion

Here we described how to use the sktime package on Amazon SageMaker. We showed two approaches (script mode and docker mode) to train and deploy your sktime model. All of the codes and the notebooks from this blog post can be found in the DataChef GitHub repository.

A Graph Convolution Network in SageMaker

DataChef — Tue, 07 Sep 2021 09:58:49 GMT

Introduction

Graph neural networks provide a more flexible paradigm for machine learning models by explicitly accounting for entities and their relationships through neural networks. For example, It is easier to model a real problem with graphs in social sciences or networks. As a matter of fact, graphs are ubiquitous in machine learning and computer science. For instance, an image can take the form of a graph where each pixel is a node connected to each one of its neighbors by an edge. Despite the fact that these grid shapes enjoy regular structures, most structures in real life are considerably more complex. In this sense, GNNs generalize classic models such as convolutional neural networks, LSTMs, etc., to tackle a larger class of problems.

This post will walk you through a classification problem defined over a graph which is modeled by a graph neural network step-by-step. It will make use of Amazon SageMaker, Deep Graph Library, and PyTorch.

The source of this project is available in our GitHub repository.

Graph Convolution Networks

If you know about convolution layers in convolutional neural networks (CNNs), the function of convolution in a graph convolution network (for short: GCN) is similar. It refers to convolving the input signal by a set of neurons (which consists of weights that are shared across the input domain) that are commonly known as filters or kernels. Filters allow CNNs to gather features from nearby cells as they slide across the image.

These models use a similar approach in that they learn the features by looking at the neighboring nodes. One of the main differences between CNNs and GCNs is that in a CNN, the underlying image also has an ordered grid structure, while in GCNs, nodes normally arent ordered.

Convolutional Neural Network

Graph Convolutional Network

Dataset

The purpose of this post is to illustrate how GCNs can be implemented in SageMaker using DGL. We skip the preprocessing phase and use one of DGLs preprocessed datasets.

A collection of benchmark datasets is included in the DGL package, which is divided into three main categories (see documentation for details):

Node Prediction (node classification)
Edge Prediction
Graph Prediction (graph classification)

For this project, the CoraFull dataset was chosen as an example of how the architecture of a neural network for classification is built on top of Amazon SageMaker using PyTorch and DGL.

SageMaker Setup

Git Repository

Developing in both SageMaker notebook and your IDE (local machine) have their privileges. Besides that, it is always necessary to keep track of your code. To do so, before creating the notebook, we need to add a repository under the SageMaker Notebook Git repositories.

To prevent any risk, you can select No Secret under Git credentials to not to save your GitHubs username and passwords. You can create a specific access token only for this repository from your GitHub profile Settings Personal access tokens Generate new token, and use it later in SageMaker JupyterLab to push to your repository.

Deep Graph Library

Deep Graph Library (DGL) is an open-source python framework that has been developed to deliver high-performance graph computations on top of the top-three most popular Deep Learning frameworks, including PyTorch, MXNet, and TensorFlow. DGL is still under development, and its current version is 0.6.

Overview of DGL - DGL 0.4.3post2 documentation

Dataset loader

Every dataset in the DGL package inherits from dgl.data.DGLDataset. This base class formulates utilities for downloading, processing, saving, and loading data from external resources. Every dataset may contain one or more graphs.

The CoraFull dataset is one of the built-in DGL datasets, which is a graph representation of scientific papers under 70 categories. Each node represents a paper, and the edges between nodes are the citation network. The following are statistics of this dataset:

Nodes: 19,793
Edges: 126,842
Number of Classes: 70
Node feature size: 8,710

The class labels in this dataset are stored in ndata_scheme properties with the key of label. They are in tensor format in which their index corresponds to the nodes ID.

>>> from dgl.data import CoraFullDataset>>> dataset = CoraFullDataset()>>> graph = dataset[0]>>> graphGraph(num_nodes=19793, num_edges=126842,      ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'feat': Scheme(shape=(8710,), dtype=torch.float32)}      edata_schemes={})>>> dataset.num_classes  # number of classes for each node.70>>> graph.ndata['feat']  # get node featuretensor([[0., 0., 0.,  ..., 0., 0., 0.],        [0., 0., 0.,  ..., 0., 0., 0.],        [0., 0., 0.,  ..., 0., 0., 0.],        ...,        [0., 0., 0.,  ..., 0., 0., 0.],        [0., 0., 0.,  ..., 0., 0., 0.],        [0., 0., 0.,  ..., 0., 0., 0.]])>>> graph.ndata['label']  # get node labelstensor([ 0,  0,  0,  ..., 52, 59, 55])

Graphs

Graphs in DGL inherit from dgl.DGLGraph. All graphs in DGL are directed by defaults, and to make indirect graphs, you need to create edges both from $u$ to $v$ and $v$ to $u$. It is also possible to use dgl.to_bidirected() function to turn directed one to undirected one more efficiently.

Example of a directed graph

The code snippet below represents the graph in the above picture:

>>> import dgl>>> import torch as th>>> # edges 0->1, 0->2, 0->3, 1->3>>> u, v = th.tensor([0, 0, 0, 1]), th.tensor([1, 2, 3, 3])>>> g = dgl.graph((u, v))>>> print(g) # number of nodes are inferred from the max node IDs in the given edgesGraph(num_nodes=4, num_edges=4,      ndata_schemes={}      edata_schemes={})>>> # Node IDs>>> print(g.nodes())tensor([0, 1, 2, 3])>>> # Edges from nodes u to nodes v>>> # edges 0->1, 0->2, 0->3, 1->3>>> print(g.edges())(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]))>>> # Edge end nodes and edge IDs>>> print(g.edges(form='all'))(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]), tensor([0, 1, 2, 3]))>>> # If the node with the largest ID is isolated (meaning no edges),>>> # then one needs to explicitly set the number of nodes>>> g = dgl.graph((u, v), num_nodes=8)

Node features, labels and edge features can be set or accessed through the ndata and edata properties respectively.

>>> import dgl>>> import torch as th>>> g = dgl.graph(([0, 0, 1, 5], [1, 2, 2, 0])) # 6 nodes, 4 edges>>> gGraph(num_nodes=6, num_edges=4,      ndata_schemes={}      edata_schemes={})>>> g.ndata['x'] = th.ones(g.num_nodes(), 3)               # node feature of length 3>>> g.edata['x'] = th.ones(g.num_edges(), dtype=th.float32)  # scalar float feature>>> gGraph(num_nodes=6, num_edges=4,      ndata_schemes={'x' : Scheme(shape=(3,), dtype=torch.float32)}      edata_schemes={'x' : Scheme(shape=(,), dtype=torch.int32)})>>> # different names can have different shapes>>> g.ndata['y'] = th.randn(g.num_nodes(), 5)>>> g.ndata['x'][1]                  # get node 1's featuretensor([1., 1., 1.])>>> g.edata['x'][th.tensor([0, 3])]  # get features of edge 0 and 3    tensor([1, 1], dtype=torch.int32)

Splitting Dataset

Because our dataset in GNNs now takes the form of a graph and not a tabular structure, we need a different function of splitting the dataset from typically used scikit-learn train-test split.

Masking is commonly applied to edge features, node features, and labels on split nodes during model training, testing, and validation. The split_dataset function returns masked indices related to what is passed as an argument.

from dgl.data.utils import split_dataset# Getting datasetdataset = CoraFullDataset()graph = dataset[0]# ... some other codes# Splitting datasettrain_mask, val_mask = split_dataset(graph, [0.8, 0.2])# Trainingfor epoch in range(epochs):    # ... some other codes    train_acc = (labels[train_mask.indices] == pred[train_mask.indices].argmax(1)).float().mean()    val_acc = (labels[val_mask.indices] == pred[val_mask.indices].argmax(1)).float().mean()

PyTorch Integration

Many of the state-of-the-art graph convolutional layers are now natively implemented in DGLs latest version.

dgl.nn - DGL 0.4.3post2 documentation

GraphConv implements the mechanism of graph convolution in PyTorch, MXNet, and Tensorflow. Also, DGLs GraphConv layer object simplifies constructing convolutional models through the stack of GraphConv layers.

import torch as thimport torch.nn as nnimport torch.nn.functional as Ffrom dgl.nn import GraphConvclass GraphConvolutionalNetwork(nn.Module):    def __init__(self, in_feat, h_feat, out_feat):        super().__init__()        self.gcl1 = GraphConv(in_feat, h_feat)        self.relu = nn.ReLU()        self.gcl2 = GraphConv(h_feat, out_feat)    def forward(self, graph, feat_data):        x = self.gcl1(graph, feat_data)        x = self.relu(x)        x = self.gcl2(graph, x)        return x

Saving and Loading Graphs

Since we train our model on a specific graph, we need to also save that graph for later use.

import osfrom dgl.data.utils import save_graphs,# The SageMaker default system path for saving and loading model artifacts.model_dir = '/opt/ml/model'save_graphs(os.path.join(model_dir, 'dgl-citation-network-graph.bin'), graph)

import osfrom dgl.data.utils import load_graphsimport torch as thmodel_dir = '/opt/ml/model'glist, label_dict = load_graphs(os.path.join(model_dir, 'dgl-citation-network-graph.bin'))graph = glist[0]

Training

It is the train files responsibility to create a model from what weve defined in model.py. We begin the training loop in the training file like every other PyTorch script, but with three important differences:

The graph should also be passed through our model along with the inputs.
For loss function, which computes the difference between predicted labels and actual labels, the masked training indices are considered only.
Also, the validation and train accuracies are computed using only the corresponding indices.

#!/usr/bin/env pythonimport os, jsonimport torch as thimport torch.nn.functional as F from dgl.data import CoraFullDatasetfrom dgl.data.utils import split_dataset, save_graphs, load_graphsfrom model import GraphConvolutionalNetworkdef main():    # Setup Variables    config_dir = '/opt/ml/input/config'    model_dir = '/opt/ml/model'    with open(os.path.join(config_dir, 'hyperparameters.json'), 'r') as file:        parameters_dict = json.load(file)        learning_rate = float(parameters_dict['learning-rate'])        epochs = int(parameters_dict['epochs'])    # Getting dataset    dataset = CoraFullDataset()    graph = dataset[0]    features = graph.ndata['feat']    labels = graph.ndata['label']    # Splitting dataset    train_mask, val_mask = split_dataset(graph, [0.8, 0.2])    # Creating Model    model = GraphConvolutionalNetwork(features.shape[1], 16, dataset.num_classes)    optimizer = th.optim.Adam(model.parameters(), lr=learning_rate)    # Training    for epoch in range(epochs):        pred = model(graph, features)        loss = F.cross_entropy(pred[train_mask.indices], labels[train_mask.indices].to(th.long))        train_acc = (labels[train_mask.indices] == pred[train_mask.indices].argmax(1)).float().mean()        val_acc = (labels[val_mask.indices] == pred[val_mask.indices].argmax(1)).float().mean()        optimizer.zero_grad()        loss.backward()        optimizer.step()        print(f'Epoch {epoch}/{epochs} | Loss: {loss.item()}, train_accuracy: {train_acc}, val_accuracy: {val_acc}')    # Saving Graph    save_graphs(os.path.join(model_dir, 'dgl-citation-network-graph.bin'), graph)    # Saving Model    th.save(model, os.path.join(model_dir, 'dgl-citation-network-model.pt'))if __name__ == '__main__':    main()

Serving

Once we have trained our model, it is time to provide an API so users and developers can make real-time inferences. SageMaker has an architecture for serving Machine Learning models. To make an inference in SageMaker you need to provide exactly two API with a web framework of your choice.

Ping (GET)It is called by SageMaker to ensure the API is available and live. It takes no arguments and returns 200 if everything is OK. The message body must be an empty string and the request timeout is 2 seconds for this API.
Invocation (POST) - it is used by developers and users to make predictions. It must accept a text/csv file of request ids as an argument and return predictions for those ids. There is a timeout of 60 seconds for this API, which means we have 60 seconds to process all the data we receive and do all the computations before responding to this request.

Serving Prediction using Flask

Flask is a simple, easy-to-use mini framework for implementing APIs. It doesnt take much boilerplate code to get a simple app running.

import osfrom dgl.convert import graph import pandas as pd from io import StringIOfrom dgl.data.utils import load_graphsimport torch as thimport flaskfrom flask import Flask, Responsemodel_dir = '/opt/ml/model'graph_dir = '/opt/ml/input/data'glist, label_dict = load_graphs(os.path.join(model_dir, 'dgl-citation-network-graph.bin'))graph = glist[0]model = th.load(os.path.join(model_dir, 'dgl-citation-network-model.pt'))features = graph.ndata['feat']pred = model(graph, features)app = Flask(__name__)@app.route("/ping", methods=["GET"]) def ping():    return Response(response="\\n", status=200)@app.route("/invocations", methods=["POST"]) def predict():    if flask.request.content_type == 'text/csv':         data = flask.request.data.decode('utf-8')         s = StringIO(data)        data = pd.read_csv(s, header=None)         response = pred[data.values]        response = str(response)    else:        return flask.Response(response='CSV data only', status=415, mimetype='text/plain')    return Response(response=response, status=200)if __name__ == '__main__':    app.run(host='0.0.0.0', port=8080)

Batch Transform

Sometimes it is more convenient to make an inference on a large amount of data once in a while and save the output for future use instead of creating an endpoint and making the same computations every time.

Batch transform automatically manages the processing of large datasets within the limits of specified parameters.AWS Documentations

The batch transform mode is not much different from deploying the API. In our Jupyter Notebook, we call transform(), and pass the S3 CSV file as its argument. SageMaker creates and runs the serve container and sends the CSV file to the /invocation API. Once SageMaker receives the response, it saves the output in another S3 bucket, deletes the endpoints, and stops serving to prevent charging.

import sagemakerfrom sagemaker.estimator import Estimatores = Estimator(    role=sagemaker.get_execution_role(),    image_uri='',    instance_type='ml.m5.large',    instance_count=1,    hyperparameters={        'epochs': 100,        'learning-rate': 0.01    })sess = sagemaker.Session()prefix = 'kiarash/dgl-citation-network'batch_input = sess.upload_data(path='../data.csv', key_prefix=prefix + '/batch')es_transformer = es.transformer(     instance_count=1,     instance_type='ml.m5.large')es_transformer.transform(    batch_input,     content_type='text/csv',     wait=True,     logs=True)

Containarization

Docker empowers SageMaker to isolate every projects configurations and its dependencies. Docker is available through the JupyterLab Terminal and Jupyter Notebook.

You can pull, commit or create a whole new image inside the docker. However, the docker image repositories and its all configurations will be reset by shutting down the Notebook to save space.

Making Dockerfile

It is easy to end up with big containers in docker, and it is always cost-efficient to keep our docker containers as small as possible. To do so, we keep in mind these points:

We start with Docker image in which the minimum requirements are installed and add up only our needs.
We use --no-cache arguments in pip commands to prevent pip from holding unnecessary caches.

The first step is to go to docker-hub and search for images that best fit our needs. Everyone can upload a new image to docker-hub, so it is crucial to make sure the image is from a known source, and it is safe to use them; the stars and download attributes of the source can be a clue on this matter.

After choosing the right package and the right version, we can download the image on our SageMaker Notebook Lab by pull command:

docker pull python:latest

In docker names, what comes after the colon, defines the image version.

Now we can make a container out of it and customize it. The -ti argument means to run it in terminal interactive mode, and the bash means to run the bash application as soon as the container instantiate is up.

docker run -ti  bash

Now we are inside the container. We can update the pip and then install the last version of the PyTorch CPU version using pip. First upgrade linux package lists:

apt-get update

Now installing torch using pip:

pip install torch

By typing the command exit and hit enter, we will exit the container, and it will be stopped. Now we have updated and installed PyTorch. We are ready to commit our base image. To get the container id of our customized container:

docker ps -a

with the container id, we are able to commit our customized container to be a new docker image:

docker commit  torch:custom-cpu-1.8

We can also clean left out containers:

docker container rm

Now that we have created our custom docker base image, we are able to deploy our model to a new image based on our customized docker image.

FROM torch:custom-1.8RUN pip install --no-cache dgl pandas flaskCOPY model.py /usr/bin/model.pyCOPY train.py /usr/bin/trainCOPY serve.py /usr/bin/serveRUN chmod 755 /usr/bin/trainRUN chmod 755 /usr/bin/serveEXPOSE 8080

The dockerfile specifies how the docker should create our new image. The following list explains each dockerfile command:

FROM: It defines what docker image should be used to build our docker image. If the image name followed by version is available locally, the docker will use the local image, and if not, the docker pulls it from the docker-hub repository.
RUN: It runs everything comes after as a bash script.
COPY: copy file from our source repository to the image.
EXPOSE: open a port to be used by our Flask web server.

SageMaker will create a new container from our image and call the train module as train and not train.py when we call the fit method of the Estimator. Therefore, we must rename train.py to train when copying it into the Dockerfileas well as serve.py.

Auto-Deploy using Bash Script

Since it is common to face some bugs when running our custom container or we need to add/remove some features, It will be easier to put all of the bash commands for creating a docker image and pushing to ECR inside a bash file and call it instead.

#! /bin/shdocker build -t dgl-citation-network:custom-torch-1.8 -f Dockerfile .aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin docker tag dgl-citation-network:custom-torch-1.8 /dgl-citation-network:custom-torch-1.8docker push /dgl-citation-network:custom-torch-1.8

now lets make it executable:

chmod 755 deploy.sh

and we can run it:

./deploy.sh

Summary

We started by developing a GNN model using the DGL framework. As a way to train our model and service our predictions through AWS SageMaker, we created two Python files, one for training and saving artifacts, and one for loading the model from artifacts and serving the API. In the next step, we packaged both files into a single Docker image and uploaded it to ECR. Using SageMaker notebooks, we use the Estimator object of SageMaker SDK to automate the process of loading the image from ECR, training the model with our hyperparameters, and then serving the result.

AWS Glue DataBrew: Introduction

DataChef — Wed, 18 Aug 2021 21:59:12 GMT

What is AWS Glue DataBrew
Setting Up
Security
Benefits
Limitations
Pricing
Final Word

It is well known in the Data Science world that building predictive models is a small part of a data scientists daily work. In fact, based on this report by Forbes, it is estimated that 80% of the time is spent exploring the data, cleaning it, and doing feature engineering. Although this amount of time varies depending on the use-case industry, the data type, and many other factors, the preprocessing phase remains highly time-consuming.

Usually, the job is done manually working on a Jupyter Notebook environment and using programming languages like Python, R, or SQL, only to mention the most popular ones. This being said, the data preparation step is often a challenging one, requiring advanced data engineering skills. Depending on the Data Scientists profile, these steps may be more or less complicated and are rarely performed at scale.

Based on their internal structure, some companies have this part of the work done by Data Engineers as they have the appropriate skills to handle the data properly. But still, the collaboration between Data Scientists and Engineers often creates several iterations that can turn into an endless cycle in the worst-case scenario.

With this in mind, AWS has presented a new service to facilitate the data preparation, and thus, AWS Glue DataBrew was launched at the 2020 re:Invent.

What is AWS Glue DataBrew

DataBrew is a graphical interface-based data preparation tool that allows data scientists/analysts to explore, analyze and transform raw data without writing any lines of code. It is part of AWS Glue, and like its parent, is a scalable and fully managed service. The user-friendly interface makes data preparation easy and convenient.

Setting up

Datasets

The dataset can come from various sources, such as an S3 bucket, local machine, Redshift Warehouse, or third-party resources like Salesforce.

Once you import a dataset, it gets stored on an S3 bucket and is automatically registered in the dataset section. Datasets can be imported to multiple projects. DataBrew also offers some sample datasets that can be used to play and experiment with the environment.

Projects

Project is the primary working environment. You can import a dataset inside a project and start processing your data. For performance matters, transformations are only applied to a sample of the dataset. The user can define the size of the sample. When you start a project, AWS takes care of provisioning computing resources, initiates a preprocessing session, imports the dataframe, and infers basic statistics. By clicking on the Schema button, you can have a perspective on data types, data distribution, boxplots, and some info on valid/missing values. DataBrew comes with an extensive set of built-in transformations that cover each step of data processing: Cleaning (e.g., removing duplicates rows and/or null values), transforming (e.g., converting to date formats and groupings), and feature engineering (e.g., one-hot encoding).

Recipes

Recipes are the core feature of AWS Glue DataBrew. Each recipe consists of a list of sequential data transformations that the user will perform using the interface. Best of all, recipes support versioning, meaning you can have multiple recipes for a project, each with different steps to manipulate data. Recipes can also be shared between people and even across accounts. These features provide a replicable, reusable and dynamic capability to preprocess data.

Jobs

As mentioned above, when creating a project and adding recipes, we only work with a sample of the data. Once the recipe is created, we can apply it to the whole dataset by creating a job. DataBrew will transform the entire dataset and store preprocessed data in a specified S3 bucket. If the dataset is large, it will be partitioned into multiple files.

Data Profiling

Data Profiling is another feature of DataBrew. It provides a description of data quality and statistical indices of the whole dataset or a sample. Data Profiling also provides a correlation table by which correlated features can be observed at a glance.

Profiling also provides distribution views.

Security

Like other AWS tools, DataBrew follows AWS Shared Responsibility Model. Amazon takes responsibility for the security of the cloud, while the user is partially or entirely responsible for the security in the cloud. For example, you can use encryption for data at rest or in transit. Stored data is resilient as it can be stored in multiple Availability Zones.

Benefits

By leveraging the 250 ready-made transformations available in the graphical interface, data engineers/scientists/analysts can reduce the preprocessing time by 80% (based on AWSs claim). DataBrew particularly helps when collaborators are not on the same page regarding their preprocessing tool. DataBrew doesnt require knowledge of programming languages, as it is a point-and-click tool. DataBrew also provides some intelligent suggestions to speed up the preprocessing. Furthermore, it comes with some tools for preprocessing text data used in Natural Language Processing tasks. If you care about data security, using DataBrew, like other AWS tools, provides many benefits over wrangling data locally.

Limitations

By now, you understand that DataBrew is a high-level tool for wrangling data. This fact takes away some of the flexibility from those who are used to working with popular libraries like pandas. Although DataBrew offers some quick graphs for visualizations, you may find yourself unable to create custom visualizations. On the other hand, DataBrew is an online tool, and one cant utilize it offline on a local machine. Of course, as mentioned in the introduction, DataBrew is newly launched, and these limitations may be resolved in the future.

Pricing

Projects

Each 30 minutes of a project session costs $1. For example, if you use two project sessions for an hour, the total cost would be:

2 * $1 * 60/30 = $4

Jobs

The price of a job is based on the number of nodes utilized. By default, DataBrew uses five nodes for a job. Each node provides 4 virtual CPUs and 16 GB of memory and costs $0.48 per hour. You will be billed for the minutes that your job lasts. For example, if the job takes 10 minutes and uses five nodes, you will be charged as follows:

5 * $0.48 * 10 / 60 = $0.4

Final Word

To sum up, DataBrew is a brand new service that will perfectly suit people who are not comfortable with coding or simply dont want to do it for preprocessing steps. As a first release, this nice and convenient service already looks promising and offers a satisfying experience. Using graphical tools for data preparation could become the usual way to do it in the years to come. There is no doubt that AWS Glue DataBrew will be a significant competitor.

Paraphrasing and Style Transfer with GPT-2

DataChef — Thu, 29 Jul 2021 21:59:12 GMT

Since their introduction in 2017, transformers have gotten more popular by day. They were initially proposed in the paper Attention Is All You Need. In Natural Language Processing, a transformer is a structure that helps perform the seq2seq tasks (text generation, summarization, etc) previously performed by Recurrent Neural Networks later to be improved and optimized in LSTM architecture in parallel.

GPT-2

One such transformer, introduced in 2019 by OpenAI team, is GPT-2. Based on the teams claim, this transformer has been trained on 40 GB worth of text from 8 million web pages. At the time of writing this post, GPT-3 from OpenAI is out, but we experimented with the lighter version of GPT-2.

Text Generation

Essentially, what GPT-2 does is to generate text based on a primer. Using attention, it takes into account all the previous tokens in a corpus to generate consequent ones. This makes GPT-2 ideal for text generation.

Fine-Tuning

Creators of GPT-2 have chosen the dataset to include a variety of subjects. This makes the pre-trained version ideal for general text generations. So what if we wanted to generate text based on some specific corpus? GPT-2 can actually be finetuned to a target corpus. In our style transfer project, Wordmentor, we used GPT-2 as the basis for a corpus-specific auto-complete feature. Next, we were keen to find out if a fine-tuned GPT-2 could be utilized for paraphrasing a sentence, or an entire corpus. In our endeavor, we came across Paraphrasing with Large Language Models paper. Basically, what the authors have done is combining the original and the paraphrased sentence. They do so by placing the original sentence and the paraphrased sentence on one line, divided by a special token. Finally, we decided that the best result could be achieved through T5, another powerful transformer. Yet here we go into details of what we did to achieve this goal using GPT-2.

Shakespeare Paraphraser

Dataset

The main challenge in designing a style-specific paraphraser was to align general sentences/corpora with their style-specific counterparts. In our quest for a suitable dataset, we found SparkNotes, a website designed to help users understand heavy works of literature. We used their guides for Shakespeare plays, which consisted of original dialogues and their equivalent in modern English. We used a scraper on all Shakespeare plays and saved all the aligned sentences in a csv file. We then cleaned our dataset and made it ready for action. The final dataframe looked like this:

Feed the GPT-2

Now to make our dataset ready to be fed to GPT-2, we combined our modern column enclosed in ~~and~~ , followed by the unique token of , and then the original column enclosed in

and

. Below is the code snippet used for this:

Fine-tuning GPT-2

And then the last step: simple fine-tuning of GPT-2. For this we used the transformers library from HuggingFace. Depending on your hardware such a code could be tweaked to finetune GPT-2 based on any text file.

Test-Drive

After calling the modelTrainer function, the model gets saved to the output_dir folder. We moved it to a folder named shakespeare. Now is the time to test our fine-tuned model. We start by importing dependencies and loading the model. Since we have not trained GPT-2 from scratch, we use GPT-2s default tokenizer.

We define a paraphraser function to manipulate the input according to our initial dataset manipulation. This way we make sure that GPT-2 follows our sequence with the paraphrased version. Now the generator function gives out our original sequence and generates the following tokens. The output is a list of dictionaries (based on how many outputs we requested). The value of the generated_text key, is our target. As you can see the output is dirty and following the pattern of the dataset, it goes on to generate additional sentences. Of course, we can define a max_length parameter in our pipeline, but it comes with the possibility of an incomplete paraphrase.

So we define another function to trim the output:

Or better yet, 2 in 1:

Nice! Now lets see how Shakespeare would have said some sample sentences:

In the end, we used T5 for our purposes as mentioned here, but it was nice to see GPT-2 could also be utilized for the same goal. What use-cases can you imagine for such paraphrasers? What would you build on top of it? Share it with us! We appreciate your interest!

How to choose the best training instance on SageMaker

DataChef — Sun, 28 Mar 2021 18:44:37 GMT

Like EC2 instances, Amazon offers a variety of instances for training in SageMaker. Based on CPU cores, memory size and presence of GPUs, they come with different on-demand prices. A complete list can be found at Amazon SageMaker Pricing Page. From my point of view, too many choices is as bad as having none or limited choices, when you dont have a clear idea how to choose. How do you handle the overwhelm of choosing an appropriate instance type for your training task on SageMaker? Do you just randomly pick one and hope that its good enough? The problem is, by choosing an inappropriate instance you may be over-provisioning which incurs higher prices, or under-provisioning which results in longer training time. Now how to save money and accomplish the task faster? Read on.

Instance Families

If you already know about instance families and you are here for the practical steps to choose the best instance, you can skip all the way to the [How to choose the best instance](#how-to-choose-the-best-instance "How to choose the best instance") section. But first, let us have a quick look into different families of SageMaker instances:

m Instances: Standard Instances (Smart)

These instances come with a balance of CPUs and Memory. The more CPU cores, the higher the memory size that comes with the instances. These are general purpose instances and can be used as your initial training instance for testing. Of note is that none of the m instances include GPUs.

c Instances: Compute Optimized (Super Smart!)

In these instances the balance is tipped towards CPU power. These instances are more appropriate for training jobs that consume more processing power and less memory, hence the name: Compute Optimized. These too, lack GPUs.

p/g Instances: Accelerated Computing (Smart and Fast)

GPUs are more efficient in computations as they can do it in parallel. However, in order to use these instances, we have to make sure that the training algorithm supports GPUs. Although, these instances are generally more expensive compared to previously mentioned instances, the reduction in time may result in lower overall price.

Factors for Instance Selection

Generally, when choosing which instance to utilize, one has to consider these factors:

Training Time

How fast do you want to train your model? This factor is rarely important, as in ML model training processes we are not usually time sensitive. But in case we have a deadline closing in, we might want to pay attention to this factor. Also, choosing a cheap instance while thinking that training time doesnt matter is not always intuitive. Think for example of a hypothetical job that runs on an instance costing $0.5/hour, taking 2 days to complete while the same job can be done using a $2/hour GPU instance in two hours. The overall cost would be $24 in the first case and $4 in the second.

Pricing

I doubt if anyone would not care about cost optimization if they had the option. In a training job, we would want to optimize hardware utilization. A powerful GPU instance, would only waste resources if the training algorithm cant utilize it fully (or at all). Likewise, an instance with high memory and low CPU power, would waste resources if CPU is utilized fully and memory gets used in a fraction.

How to choose the best instance

Intuition and Background Knowledge:

As we work on more training jobs, we acquire an intuition on the appropriate instances for different endeavors. In addition, the background knowledge on how a training algorithm works can help us narrow down our choices. For example, knowing that an algorithm works best on GPUs, we can focus our attention on g/p instance families. On the other hand, some algorithms cant use GPUs for training. In such cases, choosing a GPU instance is an obvious waste of money.

Hardware Utilization Metrics

When you run a training job on SageMaker, Amazon Cloudwatch automatically tracks and monitors the hardware utilization of your training instances. This can easily be viewed via the Training Jobs section of Amazon SageMaker Console. The picture below is an example for my personal job.

CloudWatch graphs for hardware utilization.

In this job, I have trained a deep learning algorithm using the PyTorch framework. Since Pytorch supports GPUs, I chose an ml.p2.xlarge instance and my training took 5700 seconds in total. Since I am going to be charged $1.215 for each 3600 seconds, this job has cost me $1.92.

Looking at the cloudwatch graphs, I can see that the training job has taken 100% of CPUs. However, this is not as good as it looks. ml.p2.xlarge comes with 4 virtual CPUs and so the CPU can be utilized up to 400%! The graph implies that approximately 3 vCPUs have been idle.

My first lesson: I could do this with an instance containing fewer vCPUs!

Memory Utilization graph shows that I have used a maximum of 2.8 GB of memory. This is when ml.p2.xlarge comes with 61 GB of memory!!!

Second lesson: 3 GB of memory would suffice for such a job!

We will have a quick overview of the Disk Utilization graph. The storage volume is separate from the instance type and is defined inside your training notebook. However, it is good to know that the default value is 30 GB and this graph shows that I have at most used about 12 GB. I can specify this in my script, but it has nothing to do with the instance type. This storage is actually an Elastic Block Storage volume. More information on Amazon EBS can be found on EBS official documents

The two graphs on the second line show that my job has utilized GPUs efficiently. So no complaints there!

Final Conclusion: For my next training job, I should use an instance with approximately equal GPU power, fewer vCPUs and less Memory.

Taking another look at SageMaker Training Instances, I can see that an ml.g4dn.xlarge instance seems to be more appropriate for my job. It still has 4 vCPUs but there is no GPU instance with fewer CPUs. On the other hand it comes with 16 GB of memory and the cost is 33% less than our previous instance. But is this all that matters?

Time-Price Combination

With further experimentation, one may notice that although hardware utilization has been optimized, the overall training time and price have increased! This can happen with GPU instances, as the architecture is different in g and p families. There is only one way to know which one works best for your training job. Experiment!

Practice Makes Perfect

In conclusion, to choose the best instance for your training, you have to pay attention to several factors, experiment and refine your choice. In this section, I am going to summarize all the steps we need to take in order to choose the best instance:

Determining if training speed matters.
Use intuition/past experiences and background knowledge to choose an initial instance. If youre not sure, stick with the least expensive reasonable instance.
Examine hardware utilization metrics using CloudWatch.
Refine your choice.
Experiment.

Advanced Optimization

After all, no single instance may fully suit your training job. As we saw earlier, you may not have the option of settling for an instance with fewer CPUs, as no such instance includes GPUs. One option in such circumstances may be scaling out, that is dividing your training into multiple jobs and using multiple instances. But more on that later!

Deploy Pythonshell Gluejob Using CDK

DataChef — Fri, 12 Mar 2021 12:54:01 GMT

What is AWS CDK

AWS CDK is a software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. So before explaining more about CDK, lets first elaborate more about Infrastructure as Code and CloudFormation.

Infrastructure as Code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. You make changes to code, then use automation to test and apply those changes to your systems. Excerpt From: Kief Morris. Infrastructure as Code, 2nd Edition.

AWS CloudFormation is a service that helps you model and set up your Amazon Web Services resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want (like Amazon EC2 instances or Amazon RDS DB instances), and AWS CloudFormation takes care of provisioning and configuring those resources for you. It helps you to provision and create your AWS infrastructure easily and repeatedly.

CDK acts like a wrapper around CloudFormation. It lets you prepare, create, and provision your AWS resources in your preferred language. It is originally created by Typescript and already supports Javascript, python, java, and C#/.Net.

Main components of CDK are as follows:

App: is the core or your CDK application. It acts like a container which consists of one or multiple stacks.
Stack(equivalent to AWS CloudFormation stacks): is the unit of deployment in CDK. Each individual stack would be translated into one AWS CloudFormation template.
Construct: represents a cloud component and encapsulates everything AWS CloudFormation needs to create the component. They (as well as stacks and apps) are represented as types in your programming language of choice. You instantiate constructs within a stack to declare them to AWS, and connect them to each other using well-defined interfaces.

Developers use one of the mentioned supported programming languages to define reusable cloud components (Constructs) and compose them into Stacks and Apps.

In this article, we are going to use CDK to deploy a simple Glue job. We assume that you already have CDK installed on your system. If not, you can install CDK by following this link. Make sure you do not forget to get the prerequisites prepared beforhand.

Initialize the CDK project

In terminal:

$ mkdir cdk-glue-job$ cd cdk-glue-job$ cdk init app --language=typescript

The cdk init command creates a number of files and folders inside the cdk-glue-job directory to help you organize the source code for your AWS CDK app.

In the cdk-glue-job.ts you will find the app codes which will initialize the stack that we want to create. The code for our stack will be residing in lib/cdk-glue-job-stack.ts. Also you can find the dependencies of the project in package.json.

Lets start completing lib/cdk-glue-job-stack.ts little by little.

Define an IAM Role

First, we define a role to be assumed by our glue job. To do so, we need to add the @aws-cdk/aws-iam package to the project. To do so, we run "$ npm install @aws-cdk/aws-iam" in the command line. After running the command, it will be automatically added to package.json dependencies.

Then we add the following code to lib/cdk-glue-job-stack.ts:

const role = new iam.Role(this, 'my-glue-job-role', {  assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),});  const gluePolicy = iam.ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole");  role.addManagedPolicy(gluePolicy);

Define S3 bucket

For the next step, we want to add an s3 bucket to our stack. To do so, we add the @aws-cdk/aws-s3 package to the project by running "$ npm install @aws-cdk/aws-s3" in the command line.

Then we need to add the following code to lib/cdk-glue-job-stack.ts:

const myBucket = new s3.Bucket(this, 'MyCdkGlueJobBucket', {   versioned: true,   bucketName: '-my-cdk-glue-job-bucket',   removalPolicy: cdk.RemovalPolicy.RETAIN,   blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL});myBucket.grantRead(role);

s3.Bucket() will create a bucket with provided properties. As you can see, weve just set 4 properties at the moment, but you can visit here to see all the properties that can be provided when configuring an S3 bucket.

For the bucket name, we must provide a unique name in S3. For this reason, you should include your AWS account id in the bucket name or just provide a unique name. The removal policy cdk.RemovalPolicy.RETAIN, will ensure that this Bucket will not be deleted when we delete our CloudFormation stack.

Define Glue Job

Add @aws-cdk/aws-glue to the project by running npm install @aws-cdk/aws-glue in the terminal.

And here is the code with a minimal set of properties that we use to define the glue job:

new glue.CfnJob(this, 'my-glue-job', {  role: role.roleArn,  command: {    name: 'pythonshell',    pythonVersion: '3',    scriptLocation: 's3://-my-cdk-glue-job-bucket/glue-python-scripts/hi.py'  }});

Where is to be substituted with your aws account id. All the properties that you can provide are mentioned here.

Most of the CDK modules are created based on CloudFormation modules, It is also handy to check CloudFormation Documents as well!

For the name of the job, there are some rules that must be followed! For the python shell jobs, you must provide exactly pythonshell and for Apache spark ETL jobs, you must exactly provide glueetl. It will not work if you provide something else!

The scriptLocation property shows the path in S3 bucket that we want to provide the python script afterwards.

Provide a python script code

In our project, create a directory named resources and inside that, create another one, named glue-scripts. And finally create a python file, named hi.py with the following simple hello world!

print(Hello from python-shell glue job, created by CDK!)

Now we need to provide some code to move this file into the S3. To do so, we use s3 deployment. So first we add it to the project by running npm install @aws-cdk/aws-s3-deployment and then add the following code to lib/cdk-glue-job-stack.ts:

new s3deploy.BucketDeployment(this, 'DeployGlueJobFiles', {  sources: [s3deploy.Source.asset('./resources/glue-scripts')],  destinationBucket: myBucket,  destinationKeyPrefix: 'glue-python-scripts'});

We provided the source and destination to BucketDeployment module. Just the last step to make the magic happen is bootstrapping.

$ cdk bootstrap aws://ACCOUNT-NUMBER-1/REGION-1

and it is done! we are now ready to deploy our stack.

Deploy to AWS

By just running

$ cdk deploy

We can easily deploy our project to AWS. It will first create a CloudFormation template from our CDK codes, and then try to deploy it to the cloud. After running cdk deploy, you will be prompted by the changes that CDK want to do

After the successful deployment, you can check the stack in your AWS console, in your CloudFormation stacks.

In S3, You will find your bucket and hi.py inside it. And in your Glue service, in the job menu, you will find a new python-shell job.

Delete the stack from AWS

For this reason, just run cdk destroy. Everything will be deleted from AWS except the bucket we created! because we provide removalPolicy: cdk.RemovalPolicy.RETAIN, remember?

$ cdk destroy

So if you want to run cdk deploy after deleting the stack, you first need to remove that bucket manually, or just change the policy to cdk.RemovalPolicy.DESTROY.

Run Spark Applications on AWS Fargate

DataChef — Thu, 18 Feb 2021 07:10:06 GMT

Overview

In this post, we are going to review how to run a Spark application, as a single node Fargate task. If you are familiar with Spark and its potential workload, you might wonder: why would anyone need to run Spark applications over AWS Fargate, instead of a proper cluster.

Well, to be honest, thats a valid question, however, there is a profusion of deployment cases where the application doesnt really need a full-blown cluster. Two immediate examples that come to mind are:

Running functional tests on an application.
Running a streaming job to download data once a day.Check this blog post by Databricks as a testament of the benefits of such a process.

In both of these scenarios, its quite normal that the data involved are not particularly big in terms of size, and using a cluster, maybe more like an overkill.

Why Fargate?

At this point, you might even ask: why Fargate? For me, its simply because it doesnt require me to decide about cluster size, instance types, and resource management. 😉

However, there are two considerations we need to keep in mind here:

Fargate, like any other AWS service, can turn into a budget black hole. Its always wise to decide about the platform, keeping this simple fact in mind and choose around the usage type.
The solution we are reviewing here is based on a Docker container. So the overall concept would be possible to run on any platform that runs Docker.

We need Spark 3 + Hadoop 3!

Running a Spark app inside a container, with proper access management for AWS wasnt as easy as we are going to review here. With Hadoop 2.7 (packaged with Spark versions prior to version 3), the bundled AWS SDK library, was version 1.7.4 (released back in 2016), and couldnt properly access S3 credentials from the ECS task execution role.

Fortunately, with the release of Spark 3, this issue is resolved, and its possible for the task containers to utilize the tasks IAM roles.

Create the Container

Without further ado, lets start. First, we need to create a Docker image, with all the necessary tools to run a Spark application. If you cant wait and want to try it right away, you can head over to DataChefs SparGate project, which is the deployed version of the image we are going to discuss here.

For this project, we are going to use Alpine Linux as the base image, and we need to take care of some small details to make it usable to run our application.

Install Dependencies

We have 3 basic dependency that needs to be installed on the image:

RUN apk add openjdk8-jre bash snappy

The odd thing for me here was the requirement of bash, which is required by spark-shell. The other two, however, are ordinary dependencies we already know.

Configure Snappy

By now, we have everything we need for the Spark application to run. However, as soon as we try to write a parquet file with snappy compression, we will get the following error:

Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.7-4479af19-a22c-461a-8e67-e526dea3a9d8-libsnappyjava.so:Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.7-4479af19-a22c-461a-8e67-e526dea3a9d8-libsnappyjava.so)

We can resolve it by adding the following lines to the Docker file:

# install libc6-compat and link it RUN apk add libc6-compat RUN ln -s /lib/libc.musl-x86_64.so.1 /lib/ld-linux-x86-64.so.2

Install Spark

Well, this part is quite straightforward, we just need to download a pre-built version and decompress it into the proper path:

# Install sparkADD ${SPARK_PACKAGE_URL} /tmp/RUN tar xvf /tmp/$SPARK_PACKAGE -C /optRUN ln -vs /opt/spark* /opt/spark

Install AWS Dependencies

In order to make Spark able to communicate with S3, we need to install hadoop-aws together with aws-java-sdk-bundle. Here is how we do it:

# download required aws jars into class pathADD $HADOOP_JAR  /opt/spark/jars/ADD $AWS_SDK_JAR /opt/spark/jars/

Spark Configuration

There is one more step for us to wrap up the Dockerfile and thats instructing Spark on where to find the required S3 credentials. For that, we create a file called spark-defaults.conf with the following content:

spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper

As you know this file can be also used to provide other configurations in the future. Now we need to configure our Spark instance to use this configuration:

COPY spark-defaults.conf /opt/spark/conf/spark-defaults.conf

Complete Docker Spec

Here you can see how the final Dockerfile looks like:

# DockerfileFROM alpine:3.12ARG HADOOP_VERSION=3.2.0ARG HADOOP_VERSION_SHORT=3.2ARG SPARK_VERSION=3.0.1ARG AWS_SDK_VERSION=1.11.375ARG SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgzARG SPARK_PACKAGE_URL=https://downloads.apache.org/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}ARG HADOOP_JAR=https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jarARG AWS_SDK_JAR=https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar# Install dependenciesRUN apk updateRUN apk add openjdk8-jre bash snappy# Install sparkADD ${SPARK_PACKAGE_URL} /tmp/RUN tar xvf /tmp/$SPARK_PACKAGE -C /optRUN ln -vs /opt/spark* /opt/sparkCOPY spark-defaults.conf /opt/spark/conf/spark-defaults.confADD $HADOOP_JAR  /opt/spark/jars/ADD $AWS_SDK_JAR /opt/spark/jars/# Fix snappy library load issueRUN apk add libc6-compatRUN ln -s /lib/libc.musl-x86_64.so.1 /lib/ld-linux-x86-64.so.2# CleanupRUN rm -rfv /tmp/*# Add Spark tools to pathENV PATH="/opt/spark/bin/:${PATH}"CMD ["spark-shell"]

Publish the Image

You can publish this image into any container registry of your choice. Usually, its better to deploy it into ECR, which allows you to run your container without a public IP address. However, since I wanted to share a ready-to-use version of the image with the readers, Im going to use Docker hub to keep things simple.

The Application

The following Spark application is going to simply read in a JSON file, add a date partition to it and finally store the result as a parquet file on the destination path. Were going to use S3 to store the applications jar file, together with input and output data files 😎:

package simpleimport org.apache.spark.sql.{SaveMode, SparkSession}import org.apache.spark.SparkConfimport org.apache.spark.sql.functions._object SimpleSpark {  private val parquetFmt = "parquet"  protected lazy val spark: SparkSession =    SparkSession.builder      .enableHiveSupport()      .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")      .config("spark.hadoop.fs.s3a.fast.upload", "true")      .config(new SparkConf)      .getOrCreate()  object Columns {    val date = "date"  }  def main(args: Array[String]): Unit = {    val arguments = Arguments(args)    val df = spark.read      .option("multiline", true)      .json(arguments.sourcePath)      .withColumn(Columns.date, current_date)    df.repartition(col(Columns.date))      .write      .mode(SaveMode.Overwrite)      .partitionBy(Columns.date)      .format(parquetFmt)      .option("path", arguments.destPath)      .save()  }}

You can find the full version of the source code for this application, over DataChefs Github.

Deployment

ECS Cluster

Im going to use an ECS cluster for this blog post (the other option to run Fargate is EKS, and yeah, its called cluster, but its way different from what we usually mean by when we use the word clusters 😊).

$ aws ecs create-cluster SparGate

Im also going to create a log group, into which will Ill point our tasks log stream:

$ aws logs create-log-group --log-group-name SimpleSpark

ECS Task

Now its time to create the actual task and attach it to our cluster and log stream:

$ aws ecs register-task-definition --family SimpleSpark \\    --cpu 512 --memory 2048 \\    --requires-compatibilities FARGATE \\                       # 1    --network-mode awsvpc \\                                    # 2    --task-role-arn arn:aws:iam::xxx:role/task-spargate \\      # 3    --execution-role-arn arn:aws:iam::xxx:role/exec-spargate \\ # 4    --container-definitions '[{  "image": "datachefhq/spargate:latest",                       # 5  "logConfiguration": {    "logDriver": "awslogs",    "options": {      "awslogs-group" : "SimpleSpark",                         # 6      "awslogs-region": "eu-west-1",      "awslogs-stream-prefix": "ecs"    }  },  "name": "SimpleSpark"}]'

Lets take a step back and see where were at:

Obviously, the first thing we configured here is for the task to run on Fargate, instead of an EC2 instance fleet. This means only tasks with FARGATE launch type would be able to run on this cluster.
Then we configured the network mode to aws-vpc which is the only mode that Fargate tasks can run on.
Then we assigned a task role, which provides the tasks inner job with the required permissions. In this case, the role provides read/write access to a specific S3 bucket which is all I need.
We also provided an execution role, which is required for the task to support AWS logs as the log driver. If you want to know how to create this role, you can find the full instruction over here.
After that, we defined a single container for our task, with an image pointing to our Spark Image.
Finally, we pass the log group name to the image to use it.

Run the Task

At this point, we have everything we need to trigger the task. However, I intentionally made a generic Image. The main entry point of the container runs spark-shell. When we want to use it to run a Spark application, we need to override this default entry point. Here is the command:

$ aws ecs run-task \\    --cluster SparGate \\                                               # 1    --launch-type FARGATE \\                                            # 2    --task-definition SimpleSpark:1 \\                                  # 3    --network-configuration '{      "awsvpcConfiguration": {        "subnets": ["subnet-xxxxx"],                                   # 4        "securityGroups": ["sg-xxxxx"],        "assignPublicIp": "ENABLED"                                    # 5      }    }' \\    --overrides='{      "containerOverrides": [        {"name": "SimpleSpark",                                        # 6         "command": [           "spark-submit",           "s3a://xxxx/simple-spark-assembly-0.1.0.jar",           "--source-path", "s3a://xxxx/source/sample_data.json",           "--dest-path", "s3a://xxxx/destination/"]}      ]    }'

We define the target cluster for the task.
Set the launch type to FARGATE which is the only acceptable type by our cluster.
Provide the task definition. Note to the version number attached to the task name. Every time you deploy a new task with the same name (running the ECS register-task-definition command), itll create a new version.
A subnet and a security group on the target VPC to run the task on. Since in this blog post we plan to use S3, its important to make sure an S3 Endpoint is attached to the VPC. Otherwise, we cant read or write objects on S3.
Enabling public IP assignment is required since the Docker image needs to be fetched from outside of the AWS environment (e.g Docker hub). However, if you are using ECR with access over your VPC, you can set this to DISABLED.
I override the container command and as you see Im passing the applications jar, together with source/dest paths over my S3 bucket.

And thats all needed. After a couple of seconds, you can see your task marked as STOPPED on the ECS cluster and you can read the logs for the status check.

Sources

How to run Great Expectations on EMR

DataChef — Thu, 28 Jan 2021 16:32:00 GMT

What is Great Expectations

Great Expectations is a great tool to validate the quality of your data and can be configured against a number of data sources, including BigQuery, MySQL, Snowflake and Athena. It helps you to test data instead of code. Lets start with a short example:

Say you have a pipeline that enriches and normalizes client data. The initial data looks like this:

| client_id | address                 | country ||-----------|-------------------------|---------|| ab1234    | Somestreet 1            | UK      || bc1973    | Anotherstreet 2, 1111AA | NL      |

And after applying some transformations, the resulting data looks like this:

| client_id | street          | zip code | country        | longitude | latitude  ||-----------|-----------------|----------|----------------|-----------|-----------|| ab1234    | Somestreet 1    | 2222 BB  | United Kingdom | 0.127758  | 51.507351 || bc1973    | Anotherstreet 2 | 1111 AA  | Netherlands    | 4.895168  | 52.370216 |

Based on this enriched data, we can come up with some expectations:

Is the number of columns in the resulting table equal to 6?
Is the client_id column free of null values?
Do the longitude and latitude columns conform to the regex ^(-?\d+(\.\d+)?)$?
etc.

In Great Expectations, such expectations can be a part of an expectation suite and will be executed against your data. Great Expectations will generate a report that shows which expectations have succeeded and which expectations have failed with an explanation on why they have failed and a sample of non-conforming records.

Custom wrapper

To get the most out of Great Expectations, we wanted it to be part of our data pipelines. Whenever a pipeline finishes with all of its transformations, we want to run expectation suites related to that pipeline. It turns out that we needed a way to run Great Expectations with a configurable data source and expectation suite(s). For this purpose, we came up with a wrapper that allows us to do exactly this.

In this article, we will explain how to create such a wrapper so that Great Expectations can be run on an EMR cluster as part of your pipeline.

The goal is that we can run our Great Expectations suites with the following command:

generate_dashboard --pipeline PIPELINE_NAME

The full code can be viewed here.

Storing expectation suites

For our use case, we only want to load the expectation suites that are related to the provided pipeline. To easily do this, we are going to store the suites in a structured way by creating a directory per pipeline.

.|-- cli.py|-- great_expectations|   |-- expectations|   |   |-- expectation_definitions_1.json|   |   |-- expectation_definitions_2.json|   |   |-- expectation_definitions_3.json|-- suites|   |-- pipeline_a|   |   |-- pipeline_a_suites.yml|   |-- pipeline_b|   |   |-- pipeline_b_suites.yml

The json files are the files that are generated by Great Expectations after you save the Jupyter notebooks in which you define the expectations for your data source. The yml files hold the configuration about the data on which the expectations need to be run. They look like this:

# pipeline_a_suites.yml- suite_name: expectation_definitions_1.json    database_name: some_database    table_name: some_table_1- suite_name: expectation_definitions_2.json    database_name: some_database    table_name: some_table_2

Creating the wrapper

The wrapper will be responsible for the following:

loading the expectation suites
loading the data
creating the DataContext
creating the entry point
creating the command
packaging and installing the wrapper

In the following subsections we will dive into each of these points. This will get a bit technical, but at the end we should have everything that we need in order to run Great Expectations on EMR, so bear with me.

Loading the expectation suites

Instead of running all expectation suites, we only want to load the suites that are related to the pipeline that we pass as an argument. Because of the structure that we looked at earlier, we should be able to load all suites that are stored under a given directory that corresponds with the name of one of our pipelines.

SuiteInfo = namedtuple(    "SuiteInfo",    ["suite_name", "database_name", "table_name"],)def get_suites(pipeline: str) -> List[SuiteInfo]:    """Retrieve all the suites that are related to the provided pipeline    :param pipeline: the pipeline to retrieve the suites for    :return: A list of SuiteInfo tuples    """    suite_location = get_relative_path(        f"suites/{pipeline}/{pipeline}_suites.yml"    )    with open(suite_location) as cp:        suites = yaml.load(cp, Loader=yaml.FullLoader)    suites = [SuiteInfo(**args) for args in suites]    return suites

Loading the data

In order to run the validations, Great Expectations requires a batch of data. Such a batch can be created using the get_batch method on a DataContext. All we need to know for now, is that we require a DataFrame that contains the data on which we want to run the validations. Since we have specified the database name and table name in the suites configuration files, we can just load the suites, and extract the values from it in order to create a DataFrame.

The below code sample is a very simplistic version to illustrate that this is the only required code to load in the data from a Spark table. In the next section we will expand this so we can generate a proper DataContext.

suites = get_suites("pipeline_a")spark_session = SparkSession.builder.appName("great_expectations_wrapper").getOrCreate()for suite in suites:    df = spark_session.table(f"{suite.database_name}.{suite.table_name}")

Creating the DataContext

Its time to finally create the DataContext which will provide us with the run_validation_operator and build_data_docs commands. Check out the documentation for more information on the DataContext.

Lets start expanding on what we created in the previous section. We want to generate a dashboard for each of our suites. In order to do this, we will call a method named generate_dashboard.

APP_NAME = "great_expectations_wrapper"suites = get_suites("pipeline_a")spark_session = SparkSession.builder.appName(APP_NAME).getOrCreate()def get_relative_path(path: str) -> str:    project_path = os.path.dirname(__file__)    return os.path.join(project_path, path)context_root_dir = get_relative_path("great_expectations")for suite in suites:    generate_dashboard(            suite.suite_name,            suite.database_name,            suite.table_name,            app_name=APP_NAME,            context_root_dir=context_root_dir        )

Now lets define the generate_dashboard method. It will be responsible for:

creating the DataContext
running the validations
building the data docs (this is what holds the results of running the validations).

def generate_dashboard(        suite_name: str,        database_name: str,        table_name: str,        app_name: str,        spark_session: SparkSession = None,        context_root_dir: str = "great_expectations") -> bool:    # Create a DataContext for the provided suite    context = DataContext(context_root_dir)    suite = context.get_expectation_suite(suite_name)    # Load in data as we have seen in the previous section    df = spark_session.table(f"{database_name}.{table_name}")    batch_kwargs = {"dataset": df, "datasource": "spark_datasource"}    checkpoint = LegacyCheckpoint(        name=suite_name,        data_context=context,        batches=[            {                "batch_kwargs": batch_kwargs,                "expectation_suite_name": [suite_name]            }        ]    )    # Run the checkpoint    results = checkpoint.run()    context.build_data_docs()    if not results["success"]:        print("No results")        return False    print("Data docs have been built")    return True

Checkpoints are a replacement of the validation operator. In the way that we use Great Expectations in this guide, we cant make use of the full power that checkpoints bring to the table because we make use of in-memory DataFrames.

Your use-case might differ, so this feels like a good opportunity to bring your attention to them. Checkpoints allow you to configure what validations and actions (like updating data docs or sending a Slack notification) need to be run. This is particularly useful when you want to validate a group of batches that are logically related.

Checkpoints can also be saved to the DataContext:

checkpoint_json = checkpoint.config.to_json_dict()context.add_checkpoint(**checkpoint_json)

Doing this allows you to easily run a preconfigured checkpoint:

great_expectations checkpoint run CHECKPOINT_NAME

For more in depth information about checkpoints, please check out the documentation, or follow the great Getting Started tutorial.

Creating the entry point

In the previous section, we have defined a method that will be able to run our expectation suites. Now we need an entry point for our application that can handle all of the required arguments, which in this case is just the --pipeline argument.

We will make use of Typer to help us build our CLI.

app = typer.Typer()DEFAULT_SPARK_HOME = "/usr/lib/spark"DEFAULT_CONTEXT_ROOT = get_relative_path("great_expectations")APP_NAME = "great_expectations_wrapper"@app.command(APP_NAME)def run(    pipeline: str = "",    context_root_dir: str = DEFAULT_CONTEXT_ROOT,    s3_bucket: str = None):    # Set the SPARK_HOME env var. This is necessary in EMR 6 since it's not set by default    current = os.environ.get("SPARK_HOME")    if not current:        os.environ["SPARK_HOME"] = DEFAULT_SPARK_HOME    # You probably want to check if the pipeline is passed / exists    suites = get_suites(pipeline)    keep_s3_history = False    s3_prefix = "data_doc/"    update_ge_config(context_root_dir, s3_bucket, keep_s3_history, s3_prefix)    # This part is coming from the previous section    for suite in suites:        result = generate_dashboard(            suite.suite_name,            suite.database_name,            suite.table_name,            app_name=APP_NAME,            context_root_dir=context_root_dir        )        print("Success!") if result else print("Failed!")def main():    app()

There is one really important part here: setting the SPARK_HOME correctly. On EMR 6, this is not set by default. When the SPARK_HOME environment variable is not set, the application will fail because its not able to locate the spark-submit command.

Creating the command to run the application

And lastly, we need a way to define our generate_dashboard command. We do this with Poetry (which we also use for dependency management). In pyproject.toml, we define a new command under tool.poetry.scripts:

[tool.poetry.scripts]generate_dashboard = "great_expectations_wrapper.cli:main"

The command points to the main function that we defined in the previous section.

Package and install

Our wrapper is ready, now its time to handle the packaging and installation part. Packaging the wrapper is very easy with Poetry. All we have to do is run the following command.

poetry build

A .whl and .tar.gz file are generated into the dist directory of your project.

In order to install wheel files, we first need to make sure that our EMR cluster has the wheel package installed. You can do this by SSHing into the EMR cluster. Once in the cluster, run the following commands.

python3 -m pip install --user --upgrade setuptoolspython3 -m pip install --user wheel --user

In practice, youd most likely create a bootstrap action or even a custom AMI for such configurations.

Now that the dependencies are in place, we can install our package, copy the .whl file to your EMR instance and run

python3 -m pip install --user PACKAGE_NAME-VERSION-py3-any.whl.

Thats it. You should now have access to the generate_dashboard command. We can now run the expectations for pipeline A by running the command

generate_dashboard --pipeline pipeline_a

I commend you for making it to the end of this article. I hope that this guide has been helpful in getting to run Great Expectations as part of your EMR pipelines.

Neural Language Style Transfer With StyleTransfer

DataChef — Fri, 25 Dec 2020 23:18:31 GMT

StyleTransfer is a natural language processing solution developed in-kitchen by the machine learning team at DataChef.

Stylometry

The Da Vinci Code, Angels & Demons, The Lost Symbol, Inferno. They all start with Robert Langdon finding himself in a mystery he hadnt signed up for. They all involve historical monuments, deaths and treacheries. However, they also have something else in common: they are best-sellers.

When renowned American novelist Dan Brown announces a new book, readers blindly place preorders even if they are aware that it is likely just another work of fiction wrapped around other works. So what is this mysterious ingredient in his mysteries that hooks the readers? More generally, how can one go about teasing out the style characteristics of any writer in a quantified manner?

Motivated by these questions and inspired by the potentials of the modern language models, our team of scientists and engineers at DataChef undertook project StyleTransfer. We soon realized that the above question was too wild to lend itself to universal black-box methods.

Check out Data vs. Dan Brown episode of FTFY Podcast
Meow means woof in Cat. George Carlin

At our first stop, we dove into The best seller code, a book by Jodie Archer and Matthew Jockers. Their book is an account of applying the classic machine learning techniques to decide if a new manuscript is likely to land on the best seller lists. They find that hitting the emotional ups and downs at the right frequencies across the entire plot line is far more important than producing high quality prose. For instance, comparison of the plot line between 50 Shades of Grey and Da Vinci Code reveals an astonishing level of similarity in this regard.

A comparitive visualization of the emotional ups and down in the storylines.

On a more micro level, its also quite astounding to note that a good amount of the authors finger print is captured by their use of words such as: a/an, the, of, I, s/he, etc. As noted in the book, J. K. Rowling who published a crime fiction in 2013 under the pseudonym of Robert Galbraith, would come to realize that its very difficult to alter or disguise ones linguistic fingerprint.

Period points are also more common in winning prose, and both semicolons and colons are significantly less so. The best seller code

A closer look into the literature revealed that combinatorial statistics have a long standing connection to stylometry which is a branch of linguistics with focus on the quantified aspects of language style. In particular, we found the works of George Zipf in the 40s remarkably insightful. He had noticed that the frequency of a word in a corpus is closely tied with the inverse of its rank. Roughly speaking, this is to say that the frequency of the second (third) most common word in a corpus is half (a third) the frequency of the first word, etc.

Zipf plot for Shakespeares As you like it. The x-axis corresponds to the tokens and the y-axis to counts.

Moreover, he generalized this type of observations to other contexts such as economics, demographics, etc. This becomes less surprising upon noticing that the same ideas have been floating around under other names such as the Pareto Principle, the 80-20 rule, etc.

The following plots show the log-log version of the above plot for three random subsets of the text from As You Like It by Shakespeare. Notice that the plots tend to become straight lines of very similar slope around the center.

Empirical evidences strengthened the promise that these fractal like properties did in fact have something to do with the finger print of the author. There is a profusion of evidence that this kind of analysis could be utilized to detect even more global qualities such as the language of the text, dialect, genre, etc.

In particular, these ideas were further developed by Benoit Mandelbrot from a mathematical stand point and served as motivation for the theory of fractals. Zipf-like distributions arise quite naturally in the modern data science as datasets that are innately unbalanced. Mathematically, these distributions motivate a closer look into Dirichlet series and their analytic continuation: the Riemann Zeta function.

The Riemann zeta function represented through domain colouring. Image credit: Empetrisor.

In the technical jargon of computational linguistics, the type of questions that we are considering belong to stylometrics. One notable application of stylometrics is author attribution. For instance, it is revealed that a sizeable chunk of Shakespeares works were delegated and written by other collaborators due to various reasons. Of course, the scope of the applications is endless and we believe that we are at the sweet spot where we can merge the classical techniques with the modern methods in natural language processing that have emerged in the last few year.

StyleTransfer

StyleTransfer is an ambitious project that aims to facilitate style transfer for text. This is a challenging task that by its nature lacks a clear recipe within the framework of modern natural language processing. As we shall see below, our first objective in the first phase is to tune our models to measure style similarity between a given input text and a corpus (as the source of style).

What is style?

Roughly speaking, style is any linguistic characteristic that is independent from the content of writing or speech. This includes the lexicon of the author, the grammatical structure in use, punctuation, the pace and order at which topics are introduced and expanded, the types of narration and argumentation, the element of storytelling and so much more. In this sense, style is a delicate linguistic characteristic that varies even during the writing career of an author.

Style transfer?

Consider the problem of applying the style of a known corpus to another piece of writing. To be more specific, imagine that youre writing an application letter to your desired schools, funding agencies, prospective employers, etc. Lets say that each recipient requires or expects a certain style of writing that emphasizes certain aspects of the letter. Note that this requirement can easily include the language in which the letter gets written too.

Style also brings major implications for copy writing and marketing. Imagine that you could transform your initial copy to a version that is expected to have better resonance for any other target demographics or audience. How would that transform social media marketing?

Image style transfer: from a rooster to pico de gallo (roosters beak).

Anyways, lets get back to reality for now. Style transfer for images is fairly well understood [Gatys et al., 2016], however language is a lot more delicate and complex. As a result, a direct transition from image processing techniques to language processing solutions isnt quite feasible. Most modern attempts at tackling language style depend on heavily supervised models that arrive at style by leveraging vast amounts of data. The dominant approach among these relies on the sequence to sequence translation techniques which mostly target rudimentary aspects of the style such as positive/negative, news/academic, etc.

How to measure style similarity?

Lets consider a simpler question. Lets say youre given a large enough corpus of fixed style. Now, for any input text, assign a similarity score relative to the style source. To this end, we are looking for simple and effective methods of quantifying the similarity of prose or poetry in the local sense that respects qualities such as entropy, length of tokens and sentences, punctuation and so on but is likely to be blind to the overall ups and downs of the text or the order in which the topics appear and is ideally invariant of the genre.

Before diving into the details, let us take a step back to point out that we can consider language as a channel of processing and transferring information (or more specifically, code). From the perspective of information theory, a shorter code is superior as it allows the channel to carry more information. However, channels are subject to noise. In the daily human context, noise includes any perceived vagueness, misspeak, misdirection, interference with other channels of communication and so on. As in general with any code transference, redundancy can be introduced or demanded to facilitate the decoding process. For humans, this process is inherently asymmetric as inference tends to be cheap but articulation expensive.

As our first attempt, we followed the footsteps of George Zipf. One of his insights was that more frequent words tend to be shorter to allow for a more efficient packing of information. More generally, it is believed that words that are more predictable given the preceding context, also tend to be shorter to allow for a more efficient packing of information.

Our initial idea was to simply mask the entire text and only keep the length of tokens. This in effect turns the entire text into a time series of integers. Now, if we believe Zipfs idea that combinatorial statistics is all we need to determine style, we can linearize his distributions and arrive at a sequence model. So, we trained an LSTM model to take a few token lengths and produce the next one.

The long-short term memory is a deep neural network that attempts to create a memory retention effect while keeping the vanishing gradient phenomenon under control.

We wrote the first iteration of this model in PyTorch and trained it on Alices Adventures in the Wonderland. As a measure of accuracy, we considered the percentage of token lengths that were predicted correctly. Given the fuzzy nature of language, it is difficult to formulate precise requirements for evaluating such a model. We decided to let the model run until we got about 80% train accuracy. Then, for evaluation, we gave it random pieces of different books and noticed that the average score was around 55%. It is worth noting that for a few of these random texts, we got scores that were higher than training accuracy average. This could be partly due to the fact that our style source is a short book.

Alice : I simply must get through!
Doorknob : Sorry, youre much too big. Simply impassible.
Alice : You mean impossible?
Doorknob : No, impassible. Nothings impossible.

We grabbed our corpora from Project Gutenberg in the text format and performed some manual cleaning and preprocessing such as removing the header and footer, chapter numbers, special characters and so on. Then, we used 0-1 encoding for each token length up to maximum length.

We chose to use batches of length 5 and let the model predict the length of the next token and put cross entropy as the loss function. Given the nature of the problem, we only applied very light regularization as we wanted to the learned model to be slightly overfit. Finally, we normalized the the accuracy of the test text by the expected accuracy on the train text to arrive at the desired similarity score.

There are many possible improvements in this setting. For instance, one could enrich the input sequence by attaching the part of speech tag of each token to its length. Alternatively, one could employ a convolutional network at tackle this question to get similar results.

Transformers

Next, we moved to transformers to obtain generative solutions. Transformers are novel deep neural networks that in a sense evolved out of recurrent neural networks and hidden Markov models. Cognitively, the transition from RNNs to transformers corresponds to the transition from memorization to attentiveness. As a result, transformers are more flexible in the sense that the data no longer needs to be fed sequentially.

The crucial component of transformers is their attention units. As with almost any language model, the transformer architecture is designed to create a (vector) representation of the words in the language. The attention units are there to seek out relationships between these word-representing vectors at various levels of granularity. This induces a topology on the cloud of points of a language model that mimics the language structure.

During the not terribly long time since their introduction in 2017, transformers have proven to be successful in various domains such of text generation, text summarization, translation, etc. Moreover, many of these techniques are trickling down into other areas of machine learning which is a fascinating phenomenon.

In the rest of this post, we are going to give an account of two of our transformer based generative models. We trained our models on a ml.g4dn.xlarge accelerated computing instance on AWS SageMaker which carries a 4 core CPU, a GPU and 16 GiB of memory. However, once training was complete, we were able to evaluate the models on a regular laptop.

Text completion with GPT-2

As a creative writer, you must be all too familiar with the writers block of various levels of intensity. For example, it could be that you know where you want to go with your writing but youre not quite pleased with the words coming out of your mind. Well, itd be only natural to employ the machine to give you boosts here and there. Its almost as if youre favorite writer is sitting by your side to mentor you through the process.

A depiction of the writer's block by Leonid Pasternak

You can live with me in this house Ive built out of writers blocks. Peter Wentz

For our purpose, we used the GPT-2 transformer published by OpenAI. To create our first proof of concept, we decided to go with the transformer library by Hugging Face 🤗 which provides a convenient API for working with many of our favorite transformer models.

As the source of style, we picked:

Alices Adventures in Wonderland by Lewis Carrol
The entire content of the Coding Horror blog series by Jeff Atwood

Given that we used GPT-2 under Hugging Face, we wont spend much time here to linger on the technical details. Instead, we are first going to showcase some completions following the seed sentence: Do Italians put pineapples on pizza?

Style source: Alices
Do Italians put pineapples on pizza?, said the Lizard. I dont know, said the March Hare. Then you shouldnt have put them on, added the King. Theyre too stiff. Theyre too supple!" the March Hare went on in a deep voice, Theyre too full of themselves to be anything but a pack of cards: let them be gone before the trials over "
Style source: Coding Horror
Do Italians put pineapples on pizza? Im not sure. I mean, its not that hard, necessarily, but I think one particular American style of pizza does it better. The pineapples are one of those things that puts all its complexity in perspective: more like, say, milk chocolate than a computer. One of the peculiarities of the pea is that its very adaptable: it can synthesize (almost) anything in a single pass.

Evidently, the machines imagination tends to run wild the more it gets away from the seed. This could be partly remedied by the using the descendent of GPT-2, namely GPT-3, which enjoys a much larger parameter space. As a second experiment, we put the following seed: I wonder if arts are artificial.

Style source: Alices
I wonder if arts are artificial like machines? said Alice, who always took a great interest in questions of fashion. I havent the slightest idea, said the Footman, but theyre looking at me like telescopes.
Style source: Coding Horror
I wonder if arts are artificial in the way that computers are constructed. Perhaps in a very real way. Maybe an entire generation of programmers will be trapped in a world of computers without an artistic expression, without the ability to express creative ideas online.

The following screencast shows our interactive web app wrapper for our GPT-2 model written in Flask. This instance is only hooked to the plain English model for now.

A demo of the GPT-2 sentence completion module along with the LenLSTM style similarity measure relative to Alices Adventures.

Text rephrasing with T5

In this final section, well go over a rephraser module that was built on top of the Text-to-Text Transfer Transformer model. For training data, again we went back to our favorite English dramatist, William Shakespeare, and scraped the web for a dataset of 23 plays with one-to-one translations into modern English. What follows is our machines attempt at Shakespearifying the lyrics of My heart will go on.

Original: Every night in my dreams
Rephrase: I dream all night in my dreams
Original: I see you, I feel you
Rephrase: I hear thee, I feel thee
Original: That is how I know you go on
Rephrase: So I know thou movest on
Original: Far across the distance
Rephrase: Far out of the farthest, far out of the distant distance
Original: And spaces between us
Rephrase: Between us there is a gap of space
Original: You have come to show you go on
Rephrase: Thou art come to make sure thou shalt go on
Original: Near, far, wherever you are
Rephrase: Near, far, where thou art
Original: I believe that the heart does go on
Rephrase: I believe that the heart goes forth
Original: Once more, you open the door
Rephrase: One more time, thou openst the door
Original: And you're here in my heart
Rephrase: And thou art herein my heart
Original: And my heart will go on and on
Rephrase: And heart, and heart, and heart, and heart, and heart
Original: Love can touch us one time
Rephrase: Love may once touch us
Original: And last for a lifetime
Rephrase: And last for ever
Original: And never let go 'til we're gone
Rephrase: And let never be gone 'till our time, 'till there is a time in the sky
Original: Love was when I loved you
Rephrase: Love was when I loved thee
Original: One true time I'd hold to
Rephrase: One I would hold to, though, by and by
Original: In my life, we'll always go on
Rephrase: In my life we shall be steadfast
Original: Near, far, wherever you are
Rephrase: Nah, far, whoeverst thou

As the name suggests, T5 has the convenient feature that the user interface is solely text based. We collected our data as json objects and thereafter compiled everything into a csv file with two columns: original and modern.

{"original": "Give me your hand. Art thou learned?", "modern": "Give me your hand. Are you educated?."}

For the above demo, we fed the lines one by one to our fine-tuned T5 model and asked for up to 10 outputs. The settings were as follows.

outputs = model.generate(        input_ids=input_ids, attention_mask=attention_masks,        do_sample=True,        max_length=256,        top_k=100,        top_p=0.95,        early_stopping=True,        num_return_sequences=10    )

Text rephrasing with GPT-2

GPT-2 is typically recognized for its generative capabilities. However, we wanted utilize its sequence-to-sequence nature and employ it for the rephrasing task. To this end, we reformatted our Shakespeare dataset in the following format where we used s and p tags to distinguish between the two sides.

Why, the one I just told you about. >>>> Why, this that I speak of.
Me too, I swear. >>>> Or I, I promise thee.

After training we fed the model the sentence we wanted paraphrased in the s-tags. The ouput contained the and another sentence contained in p-tags. By cleaning the output we had a paraphrased sentence.

We are happy to report that this approach worked successfully and produced results that were of comparable quality to the ones shown above. Though T5 did a better job.

Lets recap. What is StyleTransfer useful for?

Well, there are numerous products that can be built on top of StyleTransfer or as spinoffs. Here, we barely scratched the surface by implementing a few generic components. Let us close this post by recalling the items we already discussed as possible use cases:

Rewriting job descriptions, resumes, letters, etc.
Tuning and harmonizing the tone and voice in copy writing.
Rewriting and improving captions for social media contents.

What would you build on top of StyleTransfer?

How to Handle Null in Spark

Shahin — Sun, 26 Jul 2020 11:26:05 GMT

Overview

In this article, we will talk about the second-ugliest exception in the history of programming and attempt to handle it in our Spark apps. If youve ever worked with Spark in its native language, youve probably faced this bizarre, hard to debug exception, famously called as NullPointerException!

Also, you might have done your homework hoping to find a one fit all solution. But reading articles after articles, you are still not happy with the final result, or even frustrated with the Scala and Spark!

Well, this at least was true in my case, As a newcomer to this language and ecosystem, witnessing the number of arguments (and counterarguments) around the topic was a massive surprise to me. And after some frustration, I learned, this is not a simple problem to have a simple solution. Lets see why is that:

What is Null?

Even though its called the same, it has two different nature when you examine it in the context of programming languages vs data management systems. Lets see how:

In programming languages

Its probably easier to comprehend the concept of null pointers if you are familiar with low-level languages like C. Its a pointer that points to nothing. The usual use case for null pointers in a language like C is to indicate the end of a string value or a list of unknown length. However, with enough abstraction applied, it can turn into a nightmare :joy:.

The history of null pointers is a fascinating topic, and you might get surprised to know how Tony Hoare (who has come up with the idea of null) thinks about his invention:

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object-oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldnt resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years. (source)

If you are interested, I highly recommend you to watch Null References: The Billion-dollar mistake presentation by Tony Hoare himself and read The worst mistake of computer science by Paul Draper which goes into a fair amount of details around the topic.

In data management

In data management systems or more particularly SQL, null has a special meaning. Using null in your data indicates that there is no valid data available for this field or it is unknown.

So, for example, consider, you have a coordinates column for your data but, you dont know the value for a particular address. Instead of using something like Unknown or some other misleading values, you use a standardized null to express explicitly you dont know the value.

If you are also interested in this topic, apart from Null (SQL) on Wikipedia, I highly recommend SQL by Design: The Reason for NULL by Michelle A. Poolet, which is a bit dated but will help you to clarify the topic.

Scala style and null

Now that we are familiar with the concept and use cases, lets focus our attention on the problem we have. Lets see how we can deal with null in Spark and Scala in a sane way.

In lots of best practices for Scala, you can read statements like Avoid nulls. They usually resonate this approach by stating that since in functional programming, you think in the context of algebraic equations and null doesnt make sense there, then you should avoid using it when you program in functional paradigm.

However, this is not as simple as it sounds. There are some pointers to take note before smoothly rolling your eyes over null:

First of all, Scala also supports OOP, where null usually exists (in one form or another, or sometimes in multiple forms, yeah, Im looking at you JavaScript :unamused:).
Second, null is a citizen of the Java ecosystem. Unless you code up everything you ever use in your project, in functional Scala style, you cant ignore null.
And most importantly third, Spark cant ignore null. As we stated earlier, in terms of data, null represents a particular meaning.

Spark and Null

We know Spark needs to be aware of the null, in terms of data, but you, as a programmer, should be aware of some details. Null in Spark, is not as straight forward as we wish it to be. At the beginning of this article, I stated that this is not a simple problem we face here. Here Im going to discuss why I think thats the case:

Spark is Null safe, well, almost!

The fact that Spark functions are null safe (at least most of the times) is, quite pleasant. Take a look at the following example:

import org.apache.spark.sql.types.{StructType, StructField, IntegerType}val schema = List(  StructField("v1", IntegerType, true),  StructField("v2", IntegerType, true))val data = Seq(  Row(1, 2),  Row(3, 4),  Row(null, 5))val df = spark.createDataFrame(  spark.sparkContext.parallelize(data),  StructType(schema))val result = df.withColumn("v3", $"v1" + $"v2")

As you can see the third row of our data contains a null, but as you see in the following code box, Scala considers the result of that row as null (which is the desired value if one party of your calculation is already null):

scala> result.show+----+---+----+|  v1| v2|  v3|+----+---+----+|   1|  2|   3||   3|  4|   7||null|  5|null|+----+---+----+

But this is not the case for every Spark internal function. For example, if for a more complicated computation, you want to rely on a function like ml.feature.RegexTokenizer then you need to make sure your desired column doesnt contain null!

The case of empty string

Now lets see how Spark handles empty strings. We first read a data frame from a simple CSV file with the following definition:

# test.csvkey, value"", 1, 2

As you see, the key column in the first row is an empty string, but in the second row, its undefined. Lets read it in and see what Spark thinks about it:

scala> val df = spark.read.option("header", true).csv("test.csv")scala> df.show+----+------+|name| value|+----+------+|null|     1||null|     2|+----+------+

Cool, huh? Spark considered empty string as null. But lets see if it is always the case. Lets try the same thing with a JSON data file:

[{"key": ""}, {"key": null}]

The value of key in the first row is an empty string and null in the second row. Lets read it in:

scala> val df = spark.read.json("test.json")scala> df.show+----+| key|+----+|    ||null|+----+

Yep, now we have it as an empty string. To me, these details sound like double standards. However, as you see, this is also inevitable, since the source type of your data, decides what youll have in your data frame.

The case of UDF

If you were considering what we talked about so far the worst-case scenario that can happen, you were missing the point of UDFs. They are an addition to Spark, to make matters worst (just kidding :joy:).

As always, you can find a lot of so-called best practices out there where they suggest you not to use UDFs. And thats partly true, as long as what you want to achieve is already supported with Spark functions. But with any amount of real-world data engineering experience, you already know, thats not always possible. IMO, this is the nature of frameworks, because its not practical or even sometimes possible to cover everything that your framework can support.

Now that you are probably going to develop your UDFs, you should take responsibility and deal with your nulls. Because as far as Spark is concerned, a UDF is a block box. If something goes wrong there, have fun with debugging the most inexpressive exception youve ever seen in your life.

The case Option

Option in Scala satisfies the definition of a monad in functional programming languages. So its no wonder that its usually used as a solution when dealing with nulls. I believe Option syntax is handy and, the best practices are right on the point. However, when it comes to Spark, there are some details we should be aware of:

The Option comes with some performance costs. In the typical situation it is no big deal (as you can see in Scala Option Performance Cost blog post by Lex Vorona). However, Databricks in their Scala style guide suggests preferring null over Options when you are dealing with performance-sensitive code (source). And that makes sense in the context of data. Consider a UDF function, that uses some Option dependent logic under the hood, and you apply it on a data frame with millions of rows.
UDFs cant take option as a parameter. So again, you are on your own to deal with nulls inside the UDF body.
The Option will not magically resolve the issue of null and kind of makes it worst as we talked earlier, null acts like a value while its not a value. So, Some(null) is, unfortunately, a valid expression, but you cant just use Option/Some/None trio, to resolve null exceptions in your code. For example, the following code greets you with a beautiful NullPointerException (here you can find more details on it):
```
 val df = Seq((Some("a")), (Some(null))).toDF
```

In the other hand Some(null) in Scala will have the value of Some[Null] = Some(null). This sort of holes inside the logic is a sign that Option is not able to fully cover NPE issues. And its not just true in case of Spark, consider what happens in the following code:

    scala> def strfm(value: Option[String]): Option[String] = {    | value match {    | case Some(v) => Some(v.trim)    | case None => None    | }    | }    // Test it with a normal String    scala> strfm(Some(" DataChef is in the kitchen! "))    res3: Option[String] = Some(DataChef is in the kitchen!)    // Now let's go crazy!    scala> strfm(Some(null))    java.lang.NullPointerException    at .strfm(:18)    ... 36 elided    }

Of course, there are some workarounds to cover this as well (e.g. pass Option(null) instead). But to me, they are a workaround, not a solid solution.

So how to cover it?

If youve read until here, you might be interested to see how I prefer to deal with this issue as of now. This solution by no means should be considered a best practice, however, it covers all of the concerns we had so far and in my opinion, it doesnt abstract away critical details.

In our use case, we tried to treat UDFs, how Spark treats them, a black box with a clear input and output mechanism. Having a black box means at the beginning of each UDF, we deal with null explicitly like:

val awesomeFn(value: String): String {  value match {    case null => null    case _    => applyAwesomeLogic(value)  }}

This way, we ensure we dont get surprised by having null read by Spark, end up inside our functions and cause NPE. In Dealing with null in Spark, Matthew Powers suggests an alternative solution like:

val awesomeFn(value: String): String {  val v = Option(value).getOrElse(return None)  applyAwesomeLogic(value)}// In his sample the return value of the function is an Option, which we will// come back to in a bit.

Which is also fine, but I prefer to stick with pattern matching for two reasons:

Even though null looks like a curse inside Scala world, I prefer to explicitly define how Im handling it, instead of abstracting it away.
The explicit return None statement, IMO is a bit hard to read and reduces the code clarity.

Now that we are sure about the incoming values to our functions lets see how we are going to deal with return values. For the functions which directly get invoked by UDFs for the return values, I tried to lower the need for the Option and only use it where no other solution is possible (as a default value for an object of type Double for example).

Yes, this means we should expect a null appearance in all of the logic used by Spark, but I consider that a reasonable practice since its inevitable to ignore null in the first place. Plus we do not trade performance of the main logic to a syntax that doesnt resolve our main concerns (NPE).

However, this is not all. Other than the functions that directly deal with UDFs and Spark, we still have other code dependencies. Some use codes in Java, and some rely on network or some other external dependency. To deal with these, we use Options, which looks reasonable. For example here is a snippet that utilizes a Java function:

def createResponse(vatNumber: String, countryCode: String, response: EUVatCheckResponse): VatInfo = {  val name = response.getName match {    case "-"       => None    case n: String => Some(n)    case _         => None  }  val address = response.getAddress match {    case "-"       => None    case n: String => Some(n)    case _         => None  }  VatInfo(vatNumber, countryCode, response.isValid, name, address)}def getInfo(vatNumber: String, countryCode: String): Option[VatInfo] = {  Try(vatChecker.check(countryCode, vatNumber)) match {    case Success(info) => Some(createResponse(vatNumber, countryCode, info))    case Failure(exc)  =>      log.error("Error while making a request to Vat service", exc)      None  }}

In this code, the vatChecker function is a Java piece from VatChecker library. On that call, lots of following details can go wrong, and cause errors (including NPE). In this case usage of an Option looked pretty reasonable. Especially given that the result of this call would be cached and we arent going to recall it for existing values. So Options performance hit is managed here.

Is this an optimal solution, it might not be indeed; however, currently, it covers all the details I shared in the previous sections of this article. To improve it we would like to have your take on this issue, and what is your real-world experience with it. So feel free to share your thoughts on it or even better, share how you deal with nulls in your Spark/Scala applications.

3 ways to send custom metrics to AWS Cloudwatch

DataChef — Sat, 11 Jul 2020 16:41:31 GMT

Amazon CloudWatch is a monitoring and observability service. CloudWatch monitors your resources and the applications you run on AWS in real-time. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. In this post, I will show you 3 ways to send your own custom metrics to AWS Cloudwatch.

1. Using AWS SDK

Amazon has provided SDKs for many popular languages including C++, Go, Java, Javascript, .NET, Node.js, PHP, Python, and ruby. When it comes to sending metrics, the easiest and most intuitive way is to use the AWS SDK. Lets say we have a JVM based application and want to send metrics.

To publish your own metric data, call the AmazonCloudWatchClients putMetricData method with a PutMetricDataRequest. The PutMetricDataRequest must include the custom namespace to use for the data, and information about the data point itself in a MetricDatum object.

Tip!
A number of AWS services publish their own metrics in namespaces beginning with AWS/. Dont start your namespaces name with AWS.

Here is the sample code:

import com.amazonaws.services.cloudwatch.AmazonCloudWatch;import com.amazonaws.services.cloudwatch.AmazonCloudWatchClientBuilder;import com.amazonaws.services.cloudwatch.model.Dimension;import com.amazonaws.services.cloudwatch.model.MetricDatum;import com.amazonaws.services.cloudwatch.model.PutMetricDataRequest;import com.amazonaws.services.cloudwatch.model.PutMetricDataResult;import com.amazonaws.services.cloudwatch.model.StandardUnit;final AmazonCloudWatch cw =    AmazonCloudWatchClientBuilder.defaultClient();Dimension dimension = new Dimension()    .withName("UNIQUE_PAGES")    .withValue("URLS");MetricDatum datum = new MetricDatum()    .withMetricName("PAGES_VISITED")    .withUnit(StandardUnit.None)    .withValue(data_point)    .withDimensions(dimension);PutMetricDataRequest request = new PutMetricDataRequest()    .withNamespace("SITE/TRAFFIC")    .withMetricData(datum);PutMetricDataResult response = cw.putMetricData(request);

2. Using metric filters

In this way, you send your metrics inside your log data. I know all of you are familiar with logs but to ensure that we all are on the same page, here are 2 sample logs:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

20/06/28 06:59:42 INFO SparkContext: Running Spark version 2.4.4

The idea is, put your metrics inside the log and ask Cloudwatch to extract them. For example, if you want to have a metric which counts errors, you can set filter pattern to ERROR. This will match log event messages that contain this term, such as the following:

[ERROR] A fatal exception has occurred
Exiting with ERRORCODE: -1

The filter pattern is not limited to just one word. You can provide more complex patterns and you can even set pattern against JSON. To create a metric filter using the AWS CLI, open command prompt and run the following command:

aws logs put-metric-filter \  --log-group-name MyApp/message.log \  --filter-name MyAppErrorCount \  --filter-pattern 'Error' \  --metric-transformations \      metricName=ErrorCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0

Now lets send some logs to Cloudwatch:

aws logs put-log-events \  --log-group-name MyApp/access.log  --log-stream-name TestStream1 \  --log-events \    timestamp=1394793518000,message="This message contains an Error" \    timestamp=1394793528000,message="This message also contains an Error"

You should see ErrorCount metric appears in namespace: MyNamespace.

3. Using Embedded Metric Format

Like the metric filter way, this way also relies on string logs. But unlike metric filter which is very limited to simple use cases, with Embedded Metric Format you have all the power of Cloudwatch SDK, but without the need to use SDK at all. There are some use cases where adding SDK is impossible or hard, but with Embedded Metric Format, you do not have to rely on complex architecture or multiple third-party tools to gain insights into these environments. All you need to do is construct a string log which adheres to the Embedded Log Format. The following is a valid example of the embedded metric format:

{  "_aws": {    "Timestamp": 1574109732004,    "CloudWatchMetrics": [      {        "Namespace": "lambda-function-metrics",        "Dimensions": [["functionVersion"]],        "Metrics": [          {            "Name": "time",            "Unit": "Milliseconds"          }        ]      }    ]  },  "functionVersion": "LATEST",  "time": 100,  "requestId": "989ffbf8-9ace-4817-a57c-e4dd734019ee"}

Here we created a metric called time with value 100 ms in the dimension of functionVersion=LATEST.

Tip!
In order for Cloudwatch to distinguish the Embedded Format from other logs, and extract the metrics from the log, you should send a request with the following header:

x-amzn-logs-format: json/emf

Java example:

PutMetricDataRequest request = ...request.putCustomRequestHeader("x-amzn-logs-format", "json/emf")

Tip!
On Lambda, you do not need to set this header yourself. Writing JSON to standard out in the embedded metric format is sufficient.

Conclusion

Until now, we have reviewed 3 ways to send custom metrics to AWS. Now its time to examine the pros and cons of each of them.

	Using AWS SDK	Using filter pattern	Using embedded format
pros	Easy to useSDK is available in many popular languagesFlexible	Send metrics using logsNo need to add extra SDKFast to implement	Send metrics using logsNo need to add extra SDKFlexible
cons		limited to simple use cases and patternsNot reliable, patterns can be matched to unwanted logs	You should produce the format on your own or rely on few client libraries

Deserialzing Confluent Avro Records in Kafka with Spark

DataChef — Fri, 19 Jun 2020 12:07:35 GMT

Introduction

Kafka producers and consumers are already decoupled in the sense that they do not communicate with one another directly; instead, information transfer happens via Kafka topics. But they are coupled in the sense that consumers need to know what the data they are reading represents in order to make sense of itbut this is something that is controlled by the producer! How can producers evolve the schema of data without breaking downstream services?

To facilitate this, Confluent introduced Schema Registry for storing and retrieving Avro, Json schema and Protobuf schemas and they decided Avro as default choice. If you have a Kafka cluster populated with Avro records governed by Confluent Schema Registry, you cant simply add spark-avro dependency to your classpath and use from_avro function. Why? Thats because in order to work in close integration with the Kafka consumer or producer, the Confluent team has manipulated the binary format. If you know already how Avro works, you can skip the introduction and go straight to Confluent Avro format.

Avro crash course!

Apache Avro is a language-neutral data serialization system. The project was created by Doug Cutting, the creator of Hadoop, to address the major downside of Hadoop Writables: lack of language portability. Having a data format that can be processed by many languages makes it easier to share datasets with a wider audience than one tied to a single language. It is also more future-proof, allowing data to potentially outlive the language used to read and write it.

But why a new data serialization system? Avro has a set of features that, taken together, differentiate it from other systems such as Apache Thrift or Googles Protocol Buffers. However, unlike in some other systems, code generation is optional in Avro, which means you can read and write data that conforms to a given schema even if your code has not seen that particular schema before. To achieve this, Avro assumes that the schema is always present at both read and write time which makes for a very compact encoding, since encoded values do not need to be tagged with a field identifier.

Avro Schema

Avro schemas can be written in two ways, either in a JSON format:

{  "type": "record",  "name": "Person",  "fields": [    {      "name": "userName",      "type": "string"    },    {      "name": "favouriteNumber",      "type": [        "null",        "long"      ]    },    {      "name": "interests",      "type": {        "type": "array",        "items": "string"      }    }  ]}

Or using a higher-level language called Avro IDL:

record Person {  string               userName;  union { null, long } favouriteNumber;  array        interests;}

Serialization and deserialization

There are two ways to achieve this, either using Generic API or using Specific API. Specific API is used in conjunction with automatic code generation from Schema.

Generic API

Here is the serialization part:

val schemaDefinition =    """      |{      |  "type": "record",      |  "name": "Customer",      |  "doc": "Information about a customer",      |  "fields": [      |    {"name": "id", "type": "string"},      |    {"name": "name", "type": "string"}      |  ]      |}      |""".stripMarginval parser = new Schema.Parser()val schema = parser.parse(schemaDefinition)val record =  new GenericData.Record(schema)record.put("id", "44L")record.put("nam", "John")val outputStream = new ByteArrayOutputStream()val datumWriter = new GenericDatumWriter[GenericRecord](schema)val encoder = EncoderFactory.get().binaryEncoder(outputStream, null)datumWriter.write(record, encoder)encoder.flush()outputStream.close()// Get byte array encoded with Avro formatval result = outputStream.toByteArray

We can reverse the process and read the object back from the byte buffer:

val decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray, null)val datumReader = new GenericDatumReader[GenericRecord](schema)val record2 = datumReader.read(null, decoder)println(record2.get("id")) // 44Lprintln(record2.get("name")) // John

Specific API

The Generic way works well for small schemas and small applications, but as your schema grows, it will become unmanageable. If you have typo in field names, you wont notice until runtime with AvroRuntimeException and also dealing with Object type and managing type conversion manually is very cumbersome. Thats where Specific API comes handy. Thanks to code generation functionality, your time will be saved, and you have more manageable and cleaner code.

Add sbt-avro plugin to your project, then put your schema in src/main/avro/ folder and then type: sbt avroGenerate

It generates the classes, in our case Customer class, then we can use specific API like this:

val customer = new Customer()customer.setId("87R")customer.setName("Alice")// Serializationval out = new ByteArrayOutputStream()val writer = new SpecificDatumWriter[Customer](classOf[Customer])val encoder = EncoderFactory.get.binaryEncoder(out, null)writer.write(customer, encoder)encoder.flush()out.close// Deserializationval reader = new SpecificDatumReader[Customer](oldSchema, newSchema)val decoder = DecoderFactory.get.binaryDecoder(out.toByteArray, null)val result = reader.read(null, decoder)println(result.getId) // 87Rprintln(result.getName) // Alice

Schema Evolution

Changes are inevitable in the software industry, and sooner or later the requirements will change, and as a software developer, you should change the schema. But before changing the schema you should think about downstream services and ask questions like this: Who do we upgrade first? consumers or producers? Can new consumers handle the old events that are still stored in Kafka? Do we need to wait before we upgrade consumers? Can old consumers handle events written by new producers?

This can get a bit complicated, so data formats like Avro and Protobuf define the compatibility rules concerning which changes youre allowed to make to the schema without breaking the consumers, and how to handle upgrades for the different types of schema changes. In other words, the schema that the producer writes is not always the schema that the consumer reads.

Thats why both Generic and Specific API have constructors for specifying writer and reader schema:

val reader = new SpecificDatumReader[Customer](writerSchema, readerSchema)

The table below, summarize the rules for record evolution from the point of view of readers and writers:

Writer's schema	Reader's schema	Action	Behaviour
Old	New	New field has been added to the Readers schema	The reader uses the default value of the new field, since it is not written by the writer.
New	Old	New field has been added to the Writer's schema	The reader does not know about the new field written by the writer, so it is ignored.
Old	New	Field has been removed from the Readers schema	The reader ignores the removed field.
New	Old	Field has been removed from the Writers schema	The removed field is not written by the writer. If the old schema had a default defined for the field, the reader uses this; otherwise, it gets an error. In this case, it is best to update the readers schema, either at the same time as or before the writers.

For a more detailed information check Schema Resolution in Avros documentation.

Revealing the confidential Confluent Avro Format!

In order to producers and consumers seamlessly work together, Confluent team has appended the Schema id before actual standard Avro binary format. In this way, consumers can fetch the Schema id, request the schema from Schema Registry and deserialize the bytes. The Confluent Avro format looks like this:

Confluent Avro Format

SerDes in Spark

Because the binary format is not the standard Avro format but Confluent format, we cannot simply add spark-avro dependency and use from_avro function. But because the Confluent Avro format is super simple, we can extract the schema id and deserialize the Avro binary using Avro api. For instance, this method, get Confluent Avro binary, and deserialize the Avro:

def deserializeFromConfluentAvro(bytes: Array[Byte]): Account = {  val schemaRegistryUrl = "http://schema-registry-url:8081"  val schemaRegistry = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)  val buffer = ByteBuffer.wrap(bytes)  // The first byte is magic byte  if (buffer.get != 0) throw new SerializationException("Unknown magic byte!. Expected 0 for Confluent bytes")  // The next 2 bytes are schema id  val writeSchemaId = buffer.getInt()  val writerSchema = schemaRegistry.getByID(writeSchemaId)  // we want to deserialize with the last schema  val subject = "your-topic-name" + "-value"  val readerSchemaId = schemaRegistry.getLatestSchemaMetadata(subject).getId  val readerSchema = schemaRegistry.getByID(readerSchemaId)  val length = buffer.limit - 1 - 4  val start = buffer.position() + buffer.arrayOffset()  val decoder = DecoderFactory.get().binaryDecoder(buffer.array(), start, length, null)  val reader = new SpecificDatumReader[Account](writerSchema, readerSchema)  reader.read(null, decoder)}

And in order to extract value from Sparks Dataframe, you need to write a user defined function:

val deserializeAvro: UserDefinedFunction = udf(  (bytes: Array[Byte]) => {    deserializeFromConfluentAvro(bytes)  })

Then use it like this:

df.select(deserializeAvro(col("value")))

But when you run it, you will get the error: java.lang.UnsupportedOperationException: Schema for type Customer is not supported. My next move was to use Dataframes mapPartitions method. Here is the methods signature:

def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U]

But as you can see, it requires that the return type has a Encoder. So I added the encoder:

implicit val encoder = Encoders.bean(classOf[Customer])

But this also not worked, I got this error: Error: Cannot have circular references in bean class. Customer class is generated class, so I have no way to modify it.

ABRiS project to the rescue

from_avro function has some limitations, it only accepts the writers schema. That means it should deserialize with the exact schema that has been written. There is a pull request to address it but it will be available in Spark 3.0.

Manually deserializing Confluent format is easy but figuring out the right encoder for Spark is tricky. Thanks to the contributors of ABRiS project, they addressed all the issues in their project and now we are using it in our production. Here is the sample code to deserialize records:

var rawKafka = spark.readStream                    .format("kafka")                    .option("kafka.bootstrap.servers", config.bootstrapServers)                    .option("subscribe", config.topic)                    .option("startingOffsets", config.startingOffsets)                    .load()val commonRegistryConfig = Map(      SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> config.schemaRegUrl,      SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> config.topic,      "basic.auth.credentials.source" -> "USER_INFO",      "schema.registry.basic.auth.user.info" -> s"${apiKey}:${apiPass}")val keyRegistryConfig = commonRegistryConfig ++ Map(      SchemaManager.PARAM_KEY_SCHEMA_NAMING_STRATEGY -> "topic.name",      SchemaManager.PARAM_KEY_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY -> "key",      SchemaManager.PARAM_KEY_SCHEMA_ID -> "latest")val valueRegistryConfig = commonRegistryConfig ++ Map(      SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.name",      SchemaManager.PARAM_VALUE_SCHEMA_NAMESPACE_FOR_RECORD_STRATEGY -> "value",      SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")rawKafka.select(      from_confluent_avro(col("key"), keyRegistryConfig).alias("key"),      from_confluent_avro(col("value"), valueRegistryConfig).alias("value"))

Of course you should change the configuration values according to your need. Also pay attention to schema registry authentication method, we used basic authentication method:

"basic.auth.credentials.source" -> "USER_INFO","schema.registry.basic.auth.user.info" -> s"${apiKey}:${apiPass}"

If you have configured your schema registry to use other authentication methods,you should change these accordingly.

AWS VPC With Public and Private Subnets

Shahin — Wed, 22 Jan 2020 11:46:39 GMT

Overview

Virtual Private Clouds (or VPC) as you all probably know, is one of those services, which would be a lifesaver when you know how to use them. The isolation and service integrations provided by VPCs, suppose to reduce the common cloud management hassles and help you when you plan to scale your application. However, covering all that heavy lifting, its not that easy to abstract everything into a simple point and clicks user interface. In a couple of my last projects, I have faced some problems applying a generic architecture based on VPCs (usually because I forgot a small detail somewhere). So I decided to automate the solution to be able to reuse it in the future.

What We Need

To share my solution, I guess, its better to describe the problem, in a high-level manner. In web-based projects, we usually, need a database, and on the AWS platform, RDS is the product we typically use. To launch an RDS instance, we need a VPC with subnets in at least two availability zones. Also, if you want to have access to this database from your local machine, you must make sure that its launched on a public subnet. A quick side note here: To have direct access to your RDS instance, you must also make sure:

The database is configured to have public access, so it gets an IP address assigned to it. Otherwise, the database would only be accessible from services, inside the VPC.
The VPCs security group should allow incoming packets from your desired IP addresses.
Always limit access to your databases to the minimum required machines.

Lets go back to our main story. So far, our requirements are:

1 VPC
2 Public Subnets (at least)
1 Internet Gateway (to route connections through public subnet)

We have our database, and we now want a computing engine to implement our business logic. Our computing engine of choice, here would be a Lambda. As we said, for a service, to be able to access our database, it needs to either get placed inside our VPC or have a public IP address and Lambdas doesnt have any of them by default. So our option here is to put it inside the VPC which can help our Lambda access the RDS instance.

However, this would also remove our Lambda functions internet access. If you need to access an external web resource or a public AWS service like AWS Cognito, you need to follow these rules:

The function should only be associated with private subnets.
Your VPC should contain a NAT gateway or instance in a public subnet.

So basically, our list of requirements grows to this:

1 VPC
2 Public Subnets
2 Private Subnets
1 Internet Gateway (to route connections through public subnet)
1 Nat Gateway (to route connections through private subnet)
And an IP address (which is required by the NAT gateway)

You probably noticed that its easy to make a mistake in one phase of the setup and end up with a hard-to-debug situation if youve ever tried to do this manually using the web interface or CLI (at least, thats what I encountered multiple times). Having a quite happy experience with provisioning automation tools like Ansible in recent years, I find out AWSCloudFormation, would help me almost the same. I was even happier when I found it has a quite capable teardown functionality, which would be very helpful to reduce clutter inside the account (especially, when you use a single account to experiment with different ideas). So we are going to set up a ready-to-use VPC for this scenario using CloudFormation. But first, some terminology:

Terminology

Public Subnet: This means the subnets traffic is routed through an internet gateway so instances in these subnets, can send their traffic directly to the internet.
Private Subnet: Unlike public subnets, to access web resources, private subnets need to send their traffic through a NAT gateway that resides inside a public subnet.
Internet Gateway: It allows instances with a public IP address (e.g. EC2) inside your VPC to access the internet.
Nat Gateway: Allows instances without a public IP address (e.g. Lambda), inside your VPC to access the internet.

Implementation

We are going to create a CloudFormation template to take care of this setup. First, we define all of the resources we need:

AWSTemplateFormatVersion: 2010-09-09Description: Deploy a VPC with public/private subnets and NAT/IGW accessResources:  VPC:    Type: AWS::EC2::VPC    Properties:      CidrBlock: 10.1.0.0/16      EnableDnsSupport: true      EnableDnsHostnames: true      Tags:      - Key: Name        Value: !Sub ${AWS::StackName}-vpc  NATGateway:    Type: AWS::EC2::NatGateway    Properties:      AllocationId: !GetAtt ElasticIPAddress.AllocationId      SubnetId: !Ref PublicSubnetA      Tags:      - Key: Name        Value: !Sub NAT-${AWS::StackName}  ElasticIPAddress:    Type: AWS::EC2::EIP    Properties:      Domain: VPC  InternetGateway:    Type: AWS::EC2::InternetGateway    DependsOn: VPC  PublicSubnetA:    Type: AWS::EC2::Subnet    Properties:      VpcId: !Ref VPC      CidrBlock: 10.1.10.0/24      AvailabilityZone: !Select [ 0, !GetAZs ]      Tags:      - Key: Name        Value: !Sub ${AWS::StackName}-public-a  PublicSubnetB:    Type: AWS::EC2::Subnet    Properties:      VpcId: !Ref VPC      CidrBlock: 10.1.20.0/24      AvailabilityZone: !Select [ 1, !GetAZs ]      Tags:      - Key: Name        Value: !Sub ${AWS::StackName}-public-b  PrivateSubnetA:    Type: AWS::EC2::Subnet    Properties:      VpcId: !Ref VPC      CidrBlock: 10.1.30.0/24      AvailabilityZone: !Select [ 0, !GetAZs ]      Tags:      - Key: Name        Value: !Sub ${AWS::StackName}-private-a  PrivateSubnetB:    Type: AWS::EC2::Subnet    Properties:      VpcId: !Ref VPC      CidrBlock: 10.1.40.0/24      AvailabilityZone: !Select [ 1, !GetAZs ]      Tags:      - Key: Name        Value: !Sub ${AWS::StackName}-private-b

The YAML structure for this template is quite self-explanatory. Im going to explain some special syntax or choices here, but if you want, you can learn more about each type and available properties using this reference.

First, what we are creating here is:

1 VPC
1 Nat Gateway
1 Elastic IP
1 Internet Gateway
4 subnets

These are all of the services we need. However, we need to connect them to form our architecture design. But before that, lets examine some of the configurations in detail.

The syntax !Sub ${AWS::StackName}* is a form of string interpolation, which Ive used it for naming the resources. This way, I can select a custom stack name every time I invoke this template, and all the resources would get adequately namespaced in the format Ive defined here.

The !Select [ 0, !GetAZs ], means get all availability zones which I can use (!GetAZs), and use its first item. So basically, Im selecting the first item of my availability zones for the given resource. So as youve probably already noticed, we tend to create the subnets inside two different availability zones.

And finally, !Ref * is a way to mention the resources available in the current template. So !Ref VPC, means Im specifying the VPC instance created in the current template.

So lets continue with mapping these resources together; first, we attach the an internet gateway to the VPC

AttachGateway:  Type: AWS::EC2::VPCGatewayAttachment  Properties:    VpcId: !Ref VPC    InternetGatewayId: !Ref InternetGateway

Next, we need to create our routing tables. A public one that routes 0.0.0.0/0 through our defined internet gateway:

PublicRouteTable:  Type: AWS::EC2::RouteTable  Properties:    VpcId: !Ref VPC    Tags:    - Key: Name      Value: !Sub ${AWS::StackName}-publicPublicRoute:  Type: AWS::EC2::Route  DependsOn: AttachGateway  Properties:    RouteTableId: !Ref PublicRouteTable    DestinationCidrBlock: 0.0.0.0/0    GatewayId: !Ref InternetGateway

And a private one which uses the NAT gateway for the same purpose:

PrivateRouteTable:  Type: AWS::EC2::RouteTable  Properties:    VpcId: !Ref VPC    Tags:    - Key: Name      Value: !Sub ${AWS::StackName}-privatePrivateRoute:  Type: AWS::EC2::Route  Properties:    RouteTableId: !Ref PrivateRouteTable    DestinationCidrBlock: 0.0.0.0/0    NatGatewayId: !Ref NATGateway

Now that our routing tables are ready, we can map our subnets to use them:

PublicSubnetARouteTableAssociation:  Type: AWS::EC2::SubnetRouteTableAssociation  Properties:    SubnetId: !Ref PublicSubnetA    RouteTableId: !Ref PublicRouteTablePublicSubnetBRouteTableAssociation:  Type: AWS::EC2::SubnetRouteTableAssociation  Properties:    SubnetId: !Ref PublicSubnetB    RouteTableId: !Ref PublicRouteTablePrivateSubnetARouteTableAssociation:  Type: AWS::EC2::SubnetRouteTableAssociation  Properties:    SubnetId: !Ref PrivateSubnetA    RouteTableId: !Ref PrivateRouteTablePrivateSubnetBRouteTableAssociation:  Type: AWS::EC2::SubnetRouteTableAssociation  Properties:    SubnetId: !Ref PrivateSubnetB    RouteTableId: !Ref PrivateRouteTable

Thats it. Save the template in a file, and you can invoke it to create a CloudFormation stack using AWS CLI as follows:

$ aws cloudformation create-stack --stack-name DC-MyProject --template-body file://vpc.yaml

As you can see, I used DC-MyProject as my stack name, which would get used by the template to name my resources correctly. Thats it, and now you can follow the creation process from your web console.