Unlocking Observability in your Data Stack

Last month, you may have seen the updated MAD Landscape (Machine Learning, Artificial Intelligence, and Data) published by Matt Turck and team. With over 1,416 tools listed (any many others still missing), it's no wonder teams are getting confused with the tools they should be selecting and how everything fits together. As someone whose job it is to keep track of the landscape, I was only aware of ~20% of these tools.

The Modern Data Stack is currently crumbling under the weight of its own complexity. While we've gained best-in-class tools that are exceptional at performing one specific step of the data operations lifecycle, we've lost the ability to accurately identify and troubleshoot every touchpoint our data is going through.

The industry claims to want observability into the quality of their data but aren't as adamant about desiring the same level of observability into the quality of their processes. As a result, we've started leaning on data quality tools that treat the symptom of bad data, not the cause.

If we want to get to a point where we can trust the data being delivered is accurate and on-time, we have to start placing market pressure on the ecosystem to build tools with interoperability in mind.

The Modern (Brittle) Data Stack

The current state of data stacks reminds me of those videos where a robot arm serves ice cream or hot dogs. The goal is to run through a series of steps to serve someone a particular food item. 99% of the time, these services probably work.

However, as soon as any of these steps goes wrong, the entire process falls apart and someone gets delivered food that's either unrecognizable or nonexistent. Despite this, the poor little robot arm keeps chugging along. It doesn't have context into the work that's being done. It just follows a series of instructions it was programmed to perform.

While we may not be dishing out data with robot arms, we can easily see the parallels with most data stacks. In order to put your data to work you rely on disparate systems---each with their own scheduling rules---to get your data from initial source to final destination. Each tool only cares about the steps it's responsible for without a care in the world for how that might affect the end product.

We have to ask ourselves - are we really ok running our data operations systems like these food preparation robots?

The Standard Flow of Data

Here's an overly simplistic example of a data pipeline that many teams build out:

6:00AM - Ingest your data
7:00AM - Transform your data
8:00AM - Run ML Models and store results
8:30AM - Add ML scores back to your database
9:00AM - Refresh your dashboards
9:30AM - Augment SaaS tools with warehouse data

Each one of these bullet points represents the data passing through a different tool in your system. Mentally, you're aware of these six independent processes. But individually, each process is running on its own tool, with its own scheduling system, with its own potential for failure.

Every new tool you pass your data through is unaware of the bigger picture which is a surefire way to invite disaster.

Bad Data Delivery

What happens when something inevitably goes wrong in one of the tools?

Data takes longer to load than normal
Transformation steps encountered unexpected data that resulted in errors
Your ML package fails due to package dependency issues
Your dashboard service was down temporarily

These types of issues all result in incomplete or bad data. The problem? Each tool isn't aware of the flow of data. It just knows "I need to perform this task on the data at this time". The tool is only aware of its own unique objective. The end result is a mess of bad data getting deployed through each of your systems.

Artificial Bottlenecks

While delivering bad data is not great in its own right, there are some less severe side effects that can happen with this setup as well.

What happens when something goes right with your data?

The data is available to be loaded 1 hour earlier than normal
The data finishes transforming 30 minutes faster than expected.

When you rely 100% on disparate scheduling systems that aren't aware of the larger state of things, the end result is data delivery that's slower than required. You're creating artificial bottlenecks by building your process around likely completion times.

The end result is a process that will always take 4 hours to complete, even when it could have taken 2 hours and potentially even started earlier than normal.

The root cause of both these issues is the lack of interoperability between tools.

The Case for Interoperability

Think back to when you were a little kid. If you wanted to build something from scratch, what toy did you reach for?

For most, that answer is probably LEGO® (or some off-brand variation). This toy gave you the freedom to build whatever you could imagine. They were durable enough, so your idealized construction could sit up on your shelf... but if you ever wanted to build something else, you could easily take it apart and reconfigure it into something new.

LEGO® won the war on children's construction material with a simple patent. Every piece they produced could interlock and connect with any other piece.

Every piece has equally sized protrusions on top. This allows any piece to "give" itself to any other piece.
Every piece has equally sized indentions on the bottom. This allows any piece to "receive" any other piece.

The important lesson here is that LEGO® won by allowing every piece to have an input and an output. This simple decree gives more flexibility to build anything you can imagine, while giving every brick the portability to be used in multiple different ways. Most of all, it makes the act of building and experimenting fun.

So what can we learn about this strategy for the Data Stack?

Building with Interlocking Blocks

Data interoperability is dependent on two things:

The ability to be kicked off at a moment's notice (input)
The ability to provide context about the work that was performed (output)

If you want to increase the interoperability of your data stack, you need to change how you scope out the tools that you use. Getting the job done isn't good enough. You need to choose tools that have the ability to talk to each other through API Access and Webhooks.

API Access

Whether you're the one accessing it or not, the data tools you choose should have an API. The lack of an API is a clear sign that the data tool is not prioritizing the ability to talk with and work alongside other tools in the data stack.

Initially, you want to make sure the API lets you:

Kick off a job automatically
Check the status of individually executed jobs
Export individual job metadata

The goal with API access is to ensure that the tool can be run at a specified time, after a specified event, or even on an ad-hoc nature. In an ideal world, your entire data operations lifecycle would be fully event-driven where the successfully completed status of one tool kicks off another tool.

Being event-driven means relying on context of previous steps through the analysis of metadata. A tool's metadata should include details like start_time, end_time, status at the bare minimum. These three data points fill you in on the necessary information to verify when and if something finished running to be able to trigger downstream steps.

However, events can be driven by much more than just status. If the tool can include metadata like schema of data processed, number of rows processed, number of successes vs errors, etc. then you open up a world of possibilities that your data pipeline can react to.

Webhooks

While API access can get you most of what you need, webhooks can really streamline the process of stringing independent tools together.

When a tool offers webhooks, it gives you the ability to send and receive an HTTP request after a specific event occurs (typically when a job finishes). This is a crucial step to ensure that the tool can be run immediately after another tool finishes, while simultaneously making it easier to kick off other external tools immediately.

While APIs help you get access to the data you want, they're often bottlenecked by the need to continuously poll your tools to check the metadata and verify the status. In many cases, this means that you end up checking the status every 5 minutes, hoping that the job is completed. If not, you wait 5 more minutes and try again.

Webhooks ensure that the next job gets kicked off instantly after an event occurs. There's no downtime. No waiting periods. It's an insanely streamlined workflow that ensures the output of one tool can directly impact another tool.

Going Further with Orchestration

While API access and webhooks are important for creating data interoperability, it's still not wise to rely on independent tools to call each other. You need somewhere to see the big picture of which tools you're running in your stack, in what order. Additionally, you need the ability to retry tools when something breaks, react to different statuses that occur, and troubleshoot the process if something goes wrong.

Going back to our LEGO® example, if you wanted to show someone how to build a house with all of the blocks, would you just hand them the blocks needed and hope for the best? Or would you instead create a manual that details out the step-by-step order that the pieces should be connected together?

That's where data orchestration comes in. It serves as the manual for your data operations. It's a mission critical part of ensuring that your data operations run smoothly day in and day out, while giving you a high level of observability into the overall process that generates and delivers your data. It helps you immediately identify problems with the process, or with the data itself, to prevent subsequent steps from running.

But data orchestration can only work if the tools you choose are purpose-built to talk to each other.

If you've ensured that your data stack is already interoperable, low-code tools like Shipyard or high-code tools like Airflow allow you to flexibly connect the dots and orchestrate your team's data workflows end-to-end.

Creating the Interoperable Data Stack

The current Modern Data Stack is experiencing challenges due to its complexity and the lack of interoperability between the tools it employs, but it doesn't have to be that way. Companies must strive to build more efficient and reliable data operations by prioritizing orchestration and demanding tools that can seamlessly communicate with one another.

Arguments about bundling and unbundling the data stack don't really matter. Our needs for data will always be changing. We'll always see people attempt to use all-in-one tools. We'll always see people build best-in-class tools. But the one thing we'll always need, no matter the state of the data stack, is interoperability.

By learning from the example of LEGO®'s interlocking blocks, we can create data stacks that are more flexible, portable, and resilient. Adopting a more interoperable and orchestrated approach to building data stacks will help companies avoid the pitfalls of bad data delivery, artificial bottlenecks, and ultimately lead to more accurate and timely data-driven insights.

Enhancing the Interoperability of Data Tools