Modern Data Stack Explained: Use Cases & Components (2023)

Ethan
CEO, Portable
The Modern Data Stack
The Modern Data Stack

What is the Modern Data Stack?

The Modern Data Stack (MDS) is an ecosystem of data tools that emerged as a result of the rise of the cloud data warehouse.

For basic use cases, the MDS allows data teams to replicate data into a data warehouse, transform the data, and visualize insights for data-driven decision-making.

Given the modular nature of the MDS architecture, the Modern Data Stack can also be used to power the most complicated of data pipelines - machine learning models, real-time production systems, and even client-facing products for end-users.

For both startups and enterprises, the opportunities are limitless if you have the right technology and people in place.

Use Cases for the Modern Data Stack

There are 3 use cases for the Modern Data Stack:

1. Data Analytics

2. Process Automation

3. Product Development

Data Analytics

Centralize data to empower business users with dashboards for better decision-making

Process Automation

Save time by automating time-consuming tasks and business processes for end-users

Product Development

Turn raw data into valuable data products that clients can purchase

How Is the Modern Data Stack Different From a Traditional Data Stack?

Here is a side-by-side comparison of a traditional data stack and a Modern Data Stack:

CapabilityTraditional Data StackModern Data Stack
 Deployment On-premisesCloud-native
 Scalability Limited based on hardwareBig data (infinite scalability)
 Architecture Storage and compute are coupledSeparation of storage and compute
 Use Cases AnalyticsAnalytics, automation, machine learning, data products
 Users Engineers, data engineersEngineers, data engineers, data scientists, data analysts, analytics engineers, business analysts

Components of the Modern Data Stack

The Modern Data Stack includes the following components:

MDS ComponentUse CaseExample Tools
 ETL / ELT Extract data from databases and applicationsFivetran, Portable
 Data collection Create data from sites and mobiles appsSnowplow, Segment
 Real-time Move and process data in real-timeConfluent, Striim
 Data processing Provide processing power for data pipelinesSnowflake, Databricks
 Data transformation Version control and structure complex query logicDBT, Coalesce
 Orchestration Schedule jobs and handle dependenciesAirflow, Dagster
 Reverse ETL Sync data from warehouses to business appsHightouch, Census
 Data visualization Turn raw data into dashboardsPower BI, Tableau
 Data governance Measure and improve data qualityCollibra, Monte Carlo

Let's walk through each piece of the tech stack in more detail.

ETL / ELT

Job to be done: Data ingestion solutions (ETL, ELT) include connectors that extract data from data sources (i.e. PostgreSQL, LinkedIn, etc.) and load the data into a data warehouse. Instead of writing code yourself, no-code solutions offer more reliable, scalable, and simpler data pipelines.

ETL / ELT tools: Portable, Fivetran, Stitch, Hevo Data, CData, Matillion, Airbyte, Integrate.io, Blendo, Data Virtuality, Etleap, Precisely, Gathr, Skyvia, Dataddo, Kleene.ai, Rivery

Data Collection

Job to be done: Data collection tools make it simple to collect or create data from websites and mobile apps. Collection tools typically create schematized event streams that are delivered to your warehouse or data storage location (i.e. AWS S3, GCS, etc.) in real time. When your data analysts need data from your first-party platforms, it's probably time to evaluate data collection solutions.

Data collection tools: Snowplow Analytics, mParticle, RudderStack, Segment, Freshpaint, Heap, Piwik PRO, Amplitude, Tealium, Rakam, SnowcatCloud

Real-Time

Job to be done: Real-time data platforms transfer information from one system to another in a matter of milliseconds instead of minutes or hours. With stream processing, aggregations, joins, and advanced processing can take place while data is in motion.

Real-time processing tools: Confluent, Estuary, HVR, Materialize, Striim, Meroxa, StreamSets, Decodable, Popsink, Qlik Replicate, IBM Infosphere, Amazon Kinesis, AWS DMS, AWS Glue, Google Cloud Dataflow, Talend, Oracle Golden Gate, Arcion, Gravity Data, Skippr, IOblend, Attunity, DeltaStream, Upsolver, Timeplus, Debezium, Kafka, Apache Nifi, Maxwell's Daemon, Streamkap

Data Processing

Job to be done: Data warehouses (as well as data lakes and lakehouse architectures) do the heavy lifting for your Modern Data Stack. While it is possible to power analytics without a data warehouse (by connecting a data visualization tool directly to a production database), most teams that are serious about becoming data-driven will put in place a data warehouse immediately.

Data warehouse tools: Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse, Databricks, Firebolt, ClickHouse, Dremio, Starburst, Onehouse, Qubole

Data Transformation

Job to be done: Every data stack needs some way to turn raw data into insights. Typically a data transformation tool is introduced as your data processing requirements increase, as the number of data models becomes unwieldy, or as your SQL queries become ineligible. Whether you use an open-source transformation provider or a cloud solution, these tools can help you stay organized.

Data transformation tools: DBT, Coalesce, Narrator, Matillion, Mozart Data, Google Dataform, Datameer, SqlDBM, Reconfigured, Retable

Orchestration

Job to be done: Data stacks are complex. As you add more components, you need to keep everything running seamlessly. Orchestration tools tie into APIs from other pieces of the Modern Data Stack - They kick off work, manage dependencies, and track the lineage of data through your pipelines.

Orchestration tools: Airflow, Dagster, Prefect, Astronomer, Argo, Luigi, Temporal, Mage

Reverse ETL

Job to be done: Reverse ETL solutions convert your data warehouse from an analytics engine (only powering dashboards), into an operational system of record. With off-the-shelf data integrations activating data from your warehouse to downstream business applications (i.e. Salesforce and other SaaS applications), you can use Reverse ETL to automate business workflows.

Reverse ETL tools: Hightouch, Census, MessageGears, Omnata, Octolis, Lytics, Polytomic, RudderStack, SeekWell, Rivery, Weld, Twilio Segment

Data Visualization

Job to be done: Data visualization tools are typically one of the first, and most important, components of a Modern Data Stack to be introduced. They turn raw data into metrics, metrics into dashboards, and dashboards into insights that help your company make better strategic decisions. Business intelligence teams can not live without a great data visualization tool.

Data visualization tools: Astrato, Bloom AI, Canvas, Columns, Datawrapper, Domo, GoodData, Google Data Studio, Glean, Graphext, Hex, Holistics, Hyperquery, IBM Cognos Analytics, Infogram, Knowi, Lightdash, Logi Analytics, Looker, Metabase, Microsoft Power BI, Mode, Observable, Omni, Plotly, PopSQL, Preset, Qlik, Retool, SAP Lumira, Sigma, Sisense, SQL Server Reporting Services, Streamlit, Superset, Tableau, ThoughtSpot, TIBCO Spotfire, Toucan Toco, Veezoo, Zepl, Zing Data, Zoho Analytics, Zoomdata, Whaly

Data Governance

Job to be done: The newest (and currently most talked about) aspect of the Modern Data Stack is data governance. There are quite a few subcomponents here - data catalogs, policy enforcement, data observability, lineage, etc., but they all revolve around a focus on data quality. These tools are typically introduced later in the data lifecycle.

Data governance tools: Immuta, Metaplane, Monte Carlo, Castor, Bigeye, Atlan, data.world, Alation, Secoda, Privitar, Telmai, Kensu, Select Star, Ataccama, Collibra, Amundsen, DataHub, OpenMetadata, Labellerr, Anomalo, Great Expectations, Sifflet, re_data, BigID

Setting Up Your First Data Technology Stack?

In addition to the modular components listed above, there are also end-to-end data platforms like Mozart Data that offer a solution encompassing many of the components you need.

For teams that are beginning their data journey, bundled solutions and data consultants can be a great way to get started quickly.

End-to-end data platforms: Mozart Data, Keboola, Nexla, Y42, 5x, Untitled Firm, Actiondesk, Panoply, Canvas, Selfr, DataDrive, Datacoves, CorralData, IOMETE

Data consultants: Slalom, The Seattle Data Guy, Brooklyn Data Co., Upright Analytics, Bytecode IO, Leit Data, Meru, Big Time Data, Data Captains, On the Mark Data, Ternary Data, 4 Mile Analytics, Revolt BI, Analytics8, phData, 3pillar Global, Kubrick Group, FluenFactors, Deepskydata

Want To Learn More About the Modern Data Stack?

Here are some of my favorite free resources to learn about the Modern Data Stack:

How To Get Started With a Modern Data Stack (Start Today)

Portable is a cloud-based ETL / ELT tool - replicating data to Snowflake, BigQuery, Amazon Redshift, PostgreSQL, and MySQL.

We build the no-code ETL / ELT connectors that aren't supported by other platforms. The niche tools and industry-specific applications that every data team needs at some point.

Pricing is simple. Manually triggered syncs are free. Recurring data flows are $200 a month.

To get started with your Modern Data Stack, Portable is a no-brainer. Explore our 300+ connectors today!