The Modern Data Stack (MDS) is an ecosystem of data tools that emerged as a result of the rise of the cloud data warehouse.
For basic use cases, the MDS allows data teams to replicate data into a data warehouse, transform the data, and visualize insights for data-driven decision-making.
Given the modular nature of the MDS architecture, the Modern Data Stack can also be used to power the most complicated of data pipelines - machine learning models, real-time production systems, and even client-facing products for end-users.
For both startups and enterprises, the opportunities are limitless if you have the right technology and people in place.
There are 3 use cases for the Modern Data Stack:
1. Data Analytics
2. Process Automation
3. Product Development
Centralize data to empower business users with dashboards for better decision-making
Save time by automating time-consuming tasks and business processes for end-users
Turn raw data into valuable data products that clients can purchase
Here is a side-by-side comparison of a traditional data stack and a Modern Data Stack:
|Capability||Traditional Data Stack||Modern Data Stack|
|Scalability||Limited based on hardware||Big data (infinite scalability)|
|Architecture||Storage and compute are coupled||Separation of storage and compute|
|Use Cases||Analytics||Analytics, automation, machine learning, data products|
|Users||Engineers, data engineers||Engineers, data engineers, data scientists, data analysts, analytics engineers, business analysts|
The Modern Data Stack includes the following components:
|MDS Component||Use Case||Example Tools|
|ETL / ELT||Extract data from databases and applications||Fivetran, Portable|
|Data collection||Create data from sites and mobiles apps||Snowplow, Segment|
|Real-time||Move and process data in real-time||Confluent, Striim|
|Data processing||Provide processing power for data pipelines||Snowflake, Databricks|
|Data transformation||Version control and structure complex query logic||DBT, Coalesce|
|Orchestration||Schedule jobs and handle dependencies||Airflow, Dagster|
|Reverse ETL||Sync data from warehouses to business apps||Hightouch, Census|
|Data visualization||Turn raw data into dashboards||Power BI, Tableau|
|Data governance||Measure and improve data quality||Collibra, Monte Carlo|
Let's walk through each piece of the tech stack in more detail.
Job to be done: Data ingestion solutions (ETL, ELT) include connectors that extract data from data sources (i.e. PostgreSQL, LinkedIn, etc.) and load the data into a data warehouse. Instead of writing code yourself, no-code solutions offer more reliable, scalable, and simpler data pipelines.
ETL / ELT tools: Portable, Fivetran, Stitch, Hevo Data, CData, Matillion, Airbyte, Integrate.io, Blendo, Data Virtuality, Etleap, Precisely, Gathr, Skyvia, Dataddo, Kleene.ai, Rivery
Job to be done: Data collection tools make it simple to collect or create data from websites and mobile apps. Collection tools typically create schematized event streams that are delivered to your warehouse or data storage location (i.e. AWS S3, GCS, etc.) in real time. When your data analysts need data from your first-party platforms, it's probably time to evaluate data collection solutions.
Job to be done: Real-time data platforms transfer information from one system to another in a matter of milliseconds instead of minutes or hours. With stream processing, aggregations, joins, and advanced processing can take place while data is in motion.
Real-time processing tools: Confluent, Estuary, HVR, Materialize, Striim, Meroxa, StreamSets, Decodable, Popsink, Qlik Replicate, IBM Infosphere, Amazon Kinesis, AWS DMS, AWS Glue, Google Cloud Dataflow, Talend, Oracle Golden Gate, Arcion, Gravity Data, Skippr, IOblend, Attunity, DeltaStream, Upsolver, Timeplus, Debezium, Kafka, Apache Nifi, Maxwell's Daemon, Streamkap
Job to be done: Data warehouses (as well as data lakes and lakehouse architectures) do the heavy lifting for your Modern Data Stack. While it is possible to power analytics without a data warehouse (by connecting a data visualization tool directly to a production database), most teams that are serious about becoming data-driven will put in place a data warehouse immediately.
Job to be done: Every data stack needs some way to turn raw data into insights. Typically a data transformation tool is introduced as your data processing requirements increase, as the number of data models becomes unwieldy, or as your SQL queries become ineligible. Whether you use an open-source transformation provider or a cloud solution, these tools can help you stay organized.
Job to be done: Data stacks are complex. As you add more components, you need to keep everything running seamlessly. Orchestration tools tie into APIs from other pieces of the Modern Data Stack - They kick off work, manage dependencies, and track the lineage of data through your pipelines.
Orchestration tools: Airflow, Dagster, Prefect, Astronomer, Argo, Luigi, Temporal, Mage
Job to be done: Reverse ETL solutions convert your data warehouse from an analytics engine (only powering dashboards), into an operational system of record. With off-the-shelf data integrations activating data from your warehouse to downstream business applications (i.e. Salesforce and other SaaS applications), you can use Reverse ETL to automate business workflows.
Reverse ETL tools: Hightouch, Census, MessageGears, Omnata, Octolis, Lytics, Polytomic, RudderStack, SeekWell, Rivery, Weld, Twilio Segment
Job to be done: Data visualization tools are typically one of the first, and most important, components of a Modern Data Stack to be introduced. They turn raw data into metrics, metrics into dashboards, and dashboards into insights that help your company make better strategic decisions. Business intelligence teams can not live without a great data visualization tool.
Data visualization tools: Astrato, Bloom AI, Canvas, Columns, Datawrapper, Domo, GoodData, Google Data Studio, Glean, Graphext, Hex, Holistics, Hyperquery, IBM Cognos Analytics, Infogram, Knowi, Lightdash, Logi Analytics, Looker, Metabase, Microsoft Power BI, Mode, Observable, Omni, Plotly, PopSQL, Preset, Qlik, Retool, SAP Lumira, Sigma, Sisense, SQL Server Reporting Services, Streamlit, Superset, Tableau, ThoughtSpot, TIBCO Spotfire, Toucan Toco, Veezoo, Zepl, Zing Data, Zoho Analytics, Zoomdata, Whaly
Job to be done: The newest (and currently most talked about) aspect of the Modern Data Stack is data governance. There are quite a few subcomponents here - data catalogs, policy enforcement, data observability, lineage, etc., but they all revolve around a focus on data quality. These tools are typically introduced later in the data lifecycle.
Data governance tools: Immuta, Metaplane, Monte Carlo, Castor, Bigeye, Atlan, data.world, Alation, Secoda, Privitar, Telmai, Kensu, Select Star, Ataccama, Collibra, Amundsen, DataHub, OpenMetadata, Labellerr, Anomalo, Great Expectations, Sifflet, re_data, BigID
In addition to the modular components listed above, there are also end-to-end data platforms like Mozart Data that offer a solution encompassing many of the components you need.
For teams that are beginning their data journey, bundled solutions and data consultants can be a great way to get started quickly.
Data consultants: Slalom, The Seattle Data Guy, Brooklyn Data Co., Upright Analytics, Bytecode IO, Leit Data, Meru, Big Time Data, Data Captains, On the Mark Data, Ternary Data, 4 Mile Analytics, Revolt BI, Analytics8, phData, 3pillar Global, Kubrick Group, FluenFactors, Deepskydata
Here are some of my favorite free resources to learn about the Modern Data Stack:
Portable is a cloud-based ETL / ELT tool - replicating data to Snowflake, BigQuery, Amazon Redshift, PostgreSQL, and MySQL.
We build the no-code ETL / ELT connectors that aren't supported by other platforms. The niche tools and industry-specific applications that every data team needs at some point.
Pricing is simple. Manually triggered syncs are free. Recurring data flows are $200 a month.
To get started with your Modern Data Stack, Portable is a no-brainer. Explore our 300+ connectors today!