ETL stands for Extract, Transform, Load.
The need to move data across computers has existed since the mainframe era, but the productization of the ETL process started with IBM researchers Devlin and Murphy, who coined the concept of the data warehouse in the 1970s.
Devlin and Murphy saw that banks needed to develop business intelligence on financial data stored in transaction-based OLTP systems, but in order to do so, they needed to merge data from other data sources and also find a data environment better suited for data analysis. Hence, the data warehouse was born.
Since then, businesses have become significantly more data-driven. All kinds of data of varying data quality are now available across different data sources. From subscription data to shipping, there are more and more surfaces for data to be exposed, often in the form of APIs and ending up in data lakes.
With data across so many different relational databases and data sources, there's been a burgeoning of ETL processes and ETL tools to help companies move data from one or multiple data sources and load data into popular data warehouses (Snowflake, Azure, Redshift). ETL is now, the Modern Data Stack.
With so many ETL tools and data integrations in-market, how should you think about designing an ETL architecture for your company and its use cases?
Before looking into various ETL tools and frameworks, why use ETL at all? Can't you extract, transform, and load data from one database to another using just SQL and Python?
SQL is not an ETL tool.
SQL is useful for manipulating and managing data within relational databases (usually MySQL or PostgreSQL) but lacks the data transformation abilities required to merge and aggregate data across databases, especially with inconsistent schemas.
SQL also fails when you need to extract data from APIs and transform data in XML or JSON format into tabular data. While it's possible to write your own ETL connector using Python, this path is strewn with technical complexity and generally flies in the face of code reusability.
ETL tools are useful because they allow data engineers to focus on what matters: data analysis and business intelligence.
ETL tools make it easy to extract, transform, and load data from different data sources into data warehouses/data lakes at scale, but there's nuance in selecting which ETL processes are best for your business.
Every business designs products to serve customer needs. Your data stack is no different. How you choose your ETL architecture is dependent on the needs and use cases arising out of your business:
What data volume are you working with? How frequently should your data warehouse be refreshed with new data? Will you process data for a business intelligence use case? Are you training a machine learning model? How much compute will you need to handle big data? Will you work with sensitive customer data requiring GDPR or SOC2 data management policies?
When designing ETL architecture, you should think of your data stack as a product. What are the data requirements and business rules for your product or company? What data jobs need to be done and which workflows should, therefore, be built around such use cases?
real-time, business rules, data warehousing needs, data management, big data, staging area, workflows, connectors, use cases, batch processing,
Extract refers to the process of copying data out of a data source such as an API, data lake, or relational database. The challenges in data extraction are usually around dealing with different schemas and rules across various source systems. For a data engineer, this practically means reading incomplete API documentation and following tutorials on how to extract data from various apps and sources (MySQL, Postgres, SQL Server).
Data transformation is the process of cleansing, deduplicating, and oftentimes standardizing data into a tabular structure. Because data models vary across data sources, complex workflows are often required to "wrangle" data into a form suitable for analysis.
The Load phase refers to loading data into the environment in which data analysis is performed: the data warehouse or data lake.
Factors to consider here are:
Completeness: Businesses often want a unified view of their data, but this usually comes with tradeoffs. Does it make sense to perform computations on data sets with billions of rows? Or will a smaller sample data set suffice?
Latency: How does data need to be refreshed? Is it business critical to have a true real-time view of data workflows or will it suffice to have a delayed, albeit accurate understanding of current business intelligence? How often does the load process need to run in order to meet this guarantee?
Cost: When choosing a cloud-based data lake or data warehouse like Snowflake, Amazon RedShift, or Azure, which offerings are going to be most compatible and cost-effective with my current data stack?
|Workflow||Data is extracted from business-facing data sources like a CRM database or a consumer-facing application (web or mobile app), transformed, and loaded into a data warehouse.||Data is extracted from a business-facing data source, and loaded into a data warehouse or data lake. Transformations are applied within the data warehouse, not before.||Data is extracted from a data warehouse, and transformed to match the schemas of business-facing databases like a reporting database on Postgres or an API to be consumed by a customer-facing application|
|Benefits||Data is unified and loaded into a cloud-based environment suited for business intelligence.||Data does not need to be transformed prior to loading; data is made available within the date warehouse so ad-hoc analysis can occur at any point in time without having to pre-consider data engineering.||Data can be used to power business applications. Business intelligence feeds product in a virtuous loop.|
|Disadvantages||Requires upfront data manipulation and thinking on how to compute fields for analysis.||When storing data sets with no transformation, data duplication can occur and data storage can be costly.||Generally requires data to be refreshed frequently (or real-time) in order to be useful. Can be technologically complex. Event-driven data pipelines may be a better option.|
When designing an ETL process, it's worth considering how often your data needs to be refreshed through the lens of the end user. Traditionally, data pipelines have operated in batch, where new data is loaded incrementally into a data warehouse.
However, as use cases for marketing and personalization are demanding businesses to process data with higher frequency, real-time solutions are becoming more popular.
Real-time or event-driven architecture is typically suited for use cases where new data needs to be surfaced with low latency and with high recency. Certain types of data reporting, for example, may warrant a real-time data model. Some frameworks for real-time data architecture are Kafka and Estuary.
Designing an ETL architecture suitable for your business needs comes with challenges.
Data engineers often run into issues with incompatible data types or experience load processes with unexpected latency. Data can often be "dirty" with incorrect or incomplete information, or duplicative on the other hand, leading to high costs and latency.
When attempting to extract data from a data store, data engineers may find that documentation might be limited, or that there's no existing data connector.
Last but not least, managing multiple data pipelines comes with its own IT and DevOps-related concerns as well.
When deciding on how to build data pipelines between source systems and destination data warehouses, you should consider the benefits of automation.
Human-in-the-loop workflows are expensive and where possible, you should optimize your ETL systems for performance. ETL pipelines that allow you to schedule and manage recurring data flows can help with cost savings and provide a more complete view of your data. Moreover, ETL tools that come with SOC2 or related compliance out-of-the-box will save you headaches and unneeded IT involvement.
One of the key advantages of a modern data stack is automation and orchestration. Take advantage of tooling to save time and money.
Whether you need robust data replication or a framework for loading data in real-time, Portable has done the research on it. Check out side-by-side comparisons of dozens of ETL tools and decide which ones will meet your data needs.
Portable has 300+ long-tail source connectors to the most common data warehouses - Snowflake, Amazon Redshift, Google BigQuery, PostgreSQL.
Start automating workflows today. Try Portable!