ETL testing is used to validate if extracting, transforming, and loading information from one data storage to another remains consistent with business rules even with its multiple points of failure. For adequate coverage, multiple test types help keep data completeness standardized, and backwards-compatible at the end of an ETL.
Since ETL has multiple stages there are different types of tests that can be used to ensure that the process works from its smallest unit to the entire system as well as guarantee that the data remains consistent across multiple changes.
These tests target small pieces of the code like a function that transforms a set of data into another with the goal being to guarantee that the same input for that function always has the same output.
Since these tests target individual functions they can be built quickly with the downside that they don't target a full process.
For example, we focus on testing a particular phase of the data migration, the transformation phase of ETL, and not the full ETL, which is more adequately done by an integration test.
The goal of the integration tests, also known as end-to-end (E2E), is to be as close as a real ETL when in production.
Their usual setup includes generating data that resembles the origin that we're extracting from and seeing if the full ETL transformation rules will load into new storage with the structure, syntax, and values that are expected.
Contrary to the unit tests, the full process is covered with data integration testing which makes these tests slower to build and run.
These types of tests are even broader than the integration tests, the objective is to do validation testing for the performance and how robust each unit of an ETL is when running on a server.
For example, if ETL is ran on different servers, the system tests serve to validate if they can communicate with each other which allows for fast detection of bottlenecks.
Since systems can be distributed, and the tests need to be replicas of production ETLs with a clear ETL mapping, these tests are costly to maintain and hard to build.
When changes are introduced to an ETL, its units, integration, and system can run without issues while creating unintended side effects in the way the data is stored -- for example, a length of a field can be reduced which can result in the final value being partially stored.
Regression tests are created to check for the consistency of the data between changes -- and they're especially useful to be done in areas of an ETL that have frequent changes or are often problematic.
While these tests can be done manually, the more frequently an ETL is changed the more often they have to be run which makes automation particularly useful for frequent and extensive test cases.
A data warehouse is the target database that will be consumed to create reports with.
ETL testing gets data from one source into another in the format we desire.
These tests have a wide coverage since multiple data sources can be used which can require multiple transformations.
With database testing, the goal of testing is to ensure that the final normalized data is accessible and usable.
With an ETL test, it can be verified if data values are duplicated in the target system.
Meanwhile with the data warehouse, the more adequate test would be to check if any required data is missing by doing report testing. These tests help with business intelligence as they complement each other to test the full system's reliability.
Related Read: 50+ Best ETL Tools List
Understanding where the data is being extracted from and where it will end up is imperative to ensure that an ETL is working as expected as different sources might store the same values in different types.
These different types are likely to be converted and stored to a single type on the target storage, for example, an ETL might be fetching a number from a SQL database and a spreadsheet.
On the SQL database, the SQL query returns data with properly stored as a number where on the spreadsheet is stored as a string of characters.
The tester has to be aware of the different data types so they can be converted and ultimately be saved either as a string, number or other formats in the target platform with the goal of keeping consistent data types in the target tables.
While testing the format of the data itself also needs to be taken into account as the type can be the same but the content is different, for example, some source systems might have encoded URLs while others might have them decoded.
Another factor to take into account is that the structure of the sources will likely defer from each other and the target system, as is the case where a source system might call an identifier as "id" where another calls it "_id" and we want to store it as "identifier".
ETL testing verifies if the data at the end of an ETL is complete.
ETL engineers check for data completeness by testing exclusions: for example, if the source data needs to be filtered to contain certain values.
Sometimes, ETL engineers maintain a data mapping document for the target system that lists the expected values.
Data mapping helps quickly identify any issues with the ETL process.
Once the target data format and type are decided for the target system, data engineers can place fake source data to run data tests.
The aim of this practice is to have tests where the output is always known. If there are any data anomalies, it's easy to compare against the expected outcome that must always be true with a given dataset.
Data transformation testing spots errors with the added bonus that they're they're easily expandable by adding more fake source data.
Tests should include different user roles attempting to load new data into a target database.
This will help highlight if there are missing permissions in certain roles or gaps in security where a role has more permissions than it should.
It's also good to add tests that can include multiple users attempting to change a target system, as these tests will surface errors of concurrency that can affect the loading process and even the data integrity.
The primary objective testing data integrity is to ensure the structure of a target database remains the same.
This can be done by checking the units of the data for duplicates if it's known that no duplicates should exist in the target database.
The schemas of the target database serve as the data model while the column ranges and data types can be checked to guarantee consistency between values. These tests can spot if a field has data truncation or increased.
As an example, if a field "name" needs to have a length of 10 and an ETL is changed so that 11 is now being stored a test can easily spot these changes and highlight a potential regression. The same can be said for the data type where a field might have been changed from a number to a string unintentionally.
The structure of the database tables can also be checked, this can be done with primary key and foreign key relationships validation to verify if the relationship remains intact.
For this, ETL tests can be put in place that will break if one of the elements of the relationship is removed so it's easy to reference tests that are failing to spot issues in real time.
When an ETL process is in place having it run from end to end can take a long time, this might be fine when the dataset is small but as it increases it can create bottlenecks.
One way to check if the process is efficient is by logging the start time and end time of a process as well as how many records were fetched and loaded in the target database, this way there's an indication of how many records are being handled for a given period of time.
Knowing the processing time allows for limits to be set that, when reached, a performance testing case would fail.
For example, if a process for 10 million records is consistently measured at 10 minutes to process -- a test can be created to fail if it suddenly takes 15 minutes.
The thresholds to setup vary between ETLs so it's important to measure ETLs individually instead of doing aggregations with the assumption that a value that works for one process will work for all.
When an ETL should only be able to write or read from a certain database or tables it's good to test if the roles permissions are set as expected.
Data analysts can create multiple types of users for testing purposes:
Admins: have access to everything. The admin should be able to execute every operation, being it a write, read, delete or update.
Maintainer: has access to read and write data but not to delete.
Read-only access: Read-only roles receive an error if attempting to delete, otherwise the test needs to fail to indicate a mistake in the permissions of certain roles.
These tests go hand in hand with performance, as performance degrades so does the ability to process more data with fewer resources.
One type of test that can help identify scalability issues is to increase the data volume to be processed and measure how that affects the server memory, CPU, and database.
The memory and CPU either improves the system hardware or the ETL might suffice to guarantee that not all of the server resources are consumed.
For the database, multiple ETLs can be run at the same time to see how the extract, transform and loading behaves, if a database is lacking indexes it will slow down the response time which is a good indicator that a system is unable to scale to larger datasets.
ETLs are not always operations where the amount of data extracted is the same as the amount of data loaded, this can be for multiple reasons, one of which is invalid data like a field that can't be converted from a date format to a number.
In these test scenarios, the failed attempt is logged so that it can later be seen why some data was not migrated from a source to a target and even retried after the transformation phase is changed to allow for the unhandled conversion.
A test that can be done is to force invalid values on a test that leads to a log or error being thrown when invalid data is reaching the process and not allowing it to continue.
The hard part of these tests is coming up with different ways a source data can be invalid that would break the ETL, on the other hand, these tests are easily extendable as when an error occurs all that needs to be done is to add one more entry to the fake dataset.
This type of test is very important to have in place as the main goal is to check if, in case of catastrophic failure, the system can be recovered to a workable state again with, ideally, no data loss.
For example, if multiple ETL processes are running and the server crashes suddenly due to a power outage or network issue there has to be a guarantee that when the system is running again no data was lost, and the multiple ETLs can continue from where they last left off and there's no data corruption on the target databases.
Tests to guarantee that a system can recover include periodically running backups of the database to see if the data is still loadable and how long it would take to have it run again in case of issues.
Another step is to shut down ETL testing in the middle of their process and see if they're built to recover from the last valid record on the target database or if they restart the extraction process from the start potentially causing the target system to have duplicate data. ETL tools can help with this type of testing as they usually have recovery embedded.
Related Read: Data Pipelines Explained
When an ETL process is covered by the different types of testing the risk of ending up with a data warehouse or database that contains inconsistent, incomplete, or outright wrong data lowers significantly as the smallest units, as well as the full system, is covered with tests.
The constant reiteration to guarantee regression test coverage and data quality tests ensure that the same process is tested multiple times, this constant scrutiny of the same resource increases the chance that an existing error is spotted in subsequent runs as the tester, in the case of manual regression tests, can be more alert in some days over others or, in the case of automation, it makes it clear what are the gaps in the tests that need to be addressed.
When data is consistently correct it requires less verification when a data analyst or engineer needs to load it or transform it for other use cases.
Testing also provides fast feedback when a mistake is made where a regression or error can be spotted early when a change is being made instead of ending up in the production system.
Once in place testing significantly reduces the number of mistakes that inevitably end up in a production system reducing the risk of causing clients discomfort or damage to the data they rely on to operate.
As coverage spreads it becomes easier to identify where bottlenecks in performance started showing up which can lead to early detection and mitigation before it becomes a larger concern.
It also helps identify external factors issues, for example, if the system is healthy and the ETL tests consistently spot troubles with the extraction phase it can mean that a data provider has faulty data or has recently made changes that invalidate an ETL implementation.
Pinpointing problems this way improves the system's reliability and reduces the risk of spending countless hours fixing part of a process that might not be a problem.
An ETL with its multiple stages is complex, as such, to do ETL testing properly there has to be a good understanding of what the data source is, what is the objective of a transformation and how it will be stored.
This not only requires good developer skills but also a data-oriented approach as testing without understanding can lead to a false sense of security by covering a process with tests that would never fail.
Taking into account regression testing as an example, these tests require at least one person dedicated to being an ETL tester with the role of manually testing at least the critical success and failure paths of a process and constantly updating the manual tests to include new ETL developments and changes, as development grows so does the amount of work that has to be put to create and maintain such tests even if manually done.
This cost can be offset by automating the tests which in itself will be time-consuming for an ETL tester, especially on the first setup. Aside from the increasing cost in people labor there are also the system costs to bear in mind as integration tests rely on a close copy of the production environment to run which, depending on the scale of the system and the reach of the tests, can mean doubling costs by having this testing environment.
Unit tests can be quick to put in place in the transformation phase as these test small sections of the process to data check for null values and robust transformation logic, on the other hand, integration, and system tests can prove to be challenging to set up as it's likely that there are multiple data sources and target tables which can vary wildly depending on the ETL implementation.
Especially on its initial setup, some of these tests can take weeks or even months to be put in place as a system is likely changing over time and the tests require constant adaptation to meet new business requirements.