With the advent of composable technology stacks, we are seeing an explosion in the number of applications each company is using to run its business.
While integration patterns and platforms can help your enterprise better organize the sprawl of enterprise data into a Modern Data Stack, rarely are best practices for data integration shared across company boundaries.
Even within a single company, data integration practices can live in silos - stuck in specific functions or business units.
After thousands of conversations with business intelligence teams, automation teams, and product teams, I wanted to summarize a complete guide to data integration best practices.
Let's get started.
Focus on Business Outcomes
Document Your Data Sources, Destinations, and Pipelines
Invest in an Enterprise Data Model
Minimize the Divergence of Analytics, Automation, and Product Pipelines
Always Expect To Add New Systems
Decouple Process Logic From Business Systems
Monitor Your Pipelines and Data Sets
Never Compromise on Data Security
Pick the Right Data Integration Software for the Job
The worst mistake you can make is to prioritize technical initiatives before understanding business needs.
Before getting started with any data integration projects, you need to make sure you are working on the highest-value business problems.
Set up a time to interview executive stakeholders, speak with business users, and understand how data will influence decision-making and workflows.
Create a top 10 list of your company's business problems that can be solved with data, and then start thinking about the best approach to data integration with those goals in mind.
For example, if 'reducing customer churn with analytics' is a top company priority, you'll probably want to focus your time on customer data integration to build dashboards instead of integrating employee data to automate internal tasks.
If you start with the wrong initiatives, everything else will be wasted time...
If you don't know what data exists, and what pipelines are moving data, you can end up duplicating efforts or missing critical data flows.
To identify the apps and systems that are in use across your company, I recommend asking your IT team. They typically have an inventory of the databases, business applications, and even legacy systems / on-premises tools that your company is leveraging.
Once you know the data sources and destinations that are available across the company, take an inventory of the data integration solutions that might already be in place.
It's possible business users already have data pipelines moving datasets from system to system for mission-critical workflows.
Before you lift a finger on a new initiative, make sure you understand the systems and data flows that are already live.
Once you know the top company priorities, the systems in place across your company, and the pipelines that are moving data, you should create an enterprise data model.
You will know where master data lives
You can create standard definitions of concepts
Entity relationships can be outlined to reduce data integration challenges down the road
For instance, if you use a point of sale system, a ticketing system, and a CRM system, it might be unclear where organization <> customer relationships are defined.
By defining an enterprise data model, you can explicitly answer these questions and create a simple framework that will inform your data integration strategy and allow your team to focus on driving value.
We all know that there are 3 ways to create value from data: 1) data analytics, 2) process automation, and 3) product development.
One of the worst things you can do is to create different pipelines for the three use cases when you don't need to. Unfortunately, it is common for different teams to own the different data use cases, and for each team to create a siloed technology stack.
If you use one pipeline to power business decisions (likely an ETL process combined with a data warehousing solution), but a different pipeline to connect APIs for process automation (likely via an iPaaS solution or CDP), and a third home-grown system with real-time processing to power your external data products, a few things will happen:
Disparate source systems will show different answers
You will do the same work multiple times
Policy enforcement, security, and privacy will be 3x as difficult
And more
First document all the use cases, and then find the simplest way to power them in a scalable manner.
Change is a fact of life in the technology world.
If you think about the world through the lens of the systems you have in place today - the data sources, the destinations, and the data pipelines - your architecture will immediately become out of date.
When you pick tools, don't just think about the applications, data warehouses, or cloud environments that you are using today. You need to maintain optionality for the future, otherwise, your integration tools will dictate the systems your company can procure.
We are seeing an interesting pattern in our client base on this topic. Data teams are no longer on the sidelines, and data integrations aren't being evaluated after the fact. Instead, data teams are directly involved in the procurement process, and they ask Portable to build integrations before they even purchase the tool.
That way the data team knows for sure that the new system can be integrated into their architecture and business processes before the tool is chosen.
This best practice alone will save your team months of effort.
It's very common for companies to set up workflows based on the tools they have in place. Dashboards are built on Salesforce data. Workflows are automated around Amazon Ads. Marketing pipelines are directly coupled with Oracle products.
If you build dashboards, pipelines, or products that are coupled to your specific data applications, it will be a nightmare to rip out an application when you need to. Time-consuming is an understatement. You're effectively locking your company into specific tools by creating a web of integrations connecting your disparate systems directly.
Instead of building on top of application-specific interfaces, I recommend a few things:
Create an enterprise data model
Load data into a centralized environment (like a data warehouse, data lake, or ERP system)
Map the data from your source-specific schemas into the enterprise data model
Create the logic for your dashboards, procedures, or products on top of the source-agnostic data model
By creating logic on top of a source-agnostic data model, you maintain the ability to add new sources, migrate from one tool to another, or upgrade a system without needing to recreate all of your queries or automations.
You can't just stop working after data consolidation or the data integration process is complete.
Once data is flowing effectively and you're creating material business value, it's important to think about data quality.
I recommend adding two types of monitoring to ensure data is syncing correctly:
First, make sure you have alerting and monitoring set up for your data pipelines
Second, monitor the actual data to detect anomalies (volumes of data, data types, etc.)
While data quality and observability do not create value by themselves, they help to ensure your company gets the most value out of the pipelines you build.
My first rule of big data: Do not compromise on data security, privacy, or ethics.
There are many aspects of data governance that are critical to think through when integrating data:
The credentials that are used to access the data (How are they protected?)
The privileges granted to a person or system (What is the minimum level required?)
The fields that are being moved (Are they sensitive?)
The location of data storage (Do country-specific laws apply?)
The access controls (Who should be able to see the information?)
The uptime and availability of the system (How critical is this information to client workflows?)
There are plenty of other items to consider when handling structured and unstructured data. I recommend starting from the top - with your values, culture, and company policies. Then define data management procedures, and then dive into the specific controls necessary to protect your data.
There are quite a few data integration tools on the market.
Most integration platforms are tailored to a specific type of data, use case, or buyer persona. It's important to understand your specific needs and to evaluate the functionality of each platform accordingly.
When selecting a data integration tool, you should evaluate:
Data connectors
Extensibility
Pricing model
Support
Ease of use vs. customizability
On-premise vs. cloud-based
Proprietary vs. open-source
Batch vs. real-time processing
The landscape of data integration tools is complex. For instance, in the ETL category alone, there are specific tools that are great for Snowflake, BigQuery, Redshift, or PostgreSQL.
If you ever need advice, or want a second set of eyes when evaluating the various data integration techniques, we're happy to help.