Sign Up For The First Ever Low-Key Data Conference15 speakers. 4 hours. Virtual event. No frills. Join us on 3/22/23 from 1-5pm EST - Click here to sign up!.
You're ready to choose a data ingestion tool, but the options are overwhelming.
What features matter most for easy data integration? What's the right data ingestion process? And what's the difference between data ingestion, ETL, and ELT?
In this guide, you'll learn which features matter most for data ingestion, how to choose the right tool, and what to do if your data sources aren't supported.
Data ingestion is the process of aggregating big data from several sources and moving it into cloud-based storage, like a data warehouse, where it's easier to understand and analyze.
Organizations need data ingestion to find actionable insights in the data they generate. Without ingestion, information stays in its source application. It won't be in a central repository for data engineers to analyze for business intelligence, machine learning, or other uses.
There are several benefits to implementing an automated data ingestion process, especially if you're currently using a manual process or not extracting your data at all.
A data ingestion process will make your data usable outside of its source, immediately opening up new possibilities for your data strategy based on what you've collected.
Data ingestion collects data from several sources into a single repository. This improves your data analytics abilities and lets you discover insights more quickly than if you had kept data in its original location.
Compiled data is easier to analyze and thus more actionable, letting you reach decisions faster through data science.
The sources that generate your data usually aren't great for analyzing and understanding that data. Ingestion can help you move it into a place where you can analyze and interpret it more effectively.
While data ingestion will help most businesses, it comes with several challenges of its own.
Today's organizations collect from an ever-increasing volume of various data sources, from SaaS apps to IoT devices to internal databases and even spreadsheets. These various sources have different formats that may even require custom pipeline design.
The APIs that connect data sources and destinations change or break frequently, resulting in broken pipelines that must be repaired manually.
Privacy laws like GDPR, CCPA, HIPAA, and other legislation add complex requirements that change frequently. A single organization may need separate policies and procedures for the same data across different jurisdictions.
Different sources have different standards and metrics for enterprise data. These differences may go unnoticed in siloes, but ingestion can result in incomplete, duplicate, or even conflicting datasets.
The landscape of cybersecurity threats is constantly changing, meaning the data processing protections that worked today may no longer be effective tomorrow.
The two main types of data ingestion are batch processing, which ingests data in groups, and stream processing, which ingests real-time data as it arrives. When selecting a data ingestion tool and process, you'll need to decide which method is best.
Batch processing is the most common and affordable method. The ingestion layer collects new data and processes it all in a single group. Common batch processes might be hourly, daily, or weekly. It might also be triggered ad hoc or when the workload reaches a set threshold.
Batch processing is best for most broad enterprise use cases that don't need to-the-minute data. Examples include affiliate payouts, payment status, or attendance metrics.
Stream processing is best for real-time streaming data. New incoming data triggers the process using a system like change data capture. Because stream processing is "always on," so to speak, it's more resource-intensive than using batches.
Stream processing is best for applications that need data that are accurate to the second or even millisecond. Examples include threat alerts, stock prices, or social media feeds.
Micro-batch processing is a batch processing method that simulates stream processing but without the associated costs. It breaks data into very small batches that are processed in near-real-time, usually within a minute. Micro-batching is the method behind some "streaming" tools like Apache Spark Streaming.
Micro-batch processing is best for use cases that need low-latency data ingestion and replication but not at the precision of true streaming. Examples include website traffic, incoming call volume, or inventory.
Data ingestion is not the same as ETL, though the processes are similar.
Data ingestion is a broad term for collecting and moving data to a separate platform, whereas ETL is a specific pipeline typically used with data warehouses or data lakes.
ETL can form part of a data ingestion pipeline, but not all data ingestion uses the extract, transform, and load data flow that defines ETL.
The separate acronyms ELT and ETL can complicate the terminology even more.
ETL (extract, transform, load) was the original process earlier systems used to load data from a source (like a SaaS platform, ERP database, or similar), transform it on local machines, then load it into a data warehouse.
But today's hardware is no match for the near-infinite computing power of the cloud. That's why the modern data management workflow switches the last two steps for the ELT process: first extracting, then loading to the cloud, then transforming the loaded data.
Most of today's data tools actually use this modern ELT process, including many that still call themselves ETL tools. There's a good chance that no matter what a platform calls itself, it's an ELT tool under the hood.
When choosing the best data ingestion tool---or stack of tools---you'll need to consider a few factors.
You'll first need to understand the amount of data you'll be processing and the frequency at which you need to process it. Can your tool support high-volume, low-volume, or fluctuating data needs? Will batch data or real-time ingestion be best? Can you use low-latency micro-batch processing instead?
The type of data you ingest can be structured, semi-structured, or unstructured.
Structured data is already organized and sorted, like a financial spreadsheet. This kind of information is relatively easy to extract and categorize since the data's schema---such as names, field types, and more---is built into how it's stored, such as with a Hadoop format like Avro or Parquet.
Semi-structured data has some level of organization, like a collection of emails. To convert semi-structured data into a usable data format typically requires some transformations, like parsing nested JSON.
Unstructured data has little or no organization, like saved audio files. Unstructured data can be hard to sort and search, and schema needs to be applied to make it usable. You'll need to connect data ingestion to a complex ELT pipeline to transform unstructured data. It's common to load unstructured data as-is into a data lake for later transformation and analysis.
What regulations apply to your business or industry? Look for a tool that complies with applicable regulations or certifications like HIPAA, SOC 2, GDPR, and more.
Most data pipeline tools only support a few hundred data sources at best, compared with tens of thousands of SaaS platforms and a near-infinite number of internal data sources.
Cloud-based source data are usually easier to integrate since they're accessible anywhere in the world. They most commonly use APIs, which provide an accessible and robust way to extract data that doesn't need to be reprogrammed for every software update.
On-premise data sources are usually harder to integrate since the data lives on a server in a physical location. A tool will need to connect to your servers, likely using a VPN and point-to-point connector, to extract data. If the software changes, the integration might break and need to be rebuilt from scratch.
Your data workflow likely needs to integrate with a data warehouse, data lake, or database with its own formats and requirements. You'll need to ensure your tool is compatible with your chosen cloud platform, like AWS or Azure, or open-source solution, like PostgreSQL.
No matter how impressive a data ingestion tool looks, it will be unusable without the right connectors. Anyone who tells you otherwise is probably trying to sell you something.
If you choose a data platform that only connects to some data sources, your insights will stagnate.
You won't have a complete picture of your enterprise.
You won't be able to answer problems specific to your business units.
You won't be able to automate KPIs from less-common data sources.
Your product database and CRM system might get you 60% of the data, but you'll struggle to solve critical business problems if you work with incomplete data sets.
Instead of selecting a tool based on the number of connectors alone, look for a tool that supports the specific data sources you need. But that's not as easy as it might sound.
To start, you'll want to look for a data ingestion tool that supports the biggest, most important data sources in your ecosystem. If you're not sure which applications your organization uses, you're not alone. Research suggests the average company uses 80 SaaS applications.
You can ask each department for lists of the tools they use, and also leverage tools like SaaS management and network management software to get a more complete list.
These are the biggest categories to look for:
Product and internal databases like MySQL, PostgreSQL, Microsoft SQL Server, SAP HANA, or Oracle Database.
Customer relationship management (CRM) platforms include Salesforce, HubSpot, and Microsoft Dynamics 365.
Enterprise resource planning (ERP) software includes Oracle Netsuite, Sage Intacct, and SAP S/4HANA.
Human resources information systems (HRIS) like ADP, BambooHR, and Workday.
Industry-specific SaaS platforms, including marketing automation, eCommerce, customer service, event tracking, analytics tools, collaboration software, and more.
As you review compatibility, make sure the connectors are ready to use out-of-the-box on the plan you're considering. Some ETL tools list "supported" data sources with connectors that are still in development. Others wall off specific data sources only for enterprise customers.
Once you have a good idea of which data sources are mission-critical, you'll need to match compatibility with the destination where you'll be storing extracted data and, in many cases, performing data transformations.
These could be data warehouses or data lakes. Thankfully, there are far fewer options, but not all are supported by ingestion and ETL tools. Pay special attention to product names---a tool that supports relational database applications like BigQuery and Redshift doesn't necessarily support mappings to other Google Cloud and AWS services.
Cloud data warehouse options include Azure, Snowflake, Databricks, Google BigQuery, Amazon Redshift, and IBM Db2.
Data lakes like Amazon S3 and Microsoft Azure Data Lake are becoming more popular, especially for unstructured data, but not all tools support them.
It can take a long time to review which tools are compatible for 80+ SaaS applications. But if you're like most organizations, you'll realize that no single tool supports all your data sources.
If that describes your data team, you'll need to consider custom connectors to integrate your entire stack.
Almost every team runs into this problem---you have at least one long-tail data source that's mission-critical for your workflow, but isn't supported by the major data ingestion tools. If you've researched, evaluated vendors, searched integration catalogs, and still can't find a prebuilt connection, you'll probably need a custom solution.
To integrate a long-tail source into your ingestion and ETL data pipeline, here are your options:
Most ETL tools let you request a custom connector. But don't count on them to fix the problem. Vendors have huge backlogs and bigger customers than you in line. For example, Fivetran has been around since 2012 but only has 200+ connectors.
A third party will probably do a great job developing a custom connector, but there are two major downsides. First, consultants are prohibitively expensive. And second, they're only around as long as you hire them---if they leave and your data source's API changes, you'll need to wait until they're available again or fix it yourself.
You can also develop the connector in-house. Years ago, the only option was to code it from scratch, but today's open-source frameworks can help speed up the process. That said, it's still a long process with a steep learning curve, and you'll be on the hook for maintenance. Most organizations are best served to focus on growth, not building connectors.
While "give up and move on" doesn't sound like an attractive option, it's one of the most common. A surprising number of teams just accept they won't be able to use long-tail sources. Others revert to exporting data into a CSV file and importing it manually.
Today, there's a new option to connect with long-tail data sources: companies like Portable that focus on smaller data sources the major ETL platforms overlook. Portable can work in tandem with another ETL tool and help you import the long-tail data sources you need.
Portable offers 250+ pre-built connectors for long-tail data sources and builds them upon request in days or even hours. In addition to development, Portable handles all monitoring, maintenance, alerting, troubleshooting, and support so your team can focus on the business with reliable data management that just works.
If you're struggling to find an ETL or data ingestion tool that supports all the data sources you need, Portable can offer a solution. It's simple. Just reach out to [email protected].
We'll get back to you quickly, you can test things out, and we'll handle all the development, maintenance, and troubleshooting of the long-tail connectors you need.
Portable's clients come from all backgrounds and rely on our team to deliver reliable connections so they can focus on their business:
A top eSports company uses Portable to load data from their various forums and engagement tools into their warehouse to analyze user engagement.
An AI company uses Portable to load data from a long-tail HR system into their warehouse to gain better insight into who is on vacation.
A high-growth eCommerce company syncs data from their inventory platform, shipping platform, and affiliate marketing platform into their warehouse to better optimize internal operations.
If you're looking to start gleaning insights from your data, you need an ETL tool that lets you gather from every source. Unfortunately, most platforms don't support lesser-known tools, which can leave you stuck with unusable data.
With Portable, you can reconnect those data sources to your destination of choice and start gathering actionable, data-driven insights into your long-tail data sources. And all without worrying about development, maintenance, or support.