Amazon Redshift is one of the most popular cloud data warehouses, but it's only as valuable as the data it contains. And to do that, you'll need to have an effective ETL pipeline in place.
Today, we'll cover strategies for using ETL on Redshift, which tool is best for different use cases, and more.
Amazon Redshift is a fully managed cloud-based data warehouse. It's part of Amazon Web Services (AWS) and offers essentially unlimited scaling for big data at an affordable price.
Your processes on Redshift are organized as nodes organized into clusters, with each cluster running on an engine.
Every cluster has a leader node and anywhere from one to 128 computing notes, determined as either dense storage (with HDD) or dense compute (SSD).
Redshift uses column-oriented, OLAP (online analytic processing ) databases, making it a great tool for analytics and business intelligence.
For teams that need OLTP (online transaction processing), Amazon RDS is a better option.
You can interact with Redshift data using the Redshift Data API, a web interface like Amazon Redshift Query Editor V2, or a command-line tool like Amazon Redshift RSQL.
Redshift has several competitors in the data warehouse space, including Snowflake and Microsoft Azure Synapse Analytics. But there are a few reasons why AWS Redshift has remained an industry leader for almost a decade.
Redshift prices start at around $0.25/hour but vary depending on region, compute density, vCPU, and more.
Redshift offers big discounts for Reserved Instances.
Redshift offers a two-month free trial for new users.
Most likely the most affordable option on the market, with zero upfront cost, generous free trial volumes, and huge savings for paying for reserved nodes in advance.
Speeds up to 10x as fast as competitors, with under-the-hood innovations like query distribution, machine learning optimizations, and massively parallel processing architecture.
Granular security permissions including access management, cluster encryption, SSL connections, and more.
Near-infinite capacity to scale, from a few gigabytes to a few petabytes.
Easy-to-use platform that relies on basic SQL.
Deep integration in the AWS ecosystem, with seamless connections to tools like Athena, Amazon S3, Database Migration Service, and many more.
AWS Lambda Utility Runner to automate some monitoring processes.
Doesn't separate storage and computing like Snowflake and BigQuery, resulting in delays.
Only supports AWS cloud infrastructure.
Limited support for JSON functions in SQL.
Elasticity is limited to "Elastic Resize" feature, which is limited and requires downtime.
Large administrative overhead with granular settings for everything, instead of default settings that work in most use cases.
Redshift is a cost-effective platform that's best for enterprise clients with precise, granular needs and the technical know-how to implement those needs themselves.
Amazon Redshift isn't an ETL tool but has built-in ETL capabilities.
It's also compatible with several ETL tools, including others in the AWS ecosystem and third-party tools.
You can use Redshift for both ETL and ELT, but as a modern data warehouse, it's best served with ELT workflows. This process---extract, load, transform---leverages Redshift's cloud scalability for data transformations.
Native ETL in AWS pipeline may sound tempting, but it's not for the faint of heart. You'll own the entire system and completely control each step, but there are detailed recommendations to follow.
Using a single COPY command instead of several when loading multiple files into a target table
Using workload management to define queues dedicated to separate workloads
Using BEGIN and END statements to execute several transformations in each commit, rather than sequential steps on transformed data
Because Redshift doesn't automate as many maintenance processes as competitors like BigQuery, you'll need to include these best practices in your scripts.
Most data teams want to focus their time on analytics, not managing custom-built ETL processes. If that sounds like your team, you'll want to choose a third-party tool.
AWS has several ETL and data pipeline tools including AWS Glue, AWS Data Pipeline, and AWS Kinesis. Each has a different use case, but they're not necessarily the best options for most businesses.
Redshift's popularity means there are dozens of ETL tools from third parties, which are usually more flexible than AWS tools. Here's what to look for.
Any tool can load data from major sources like Oracle or MySQL databases. But what about the less-popular apps that deliver mission-critical data? Look for a tool that integrates all the data sources you'll need, not just the biggest ones.
Your data workloads in a few years will be different than they are today. Will your ETL tool adapt? Look for a tool that makes it easy to connect new data sources and destinations.
Just because a data integration tool is compatible doesn't mean it's built for the unique optimizations Redshift offers. Look for a tool that's designed to leverage the speed built into Redshift's unique architecture.
When something goes wrong, you need a team who's able to help. API changes and other connector-breaking changes are a common occurrence, so look for a tool with a team that responds quickly.
Portable is the most ideal Redshift ETL tool for hard-to-find data sources. It already has 300+ long-tail ETL connectors, with more being delivered every month---all focused on long-tail apps that the big platforms ignore.
Portable also delivers new connectors within days or even hours, at no extra cost. If you've started using a new app and need a custom Redshift connector, the Portable team can have it ready to use in less time than it'd take to build it yourself.
And maintenance, alerting, and reporting are handled for you. That means if your app's API changes, the Portable team will make the necessary adjustments, so it keeps working. No more missing data because a connector broke without warning.
Portable offers a free plan for manual data workflows with no caps on volume, connectors, or destinations.
For automated data flows, Portable charges a flat fee of $200/month.
For enterprise requirements and SLAs, contact sales.
300+ data connectors meant for long-tail applications.
New data source connectors developed within days or hours at no extra cost.
Ongoing connector maintenance included at no cost.
White glove customer support on all plans.
Portable only delivers long-tail data sources and doesn't have connections for enterprise applications like Salesforce or Oracle.
No support for data lakes.
Only available in the United States.
Portable is best for data teams with less-common data sources who want to spend their time on analysis, not data processing.
AWS Glue is one of Amazon's ETL tools. It uses a serverless architecture and makes it easy to integrate data from other Amazon tools, like S3, into your Redshift cluster. It's built on Apache Spark.
Seamless integration with Redshift and other AWS properties.
Automations that learn the schema and metadata of a dataset and generate ETL scripts to import new data.
Autoscaling based on workload.
Limited to AWS properties and no use cases for third-party platforms.
Difficult to combine stream and batch data processes.
Reliance on Apache Spark means developers need to be familiar with yet another tool for effective scripts.
AWS Glue is the best option for teams that are looking for data integration from other platforms in the AWS ecosystem into Redshift.
Stitch is a data pipeline tool that's part of Talend. It manages data extraction and simple transformations using a built-in GUI or Python, Java, or SQL.
Standard plan starting at $100/month for up to 5 million active rows per month, one destination, and 10 sources (limited to "Standard" sources)
Advanced plan at $1,250/month for up to 100 million rows and three destinations
Premium plan at $2,500/month for up to 1 billion rows and five destinations
14-day free trial available
130+ data sources supported.
Integration with other Talend tools.
Intuitive platform with a GUI for transformations.
Automations including monitoring and alerts.
Transformation options are limited to the basics required to import data.
Limits on destinations and sources for every plan.
No on-premise deployment option.
Stitch is designed for teams that use common data sources and need a simple tool for basic Redshift data ingestion.
Hevo is a platform with 150+ data connectors that supports ETL, ELT, and Reverse ETL workflows. It's codeless and includes real-time data loading, replication, and transformations.
Free: Up to one million events (limited to 50+ data sources)
Starter: Starting at $239/month
Business: Custom quote
150+ data connectors (free plan limited to 50+).
Real-time data migration.
Robust Python-based data transformation.
24/7 live support.
Limited control over ingestion and loading timeframes.
Not all schema updates to other tools automatically and may require manual loading.
Hevo is best for data teams with well-known data sources looking for a no-code platform with flexibility for Python code.
Blendo is a no-code ELT cloud data platform that's part of Rudderstack. It uses automation scripts to speed up the setup process so you can start importing Redshift data quickly.
Free plan limited to three sources
Pro plan starts at $750/month and includes transformations
Enterprise plans available with custom pricing
45+ data sources supported.
Easy-to-use platform that doesn't require programming knowledge to use.
Built-in features including monitoring and alerts.
Very limited number of supported data sources.
Limited featureset for data transformations.
Teams can't connect new data sources to Blendo on their own.
Blendo is best for data teams with a small selection of data sources that are looking for a no-code platform.
We couldn't fit every Redshift data tool in this list. Here are some other ETL tools that we didn't feature, but that might work based on your data needs.
These are two other AWS data tools, in addition to Glue. AWS Data Pipeline helps import data, but doesn't have as many transformation options as Glue. Kinesis is designed for real-time and streaming data.
Fivetran supports 160+ data sources, mostly focused around major applications. It's a robust platform designed for enterprises.
Integrate is a codeless platform with 200+ built-in integrations, focused around eCommerce. It uses templates to speed up new data integrations.
Matillion is an all-in-one ELT data warehouse tool that supports 110+ data sources. It offers cloud-based and on-premise deployments.
Redshift is a deservedly popular choice for data warehousing. It's fast, scalable, and affordable---and integrates with the AWS ecosystem better than anything else out there. With the right ETL system for importing Amazon Redshift data, you'll build a robust and effective data analytics solution.
But Redshift's built-in ETL tools leave much to be desired. They'll increase the amount of time and work your data team has to spend collecting and transforming data. Portable eliminates that busywork and lets your team focus on insights, not troubleshooting.
Looking for the best Redshift ETL tool? Give Portable a try.