One of the most well-known cloud data warehouses is Amazon Redshift, but it's only as valuable as the data it contains. You'll need to have a strong ETL pipeline in place in order to accomplish that.
This guide explains how to use ETL on Redshift, the best Amazon Redshift ETL tools for various use cases, and more.
Amazon Redshift is a fully managed cloud-based data warehouse. It's part of Amazon Web Services (AWS) and offers essentially unlimited scaling for big data at an affordable price.
Your processes on Redshift are organized as nodes organized into clusters, with each Redshift cluster running on an engine.
Every cluster has a leader node and 1–128 computing nodes, determined as either dense storage (with HDD) or dense compute (SSD).
Redshift tables uses column-oriented, OLAP (online analytic processing) databases, making it a great tool for analytics and business intelligence.
For teams that need OLTP (online transaction processing), Amazon RDS is a better option.
You can interact with Redshift tables using the Redshift Data API, a web interface like Amazon Redshift Query Editor, or a command-line tool like Amazon Redshift RSQL.
Redshift has several competitors in the data warehouse space, including Snowflake and Microsoft Azure Synapse Analytics. But there are a few reasons why AWS Redshift has remained an industry leader for almost a decade.
Redshift prices start at around $0.25/hour but vary depending on region, compute density, vCPU, and more.
Redshift offers big discounts for Reserved Instances.
Redshift offers a two-month free trial for new users.
Most likely the most affordable option on the market, with zero upfront cost, generous free trial volumes, and huge savings for paying for reserved nodes in advance.
Runs multiple SQL queries in a single transaction to minimize data processing costs.
Speeds up to 10x as fast as competitors, with under-the-hood innovations like query distribution, machine learning optimizations, and massively parallel processing architecture.
Granular security permissions including access management, cluster encryption, SSL connections, and more.
Near-infinite capacity to scale, from a few gigabytes to a few petabytes.
|Redshift Advantages||Redshift Disadvantages|
|Easy-to-use serverless platform that relies on basic SQL.||Doesn't separate storage and computing like Snowflake and BigQuery, resulting in delays.|
|Deep integration in the AWS services, with seamless connections to tools like Athena, Amazon S3, Database Migration Service, AWS Glue, and many more.||Only supports Amazon Web Services (AWS) connections natively.|
|AWS Lambda Utility Runner to automate some monitoring processes.||Limited support for JSON functions in SQL.|
|Attractive pricing since it offers a generous free tier inexpensive scaling costs.||Despite being based on PostgreSQL, Amazon Redshift doesn’t support most native PostgreSQL functions, or data types.|
|Ideal for setting up data lakes and archiving less frequently accessed data into Amazon S3.||Large administrative overhead with granular settings for everything, instead of default settings that apply to most use cases.|
Redshift is a cost-effective platform that's best for enterprise clients with precise, granular needs and the technical know-how to implement those needs themselves.
You can use AWS Redshift for both ETL and ELT, but as a modern data warehouse, it's best served with ELT workflows. The ETL process leverages Redshift's cloud scalability for data transformations.
While Amazon Redshift isn't an ETL tool, it has built-in ETL capabilities.
It's also compatible with several ETL tools, including others in the AWS ecosystem and third-party tools.
Performing ETL with AWS services may sound tempting, but it's not for the faint of heart. You'll own the entire system and completely control each step, but there are detailed recommendations to follow.
Using a single COPY command to load data from multiple files into a target table
Using workload management to define queues dedicated to separate workloads
Using BEGIN and END statements to execute several transformations in each commit, rather than sequential steps on transformed data
Because Redshift doesn't automate as many maintenance processes as competitors like BigQuery, you'll need to include these best practices in your scripts.
Most data teams want to focus their time on analytics, not managing custom-built ETL processes across various Redshift clusters. If that sounds like your team, you'll want to choose a third-party tool.
AWS has several ETL and data pipeline tools including AWS Glue, AWS Data Pipeline, and AWS Kinesis. Each has a different use case, but they're not necessarily the best options for most businesses.
Redshift's popularity means there are dozens of ETL tools from third parties, which are usually more flexible than AWS tools. Here's what to look for.
Any tool can load data from major sources like Oracle or MySQL databases. But what about the less popular apps that deliver mission-critical data? Look for a tool that integrates all the data sources you'll need, not just the biggest ones.
Your data workloads in a few years will be different than they are today. Will your ETL tool adapt? Look for a tool that makes it easy to ingest new data sets.
Just because a data integration tool is compatible doesn't mean it's built for the unique optimizations Redshift offers. Look for a tool that's designed to leverage the speed built into AWS Redshift's unique architecture.
When something goes wrong, you need a team who's able to help. API changes and other connector-breaking changes are common, so look for a tool with a team that responds quickly.
Portable is the best Amazon Redshift ETL tool for long-tail sources. It already has 450+ ETL connectors, with more being added each month. They're focused on connecting apps that the big platforms ignore.
Portable also builds new connectors within days or even hours, at no extra cost. If you've started using a new app and need a custom AWS Redshift connector, the Portable team can have it ready in less time than it'd take to script it yourself.
And maintenance, alerting, and reporting are handled for you. That means if your app's API changes, the Portable team will make the necessary adjustments, so your ETL pipeline keeps working. No more missing data because an integration broke without warning.
Portable has a free plan for manual data workflows with no limits on volume, connectors, or destinations.
For automated data synchronization, it's only $200 monthly.
For enterprise requirements and SLAs, contact sales.
450+ data connectors meant for long-tail applications, including Salesforce and HubSpot.
Integrates with many popular cloud data warehouses
New sources are developed within days or hours at no extra cost.
Ongoing connector maintenance included at no cost.
White-glove customer support on all plans.
Portable doesn't have connections to some enterprise applications like Oracle or SAP.
No support for data lakes.
Only available in the United States.
Portable is best for data teams who want to spend their time on data analysis, not data processing.
AWS Glue is one of Amazon's ETL tools. It uses a serverless architecture and makes it easy to integrate data from other Amazon services, like Amazon S3, into your Redshift cluster. It's built on Apache Spark.
Seamless integration with Redshift and other AWS properties.
Automations that learn the schema and metadata of a dataset and generate ETL scripts to import new data.
Autoscaling based on workload.
Limited to AWS properties and no use cases for third-party platforms.
Difficult to combine stream and batch data processes.
Reliance on Apache Spark means developers need to be familiar with yet another tool for effective scripts.
AWS Glue is the best option for teams that are looking for data integration from other platforms in the AWS ecosystem into Redshift.
Stitch is an open-source data pipeline tool that's part of Talend. It manages data extraction and simple transformations using a built-in GUI or Python, Java, or SQL.
Standard plan starting at $100/month for up to 5 million active rows per month, one destination, and 10 sources (limited to "Standard" sources)
Advanced plan at $1,250/month for up to 100 million rows and three destinations
Premium plan at $2,500/month for up to 1 billion rows and five destinations
14-day free trial available
130+ data sources supported.
Integration with other Talend tools.
Intuitive platform with a GUI for transformations.
Automations including monitoring and alerts.
Transformation options are limited to the basics required to import data.
Limits on destinations and sources for every plan.
No on-premise deployment option.
Stitch is designed for teams that use common data sources and need a simple tool for basic Redshift data ingestion.
Hevo is a platform with 150+ data connectors that supports ETL, ELT, and Reverse ETL workflows. It's codeless and includes real-time data loading, replication, and transformations.
Free: Up to one million events (limited to 50+ data sources)
Starter: Starting at $239/month
Business: Custom quote
150+ data connectors (free plan limited to 50+).
Real-time data migration.
Robust Python-based data transformation.
24/7 live support.
Limited control over ingestion and loading timeframes.
Not all schema updates to other tools automatically and may require manual loading.
Hevo is best for data analysis teams with well-known sources looking for a no-code platform with flexibility for Python code.
Blendo is a no-code ELT cloud data platform that's part of Rudderstack. It uses automation scripts to speed up the setup process so you can start importing Redshift data quickly.
Free plan limited to three sources
Pro plan starts at $750/month and includes transformations
Enterprise plans available with custom pricing
45+ data sources supported.
Easy-to-use platform that doesn't require programming knowledge to use.
Built-in features including monitoring and alerts.
Very limited number of supported data sources.
You have to whitelist Blendo's IP addresses for AWS connectivity.
Limited featureset for data transformation.
Teams can't add data connections to Blendo on their own.
Blendo is best for data science teams with a small selection of sources that are looking for a no-code platform.
We couldn't fit every AWS Redshift data tool in this list. Here are some other ETL tools that we didn't feature, but that might work based on your data needs.
These are two other AWS data tools, in addition to Glue. AWS Data Pipeline helps import data, but doesn't have as many transformation options as Glue. Kinesis is designed for real-time and streaming data.
Fivetran supports 160+ sources, mostly focused around mainstream applications. It's a robust platform designed for enterprises.
Integrate is a codeless platform with 200+ built-in integrations, focused around eCommerce. It uses templates to speed up new data integrations.
Matillion is an all-in-one ELT data warehouse tool that supports 110+ source systems. It offers cloud-based and on-premise deployments.
Redshift is a deservedly popular choice for data warehousing. It's fast, scalable, and affordable and integrates with the AWS ecosystem better than anything else out there.
With the right ETL system for importing Amazon Redshift data, you'll build a robust and effective modern data stack.
However, Redshift's built-in ETL tools leave much to be desired. They'll increase the amount of time and work your data team has to spend collecting and transforming data. Portable eliminates that busywork and lets your team focus on insights, not troubleshooting.
Looking for the best Redshift ETL tool? Give Portable a try.