ETL using Fivetran with AWS Lambda

CEO, Portable

What is AWS Lambda used for?

AWS Lambda is Amazon's event-driven, serverless compute service. It is used to run code in response to an event or a series of events (event-driven) without procuring and managing servers or containers (serverless).

Code is triggered using Lambda functions, which are typically used in data integration use cases. Lambda functions can be used to build and run applications and services triggered by events like data changes in AWS S3 buckets or HTTP requests to API Gateways.

AWS Lambda allows users to add custom functions or connectors to AWS resources such as DynamoDB tables and S3 buckets allowing you to efficiently compute data as it moves or enters the cloud.

Notable use cases for AWS Lambda include real-time file processing, data analytics, and web applications. You can also use the compute service to manage IoT, web and mobile backends, and processing streams.

AWS Lambda supports several languages using runtimes. It guarantees capacity provisions, automatic scaling, and configured logging. Besides, it is easy to configure and use.

What is the Difference Between Serverless and Scaleable?

A serverless compute service is a system that allows you to run code and build applications on-demand without having to provide, manage, or scale any infrastructure.

Serverless doesn't necessarily mean that there are no servers. Instead, it is a relatively new paradigm that abstracts away managing servers from the application developer.

AWS Lambda function is a perfect example of serverless; with AWS Lambda, application developers do not have to manage backend infrastructure.

On the other hand, scalable in the computing world means the ability for an application or service to increase or decrease in size to meet the changing needs. A scaleable application is one that will continue to function optimally despite changes in data volume or size.

While the two terms are different in concept, serverless computing models are remarkably scalable. Any service with a serverless architecture, such as AWS Lambda, is inherently scalable, because backend infrastructure is managed by AWS, not by independent dev teams.

Benefits of Serverless Architecture

Serverless computing comes with several benefits for data engineer to prefer it over server-centric or traditional cloud-based infrastructure.

They offer more flexibility, scalability, and faster turnarounds, at reduced costs.

Benefits of serverless computing architecture:

  • No need to procure, manage and maintain servers: While serverless computing also takes place on servers, data engineers have nothing to do with the servers. Thus, you do not need to purchase or maintain any backend servers, as the vendor does all that. With that, DevOps can reduce expenses and free up space.

  • Scalability features: Unlike traditional computing architectures, serverless architectures can automatically respond to size or volume variations of applications. For example, a data warehouse hosted on a serverless architecture will be able to handle any unexpected increase in data.

  • You only pay for what you use: In serverless computing, you only pay for the server space you use. Your code will only run whenever you need backed functions. Consequently, DevOps can enjoy reduced costs.

  • Decreased latency: Another advantage of this computing architecture is its decreased latency. You can run your code on servers closer to you from wherever you are. Thus, your requests do not have to travel to the origin server.

  • Faster updates and deployments: You do not need time to configure any backend infrastructure. Besides, you do not have to upload the entire code to the servers to develop an application. You have the convenience bits of code, making it easy to fix, patch, update, or add features to your application.

What is Fivetran?

Fivetran is a reliable, automated, production-grade ELT solution. It is popular in data warehousing for moving data into, out of, or across cloud data platforms such as Google Cloud Platform, IBM Bluemix, and Microsoft Azure.

The cloud-based data integration platform automates the most time-consuming and repetitive parts of the ELT process. It helps to streamline data pipelines, increase accuracy and reduce effort.

Fivetran is a poular ETL tool for loading data into Google BigQuery, Snowflake, PostgreSQL, or SQL Server.

Fivetran Downsides

But while everything sounds rosy, there are some downsides. For instance, Fivetran does not maintain custom connectors for developers.

Instead, they provide you with a platform to build your custom integrations. This involves writing code, which can be complex to maintain.

Developers typically pair Fivetran with other cloud-based infrastructure. This may include creating an account with AWS Lambda, Google Cloud Functions, or AWS Lambda.

How does Fivetran integrate with AWS Lambda?

Fivetran can integrate with AWS Lambda to allow you to set up custom functions that can respond to Lambda triggers in your Fivetran data pipeline. This integration enables you to perform custom validations, transformations, and manipulations on data before you load it into your data warehouse.

You can write the Lambda functions in your chosen code, including Java, JSON, Node.js, or Python. You can customize these functions to respond to triggers, such as adding an item to your tableau.

Does Fivetran run on AWS?

Yes. Fivetran runs on AWS.

It uses Amazon Web Services as its infrastructure vendor. It uses an array of AWS services, such as Lambda, EC2, and S3, to build a scalable and secure data integration platform.

Moreover, you can integrate it with Snowflake and Redshift to load and store data in the data warehouses.

Using AWS Lambda to Create a Custom Data Source

You can create a custom data source in AWS Lambda with the short setup guide below:

  1. Create a Lambda function: The first step is to write code in any language supported by Amazon Web Services. The code can be in JSON, Python, Java, Node.js, or any programming language. The primary function of the code will be to fetch data from your external data source and return it in any Amazon QuickSight format.

  2. Create an S3 bucket: You need an Amazon S3 bucket to store your data. So, create one.

  3. Create an IAM role: You should not use your primary AWS account credentials for creating custom data sources. Therefore, you should create an Identity and Access Management role and configure it with sufficient permissions. Ensure you have permission to read data from the Amazon S3 bucket and execute Lambda functions.

  4. Create and configure an Amazon QuickSight dataset: This step requires you to use the Lambda function and the IAM role you created in the previous steps. You will use both as the data source for the Amazon QuickSight data set.

  5. Create a visual: Finally, create visualizations once your data is successfully loaded by leveraging the Lambda connector. You can try out the integration with sample functions.

That's it! You have successfully created a custom data source for your workflow using Amazon QuickSight and AWS Lambda.

Serverless ETL with Fivetran

Fivetran enables serverless Extract, Transform, and Load (ETL) processing. You can use it to automate data ingestion from databases such as MongoDB or DynamoDB and SaaS applications.

To achieve this, you will begin by connecting your data sources using Fivetran's pre-installed connectors. You can then configure your data pipelines, schedule a sync for your data, and monitor the data quality.

You can use the setup to analyze your data in the data warehouse or data lake and perform advanced business intelligence tasks as you deem appropriate.

Building your own Custom Connector with Fivetran

Fivetran allows you to build custom Fivetran connectors to integrate data from sources not natively supported by the platform.

This short tutorial will help you build your custom connector with Fivetran:

  1. Define the data structure: Define the data structure of the source you want to connect to. You will also need to map it to the desired schema in your data warehouse or data lake.

  2. Write the connector code: Write the code to extract data from the source and transform it into the desired format. You can use any programming language supported by the source API, such as Python or Java. You may need to host your code on GitHub.

  3. Package the connector: Package the connector code and dependencies into a deployable artifact. For example, you can package it as a Docker image.

  4. Deploy the connector: Deploy the connector to a hosting environment accessible by Fivetran. You can use a cloud-based environment, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), or an on-premise environment.

  5. Configure the connector: Configure the connector in Fivetran and test it to ensure it works correctly.

  6. Integrate the connector: Integrate the connector into your data pipeline and schedule data sync to run at the desired frequency.

Building a custom connector with Fivetran requires technical expertise and a good understanding of the source API and data structure. However, it allows you to integrate data from any source into your data warehouse or data lake.

Is Fivetran the Best Choice?

Even after the technical code-writing process to build custom Fivetran connectors, Fivetran will only cover some of your needs. It will not solve all your problems, as Fivetran's ETL catalog does not have some data connectors. For instance, its directory does not include most long-tail connectors.

Building unstandardized connectors with custom APIs and OAuth2 vs API keys may not work with Fivetran.

Dedicated ETL Tools for Long-Tail Data Source Connectors

1. Portable

Portable is the best data integration tool for data engineers or teams working with long-tail connectors.

Portable builds long-tail connectors on demand in minutes, hours, days.

You can compare features, including pricing on our Portable vs. Fivetran guide.

Key Features

  • More than 300 built-in data connectors

  • Free on-demand connector development

  • Hands-on technical data support

  • First turnaround for new customers

Best suited for

Portable is ideal for building long-tail data source connectors. It is an incredible ETL tool for teams looking to connect several data sources into their data pipelines. With Portable, you do not need to develop and maintain data pipelines, we do for you.

2. Microsoft

Microsoft is one of the largest platforms in the data processing industry. It features various data services, including Microsoft Flow, Microsoft SQL Server Integration Services (SSIS), and Azure Logic Apps.

Key Features

  • Built-in support for Azure Data Factory and Microsoft SQL Server

  • Easy mapping without complex tutorials

  • Scalable and offers pay-as-you-go service

Best suited for

Microsoft data integration tools are ideal for companies that have deep networks with Azure and other data services from Microsoft.