Using S3 as a Destination with Parquet Files

Denis
Staff Software Engineer

Amazon Simple Storage Service (Amazon S3) is a popular cloud storage service for saving data in parquet format. It can be used for several data tools, such as Databricks, Apache Spark, Apache Hive, Apache Drill, Presto, AWS Glue, Amazon Redshift Spectrum, Google BigQuery, Microsoft Azure Data Lake Storage, Dremio, and Snowflake. We have introduced a new feature allowing the data team to use any of these tools with data storage on S3.

Getting Started with AWS S3 with parquet

After logging in to the platform, navigate to Destinations.

Portable Destinations Page
Portable Destinations Page

Locate the AWS S3 destination and click on the Create button to configure a new destination. This is what it'd look like:

A new AWS S3 destination
A new AWS S3 destination

As you can see, you will have checks that inform you of missing configuration or credential errors. You will need the Bucket name, AWS region, Access key, and Secret key to configure your destination. If you have a specific AWS endpoint or a temporary access token, you can also set these up in this window, but they are not mandatory fields.

In AWS, you need to create an S3 bucket and, using IAM, create a user with privileges to read and write. We recommend giving this user only minimal access. After making the user, you can generate access and secret keys, which you can use in our system. All configuration data is stored securely.

When you are done configuring your destination, click on the Save button. Your checks should all be green, like this:

Configured S3 destination
Configured S3 destination

Use your AWS S3 Destination

We have properly configured the destination, and you are now ready to start your ETL process. In this example, we will show how to set up a flow that extracts data from the National Park Service connector to AWS S3 using Parquet format. (You can check out all the Portable connectors here.)

First, navigate to Sources and, on the search box, type "national" to narrow down the list of connectors. Click on the National Parks connector.

National Parks Connector
National Parks Connector

After selecting it, you can configure it using your API key, and you will be ready to create your flow from National Parks to AWS S3. You can see it here:

Configured Source
Configured Source

With a source and a destination created and configured, we're now ready to create a flow. To do that, navigate to Flows and click on the New button. You will be prompted to select a source and a destination, and after that, you will have the option to set a scheduled frequency and a button to save and run the flow. See that here:

Creating a flow with the National Parks source
Creating a flow with the National Parks source
Creating a flow with an AWS S3 destination
Creating a flow with an AWS S3 destination
Flow ready to be created
Flow ready to be created

After you create the flow and are on the flow details page, you can select a frequency and run the flow. I selected "manual" for the frequency and clicked on the Save and Run button to run it. The flow status will be updated on the page, and will show information about your flow. See that here:

Flow just finished
Flow just finished

What is stored in AWS S3

In the AWS S3 bucket, the ETL process will store Parquet file data from the resources. The files will have a naming convention following this schema:

    <connector name>_<resource>_<flow id>_<timestamp>

Here is an example of the S3 bucket's content after our National Parks flow finished:

Data on S3
Data on S3

Conclusion

Importing data to Databricks, Apache Spark, Apache Hive, Apache Drill, Presto, AWS Glue, Amazon Redshift Spectrum, Google BigQuery, Microsoft Azure Data Lake Storage, and Dremio is now possible using AWS S3 with Parquet files. You only need a bucket to access more than 1,400 connectors in our platform.