Using Azure Blob as a Destination with Parquet Files

Denis
Staff Software Engineer

Azure Blob Storage is a widely-used cloud storage solution offering scalability, security, and cost-effectiveness. In this guide, we'll walk you through how to configure Azure Blob Storage as a destination in Portable to store your data in Parquet format, allowing you to efficiently work with structured data and easily integrate it into your analytics workflows.

Why Azure Blob Storage?

Azure Blob Storage provides a reliable, scalable, and highly available service for storing large volumes of unstructured data. Whether you're looking to archive data, manage backups, or store logs, Azure Blob Storage is a versatile solution. Portable allows you to leverage this storage while converting and loading your data in Parquet format, making it optimal for analytics and querying.

Benefits of using Parquet

Parquet is a columnar storage format optimized for performance and efficiency when handling large datasets. By storing your data in Parquet format, you can:

  • Reduce storage costs: Parquet compresses data efficiently, reducing the size of your files.
  • Faster querying: Parquet allows for faster reads on specific columns without reading the entire dataset.
  • Compatibility: It’s compatible with popular data analytics engines like Azure Data Lake Analytics, Databricks, and Synapse.

Prerequisites

Before setting up Azure Blob Storage as a destination in Portable, ensure the following:

  • You have an active Azure account.
  • You have created a Blob Storage container.
  • You have your access key

Step 1: Create Your Azure Blob Storage Account and Container

If you haven’t done so already, set up your Azure Blob Storage account:

  1. Sign in to the Azure Portal.
  2. Navigate to Storage Accounts and click Create.
  3. Configure your storage account with a BlobContainer.
  4. Once the account is created, locate it in the portal and create a new container for storing your Parquet files.

Retrieve your Access Key

  1. Sign in to the Azure Portal.
  2. Navigate to Storage Accounts and expand Security + networking.
  3. Click on Access Keys.
  4. Collect Accoun name, container and key to use in portable configuring your destination
Azure Security
Azure Security
Access Keys
Access Keys
Container
Container

Step 2: Configure Azure Blob Storage as a Destination in Portable

  1. Log in to your Portable dashboard.
  2. Navigate to the Destinations tab and click Add Destination.
  3. Select Azure Blob Storage from the list of available destinations.
  4. Enter your Account Name, Account Key, Container Name, and Upload Path from your Azure Blob Storage setup.
  5. Upload Path If specified, the necessary folders will be automatically created if they do not already exist to ensure successful file uploads.
Destination Config
Destination Config

You should be able to see now your configure destination in your destinations' list

Destinations
Destinations

Step 3: Create Your Workflow

With your Destination and your Source configured, you’re ready to start the data pipeline:

  1. Go to your Portable Flows and create a new Flow selecting your created Azure Blob destination and your confugured Source.
  2. On the flow destail page you now have several options to run your Flow such as manual, in specific frequency or cron. Select your prefer option and save and run the flow
  3. Monitor the flow in Portable's recent runs tables to see the status of the data transfer.
Azure flow
Azure flow

Step 5: Verify Your Data in Azure Blob Storage

Once the workflow is complete, navigate to your Azure Blob Storage container:

  1. In the Azure Portal, navigate to your storage account.
  2. Go to Containers, select your container, and verify that your Parquet files have been uploaded.
Azure container content
Azure container content

Conclusion

By setting up Azure Blob Storage as a destination in Portable, you can efficiently export data in Parquet format for optimized storage and analysis. Azure Blob Storage is a scalable and secure solution, while Portable simplifies the process of integrating data across platforms. This configuration allows for seamless data workflows, helping you store, manage, and analyze large datasets with ease.