BigQuery ELT: Best Practices for Extract, Load, Transform

Ethan
CEO, Portable

Google BigQuery Overview

  • Google BigQuery is one of the big heads when it comes to cloud-based data warehousing solutions in the market today.

  • 8,000+ companies across the world use Google BigQuery as their data warehousing tool. It is specifically known for its powerful machine learning capabilities and big data features, making it an apt choice for dealing with huge loads of data and robust data analytical applications.

  • Google BigQuery is fully managed and provides excellent performance, in-memory caching, columnar storage, and other advanced features. All these features make it one of the best data warehousing solutions for businesses of all sizes.

BigQuery Benefits with ELT

Excellent performance and speed

  • BigQuery uses distributed serverless computing model and various replication techniques to process large datasets and perform complex queries quickly.

  • The query execution engine used by BigQuery is called Dremel and is exceptional when scaling up on demand.

  • Dremel can process data rows in the count of trillions within mere seconds.

  • BigQuery also employs advanced systems like the Colossus distributed file system and the Juniper Network connecting the Colossus storage and Dremel, all designed for excellent performance optimization and seamless scalability.

Ease of use

  • Google BigQuery allows you to run queries like the standard SQL.

  • It has an easy-to-use interface. You need not have expensive hardware or extensive coding knowledge to work with BigQuery.

Flexible pricing models

  • The common pricing model in BigQuery is pay-per-use, which allows you to pay only for the resources used.

  • It also provides a flat pay alternative, where you pay a flat rate depending on the resource range you would be using.

  • This flat pay model is best suited for companies with a fixed storage size and a predictable estimate of the number of queries and operations they would need to run on the data warehouse.

Fast real-time analytics

With all the performance-enhancing features provided by BigQuery, running powerful analytical applications becomes much easier. You get quick insights from real-time data through sleek dashboard visualization.

Optimized storage and reduced costs

  • BigQuery reduces redundant data storage. It also separates the data storage and transformation from the executions and operations run on top of it.

  • This helps you reduce your operating costs as now you can store your data without paying for processing costs.

Fine-grained access control

  • BigQuery allows you to define a fine level of access control starting from field level to project level.

  • This helps you stay compliant with all the data standards and regulations put forth by regulatory bodies like HIPAA and GDPR.

Integration with the wider Google Cloud Services

  • Google has a wide range of productivity tools and internet services, which can all be easily integrated into your BigQuery data warehouse solution.

  • All your data from Google Sheets, email, data studio, and so on can be seamlessly integrated with BigQuery with no extra effort.

Concerns when using BigQuery

Having listed all the amazing stuff about BigQuery, you should also be aware of certain concerns or challenges you might face. Understanding these concerns will help you design better data pipelines and data solutions where you can work around these issues.

Limited data type support

BigQuery does not support data types like maps, structs, and arrays. Hence, you will have to transform such data to make them compatible with your data analysis requirements.

Dealing with unstructured data

You must factor in additional optimization tasks or transformational steps when dealing with unstructured data in BigQuery. BigQuery is well-optimized for dealing with structured and semi-structured. However, things can get a bit complicated with unstructured data.

Complicated workflow

Beginners might find it difficult to get started with the workflow feature in BigQuery, especially if they are unaware of much of the data analysis stuff like basic SQL.

Lack of support for Modify/Update delete operations on individual rows

You must modify the entire table or use a combination of insert, update, and delete to modify any row.

Serial operations

BigQuery is well-optimized for running large, parallel queries. But you might find it less performing when trying to run serial operations.

Daily table update limit

The default limit for updating a table within a day is around 1000 times. If you want a higher number of updates, you will have to request and increase the quota.

Building an ELT data pipeline for BigQuery

The basic process of creating an ELT pipeline applies to BigQuery as well. You identify your data sources, set up the connectors, and build the Extract and Load operations which store the data in your BigQuery data warehouse.

  • ELT stands for Extract, Load, and Transform, which is exactly what an ELT data pipeline executes in a sequence.

  • Data is extracted from its source and loaded onto the target, the BigQuery data warehouse in this case.

  • Then depending on the requirements, the loaded data can be transformed to be compatible with the data analytical apps and BI tools (business intelligence tools) that run on top of the data.

  • There are many reasons why an ELT data pipeline serves a cloud-based data warehouse better compared to the traditional ETL method and ETL tools.

  • ELT lets you take full advantage of the cloud-native environment. Thus you can avoid the costly and time-consuming transformation processes on the client side and move/replicate the data to the cloud data store.

  • ELT data pipelines can be easily automated, making them an efficient option for repetitive tasks and scheduled data integration tasks. This helps large organizations maintain an up-to-date and accurate data management system.

Speaking of BigQuery, the loading process can be executed in many ways:

Streaming inserts

In this approach, you create BigQuery datasets and then stream the data using the tabledata().insertAll API.

Batch imports

In this approach, you can use the BigQuery command line tool or the load method to load data from a file. The source file could be stored in the cloud or a local storage medium.

Exported data

BigQuery provides several API methods that can be used to export data from other third-party services and apps.

While most of these loading mechanisms depend on your particular use case and data type, you still need to design and build a proper ELT pipeline to ensure you get valid and accurate data from your sources.

Well-designed ELT pipelines load data seamlessly into your BigQuery data warehouse. You must identify the data sources, use the proper data connectors, and ensure the loading process is efficient and automated.

The loaded data can then be transformed using SQL queries or data operations as required.

The common stages involved in a BigQuery ELT pipeline are listed below:

  • Uploading the data from local storage, file systems, or any other medium

  • Loading data into the GCP (Google Cloud Platform services)

  • Loading data into BigQuery

  • Data transformation carried out via SQL queries, procedures, or methods

Data transformation with BigQuery can be achieved with two different approaches:

Making use of Data Transfer Service

This approach uses GCP native services to load data into BigQuery, and then transformation tasks are carried out by SQL.

Making use of GCS

In this approach, data is loaded into the GCS (Google Cloud Storage) bucket with the help of utilities like Distcp, Sqoop, Spark Jobs, GSUtil, and the like. SQL can also carry out the transformation in this approach.

Best practices for ELT with BigQuery

You don't have to perform any transformation in the ELT pipelines, so you can directly load the data to BigQuery using the Data Transfer Service. The GCS is often used as a staging platform for ETL pipelines to easily transform data before loading.

Try to compress your data before loading to improve data transfer speed. Here are some pointers to help with data compression:

  • Use Avro binary format whenever possible, as it is the most efficient format for loading data.

  • Other efficient formats include Parquet and ORC format

  • If your data format is already in CSV or JSON files, you can load them directly, as these formats load faster when they are uncompressed.

  • Make use of streaming inserts to load data without any delays. You can use SDKs and services like Data flow to perform stream inserts.

Here are some factors to consider:

  1. After the first stream is inserted into the table, this data will immediately be queried within a few seconds.

  2. But it takes about 90 minutes to become ready to copy and export.

  3. When streaming to a partitioned BigQuery table, certain fields could be NULL. Take care to consider these values.

  4. Similarly, a partitioned table's DATE and TIMESTAMP fields should be within an acceptable range. The data outside this range will be rejected when streaming it.

  • Data scientists should use bulk updates wisely. As mentioned earlier, BigQuery has an upper limit on the number of update operations that can happen on a table per day. The default limit is set to 1000. If you need to increase your quota, you will have to request it. Also, remember that each of these updates could still affect a large number of rows of Google data without having any lag in the performance.

  • Transforming the loaded data can be easily done with SQL operations. The supported methods are INSERT INTO SELECT and CREATE TABLE AS SELECT. You might want to check tutorials to implement this.

  • Use the default encryption and access control mechanism provided by Google BigQuery.

  • BigQuery supports direct querying of data from certain locations called federated data sources or tables. These are the Cloud BigTable, GCS, and Google Drive. While the performance and consistency may not be as good as direct queries executed on BigQuery, this provides a convenient way to integrate data from the wider Google Cloud services.

  • Finally, you should understand when is a good time to use ELT with bigQuery. While it does reduce loading times and helps optimize real-time data analysis, it cannot be applied for all use cases. You might have to design complex transformations and take care of the quality and security issues that come with storing raw data directly in the BigQuery data warehouse. Use the right SaaS tools to help manage data quality, governance, and security concerns.

ELT Tools for BigQuery

  • ELT tools are essential to set up efficient ELT data pipelines. Without these tools, setting up the various data connectors, staging, loading, and transformation tasks can be quite cumbersome.

  • These tools can greatly reduce the time spent on coding and can help your data engineers focus on creating efficient data management systems that provide value in a quick turnaround time.

  • The basic purpose of an ELT approach tool is to replicate data from multiple data sources and feed it to your centralized repository, which could be a data lake/cloud data warehouse solution.

  • They also help you automate the data pipelines, minimize errors, and ensure your data quality is maintained throughout the process.

  • ELT tools also offer the advantage of ensuring your data complies with data standards and regulations.

  • They can also take care of your security concerns and provide complete data tracking at all stages throughout the pipeline. You can also learn more about BigQuery ETL tools here.