ETL for Big Data

Ethan
CEO, Portable

Everyone talks about Big Data. The idea that you can collect data points from every system, piece of hardware, or individual to create curated, personalized, experiences.

If you're trying to process massive amounts of data, how should you extract, transform, and load (ETL) that data? Read on to find out.

Should your approach to ETL be different for big data vs. small data?

Syncing data is much simpler at small scale. If your goal is to move small amounts of data from one place to another, your infrastructure and your data pipelines don't really have to worry about running out of space.

But when you're moving massive amounts of data (the scale of data typically associated with Big Data), you need to make sure that every step in the process is able to handle the volume.

What considerations are specific to ETLing large data sets?

  • Will the hardware you leverage run out of memory?
  • Can you horizontally scale your ETL job (i.e. add more resources if the volume of data increases)?
  • Can you vertically scale your ETL job (i.e. if a specific ETL job will take a long time, can your system withstand errors along the way)
  • Do you replicate all of the data every time you want it synced, or do you incrementally replicate data that has changed since the last sync?

Should you use ETL or ELT to replicate big data sets?

There are great reasons to use the ETL paradigm, and there are great reasons to use the ELT paradigm for data loading. The biggest difference between the ETL and ELT is when data transformation takes place in the data pipeline.

Here's a simple framework:

  1. If the destination for the data can handle large amounts of information, use ELT

  2. If the destination needs specific data points, or a small scale of data, use ETL

What tools should you use for big data ETL?

Most cloud native ELT solutions are purpose build to replicate massive volumes of data. We've outlined the top 5 ELT Tools in this article to help you think through your options.

At Portable, we specialize in ETL for big data, and we're experts in connecting your long-tail business applications to your data warehouse.


Want to learn more? Book time for a discussion or a demo directly on my calendar