Learn the difference between ETL and ELT, and which is best suited for you.
ETL and ELT are two ways to get data from one or more sources into a destination storage system. The goal of both of these processes is to ingest raw data, and output data that can be used for analysis. ETL stands for extract, transform, load, whereas ELT stands for extract, load, transform. The primary difference between ELT and ETL is the order in which the transform and load steps take place.
Let's start with some definitions.
- Extract: The extract step refers to pulling data from one or more sources.
- Transform: The transform step refers to changing the information to a desired format or performing operations on the data.
- Load: The load step refers to saving the data to a destination storage system.
ETL and ELT apply the same steps in different orders, which results in each method having strengths and weaknesses.
ETL (Extract, Transform, Load)
ETL is a process that has been around for decades. The process kicks off by extracting raw data from the source into a staging area. In the staging area, transformations are performed on the raw data, and the transformed data is loaded into the storage system.
Some examples of transformations that would be done before loading are: removing/adding columns, changing types, replacing values, removing outliers or null values, merging columns, etc.
There are a few benefits of transforming data before it reaches its destination.
- By performing the transformation step before loading the data, the destination storage system (usually a data warehouse) will only receive data in a pre-specified structure. Since all the data in the storage system fits the specified schema, it will be much easier to work with later on.
- Sensitive data can be cleaned and secured before the data is stored, which allows for better privacy and compliance.
- ETL reduces resource bloating by allowing users to decide what data will be persisted, ensuring that unnecessary raw data will not be stored.
Historically, ETL was complex - having data engineers code the process, performing transformations locally, updating transformation steps to handle different sources, and dealing with conflicts required a lot of time and money. Modern cloud solutions help address some of these issues, however, ETL still has some pitfalls.
- The main issue with ETL is that if a transformation step fails, the data will never reach its destination. Modern tools are trying to solve this problem with smart conflict resolution and alerts, but there is still lots of work to be done. This point is one of the reasons why some are switching over to ELT, as the data is loaded immediately.
- Although ETL transforms data before it reaches the storage system, additional transformations might need to be applied after loading. This leads some people to argue that it is better to perform all the transformations after loading, as is the case with ELT.
ELT (Extract, Load, Transform)
ELT loads the data directly from the source to the storage system, and transformations are applied within the storage system (often a data lake or data warehouse). The storage system must be capable of storing structured and unstructured data, as there is no cleaning step before data is loaded.
ELT has gained a lot of popularity with cloud data storage becoming so widely available and easily scalable. When working with huge amounts of data that needs to be loaded quickly, ELT is extremely powerful, however, it does come at a higher cost and complexity.
ELT is growing because it has some key advantages.
- All source data is ingested almost immediately since data is extracted from the source and loaded directly into the storage system. Users can load the data immediately and figure out what to do with it later on.
- By performing transformations on data that is already persisted, unnecessary transformations can be avoided. Aside from saving on computational usage, transforming the data at the end of the process is more flexible, as the transform can be tailored to the specific use case. If you notice an error in your transformation pipeline, you can easily fix it and re-run the transformations to get the right output.
- Transformations can be written in SQL, which allows both engineers and analysts to work with the data to get their desired output.
Although more people are using ELT than ever, there are still some downsides.
Because ELT is so flexible, users tend to load a ton of data from different sources, which leads to more tables that can easily become difficult to manage. Over time, the storage system can get bloated, leading to higher storage costs. To avoid this, tools or processes are needed to clean unused data.
Since the storage system is full of raw data, some additional cleaning steps will need to be done before data analysis as well. ETL does these transformations before loading the data instead of after.
Now that we have an understanding of the differences between ETL and ELT, let's talk about what situations they each excel in.
- Data scrubbing: If you are dealing with sensitive data, ETL allows you to add additional security measures before your data reaches the storage system.
- Lots of raw data that won't be needed later on: If you know you will not need some of your data, ETL allows you to remove it immediately so that you can reduce your storage costs.
- Consistent source data: If your source data always comes in the same formats, you can set the transformations and forget about them. ETL allows you to apply the same transformations every time you receive new data with no additional effort.
- Huge amounts of data, both structured and unstructured: If you are dealing with a lot of data coming from different sources, ELT is probably for you. Instead of transforming data upon ingestion, you can transform it when you need it.
- Want data loaded as fast as possible: If you need data to be loaded as quickly as possible, ELT is a great solution. Data goes right into your storage system and transformations can be applied later on.
- Heavy data analysis: ELT is great for data analysis. You can load everything in and figure out how to analyze it later on. Also, you can apply all your transformations in one step, unlike with ETL.
Both ETL and ELT can be used to ingest data and output it in the desired schema, however, there are certain situations where one process outperforms the other.
Although the transform step in ETL results in more time for the data to reach its destination, once the data is loaded, it will be much easier to work with. ETL is also simpler and enables users to add additional security measures (e.g. encryption) before persisting sensitive data.
For situations where loading data quickly is important, or you are dealing with huge amounts of data, ELT might be preferable. Users also have more freedom with ELT, as they can load all their data and decide how to transform it later on.