ETL Batch vs Data Streams
In this article we are going to look at the different ETL approaches Batch vs. Data Streams.
ETL (Extract, Transform, Load) is a data transportation process of extracting data from a source system, transforming the values, structure, or format, and loading the data into a target system. In some cases, the transformation process is the last step when using data staging. So, it is not a matter of ETL versus Streams since streams use ETL in its process but a matter of Batch versus Real-time streams.
Batch
In corporate enterprise, source systems are typically end-user applications like CRMs (Salesforce, Microsoft Dynamics 365, SugarCRM, Zoho, etc...), Accounting software, and Supply Chain applications. The data is then manipulated through a transformation process that maps the data into the target system and processes any structural or format changes necessary to conform to the target system. Then the data is loaded into the target system (Data Lake, Data Warehouse, Data Mart, or Analytics application). Typically, ETL when done in Batch (Collecting and storing data in a set amount of time) can be executed multiple times a day. The main issue with ETL Batch is balancing the amount of data handled by the system with the time constraints imposed by business expectations. Commonly it is too inefficient to load every source system record into the target system using ETL Batch, but the ETL process just needs to handle the incremental changes that have occurred from the last time the ETL process began to run. Incremental changes can be detected by a record’s creation date or last modification date provided the Database Administrator and/or Application Developer used those fields in their design. The issue with batch ETL is the fall behind time. If an ETL Batch Process takes 15 minutes to complete, and you put the ETL Batch in a continual loop your target system will potentially be 15 minutes behind the source system providing your source system is constantly changing. Therefore, ETL Batch processes are always run during either closed or slow business hours. The more historical data pulled the more stress is placed on the ETL system which can lead to greater fall behind times. So, a couple of guiding principles, ETL Administrators should never be forced to repeatedly pull an entire source system’s data into a target system every time the ETL Batch is executed to reduce fall behind time and there should be a constant goal to reduce ETL Batch run-times to reduce the opportunity for a structural change in the source system that can break ETL process or delay the detection of an ETL process failure.
Stream
Now let's look at data streaming which is the continuous movement of data from the source system to the target system. Data streaming still needs ETL, however, this is performed during the streaming process. Data streaming is used by target applications that need real-time or near real-time data such as system performance monitoring dashboards, Stock Market applications, and web analytics. So, if Data streaming is real-time and eliminates issues associated with Batch what is the drawback? You give up accuracy and the ability to validate against the source system. In real-time, you cannot validate the accuracy of the data being consumed. As a real-world example, a sales rep entered a telephone number in the anticipated revenue salesforce field. A nightly batch ETL process ran and picked up the change and loaded the Data Warehouse. Fortunately, there were quality checks run first thing the next morning that found the issue before any official Sales reports were generated so auditing the data was critical. Also, there can be a substantial cost increase to use a streaming tool versus a traditional batch ETL tool.
Conclusion
So, when it comes to ETL there are two choices, Batch or Stream, with each having its advantages and disadvantages. Picking the right ETL solution is extremely important and if you find your organization struggling to choose or implement the correct solution, let Explait’s expertise help you.