The ETL process is an integral part of analytical and reporting efforts in an organization and keeping in shape is an essential part of delivering reliable, timely and quality data. Internet has a plethora of articles on building a good ETL but very few turn an eye on maintaining and optimizing a real-world corporate one. In this text I offer my inflation-adjusted two cents on the topic filling a small part of this void. The subject of optimization is the duration of the process and I would like to share with you some points of on making ETLs run faster based on experience and thoughts on the subject. It is a long read but everyone involved in this topic has the patience and stamina to endure lengthy and boring texts. My text is about ETL, but a lot here applies for the ELT as well.
What does real-world corporate ETL look like? The first and most obvious feature is the high number of steps in the process, easily in the hundreds and thousands. A typical corporate ETL processes data for all activities in the organization as well as data from various external sources. The large data volume is another distinct feature. Last but not least characteristic is the complexity - the process steps execute in parallel, are highly interdependent and feed all sort of targets with variety of data transformation. Complexity and data volume grow over the time with the business development, new and evolving rules, the growing appetite for dat. A factoris also sub-optimal development management and lack of regular audit. Increased complexity and volumes directly translate into longer execution times. Other factor are contextual. For example, an US company expanding to Asia sets new requirements for shorter load duration because of the time difference.
Setting a target
The very first step in the optimization effort is establishing the target duration. Keep in mind the difference between total duration of the load and downtime of the target system, DWH or other. All data consumers should be interviewed to establish their limits - the musts and the nice-to-have. Time point of picking the data is also something to consider. Agreeing on a target duration is essential all aspects of the project - managing expectations, budgeting, measuring success, etc.
Setting up the team
Apart from the data engineers, a successful team should also include a good database administrators and BI experts. The last two may not be involved full-time but their input is indispensable.
Before any optimization started
Along with creating a test environment, prepare scripts for resetting the target. The ETL will be executed many times during the project and you have to make sure it does so under same conditions. For example, not resetting the target would skip insert operations because there will not be any new data to insert after first run. A scrip for automated execution may also come handy because for some tests the ETL may need to be run multiple times. Another thing to consider is the test instance differs from the production one because typically there is no activity in it and the advancements in test are lesser in production. The optimization project also needs a fast, accurate and reliable way to measure execution time of the process and its steps. Most of platforms and software packages have this feature but some manual processes do not have that built-in.
Is your service good enough
This is the lowest hanging fruit. Review the "hardware" and its settings. In the case of cloud environment, analyze the adequacy of compute resources, cores, parallelism, etc. The data/compute services could be outdated and if the organization has a budget for upgrade, the target duration could be easily achieved at this very first step. Testing environments in the cloud is relatively easy and cheap so you can play with few setups to balance parameters and cost. Options for testing are more restricted in the case of on-premises servers. This step also includes analyzing the optimality of the settings of environment and the tools. Every software and service has multiple settings and there is a combination providing the best performance within the system parameters. The software vendors and numerous gurus on the Internet have good advice to follow on this one. However, the best setup may not be clear for your specific case because the setup parameters could have compound effect that is difficult to figure out. This case could be tackled by a grid search across the ranges of parameters values. The automated execution script helps a lot in this scenario as you can analyze the best configuration while doing your tai chi.
General approach - Theory of Constraints
The approach of the optimization follows classical Theory of Constraints. In essence, find the bottleneck, improve it until it is not a bottleneck anymore, then move to the next one. Do this in repeat until you hit the duration target or your options for improvement are depleted. In the case of ETL, bottleneck is the process taking the longest execution time.
The Worst Offenders list
Some bottlenecks are too difficult to improve and return of invested time is not good enough. Instead of jumping to fixing the longest running step, a better option is to start with compiling a list of worst performing steps. Analyze the efforts to improve and weight in weight in overall duration for the top 10 and set your priorities. Setting these helps in making steady progress, keeping motivation high and showing progress to the project owner. The "worst offenders" list also useful in assigning the tasks to the team members.
The steps above set the project and the team could dive in the gory details of each step toward getting to the desired target. Below are some of the common methods for improvement.
Turn off steps loading unnecessary or redundant data
For an ETL that has been around for some time it is very likely some of processed data not to be used any more. A case like this may also come from out-of-box ETL of some pre-packaged solution. These usually do not consider the company specifics resulting in processing unnecessary data. The data lineage, tracking data from source to report/analysis, is the right tool for identifying the redundant steps. Make sure to turn off the step execution rather than deleting it because in case the step is required in the future all you have to do is turn it back on. Analyzing the data lineage is not easy but worth it - you can avoid massive amount of hard work on optimization.
Process only the data you need
It sounds obvious and simple enough but nevertheless, make sure only the necessary data is extracted. For example, a source table may be small at ETL development phase and then grow over the years to cause longer load times. As another example, a step could use some hard-coded values or obsolete parameters to cause extraction of a much greater data volume. A step could also be set for full instead of incremental load. The rule of getting in only the new/updated data is basic for ETL development but it is the overlooking of basics that causes the trouble.
Combine steps sharing same source
It is likely the sub-optimal development process and documentation to introduce steps loading data from same source. Usually there are no great differences between these steps - one loading most of data and the others getting few additional data elements. Find these steps and combine them into one. If it is not not possible for some reason, you can use a temp data structure to pre-load and use in both steps.
The next steps could be defined by one word - simplify.
Simplify
ETL steps involving many tables, views, procedures, and functions along with complex transformations are very likely to have a heavy punch on performance. They offer the additional "benefits" of being hard to read, debug and analyze. Simpler and smaller steps provide faster execution, better readability, optimal use of resources and easier performance analysis. One way to simplify is to split the step into smaller ones. Another is to replace sub-selects and views with temp tables loaded before the actual step. Use of functions is also an option to explore.
Replace views and sub-selects as source with temp tables
The candidates are the complex steps involving many sources and/or producing large data volume. Apart from the benefits of simplification mentioned above, temp table allow indexing that could further speeding up the processes. The temp tables are loaded before the step using the data. Explore the option to do that in parallel with some other task, saving even more time.
Optimize indexes and their usage
This goes in few different ways. One is making sure the tables are properly indexed and two, proper indexes are used. Review indexes of all tables to remove redundant and create new ones if necessary. The impact of that is during use of tables but also in rebuild/refresh of indexes. The ETL code has to be reviewed for using hints because hinting an obsolete or sub-optimal index hurts the performance much more than not using index at all. If a hint is used, try the code without it - often the database engines do very good optimization job and improvements are immediate. Another check point is if indexes are deactivated during the ETL operations. Some take the approach to completely drop the indexes before record operations and re-create after. Disable/drop indexes is a fast and easy way to save execution time but balance that with time lost on refresh/rebuild.
Parallel vs consequential execution
Executing steps in parallel is a great way for reducing the execution time. However, handle with caution. A proper setup should take into account the parallelism capabilities and settings of the database/engine and the ETL tool as well as the required resources. Make sure processes running in parallel do not have same target or source table.
Optimize the target tables size
Target tables could be enormous. Sometimes it is for a reason, sometimes it is because their size was neglected over the years. Why would you keep data for 12 years back when all ever needed is 5 years back? Reducing the size of target tables will benefit in three ways - faster data insert/update, faster index rebuild/refresh and faster querying. One option is moving unused data to a historical table. The historical data has to be available to BI and other analytical tool and this requires additional development work. Another way to deal with the table size is to use partitioning.
Optimize custom code
Custom code is not unheard-of in data processing and is very likely to be a weak link in the process. Review this code for opportunities for improvement, usually there are plenty. For example, SQL UPDATE could be replaced by a MERGE operation. Also, consider removing the custom code and including its operations in a standard
ETL steps.
Take a step back and use common sense
The common sense is a lost art. We tend to rush into Googling and get drawn in the millions of pages with advice, magic command-lines and hacks. The answers sometimes are just before our eyes. This applies to optimization of ETL. Always keep in mind project circumstances, general business context and other specifics. Let me illustrate with an example from a project I did some time ago. It was back in times when BI access had to be shut down during refreshinig data from the sources. After weeks of efforts, the total execution time was significantly lowered and there were no obvious candidates for further optimization I was out of moves. However, at some point I realized there was no reason to kick out users before the initial staging phase (loading data from source into staging tables). A simple move of the shut-down step after staging phase saved a third of total downtime. It is not a process optimization step per se, but it served the greater goal of reducing total BI downtime. The temptation to dive in improving steps has to be balanced by a sober look at the project as a whole.
Instead of Conclusion
A slow ETL frustrates the business users and damages the image of the IT department. The costs of that and the cost of an optimization project can be avoided by introducing better practices in ETL maintanance and development. These practices include monitoring for steps duration trends and promptly acting on undesireable ones, keeping proper documentation, better development and change management, regular audits, avoiding custom code. ETL optimization is a challenge but now you are equipped with some ideas and a plan how to tackle it.
Apr 5, 2023
Optimizing an ETL
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment