Dan Gardev on Analytics, Data and Everything

Apr 5, 2023

Optimizing an ETL

The ETL process is an integral part of analytical and reporting efforts in an organization and keeping in shape is an essential part of delivering reliable, timely and quality data. Internet has a plethora of articles on building a good ETL but very few turn an eye on maintaining and optimizing a real-world corporate one. In this text I offer my inflation-adjusted two cents on the topic filling a small part of this void. The subject of optimization is the duration of the process and I would like to share with you some points of on making ETLs run faster based on experience and thoughts on the subject. It is a long read but everyone involved in this topic has the patience and stamina to endure lengthy and boring texts. My text is about ETL, but a lot here applies for the ELT as well.

What does real-world corporate ETL look like? The first and most obvious feature is the high number of steps in the process, easily in the hundreds and thousands. A typical corporate ETL processes data for all activities in the organization as well as data from various external sources. The large data volume is another distinct feature. Last but not least characteristic is the complexity - the process steps execute in parallel, are highly interdependent and feed all sort of targets with variety of data transformation. Complexity and data volume grow over the time with the business development, new and evolving rules, the growing appetite for dat. A factoris also sub-optimal development management and lack of regular audit. Increased complexity and volumes directly translate into longer execution times. Other factor are contextual. For example, an US company expanding to Asia sets new requirements for shorter load duration because of the time difference.

Setting a target
The very first step in the optimization effort is establishing the target duration. Keep in mind the difference between total duration of the load and downtime of the target system, DWH or other. All data consumers should be interviewed to establish their limits - the musts and the nice-to-have. Time point of picking the data is also something to consider. Agreeing on a target duration is essential all aspects of the project - managing expectations, budgeting, measuring success, etc.

Setting up the team
Apart from the data engineers, a successful team should also include a good database administrators and BI experts. The last two may not be involved full-time but their input is indispensable.

Before any optimization started
Along with creating a test environment, prepare scripts for resetting the target. The ETL will be executed many times during the project and you have to make sure it does so under same conditions. For example, not resetting the target would skip insert operations because there will not be any new data to insert after first run. A scrip for automated execution may also come handy because for some tests the ETL may need to be run multiple times. Another thing to consider is the test instance differs from the production one because typically there is no activity in it and the advancements in test are lesser in production. The optimization project also needs a fast, accurate and reliable way to measure execution time of the process and its steps. Most of platforms and software packages have this feature but some manual processes do not have that built-in.

Is your service good enough
This is the lowest hanging fruit. Review the "hardware" and its settings. In the case of cloud environment, analyze the adequacy of compute resources, cores, parallelism, etc. The data/compute services could be outdated and if the organization has a budget for upgrade, the target duration could be easily achieved at this very first step. Testing environments in the cloud is relatively easy and cheap so you can play with few setups to balance parameters and cost. Options for testing are more restricted in the case of on-premises servers. This step also includes analyzing the optimality of the settings of environment and the tools. Every software and service has multiple settings and there is a combination providing the best performance within the system parameters. The software vendors and numerous gurus on the Internet have good advice to follow on this one. However, the best setup may not be clear for your specific case because the setup parameters could have compound effect that is difficult to figure out. This case could be tackled by a grid search across the ranges of parameters values. The automated execution script helps a lot in this scenario as you can analyze the best configuration while doing your tai chi.

General approach - Theory of Constraints
The approach of the optimization follows classical Theory of Constraints. In essence, find the bottleneck, improve it until it is not a bottleneck anymore, then move to the next one. Do this in repeat until you hit the duration target or your options for improvement are depleted. In the case of ETL, bottleneck is the process taking the longest execution time.

The Worst Offenders list
Some bottlenecks are too difficult to improve and return of invested time is not good enough. Instead of jumping to fixing the longest running step, a better option is to start with compiling a list of worst performing steps. Analyze the efforts to improve and weight in weight in overall duration for the top 10 and set your priorities. Setting these helps in making steady progress, keeping motivation high and showing progress to the project owner. The "worst offenders" list also useful in assigning the tasks to the team members.

The steps above set the project and the team could dive in the gory details of each step toward getting to the desired target. Below are some of the common methods for improvement.

Turn off steps loading unnecessary or redundant data
For an ETL that has been around for some time it is very likely some of processed data not to be used any more. A case like this may also come from out-of-box ETL of some pre-packaged solution. These usually do not consider the company specifics resulting in processing unnecessary data. The data lineage, tracking data from source to report/analysis, is the right tool for identifying the redundant steps. Make sure to turn off the step execution rather than deleting it because in case the step is required in the future all you have to do is turn it back on. Analyzing the data lineage is not easy but worth it - you can avoid massive amount of hard work on optimization.

Process only the data you need
It sounds obvious and simple enough but nevertheless, make sure only the necessary data is extracted. For example, a source table may be small at ETL development phase and then grow over the years to cause longer load times. As another example, a step could use some hard-coded values or obsolete parameters to cause extraction of a much greater data volume. A step could also be set for full instead of incremental load. The rule of getting in only the new/updated data is basic for ETL development but it is the overlooking of basics that causes the trouble.

Combine steps sharing same source
It is likely the sub-optimal development process and documentation to introduce steps loading data from same source. Usually there are no great differences between these steps - one loading most of data and the others getting few additional data elements. Find these steps and combine them into one. If it is not not possible for some reason, you can use a temp data structure to pre-load and use in both steps.

The next steps could be defined by one word - simplify.

Simplify
ETL steps involving many tables, views, procedures, and functions along with complex transformations are very likely to have a heavy punch on performance. They offer the additional "benefits" of being hard to read, debug and analyze. Simpler and smaller steps provide faster execution, better readability, optimal use of resources and easier performance analysis. One way to simplify is to split the step into smaller ones. Another is to replace sub-selects and views with temp tables loaded before the actual step. Use of functions is also an option to explore.

Replace views and sub-selects as source with temp tables
The candidates are the complex steps involving many sources and/or producing large data volume. Apart from the benefits of simplification mentioned above, temp table allow indexing that could further speeding up the processes. The temp tables are loaded before the step using the data. Explore the option to do that in parallel with some other task, saving even more time.

Optimize indexes and their usage
This goes in few different ways. One is making sure the tables are properly indexed and two, proper indexes are used. Review indexes of all tables to remove redundant and create new ones if necessary. The impact of that is during use of tables but also in rebuild/refresh of indexes. The ETL code has to be reviewed for using hints because hinting an obsolete or sub-optimal index hurts the performance much more than not using index at all. If a hint is used, try the code without it - often the database engines do very good optimization job and improvements are immediate. Another check point is if indexes are deactivated during the ETL operations. Some take the approach to completely drop the indexes before record operations and re-create after. Disable/drop indexes is a fast and easy way to save execution time but balance that with time lost on refresh/rebuild.

Parallel vs consequential execution
Executing steps in parallel is a great way for reducing the execution time. However, handle with caution. A proper setup should take into account the parallelism capabilities and settings of the database/engine and the ETL tool as well as the required resources. Make sure processes running in parallel do not have same target or source table.

Optimize the target tables size
Target tables could be enormous. Sometimes it is for a reason, sometimes it is because their size was neglected over the years. Why would you keep data for 12 years back when all ever needed is 5 years back? Reducing the size of target tables will benefit in three ways - faster data insert/update, faster index rebuild/refresh and faster querying. One option is moving unused data to a historical table. The historical data has to be available to BI and other analytical tool and this requires additional development work. Another way to deal with the table size is to use partitioning.

Optimize custom code
Custom code is not unheard-of in data processing and is very likely to be a weak link in the process. Review this code for opportunities for improvement, usually there are plenty. For example, SQL UPDATE could be replaced by a MERGE operation. Also, consider removing the custom code and including its operations in a standard
ETL steps.

Take a step back and use common sense
The common sense is a lost art. We tend to rush into Googling and get drawn in the millions of pages with advice, magic command-lines and hacks. The answers sometimes are just before our eyes. This applies to optimization of ETL. Always keep in mind project circumstances, general business context and other specifics. Let me illustrate with an example from a project I did some time ago. It was back in times when BI access had to be shut down during refreshinig data from the sources. After weeks of efforts, the total execution time was significantly lowered and there were no obvious candidates for further optimization I was out of moves. However, at some point I realized there was no reason to kick out users before the initial staging phase (loading data from source into staging tables). A simple move of the shut-down step after staging phase saved a third of total downtime. It is not a process optimization step per se, but it served the greater goal of reducing total BI downtime. The temptation to dive in improving steps has to be balanced by a sober look at the project as a whole.

Instead of Conclusion
A slow ETL frustrates the business users and damages the image of the IT department. The costs of that and the cost of an optimization project can be avoided by introducing better practices in ETL maintanance and development. These practices include monitoring for steps duration trends and promptly acting on undesireable ones, keeping proper documentation, better development and change management, regular audits, avoiding custom code. ETL optimization is a challenge but now you are equipped with some ideas and a plan how to tackle it.

Aug 9, 2016

Motor Vehicle Market in Bulgaria - Some Data Revealed

Couple of years ago I got interested in second hand car market in Bulgaria and was not surprised there is not available data for it. I tried to proxy it with the data from a mobile.bg - the largest car sale site in the country - see the post here. Тhe approach had some downsides but still delivered some data. However, it was a partial effort as a proper estimation would require regular data extraction and comparison and I had not had the time to do that. The good news is, thanks to EU Open Data effort, the Bulgarian government has recently published data for registered motor vehicles. Please find it here, in Bulgarian only. It is a good step forward and I welcome that. The data is missing some important details such as mileage and engine type (diesel, petrol, electric) and could have been better organized but it way better than no data at all. I was not happy to confirm some of my conclusions about the age of the car park as it is a sign the economy is not in the country it should be after more than 20 years of free market.

I hope the Open Data initiative will not die and the government will keep publishing data on regular basis to satisfy the data curious minds.

Apr 22, 2016

Is There Anything Wrong with Year 2016

The pop king Prince has passed yesterday at age of 57. He joined a seemingly long list of celebrities we lost this year. Posts on social media go along the lines of "2016 did it again", "First Bowie now Prince" and "Isn't is the worst year ever?". One might really think there is something going on this year. But is it really?

Sunken Costs and Sausages

Last week I went to a client conveniently located in a fantastic tourist destination and I had great opportunity to add some weight while having good times. I was recommended a restaurant serving fantastic one-in-the-world sausages and one evening I took the walk to get there. It was bit off the tourist and business tracks and soon I got to a part of the city where nothing much was happening and I was expecting a quiet and tasty dinner. However, to my surprise the restaurant was crowded as an Apple store on a release of a new iPhone! People were standing with a beverage and plate of sausage so close to each other that even basic activities as checking Facebook seemed impossible. My companions were all "See how good is this place!". But was it really the food that kept people in?

The Problem with Excel Spreadsheets in Organizations

The death of Excel has been announced many times but it seems to be tougher than officer McClane and still saves the day for many organizations. However, implementation is riddled with problems and dangers that could have detrimental effect on careers or businesses. I have long developed and fought for standards and procedures to minimize the risks and pains coming from badly placed, designed and developed spreadsheets and I am not alone in that as there are numerous evangelists of good Excel practices. However, their impact is limited and there are larger opposing forces at play. They come from the organizations and the accessibility of Excel itself.

Quantitative Models in Business War Games

Many things look very cool but turn out to be not that exciting when one try using them. Like voice-recognition technology or Communism or electric cars. Quantitative simulation in business war games proudly goes into this list as well. War games in business is simulation of moves and counter-moves by opponents in a commercial setting according to Wikipedia and is a part of the large family of exercises that aim at leveling up the strategies and management methods in modern companies. In a gist, a war game has two or more teams competing for some resources - usually it is a market share, sales or other tangible outcomes. These exercises could be very beneficial for a company. In the general case there is no measure for the impact of a strategy and some thought introducing a suitable quantitative model to deal with that. And it is a good idea. Except it is not as it usually does not work.

From Start up to F**k up - What I Have Learned

I wonder when and why startup businesses became such a big thing. Maybe with the easy access to high tech, available funding and more people got convinced they could develop their own product to kill the market and buy the Hollywood life. However, despite all the books and gurus on the matter, starting a company from scratch is bit more difficult than installing an app on your phone. I am not a doctor but you can trust me on that - year and half ago I co-founded a company and went though the almost full specter of problems and emotions that come with it. The company is focused on retail analytical services and providing BI solution that addresses the specific challenges in that industry. I have recently terminated my tenure there and have the time to reflect and share some of the lessons I took. After all, isn't it a crime if you are in a startup and the world does not know about that?

Some Great Data Science and Big Data Links

The analytical hub of Data Science Central did an extensive research on the liked or mentioned sites and blogs among their member base. The result is a comprehensive list of the best data science sources. Please find the list here.

If you are looking to expand your list of regularly visited blogs check out the similar list of 50 Data Science and Statistics Blogs Worth Reading. There are some true gems.

In case you need some large data to sharpen your skills or for any other purpose go to 20 Big Data Repositories You Should Check Out.

Of course, you can always find some original content and random analytical-related thoughts here.

Sep 24, 2014

Used Car Market in Bulgaria - Where is The Data?

Couple of weeks ago I shared some difficulties that come with forecasting the new cars market. This market is interesting for marketers for obvious reasons but it forms the smaller portion of the total car market. The second hand car market got my interest and I looked around for some data. I focused my curiosity on my home country and I would like to put some popular myths against hard data. It turned out that there is virtually no data at all and I had to do some digging for details.

The Winding Roads of Car Sales Forecasting

Forecasting sales of a car dealer is a tough business, much tougher than predicting the total market. Winning a car race is matter of right combination of engine tuning, tires type and pressure, quality of petrol and the other fluids, the race track parameters, weather conditions, the pilot mental and physical status and many others as well as all these of the competitors. Similarly, sales results depend on plethora of interconnected factors that makes forecasting it Heracles-grade labour. However, the challenges in predicting car sales are not unique and it is a good illustration of some of common problems.