Nov 19, 2025

No More Mess: A Practical Guide for Excel Workbook Organization

It seems a bit awkward to talk about Excel amidst the avalanche of articles for AI, advanced data platforms, and fancy dashboards. Yet, Excel is still at the foundation of many businesses and is here to stay, despite its death having been announced many times in its 40-year existence. There are millions of articles, videos, and posts across all platforms commenting on a feature, teaching a function, or presenting some clever trick with it. While fluency in Excel tools and features is a solid foundation for using it in the real world, this is not enough for a long and happy life with it. The business reality is dominated by the high volume and speed of projects, tasks, and data, as well as by ever-increasing complexity and demand for sharing and communication. This imposes greater demands on how we use Excel or any other tool. This context requires well-readable and easy-to-understand files. However, it does not seem to be the case. Workbooks are usually messy, difficult to read, understand, or modify. We are all so used to Excel files with just some data on the sheets headed by cryptic column names that we do not even see this as an issue. Indeed, it is not one in some cases, but it is a big one when it comes to teams and individuals that create, update, manage, and share significant volume of spreadsheets, as well as when the workbooks produce impactful outcomes and are communicated with many parties.

Why Organize

Among the top benefits of well-structured workbooks is the reduction in time spent on development, modifying files, updating data, and troubleshooting issues and mistakes. It also allows for quickly answering any questions on it and making the stakeholders happy. The less time spent on technicalities translates to more time for actual brain work and generating more and better value for stakeholders.

One of the most frequent pains is refreshing the worksheet output or modifying mapping data. This could easily become an issue if the file was created months or even weeks ago, and it becomes an even larger one if it was done by an analyst other than the creator, as often happens in a team environment. A good workbook should alleviate this pain point, as well as allow it to be easily used as a template for creating similar workbooks addressing other tasks.

An important benefit of a good structure is the better communication of the outcomes to its consumers. Clarity on data sources and recency, column names, processing and calculation logic, assumptions, and content of sheets goes a long way as it reduces ambiguity, confusion, and blind spots, bringing forward the outcomes, conclusions, and impact. It's no small feat that it saves you from unnecessary emails and calls (who needs that?!). This benefit is amplified for teams producing many worksheets for multiple stakeholders. Once out of the analyst's hands, a spreadsheet often is shared with an audience much larger than originally intended, and things could easily get messy as the spreadsheet could be misinterpreted and misunderstood, harming decisions made down the line.

Last but not least, a good spreadsheet improves the image and perception of the analyst, with long-lasting positive effects. Both research and practical knowledge suggest image matters more than actual performance, to the pity of most analysts out there.

This is not an exhaustive list of benefits, although it includes the major ones. However, the full list is not the main topic of this piece, and considering the feelings of the AI reading this (which should have more important work to do), let me introduce you to the system I have used for quite some time now.

Organizing Workbooks: The System

The main principle to follow in setting up a workbook is to be clear and easy to understand, update, and modify. The elements below should be used with common sense and respect to the context.

The Welcome Page

A well-organized workbook has a Welcome Page and distinct sections for data sources, calculations, and outputs. I'm starting with this one because it is the heart of the file to explain its content, allowing anyone to quickly get all the necessary information about the file. At the same time it is the most underestimated one. Despite its huge impact, it is generally missing in the Excel universe. This tab is at the heart of the workbook. Follow the elements on this page.

1. Header and Contact Information

The first element is the header, and it includes the Title, Purpose, Author/Contact, and Request Details.
- Title and Purpose should be concise and written in plain business language so any reader understands unambiguously what the workbook is all about.

- Contact is an another must have element to point to the person or team that produced the file. The stakeholoders need to know who to contact for questions.

- Request Details include the person requested it or the task name, date, email subject/task ID, or any other relevant data. These are a great asset in managing the overall work. It seems obvious to some kind readers, but be honest: how many times have you spent time trying to find out who requested the worksheet and why?

One might ask if this page doubles the documentation that should be in place. That is a good question, and the answer is no — because usually there is no documentation.

I've too often found myself opening an Excel file and guessing what it is all about, what the output is, and who and how produced it. I am sure I am not the only one in this. Tip - the name, color and location of this tab shuld be same across all the Excel files so users know where to look, just like a content of book.

2. Data Sources

The Data Sources section should list all the sources used in the file in a way that eliminates any ambiguity. Each source should be described in reasonable detail.

In case the data is an extract from a database, specify the tables and filters applied, or the query name if it is saved. It is important that the description be detailed enough. For example, "Sales" as a data source does not say much, but "Sales for 2024/2025 in EMEA region" points to the exact dataset.

When data comes from a file, specify its location and who provided you with it, including the when and the how.

In case the data is generated within the file (e.g., manual mapping), then point that out as well.

It is helpful to place links to the source if applicable to reduce the time navigating to them. An important element is the as-of date/time to easily track the data recency.

Finally, each data source should be listed along with its location in the file (i.e., the sheet name).

These elements are crucial to understanding the foundation of the workbook, they allow tracking data pieces, and they facilitate flawless data updates. They are mostly useful to your fellow analyst, and that makes them arguably more important.

3. Used Terms and Column Names

This section is required only in cases where the file has columns or uses terms that are not common language in the organization. You might call it Used Terms, Column Names, or any other name that suits the context. The best-case scenario is for all column names to be self-explanatory or for their meaning to be clarified next to them. The rest, the non-ideal cases, include this section.

For example:

"Sales Adjusted with Returns from E-commerce Prev Month" is too long for easy handling, and the analyst might shorten it to "Sales Adj Ecom PM." Without a dictionary, this abbreviation could sound like something from Hogwarts advanced classes, which would bring unnecessary emails and calls or, even worse, cause its meaning to be confused.

The file introduces some new term that, again, is not in the metrics and KPIs the business is used to.

Tip: Do not assume a new term or an unusual column name is self-evident and does not need an explanation.

4. Notes

The Notes section is another element of the Welcome Page to include any details regarding data processing steps, assumptions, or other important information necessary for understanding the workbook.

5. Steps to Refresh

Last but not least! How often does a "one-time" workbook end up getting asked for an update six months later? Or, more commonly, the analyst who updates a file is not available at the moment? This is where this section helps by concisely and clearly listing the steps to update the data. Be as specific as possible, never assuming that the other person would know all the details. Some files can be very complex, and navigating their data can be daunting.

Workbook Worksheets

One might say, "How difficult could these be? Just throw some numbers there, add some formulas and charts, and you're done!" And while that person might be technically right, they are definitely not the one to bring balance to the (work)force or free humanity from searching for errors in their workbook. (Please forgive me for mixing the iconic characters into something as boring as an Excel spreadsheet, but humanity, in its ignorance, has no clue the scale of disasters a bad Excel file can bring, especially to departments of Accounts Receivable or Logistics.) I am not Thomas Anderson or Anakin, but I can introduce you to the force of exceling in Excel.
Back to sheets: there is no secret or a special system here but some old-fashioned housekeeping.
Worksheets should be separated into distinct groups of Source Data, Calculations, and Output.

1. Source Data Tabs

The two main rules are:
- Each unique source to be on a separate page.
- Each data source tab must be clearly titled, along with additional details to explain any necessary technical specifics that could be handy in future use.
The exception of these are the small pieces of mapping, reference and parameter data. A very good practice is for all those to be placed on a dedicated Reference Data sheet.

2. Calculation Sheets

Calculation sheets follow the clarity and simplicity principle including a clear title and explanation of what is going on in the tab.
The other principles to apply here are:
   - Logical Order: The calculation steps should be in some logical order for easy tracking and maintenance, with some annotation in case of more complex reasoning.

   - Ease of Tracking: Include notes on the source and dependent tabs to keep track and figure out the chain of impact.

   - Complexity at reasonable minimum: The formulas should be up to a reasonable level of complexity to facilitate tracking, error fixing, and modification. An overly complicated formula is testament to the analyst's mastery in formulas and foolishness in proving their work later. The exact level of formula complexity is a matter of context and intended users, apart from technical considerations such as speed.

3. Output Sheets

The output elements should also be organized in a logical flow and grouping, well titled and annotated. Make sure your logic is not unique one and is aligned with that of other users.

In the common case of combined source data and calculation on it, the rule of separation still applies by separating them in distinct blocks: a block of data and next to it (left/right/above/below) the calculations. Avoid source data columns interspersed with calculation columns as often brings confusion and difficulties in updating and understanding the process. The separation must be enhanced with different colors for the source and calculating columns. The background or font color of the headers is more than enough. Remember, the guiding principle is clarity and low effort.

4. Visual Organization

It is practical to color-code the tabs in the workbooks. Use a distinct colors for Source, Calculation, Output tabs, as well as for the special case of Reference Data tab to quickly navigate and find the tab you need with the least time spent reading tab names. You do have meaningful tab names, right? It is even better using same colors across all files for consistency.

Another tip is to move the most-used tabs together to reduce clicking around.

Pro-tip: Organize the sections on each sheet so one can navigate through them quickly and efficiently using the keyboard only.

I will not go into detail on the formatting elements such as number format, fonts, colors, etc., or keeping the same format throughout all the tabs to enhance readability and aesthetics, as I am sure you all do that properly. Or at least know someone who knows someone who does it.

Final Words

Excel is constantly getting upgraded and is now loaded whit AI features that are going to kill all the hard-won advantage of formulas and tricks to perform complex tasks. These new features are widening even further the gap between the ones who know Excel well and the ones that simply get along with it. The skill to produce better Excel workbooks is one to move an analyst to the good side. The minor effort required to keep your workbooks organized is a small price for organizational clarity, reduced uncertainty, and personal reputation. Spending a few minutes setting up a Welcome Page and color-coding your sheets transforms your file from a cryptic headache into a reliable asset.

The principles outlined here — clear separation of inputs, calculations, and outputs — could and should be applied to data products, dashboards, and anything else involving data, often removing the necessity for cumbersome separate documentation.

Apr 5, 2023

Optimizing an ETL

The ETL process is an integral part of analytical and reporting efforts in an organization and keeping in shape is an essential part of delivering reliable, timely and quality data. Internet has a plethora of articles on building a good ETL but very few turn an eye on maintaining and optimizing a real-world corporate one. In this text I offer my inflation-adjusted two cents on the topic filling a small part of this void. The subject of optimization is the duration of the process and I would like to share with you some points of on making ETLs run faster based on experience and thoughts on the subject. It is a long read but everyone involved in this topic has the patience and stamina to endure lengthy and boring texts. My text is about ETL, but a lot here applies for the ELT as well.

What does real-world corporate ETL look like? The first and most obvious feature is the high number of steps in the process, easily in the hundreds and thousands. A typical corporate ETL processes data for all activities in the organization as well as data from various external sources. The large data volume is another distinct feature. Last but not least characteristic is the complexity - the process steps execute in parallel, are highly interdependent and feed all sort of targets with variety of data transformation. Complexity and data volume grow over the time with the business development, new and evolving rules, the growing appetite for dat. A factoris also sub-optimal development management and lack of regular audit. Increased complexity and volumes directly translate into longer execution times. Other factor are contextual. For example, an US company expanding to Asia sets new requirements for shorter load duration because of the time difference.

Setting a target
The very first step in the optimization effort is establishing the target duration. Keep in mind the difference between total duration of the load and downtime of the target system, DWH or other. All data consumers should be interviewed to establish their limits - the musts and the nice-to-have. Time point of picking the data is also something to consider. Agreeing on a target duration is essential all aspects of the project - managing expectations, budgeting, measuring success, etc.

Setting up the team
Apart from the data engineers, a successful team should also include a good database administrators and BI experts. The last two may not be involved full-time but their input is indispensable.

Before any optimization started
Along with creating a test environment, prepare scripts for resetting the target. The ETL will be executed many times during the project and you have to make sure it does so under same conditions. For example, not resetting the target would skip insert operations because there will not be any new data to insert after first run. A scrip for automated execution may also come handy because for some tests the ETL may need to be run multiple times. Another thing to consider is the test instance differs from the production one because typically there is no activity in it and the advancements in test are lesser in production. The optimization project also needs a fast, accurate and reliable way to measure execution time of the process and its steps. Most of platforms and software packages have this feature but some manual processes do not have that built-in.

Is your service good enough
This is the lowest hanging fruit. Review the "hardware" and its settings. In the case of cloud environment, analyze the adequacy of compute resources, cores, parallelism, etc. The data/compute services could be outdated and if the organization has a budget for upgrade, the target duration could be easily achieved at this very first step. Testing environments in the cloud is relatively easy and cheap so you can play with few setups to balance parameters and cost. Options for testing are more restricted in the case of on-premises servers. This step also includes analyzing the optimality of the settings of environment and the tools. Every software and service has multiple settings and there is a combination providing the best performance within the system parameters. The software vendors and numerous gurus on the Internet have good advice to follow on this one. However, the best setup may not be clear for your specific case because the setup parameters could have compound effect that is difficult to figure out. This case could be tackled by a grid search across the ranges of parameters values. The automated execution script helps a lot in this scenario as you can analyze the best configuration while doing your tai chi.

General approach - Theory of Constraints
The approach of the optimization follows classical Theory of Constraints. In essence, find the bottleneck, improve it until it is not a bottleneck anymore, then move to the next one. Do this in repeat until you hit the duration target or your options for improvement are depleted. In the case of ETL, bottleneck is the process taking the longest execution time.

The Worst Offenders list
Some bottlenecks are too difficult to improve and return of invested time is not good enough. Instead of jumping to fixing the longest running step, a better option is to start with compiling a list of worst performing steps. Analyze the efforts to improve and weight in weight in overall duration for the top 10 and set your priorities. Setting these helps in making steady progress, keeping motivation high and showing progress to the project owner. The "worst offenders" list also useful in assigning the tasks to the team members.

The steps above set the project and the team could dive in the gory details of each step toward getting to the desired target. Below are some of the common methods for improvement.

Turn off steps loading unnecessary or redundant data
For an ETL that has been around for some time it is very likely some of processed data not to be used any more. A case like this may also come from out-of-box ETL of some pre-packaged solution. These usually do not consider the company specifics resulting in processing unnecessary data. The data lineage, tracking data from source to report/analysis, is the right tool for identifying the redundant steps. Make sure to turn off the step execution rather than deleting it because in case the step is required in the future all you have to do is turn it back on. Analyzing the data lineage is not easy but worth it - you can avoid massive amount of hard work on optimization.

Process only the data you need
It sounds obvious and simple enough but nevertheless, make sure only the necessary data is extracted. For example, a source table may be small at ETL development phase and then grow over the years to cause longer load times. As another example, a step could use some hard-coded values or obsolete parameters to cause extraction of a much greater data volume. A step could also be set for full instead of incremental load. The rule of getting in only the new/updated data is basic for ETL development but it is the overlooking of basics that causes the trouble.

Combine steps sharing same source
It is likely the sub-optimal development process and documentation to introduce steps loading data from same source. Usually there are no great differences between these steps - one loading most of data and the others getting few additional data elements. Find these steps and combine them into one. If it is not not possible for some reason, you can use a temp data structure to pre-load and use in both steps.

The next steps could be defined by one word - simplify.

Simplify
ETL steps involving many tables, views, procedures, and functions along with complex transformations are very likely to have a heavy punch on performance. They offer the additional "benefits" of being hard to read, debug and analyze. Simpler and smaller steps provide faster execution, better readability, optimal use of resources and easier performance analysis. One way to simplify is to split the step into smaller ones. Another is to replace sub-selects and views with temp tables loaded before the actual step. Use of functions is also an option to explore.

Replace views and sub-selects as source with temp tables
The candidates are the complex steps involving many sources and/or producing large data volume. Apart from the benefits of simplification mentioned above, temp table allow indexing that could further speeding up the processes. The temp tables are loaded before the step using the data. Explore the option to do that in parallel with some other task, saving even more time.

Optimize indexes and their usage
This goes in few different ways. One is making sure the tables are properly indexed and two, proper indexes are used. Review indexes of all tables to remove redundant and create new ones if necessary. The impact of that is during use of tables but also in rebuild/refresh of indexes. The ETL code has to be reviewed for using hints because hinting an obsolete or sub-optimal index hurts the performance much more than not using index at all. If a hint is used, try the code without it - often the database engines do very good optimization job and improvements are immediate. Another check point is if indexes are deactivated during the ETL operations. Some take the approach to completely drop the indexes before record operations and re-create after. Disable/drop indexes is a fast and easy way to save execution time but balance that with time lost on refresh/rebuild.

Parallel vs consequential execution
Executing steps in parallel is a great way for reducing the execution time. However, handle with caution. A proper setup should take into account the parallelism capabilities and settings of the database/engine and the ETL tool as well as the required resources. Make sure processes running in parallel do not have same target or source table.

Optimize the target tables size
Target tables could be enormous. Sometimes it is for a reason, sometimes it is because their size was neglected over the years. Why would you keep data for 12 years back when all ever needed is 5 years back? Reducing the size of target tables will benefit in three ways - faster data insert/update, faster index rebuild/refresh and faster querying. One option is moving unused data to a historical table. The historical data has to be available to BI and other analytical tool and this requires additional development work. Another way to deal with the table size is to use partitioning.

Optimize custom code
Custom code is not unheard-of in data processing and is very likely to be a weak link in the process. Review this code for opportunities for improvement, usually there are plenty. For example, SQL UPDATE could be replaced by a MERGE operation. Also, consider removing the custom code and including its operations in a standard
ETL steps.

Take a step back and use common sense
The common sense is a lost art. We tend to rush into Googling and get drawn in the millions of pages with advice, magic command-lines and hacks. The answers sometimes are just before our eyes. This applies to optimization of ETL. Always keep in mind project circumstances, general business context and other specifics. Let me illustrate with an example from a project I did some time ago. It was back in times when BI access had to be shut down during refreshinig data from the sources. After weeks of efforts, the total execution time was significantly lowered and there were no obvious candidates for further optimization I was out of moves. However, at some point I realized there was no reason to kick out users before the initial staging phase (loading data from source into staging tables). A simple move of the shut-down step after staging phase saved a third of total downtime. It is not a process optimization step per se, but it served the greater goal of reducing total BI downtime. The temptation to dive in improving steps has to be balanced by a sober look at the project as a whole.

Instead of Conclusion
A slow ETL frustrates the business users and damages the image of the IT department. The costs of that and the cost of an optimization project can be avoided by introducing better practices in ETL maintanance and development. These practices include monitoring for steps duration trends and promptly acting on undesireable ones, keeping proper documentation, better development and change management, regular audits, avoiding custom code. ETL optimization is a challenge but now you are equipped with some ideas and a plan how to tackle it.

Aug 9, 2016

Motor Vehicle Market in Bulgaria - Some Data Revealed

Couple of years ago I got interested in second hand car market in Bulgaria and was not surprised there is not available data for it. I tried to proxy it with the data from a mobile.bg - the largest car sale site in the country - see the post here. Тhe approach had some downsides but still delivered some data. However, it was a partial effort as a proper estimation would require regular data extraction and comparison and I had not had the time to do that. The good news is, thanks to EU Open Data effort, the Bulgarian government has recently published data for registered motor vehicles. Please find it here, in Bulgarian only. It is a good step forward and I welcome that. The data is missing some important details such as mileage and engine type (diesel, petrol, electric) and could have been better organized but it way better than no data at all. I was not happy to confirm some of my conclusions about the age of the car park as it is a sign the economy is not in the country it should be after more than 20 years of free market.

I hope the Open Data initiative will not die and the government will keep publishing data on regular basis to satisfy the data curious minds.

Apr 22, 2016

Is There Anything Wrong with Year 2016

The pop king Prince has passed yesterday at age of 57. He joined a seemingly long list of celebrities we lost this year. Posts on social media go along the lines of "2016 did it again", "First Bowie now Prince" and "Isn't is the worst year ever?". One might really think there is something going on this year. But is it really?

Sunken Costs and Sausages

Last week I went to a client conveniently located in a fantastic tourist destination and I had great opportunity to add some weight while having good times. I was recommended a restaurant serving fantastic one-in-the-world sausages and one evening I took the walk to get there. It was bit off the tourist and business tracks and soon I got to a part of the city where nothing much was happening and I was expecting a quiet and tasty dinner. However, to my surprise the restaurant was crowded as an Apple store on a release of a new iPhone! People were standing with a beverage and plate of sausage so close to each other that even basic activities as checking Facebook seemed impossible. My companions were all "See how good is this place!". But was it really the food that kept people in?

The Problem with Excel Spreadsheets in Organizations

The death of Excel has been announced many times but it seems to be tougher than officer McClane and still saves the day for many organizations. However, implementation is riddled with problems and dangers that could have detrimental effect on careers or businesses. I have long developed and fought for standards and procedures to minimize the risks and pains coming from badly placed, designed and developed spreadsheets and I am not alone in that as there are numerous evangelists of good Excel practices. However, their impact is limited and there are larger opposing forces at play. They come from the organizations and the accessibility of Excel itself.

Quantitative Models in Business War Games

Many things look very cool but turn out to be not that exciting when one try using them. Like voice-recognition technology or Communism or electric cars. Quantitative simulation in business war games proudly goes into this list as well. War games in business is simulation of moves and counter-moves by opponents in a commercial setting according to Wikipedia and is a part of the large family of exercises that aim at leveling up the strategies and management methods in modern companies. In a gist, a war game has two or more teams competing for some resources - usually it is a market share, sales or other tangible outcomes. These exercises could be very beneficial for a company. In the general case there is no measure for the impact of a strategy and some thought introducing a suitable quantitative model to deal with that. And it is a good idea. Except it is not as it usually does not work.

From Start up to F**k up - What I Have Learned

I wonder when and why startup businesses became such a big thing. Maybe with the easy access to high tech, available funding and more people got convinced they could develop their own product to kill the market and buy the Hollywood life. However, despite all the books and gurus on the matter, starting a company from scratch is bit more difficult than installing an app on your phone. I am not a doctor but you can trust me on that - year and half ago I co-founded a company and went though the almost full specter of problems and emotions that come with it. The company is focused on retail analytical services and providing BI solution that addresses the specific challenges in that industry. I have recently terminated my tenure there and have the time to reflect and share some of the lessons I took. After all, isn't it a crime if you are in a startup and the world does not know about that?

Some Great Data Science and Big Data Links

The analytical hub of Data Science Central did an extensive research on the liked or mentioned sites and blogs among their member base. The result is a comprehensive list of the best data science sources. Please find the list here.

If you are looking to expand your list of regularly visited blogs check out the similar list of 50 Data Science and Statistics Blogs Worth Reading. There are some true gems.

In case you need some large data to sharpen your skills or for any other purpose go to 20 Big Data Repositories You Should Check Out.

Of course, you can always find some original content and random analytical-related thoughts here.

Sep 24, 2014

Used Car Market in Bulgaria - Where is The Data?

Couple of weeks ago I shared some difficulties that come with forecasting the new cars market. This market is interesting for marketers for obvious reasons but it forms the smaller portion of the total car market. The second hand car market got my interest and I looked around for some data. I focused my curiosity on my home country and I would like to put some popular myths against hard data. It turned out that there is virtually no data at all and I had to do some digging for details.

Dan Gardev on Analytics, Data and Everything