Sep 23, 2013

Excel on the Edge of Reason

Every now and then clients throws requests that challenge the usual way we do things. One of these challenges that come not that rarely is for Excel workbook working as a database. A recent request included a spreadsheet with over a thousand rows and about 250 columns with extensive set of extracts and charts based on it. The picture was completed with the perspective of growth in data size and analytics on it. The task challenged the usual methods for working with spreadsheet data and imposed some deeper and more careful selection of techniques. I would like to share some experience with such challenges.

A suitable solution
Thousand rows and 250 columns of data is not the usual task for Excel. It is not impossible if it is matter of just data storage and some small number simple operations. However, the presence of a huge selection of conditional data extracts and aggregation on them changes the methods to apply as reasonable performance is a must for any solution. Large data and good performance usually is a call for a database with a BI tool on top of it as it provides the scalability, development speed and flexibility for a project like this. There is a good set of free or cheap databases and BI tools that could be used.

... that was dismissed
However, there are various reasons for not selecting the obvious solution - the manager in charge may not be familiar with these, IT department objections, policy restrictions, lack of time for adoption of new skills, etc. Also, if it is only one tool that is data and analytics intensive and all others are not quite so it is probably worth the extra effort and inconveniences coming from applying Excel for the task.

The usual methods are not that good
First thing that comes on mind with mentioning of finding data on a spreadsheet usually is VLOOKUP function. Unfortunately, lookups and index formulas over large ranges of data results in slow spreadsheets and some unexpected side effects - like popping out of messages "Excel cannot complete the task with available resources" or "Not enough resources to display completely". Some users would not have the latest and greatest equipment and it will practically prevent them from using the tool.
Implementing a good level of dynamics on the spreadsheet is also a challenge. I was thinking about using some fancy-consultant array formulas to tackle the requested dynamics and impress the customer when I first read the requirements for the project. However, array formulas are the last thing to implement for processing large data - they are slow, resource intensive and difficult to trace and debug.
Large number of columns is also a challenge as navigating through 200 columns is time consuming, there is a higher chance for making an error when referring in a formula and it is plainly boring. 

What does work
A good portion of the problems coming with the large data in Excel could be solved by using tables (available in versions 2007 and later). The data access is much faster and resource optimal. The structured reference that comes with it is very helpful for writing and tracing formulas to say the least. Also, a table removes the hassle of dynamic formula range as it ranges automatically adjust with adding new rows of data to it.
In some cases the data could be split up in parts and/or columns turned into rows and reducing the size of ranges for the formulas.
For the purposes of extracting data I found it very useful to use INDEX formula on a column together with a MATCH on the specific criteria instead of lookups. It is faster and easier on resources.
Complex searches and aggregations on more on a set of criteria  are better to be organized with introduction of additional filter columns - each one  for a specific criteria plus a summarizing one for all met criteria for easier reference.
The problem of dynamic searches that are usually solved by array formulas are better replaced by a set of simpler operations, e.g. using index/match for extracting the subsets of data, simple formulas for performing some operations on it and then use an array formula if necessary. It may seem more work and more calculations to perform but it pays back in transparency, traceability and speed.
Some more specific tasks could benefit when performed by a smaller subset of data. For purposes like that it could be a good idea to extract a subset of data on a separate sheet with a macro and use it for the specific task. The smaller data size gives more freedom on methods and techniques to apply without the restrictions of machine resources.

In conclusion
Excel is not particularly good for storing and processing large data. However, sometimes there is no other option and we better be prepared for the challenges.

No comments:

Post a Comment