Dan Gardev on Analytics, Data and Everything: The Reference Data

Reference data is list of country names, department names, equipment type, products, customers, etc. Setting it right is a key for a good and flexible report or analysis. For example you would not like to have an analysis of sales per country with USA appearing there as USA and as United States. The problem is not acute when all the data comes from one source. However, do not underestimate the possibility for duplicate names and do the necessary to check them - Exlce has really nice tool for that. The problem is much more acute when analysis pulls data from few data sources. I would like to share few simple steps to deal with that and save time.

Make list of all reference data from the different sources.
Process each list to remove duplicates
Check the level of granularity and create a reference master data list. The master list has the data as you want it to appear in the analysis. For example, a source could report by each European country while in other there is simply total Europe figure. The report could be at the highest existing level of granularity - in this case - Europe. Of course, master list could be the reference data as used in one of the sources.
Create a data dictionary. This dictionary is the relation between the master list and the ones in the sources.
Enrich the reference data in sources with the corresponding values from the dictionary (if possible). I find it is a good practice as it makes very easy to perform any operations on the source data - sum, average, look-up, etc.
Use the master list to create the analysis and formulas to refer to the sources

Once the master list is set up it is flexible and could easily be updated to include any new piece of data or requirement. As in the example with the countries, it would be easy to prepare the analysis per group of countries in Europe if the total Europe figure is replaced by figures for DACH, IIG, etc. The dictionary has to be updated with the relations between a country name and the group it belongs to. All the formulas that extract or aggregate the source data as well as the ones enriching the sources would remain the same and the analysis has be extended to include the new reference data easily.

I have used this approach for some time now for reports and models that need regular update of data and requirements and with saved some frequently used dictionaries it significantly reduces the data processing and setup time.

Dan Gardev on Analytics, Data and Everything

Apr 2, 2013

The Reference Data - A Perfect Mess No More

No comments:

Post a Comment