Mar 20, 2014

The Do’s and Don’ts of Data Mining

Checklists and Business Intelligence-like "Ten Things You Need to Know" over a huge and general topics have dubious value or impact. This is probably because we tend to read them fast and between other stuff as well as it seems that we all know better everything included there. However, sometimes it is really worth to pause and reflect on some of the included topics. KDnuggets recently published an article about the do’s and don’ts of data mining that is a fine example of this. As you might expect, a practitioner would find lot of "I know that" there and often it is hard won knowledge. Still, it is worth to stop and think for while over the topics as in the daily hurry and pressing deadlines we tend to go in a frame and veer off the best course of action.

Do's and my two cents on them are as follows:

Do plan for data to be messy.
Please do that! It kind of goes under "we'll deal with that if we find a problem". There always is a problem with data and not planning for it is a serious mistake.
Do create a clearly-defined, measurable objective for every project.
This is something that often is overlook despite that all PMs have produced tons of paper stating goal, objectives and so on. Time and again the real objective of a DM project is obscured by too much of a managerial lingo while the it could be clearly stated as a set of questions or business problems we need to tackle

Do ask questions.
I could not agree more! A closer look and intimate understanding of the problem at hand and its context as well the data is what makes a real difference. Just analysing the data and reading articles about the problems is not good enough, one needs to get down and dirty with data and understand the context and the problem.Our image of guru-level consultants would not suffer if we keep asking questions. 

Do simplify the solution to increase your chances of success.
The question of complexity is well misunderstood. In analytics, forecasting and data mining a simple solution often has comparable performance with a complex one. Simplicity provides lower costs, robustness, traceability and other benefits but the main factor to consider probably is the business objective and real value of model outputs and accuracy.

Do cross-check data coming out of the ETL process with the original values, and with project stakeholders.
This is a good one. Sometimes we trust so much in our abilities to manipulate the vast amounts of data that we tend not to question the outcomes of ETL. The probability for a mistake grows exponentially with the data size and the number of the transformations on it as well with the decreasing of the our knowledge about the business domain and specific sources.
Do use more than one technique/algorithm.
This is not very common as each method requires time which a luxury but for models with high-impact outcomes this is must. Data mining is more of a experimental science and there are no recipes for every problem that makes testing of different approaches a must.

Do be informed.
I do not believe a  DM practitioner underestimates the peril of not following the advances in methods and software but it is a matter of time as always. Younger and single practitioners have plenty of time outside the business hours but when "married with children" dramatically changes this and then updating skills and staying informed requires planning and consideration.

DONT's of the article and my short comments on this are as follows:
Do Not Ever underestimate the power of good data preparation.
I would rephrase Steve Ballmer's "Developers, developers, developers..." crazy moment to "Data preparation, data preparation, data preparation ....". I would skip on the sweat shirt and the eyes of a mad man The data is the foundation of data mining and its importance cannot be overstated.
Don’t use the default model accuracy metric.
We tend to use the metrics that we have learn in the university or the ones software packages  promote. Expanding the understanding of model quality and behavior goes through better understanding of the accuracy measures and their applicability to the specific case.

Don't forget to document all modeling steps and underlying data!
I would add the assumptions to the documentation as they are a key in the analytical project of any kind. Detailed documentation of every data piece and modeling step may look too much of a work but try not doing that properly in a project that last more than few weeks!

Don't overfit...with Big Data it is easy to find patterns even in random data.
Overfitting is confused with and disguised as higher accuracy of the models and accuracy is what the customer wants to see. It is also very easy to do it and has be in the "alert" list. This is problem is cured by proper testing and validation design.

Do not just collect a pile of data and “toss it into the big data mining engine” to see what comes out.
A big problem this is, the data mining Yoda would say and one of my favorites. There are series of articles and books about the spurious correlations as "perks" of the vast amounts of data. We have to be very careful with that if we do not want to explain to a CEO why the phase of the Moon affects servers downtime.

Do not ascribe them mystical powers and wrongly think “it’s all about the algorithms”.
It is all about the people that apply algorithms and their understanding of the problem. The human intelligence is the key factor here not the software outputs.

Do not underestimate the power of a simpler-to-understand solution that is slightly less accurate.
This is related to a point made earlier. A model could have a high business value no matter its simplicity or relatively big error.
Do not Blindly trust assumptions made to satisfy frequency statistics, as well as p-values and AIC. 

Now you could read the original article with a detailed description of each topic here.

No comments:

Post a Comment