Oct 29, 2013
The Almighty Linear Regression
Linear regression is easy to understand and even easier to apply - popular spreadsheet tools as Excel and OpenOffice offer getting results with few simple mouse operations. The simple idea behind it and simple math for implementing it sets very low the level for knowledge and skill to apply it. The method is also presented in every college and university course that gives it sort of universal acceptance. No matter who the recipient of an analysis is, there is very high chance for her to understand the analysis. Linear regression seems to be a very good answer to a good portion of the problems faced by businesses where actionable results do not demand great accuracy and it is safe enough to take under consideration few driving factors. Another good reason for the pervasiveness of this method is that a good portion of its results do not go further than an inconsequential PowerPoint presentation or a spreadsheet.
It is not a panacea though. Setting aside the common problems with statistical methods, the most serious drawback is coded in its name - "linear". As very few relations are linear in nature and society its proper applications are limited. Additional restriction comes from the ever changing drivers behind processes and statuses. For example, drivers of current market growth are likely to change in 5 years and that automatically renders dubious a10-year outlook produced by a linear regression. The reason usually goes "what would it be if nothing changes" and it is perfectly fine. However, proper application calls for keeping in mind things do change. Neglecting these facts produces lot of border-line idiotic statements made most frequently by media like that population in Japan would be made entirely by retirees in 50 years (I am speculating with this particular example to illustrate the way of thinking). It would not not and that is for sure. The limits of linear regression application and confidence level should be considered carefully.
Another problem with linear regression is the selection of the drivers. Simple search of highest influencing factors among many is not the best course of actions in the common case. The amount of available data makes the probability of spurious correlations very high and the regression could end up with factors that are odd at best. A nice story could be told on any set of drivers of course but the usefulness is doubtful. For example, GDP often comes out as a significant driver for regressions on economic data and this is OK but how useful is that? Everything is related to GDP one or another and in my view it makes is useless.
No doubt, linear regression is a great tool. And as any tool, it should be applied with careful consideration of the purpose, range and the target system or process. I hope this post opens a discussion on this topic.