May 14, 2013

How Different is Big Data Than Just Data

One of the latest buzz words is "big data". Success stories about it are abundant and many people suddenly became big data experts and "data scientists". But where is this big data and how different it is from the data we know?

There are few definitions of what big data is and of course no one have checked them and relies on her "embedded" understanding of it. For clarity, please check the Wikipedia definition.

It should not come as a surprise that big data is very far from us. We could be subjects of it but few are actual users. Big data is collected by big organizations - some governments, government agencies, scientific organizations,  few companies and some websites. Outside of this narrow group the big data is just data. The tools and the methods to deal with it are abundant and well known. Sometimes the data is large and and is referred to as "big". Well, an Excel sheet with 1,000,000 rows of data is not strictly speaking big data even if it looks, well, big. This is just data that is not properly stored. Low performance indicators for storing and processing the data are signal the employed tools and the hardware are not suitable not that the data is that big.

Big data is just data but in much higher doses. It seems to be different as it imposes much higher requirements for storage, access and process. It all comes sheer size of it and requirement to get results in reasonable time. The SQL queries and procedures, hardware and pieces of software have to be tuned, optimized or specifically built for dealing with huge data in order to produce results in a reasonable time. For example, what produces covariance matrix out of 10,000 records would do that for 1,000,000 records as well but for much longer time.

The methods to transform big data into information are not much different. They deal with much more detailed granularity, more dimensions and greater number of possible interconnections but these are the same methods employed in "regular" data and not some mystical ones not known to an experienced analyst.

The rise of the big data has two related components. One is the data availability and the other is the cheap computing power and storage. Is not a new type of data with some new set of properties - it is the same data that has been of interest for ages but we just got lot of it. So the story of big data seems to be more of a story for data availability. However, I would not change the name "big data" sits much better on headlines and presentations.

However, due its nature, big data calls for better analytical processes. It includes good definition of the questions to answer, sound statistical methodologies, proper methods for building hypotheses and common sense in interpreting the results. The bigger the data the bigger the mistakes we could make as Mr Taleb points out and explains in his article for Wired. Among the reasons for that is the abundance of spurious relations and cherry picking of results. Check the article for more.

Another danger coming from the big data - it is seen as holding the answer for many questions and makes us confident for the outcome. We could be blinded by its abundance and this could lead us to working in the wrong direction and come to wrong conclusion as we could fail to see reasons that come from another place, not described in the data on hand. This fallacy is true for any data of course but it is much more emphasized with big data. I see solution for it in adhering to good analytical practices, common sense and more critical thinking in the process.

Big data is just a rich and extensive data. It requires stronger focus on the way to store, extract and process it while apply the analytical methods are almost the same and I think it not a big step for an analyst or organization to move from normal data to the big one.

No comments:

Post a Comment