May 9, 2014

Big Data Deja Vu?

I was running series of machine learning algorithms on few huge files the other day in search of some meaningful information. I was enjoying all the fun that comes with large volumes of data - painfully long processing times, slow response to any data operation and loud laptop cooler to mention a few. As I was optimizing memory usage and calculation time I had a deja vu about long time ago in a lab far, far way. Back then I was calculating big set of parameters from hundreds of physics experiments. The PCs had less computing and storage power than an entry-level smart phone and all the data operations and calculations had to be performed in a clever way in order to get something meaningful in your lifetime. Back at these times nobody talked about Big Data. It was probably because quite often the data was big. Of course, the ability for collecting large volumes of data was galaxies away from the powers we have today but still, there were many domains that amounted large volumes of data. It got me thinking. Going even further back made me realize that large data sets have been with us since the beginning of the computer era. Big data is defined in many ways (see Defining Big Data) but if we adopt the simplest definition we see that it. It seems our abilities to generate data always are one step ahead of our abilities to process all of it.


It may seem far-fetched but think about it.The problems with computing back in 80s and 90s are remarkably similar to the ones related to Big Data these days.If we peel off the layers of salesman talking the problems with large data sets boil down to:
- Long times in manipulating stored data (save, load, update);
- Long times for extracting and processing the data;
("Long" is defined in specific ways according to the application but in general, time is of less and less supply and in more and more demand these days.)
Overcoming these problems required careful work with memory, calculating time for execution and selecting good calculation methods. All these problems were easily solved back then by a more powerful processor and larger and faster storage. Fortunately, new and better models were coming out every six months! The scale and speed of data now are so big that it requires applying radically new technologies. It does not come as a surprise that the core of the modern Big Data technologies were invented long time ago and now are simply revived and updated.

History repeats itself and big data will be just data sooner than we think. We are at the peak of the hype curve and shortly we will stop hearing much for Big Data and probably stop being very careful about memory usage  and processing time again. At least for a while. Until we hit again the ceiling of our abilities to deal with the data we generate. Then will come the time of what? Mega Data?

No comments:

Post a Comment