Mar 10, 2014

Defining Big Data

Would you be surprised if I told you there is not one broadly accepted definition for "big data"? Would you be surprised if I told you 99.99% of people using the term have never even tried to find a definition for it? I bet you won't. I know some people who use "big data" for spreadsheets with more than thousand rows, some other use it every time they talk about something related to data and have no idea what they are talking about - this seems to be a common case actually. Relying on gut feeling could be misleading. Two computer science students attempted to catalog  all the definitions out there. You can see the article at  arXiv:1309.5821

The definitions include:
  • Gartner Group: The “Four V’s” definition: volume, velocity, variety, veracity
  • Oracle: The derivation of value from traditional relational database-driven business decision-making, augmented with new sources of unstructured data such as blogs, social media, sensor networks, and image data.
  • Intel: Generating a median of 300 terabytes of data weekly. Includes business transactions stored in relational databases, documents, e-mail, sensor data, blogs and social media
  • Microsoft: The process of applying serious computing power, the latest in machine learning and artificial intelligence, to seriously massive and often highly complex sets of information.
  • The application definition (arrived at by analyzing the Google Trends results for “big data”):  Large volumes of unstructured and/or highly variable data that require the use of several different analysis tools and methods, including text mining, natural language processing, statistical programming, machine learning, and information visualization.
  • The Method for an Integrated Knowledge Environment (MIKE2.0) definition:  A high degree of permutation and interaction within a dataset, rather than the size of the dataset.  “Big Data can be very small, and not all large datasets are Big.”
  • NIST: Data that exceeds the capacity or capability of current or conventional [analytic] methods and systems.
The online magazine Information Management adds to the list something that is much simpler and easy to understand: "More data than you're used to--some people deal with petabytes and it's easy, but if you're a small practice, just your own data is more data than you're used to”

No comments:

Post a Comment