In the past couple of weeks I attended two events in Washington where the question of “Big Data” was circulating. The National Academies event was solely dedicated to Big Data, while at the Library of Congress Digital Preservation 2012 conference it was a somewhat peripheral consideration.
seems to has been ascribed volume, velocity, and variety from a IBM marketing junket META (now a part of Gartner) report written by Doug Laney. For marketing lingo its not all that bad, its pretty good actually. Volume = lots of data. Velocity = rapidly growing amount of data. Variety = heterogeneous formats, sizes, and confusing provenance trails. The triple works relatively well but it does little to alleviate concerns attendant to how to deal with Big Data – rather it just enhances the disorientation that a nebulous concept like Big Data implies.
Lately, I read the volume, velocity, variety triple over and over, turning it in my mind like a caveman with a shiny pebble – growing increasingly Hulk-like in my self narrated loop, “Data BIG!”, “Data GROW FAST”, “Data DIFFERENT!” Occasional frustration aside, I think we can all agree that the three V’s are certainly better than correlating Big Data to the “new oil”. Thanks to my friend Jefferson Bailey for noticing that on the interwebs. Im thinking Forbes is the likely culprit.
Lets say we can agree on what size constitutes Big Data, do we then need agreement on medium, small, tiny, micro data? Would a division of size like this even be useful?
Patricia Cruse and her colleagues at University of California Curation Center, California Digital Library think so. At Digital Preservation 2012, Trisha talked about a forthcoming service targeted at scientists called DataUp. Essentially the service aims to help scientists prepare their tabular data (i.e. data held in Excel files) for archiving and sharing. Trisha showed this really cool graphic that highlighted the vast number of small datasets produced by scientists relative to the number of entities we characterize as Big Data. Hands down there is a whole lot of small data being produced out there.
At first appearing to eschew the Big Data conversation, DataUp focuses on small datasets. But the data contained in each little Excel file could eventually become part of a massive dataset, heck it could be reborn as Big Data. If this happened the DataUp cleaned data would be much easier to work with. Kudos to CDL. The rationale behind the project provides a useful case study in target audience and dataset size segmentation, usefully relating the parts to the whole. Forward thinking.
But what about infrastructure? You know the nuts, bolts, magnetic substrates, fiber optic cables, etc…
In a somewhat awkward moment at Digital Preservation 2012, David Rosenthal took a couple of members of the Big Data panel to task for what he judged was insufficient attention paid to the problem of data storage. He basically said that all of the talk about the wonderful things that can be done with Big Data are rendered moot if we dont dedicate enough research toward more cost effective storage solutions. Valid point. We can talk analytics potential galore but that potential cannot be realized without storage infrastructure that is available at a cost feasible for an organization to incur.