As a student in Data Science, one of the most hyped terms I hear is “Big Data.” Nearly as ubiquitous as cloud computing references, it’s a term that gets thrown around a lot—vaguely understood, poorly defined, and potentially elusive. Broadly, one can categorize Big Data based on the three V’s:
- Volume: Big=the quantity of the data
- Variety: Big=the range of data types or sources—structured data from a relational database, semi-structured or unstructured data from social media, streams of sensor data from gadgets and mobile devices, etc.
- Velocity: Big=fast—constant changes or sources of data demanding quick response times
To be truly classified as “big” data, data must meet all three of these conditions. However, I’ve seen the term used much more broadly than this. So when I was offered the opportunity to attend NYC Data Week, run in conjunction with the Strata Conference/Hadoop World, I suspected that at last, all would become clear.
Boy, was I wrong.
The combination of Strata/Hadoop and NYC Data Week was an interesting exercise in the multiple interpretations of “Big Data.” There were presentations by technologists, which were heavily focused on the technology components that are used in “Big Data” scenarios: data storage and data analytics. Each presentation seemed to prioritize one or two of those three V’s.
Big Data vs. Raw Data
For example, Amy O’Connor, Senior Director of Big Data at Nokia, spoke about the velocity of data requests and the variety of data provided by sensors in devices.
Jason Wisdom of Greenplum focused more on the volume of data and how big data tools are a response to trying to make traditional databases more scaleable. Roger Baga, SQL Server Engineering at Microsoft, focused on the ways that velocity impacts volume. These presentations all thus stayed true to the three V’s of Big Data and focused on the tools needed to tackle this problem.
But the panels co-sponsored by the City of New York’s Department of Information Technology & Telecommunications (DoITT) brought into stark contrast the difference between the IT world’s understanding of the term and the business world’s understanding of it. At a closing panel titled Data Innovation Across the City (DIAC), seven panelists from different industries spoke about their understanding of Big Data and how they were attempting to leverage it. With representatives from government, beauty, e-commerce, media, development, biotech, and digital industries, the theme here was not Big Data as the three V’s, but Big Data as a Big Problem.
These panelists spoke of the more “traditional” data problems with which information management students are generally quite familiar. These include uncoordinated data gathering efforts, wildly different data needs and terminology across lines of business, uncertainty as to how existing trusted data can be leveraged, and the struggle to get IT and business to work together to solve these problems. As Amy Shriber of NBC Universal noted, “I don’t have a big data problem; I have a raw data problem.” This sentiment was often echoed across the panel.
This difference was brought into even starker relief at a talk given at Syracuse University on November 5, 2012 by Jennifer Gibbs, Director of InfoSphere Master Data Management Development at IBM, supported by both the Data Science program and the Global Enterprise Technology (GET) Speaker Series.
Master Data Management (MDM) is, in essence, what most of the DIAC panelists were actually talking about: the need to be able to see information across business units and to devise ways to leverage this information productively. MDM addresses this concern by focusing on both IT and business to solve it. It uses software and business/IT collaboration to pull together scattered key information to a common master data solution.
Comparing Big Data and MDM
MDM is not a new field, per se, but the concerns it addresses are still faced by many organizations around the world, as evident in the DIAC panel. Their sentiments echoed a comment made by Rory Crosby of Heineken’s Global Business Services this summer during the GET Eurotech trip: that Heineken couldn’t begin to think about big data issues because they were still learning all the ways they could leverage their existing information.
Heineken has begun its own MDM strategy, and many members of the DIAC panel seem to be moving in that direction, though they generally call this Big Data instead of MDM. This seems to reflect a larger disconnect between IT’s use of the term and business’s use of the term.
What does all this mean for Big Data? Well, it first provides a warning: Big Data remains an elusive, vaguely defined concept. If you get into serious discussion with someone about Big Data, be sure to ask them to clarify the concept so you know whether you’re dealing with true Big Data or an MDM issue.
But more importantly, the hype around Big Data has made all types of organizations aware of the power of data—tracking it, storing it, consolidating it, analyzing it, and creating procedures to make that analysis actionable. It may mean that people mistakenly use the term Big Data for non-Big Data problems, but the fact that people from all industries are now seriously examining their data sources and their potential uses is a good thing. It can only increase the use of tools that allow for data-driven decision-making.
And for those of us irrepressible data geeks, this is indeed good news.