Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.
In a 2001 research report and related conference presentations, then META Group (nowGartner) analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources). Gartner continues to use this model for describing big data.
Whether through blogs, twitter, or technical articles, you’ve probably heard about Big Data, and a recognition that organizations need to look beyond the traditional databases to achieve the most cost effective storage and processing of extremely large data sets, unstructured data, and/or data that comes in too fast. As the prevalence and importance of such data increases, many organizations are looking at how to leverage technologies such as those in the Apache Hadoop ecosystem. Recognizing one size doesn’t fit all, we began detailing our approach to Big Data at the PASS Summit last October. Microsoft’s goal for Big Data is to provide insights to all users from structured or unstructured data of any size. While very scalable, accommodating, and powerful, most Big Data solutions based on Hadoop require highly trained staff to deploy and manage. In addition, the benefits are limited to few highly technical users who are as comfortable programming their requirements as they are using advanced statistical techniques to extract value. For those of us who have been around the BI industry for a few years, this may sound similar to the early 90s where the benefits of our field were limited to a few within the corporation through the Executive Information Systems.
Analysis on Hadoop for Everyone
Enabling end users to merge data stored in a Hadoop deployment with data from other systems or with their own personal data is a natural next step. In fact, we also introduced Hive ODBC driver, currently in Community Technology Preview, at the PASS Summit in October. This driver allows connectivity to Apache Hive, which in turn facilitates querying and managing large datasets residing in distributed storage by exposing them as a data warehouse.