Let us discuss the factors driving the explosion of Big Data.
What is Big Data? What makes data “big”; and how this explosion drives technology and opportunities in today’s Enterprise.
3 Big Data Vs Drive Apache Hadoop
Companies like Cloudera, Hortonworks are not driving the case for Hadoop; data Volume, Velocity, and Variety are driving the need for Hadoop. Hadoop is the industry recognized backplane for Connected Data Platforms.
Next up? 1000 Exabytes = 1 Zeta byte
What Makes Big Data Big?
Hadoop does not just work on data; it was specifically designed to work on Big Data.
What makes data big? Where did the phrase Big Data come from and what does it mean? The term Big Data came from the computational sciences. Specifically, it is used to describe scenarios where the volume and variety of data types overwhelm the existing tools to store and process it.
In 2001, the industry analyst Doug Laney described Big Data using the three V’s of volume, velocity, and variety
NOTE: Instead of “unstructured”, currently we tend towards talking in terms of “schema on read”
Volume refers to the amount of data being generated. Think in terms of gigabytes, terabytes, and petabytes. Many systems and applications are just not able to store, let alone ingest or process, that much data.
Many factors contribute to the increase in data volume including transaction-based data stored for years, unstructured data streaming in from social media, and the ever-increasing amounts of sensor and machine data being produced and collected.
Volume issues include:
- Storage cost
- Filtering and finding relevant and valuable information in large quantities of data that often contains much information that is not valuable
- The ability to analyze data quickly enough in order to maximize business value today and not just next quarter or next year
Velocity refers to the rate at which new data is created. Think in terms of megabytes per second and gigabytes per second.
Data is streaming in at unprecedented speed and must be dealt with in a timely manner in order to extract maximum value from the data. Sources of this data include logs, social media, RFID tags, sensors, and smart metering.
Velocity issues include:
- Not reacting quickly enough to benefit from the data
For example, data could be used to create a dashboard that could warn of imminent failure or a security breach – failure to react in time could lead to service outages
- Data flows tend to be highly inconsistent with daily, seasonal, or event-triggered changes in peak loads
For example, a change in political leadership could cause a peak in social media
Variety refers to the number of types of data being generated. Data can be gathered from databases, XML or JSON files, text documents, email, video, audio, stock ticker data, and financial transactions.
Varieties of data include:
- Unstructured (schema on read)
Variety issues include:
- How to gather, link, match, cleanse, and transform data across systems
- How to connect and correlate data relationships and hierarchies in order to extract business value from the data
The Information Explosion
The world’s data used to double every century, now it doubles every two years. This explosion is driven by the Internet of Things(IoT), by mobile devices, and by our ability to generate more digital content than ever before.
The digital universe will grow from 4 zeta bytes of data in 2013 to 44 zeta bytes in 2020.
Existing data architectures make data inaccessible, incomplete, irrelevant, and expensive.
This is the largest business innovation cycle in history, and these changes threaten existing data strategies. Many companies have big plans for big data, but existing data architectures make our data inaccessible, incomplete, irrelevant, and expensive. As data streams in at accelerating rates, the cost to store, reformat and retrieve it grows more quickly than the value it may provide.
We know that big data holds big value, but we also know that we are at risk of being left behind if our competitors capture that value before we do.
Apache™ Hadoop® transforms business, making Big Data easily accessible for advanced analytic applications.
Companies of every size are using new big data opportunities to transform their businesses and the lives of their customers.
Hadoop can help:
- Pharmaceutical manufacturers make better vaccines that save lives
- Doctors prescribe treatments based on data from all previous patients
- Car insurance companies keep drivers safe
- Mobile providers reduce call center wait times
There isn’t a single organization that could not benefit from better insight into their data, but most are unable to store or even make use of all the data they have.
What is Apache Hadoop:
So what is Apache Hadoop?
It is a scalable, fault-tolerant, open source framework for the distributed storing and processing of large sets of data on commodity hardware.
But what does all that mean?
Hadoop clusters can range from as few as one machine to literally thousands of machines.
Hadoop services become fault tolerant through redundancy. For example, the Hadoop Distributed File System, called HDFS, automatically replicates data blocks to three separate machines, assuming that your cluster has at least three machines in it. Many other Hadoop services are replicated, too, in order to avoid any single points of failure.
Hadoop development is a community effort governed by the licensing of the Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software bugs, or improving performance and scalability.
Distributed Storage and Processing
Large datasets are automatically split into smaller chunks, called blocks, and distributed across the cluster machines. Not only that, but each machine processes its local block of data. This means that processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of memory.
All of this occurs on commodity hardware which reduces not only the original purchase price but also potentially reduces support costs as well.
Hadoop Core = Storage + Compute
At the most granular level, Hadoop is an engine who provides storage via HDFS and compute via YARN capabilities.
Open Enterprise Hadoop:
Open Enterprise Hadoop is a new paradigm that scales with the demands of big data applications. It is supported by a rich and growing partner ecosystem that enables enterprises to meet the unique demands of their industries. By making governance, security and operations an integral part of the platform Open Enterprise Hadoop opens the door for integration with existing enterprise architectures.
All of this is possible because Open Enterprise Hadoop maximizes community innovation by collaborating with developers in open source and within an open community environment.