Apache Hive Overview

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that provides data summarization, ad-hoc query, and analysis of large datasets. It projects structure onto the data in Hadoop and queries that data using a SQL-like language called HiveQL (HQL).

Tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within any particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps, and arrays.

Hive Alignment with SQL

Hive offers semantics similar to RDBMS tools.
Hive Query Process

The steps called out in the above process are maintained by HiveServer 2, but in reality, all data processing is happening on the worker nodes in the cluster.

Hive Query Process

Hive SQL is converted to Map/Reduce jobs and run on Hadoop. The SQL is optimized, just as in an RDBMS, for best performance.

Query Submission Tools

Query Submission Tools

Use familiar command-line and SQL GUI tools just as with “normal” RDBMS technologies.

Beeline

HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.

GUI Tools

Open source SQL tools used to query Hive include:

Ambari Hive View

Ambari Hive View

Ambari includes a built-in set of Views that are pre-deployed for you to use with your cluster. The Hive View is designed to help you, author, execute, understand, and debug Hive queries.

From the Ambari Hive View you can:

  • Browse databases
  • Write and execute queries
  • Manage query execution jobs and history

Performing Queries in Hive

Hive still thinks in MapReduce terms – even with Tez.

Performing Queries in Hive

Defining a Hive-Managed Table

Defining a Hive-Managed Table

The main difference between Hive and SQL table creation is the specification of how it is stored, such as each row’s field delimiter. If the table definition is dropped, the underlying data is lost.

Defining an External Table

Defining an External Table

An external data uses existing Hadoop data; it makes existing data look like a relational table. If the table definition is dropped, the underlying data will remain.

Defining a Table LOCATION

Defining a Table LOCATION

If a LOCATION is not supplied, the table’s data will reside in
/apps/hive/warehouse/db_name/table_name.
LOCATION commonly mapped to EXTERNAL tables but can be used for other types.

Loading Data into Hive

Loading Data into Hive

Data can be loaded into Hive tables from external files or from other tables. The table definition (FIELDS TERMINATED BY ‘,’) allows proper identification of each column.

The OVERWRITE clause replaces all existing table data instead of appending to it.
The last example shows the population of a table with the results of a SELECT statement using data from an existing table.

A pair of CREATE TABLE and INSERT statements can be combined into one Create Table As Select statement.

Performing a Query

Performing a Query
HIVE SELECT statements are just like those in SQL.

Ambari Hive Views

Ambari Hive Views

A view is a “synthetic” table. It is essentially a predefined SELECT statement, referenced as a table.

Hive views are not “materialized”, that is, they are created by running the SELECT action when they are accessed, so they always have current data. 

Defining a Hive View

Defining a Hive View

A view is defined using the CREATE VIEW statement.
You can run the DESCRIBE command on a view to see its schema, and also see it in the output of ‘show tables;
 “

Using Views

When you access a view, HIVE will parse, optimize and execute the query just as if a physical table accessed, substituting the view’s definition into the query plan in place of the table name. Hive will determine the best way to convert the above command into one or more MapReduce jobs at runtime.

Performance Improvements

Both Stinger and Tez have increased the performance of Hive queries.

The Stinger Initiative enables Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer.

Tez improves the MapReduce paradigm by dramatically improving its speed while maintaining MapReduce’s ability to scale to petabytes of data.

Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios.

Tez

apache tez

Apache™ Tez is an extensible framework for building high-performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third-party data access applications developed for the broader Hadoop ecosystem.

Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands of nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.

Stinger.next

Stinger.next is a continuation of the Stinger initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities. Similar to original Stinger initiative, a 3-phase delivery schedule will take care of this. Open Apache Hive community will work towards it.

Stinger.next Goals

  • Speed: Deliver sub-second query response times.
  • Scale: The only SQL interface to Hadoop designed for queries that scale from Gigabytes to Terabytes and Petabytes.
  • SQL: Enable transactions and SQL:2011 Analytics for Hive.

Summary

  • Hive, a data warehouse system for Hadoop, uses the familiar table and SQL metaphors similar to classic RDBMS solutions
  • Hive can create, populate and query tables
  • Views supported, but not materialized
  • Significant performance improvements have surfaced from the Stinger initiative including the use of the ORC file format and Tez as the execution engine 

LEAVE A REPLY

Please enter your comment!
Please enter your name here