Apache Hive Overview
Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that provides data summarization, ad-hoc query, and analysis of large datasets. It projects structure onto the data in Hadoop and queries that data using a SQL-like language called HiveQL (HQL).
Tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.
Within any particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps, and arrays.
Hive Alignment with SQL
Hive offers semantics similar to RDBMS tools.
Hive Query Process
The steps called out in the above process are maintained by HiveServer 2, but in reality, all data processing is happening on the worker nodes in the cluster.
Hive SQL is converted to Map/Reduce jobs and run on Hadoop. The SQL is optimized, just as in an RDBMS, for best performance.
Query Submission Tools
Use familiar command-line and SQL GUI tools just as with “normal” RDBMS technologies.
HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.
Open source SQL tools used to query Hive include:
Ambari Hive View
Ambari includes a built-in set of Views that are pre-deployed for you to use with your cluster. The Hive View is designed to help you, author, execute, understand, and debug Hive queries.
From the Ambari Hive View you can:
- Browse databases
- Write and execute queries
- Manage query execution jobs and history
Performing Queries in Hive
Hive still thinks in MapReduce terms – even with Tez.
Defining a Hive-Managed Table
The main difference between Hive and SQL table creation is the specification of how it is stored, such as each row’s field delimiter. If the table definition is dropped, the underlying data is lost.
Defining an External Table
An external data uses existing Hadoop data; it makes existing data look like a relational table. If the table definition is dropped, the underlying data will remain.
Defining a Table LOCATION
If a LOCATION is not supplied, the table’s data will reside in
LOCATION commonly mapped to EXTERNAL tables but can be used for other types.
Loading Data into Hive
Data can be loaded into Hive tables from external files or from other tables. The table definition (FIELDS TERMINATED BY ‘,’) allows proper identification of each column.
The OVERWRITE clause replaces all existing table data instead of appending to it.
The last example shows the population of a table with the results of a SELECT statement using data from an existing table.
A pair of CREATE TABLE and INSERT statements can be combined into one Create Table As Select statement.
Performing a Query
HIVE SELECT statements are just like those in SQL.
Ambari Hive Views
A view is a “synthetic” table. It is essentially a predefined SELECT statement, referenced as a table.
Hive views are not “materialized”, that is, they are created by running the SELECT action when they are accessed, so they always have current data.
Defining a Hive View
A view is defined using the CREATE VIEW statement.
You can run the DESCRIBE command on a view to see its schema, and also see it in the output of ‘show tables; “
When you access a view, HIVE will parse, optimize and execute the query just as if a physical table accessed, substituting the view’s definition into the query plan in place of the table name. Hive will determine the best way to convert the above command into one or more MapReduce jobs at runtime.
Both Stinger and Tez have increased the performance of Hive queries.
The Stinger Initiative enables Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer.
Tez improves the MapReduce paradigm by dramatically improving its speed while maintaining MapReduce’s ability to scale to petabytes of data.
Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios.
Apache™ Tez is an extensible framework for building high-performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third-party data access applications developed for the broader Hadoop ecosystem.
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands of nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.
Stinger.next is a continuation of the Stinger initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities. Similar to original Stinger initiative, a 3-phase delivery schedule will take care of this. Open Apache Hive community will work towards it.
- Speed: Deliver sub-second query response times.
- Scale: The only SQL interface to Hadoop designed for queries that scale from Gigabytes to Terabytes and Petabytes.
- SQL: Enable transactions and SQL:2011 Analytics for Hive.
- Hive, a data warehouse system for Hadoop, uses the familiar table and SQL metaphors similar to classic RDBMS solutions
- Hive can create, populate and query tables
- Views supported, but not materialized
- Significant performance improvements have surfaced from the Stinger initiative including the use of the ORC file format and Tez as the execution engine