Apache Hive vs MySQL - What are the key differences?

If you’ve read “A Beginner’s Guide to
Hadoop
,”
you have an idea of what Hadoop is and how it works. But Hadoop is really
just the foundation for a big data platform – there are a number of
tools that work with Hadoop to enhance and build upon the core platform.
One such technology is Apache Hive.

A project of the Apache Software Foundation,
Apache Hive is a query engine that acts as an interface
into Hadoop MapReduce (among other execution engines like Tez). With hive it is possible to perform data analysis on large datasets via SQL,
or rather HiveQL, which is very similar. A primary purpose of Apache Hive is to allow users to structure and
query data to obtain useful analytics.

MySQL, for its part, is an open-source relational database platform. MySQL forms the database foundation of
a LAMP (Linux-Apache-MySQL-PHP) stack for web and application
development, but can also be very useful as a database for analytics.

Hive’s Basic Architecture

As an overview, the major architectural components of Hive include the
following:

  • Hadoop Distributed File System (HDFS): Hive uses the Hadoop HDFS
    system for storage.
  • MapReduce / YARN: As part of the Hadoop umbrella of projects, Hive uses MapReduce for
    parallel processing functions. It can also use Tez
  • Hive Driver / Compiler: The Hive Driver executes HiveQL queries which are
    parsed by the compiler to generate an execution plan by
    examining both the query blocks and expressions. The query is also compared
    against the metadata from the metastore to make sure it is valid. The resulting execution plan
    is executed on the Hadoop cluster via MapReduce or Tez.
  • Hive Thrift Server: Clients access Hive via the Hive Thrift
    Server, which allows any JDBC or ODBC-compliant application to
    access Hive. Along with programming language bindings for languages like PHP and Python a
    command line interface is also available.
  • Hive Metastore: The metastore contains information about the
    partitions and tables in the warehouse, data necessary to perform
    read and write functions, and HDFS file and data locations.

You can find a full explanation of the Hive architecture on the Apache
Wiki
.

Hive vs. MySQL

While each tool performs a similar general action, retrieving data, each
does it in a very different way. Whereas Hive is intended as a
convenience/interface for querying data stored in HDFS, MySQL is
intended for online operations requiring many reads and writes.

One good example of this difference in action is in forming table schemas. Hive uses a method of querying data known as “schema on
read,” which allows a user to redefine tables to match the data without touching the data.
Hive has serialization and deserialization adapters to let the user do this, so it isn’t intended for online tasks requiring heavy read/write
traffic. On the flip side, MySQL utilizes “schema on write”, which means you define table schemas before
you can add data to the store. This allows MySQL to store it in an optimal way for fast reading and writing. These differing
approaches are a good example of how these two technologies differ. You can read more about schema on read vs schema on write on the
Marklogic
blog
.

When to Use Hive

  • If you have large (think terabytes/petabytes) datasets to query: Hive is
    designed specifically for analytics on large datasets and works well
    for a range of complex queries. Hive is the most approachable way to quickly (relatively)
    query and inspect datasets already stored in Hadoop.
  • If extensibility is important: Hive has a range of user function APIs that can be used to build
    custom behavior in to the query engine. Check out my guide to Hive functions
    if you’d like to learn more.

When to Use MySQL

  • If performance is key: If you need to pull data frequently and
    quickly, such as to support an application that uses online
    analytical processing (OLAP), MySQL performs much better. Hive isn’t
    designed to be an online transactional platform, and thus performs
    much more slowly than MySQL.
  • If your datasets are relatively small (gigabytes): Hive works very well in
    large datasets, but MySQL performs much better with
    smaller datasets and can be optimized in a range of ways.
  • If you need to update and modify a large number of records
    frequently:
    MySQL does this kind of activity all day long. Hive,
    on the other hand, doesn’t really do this well (or at
    all, depending). And if you need an interactive experience, use
    MySQL.

Through this summary of the differences between Hive and MySQL, I hope
I’ve helped provide some direction on which platform to use in different
applications and environments. For additional points of comparison,
check out this post on the Hadoop Tutorial website.

Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Need More Hadoop Reading?

I've collected a list of the top Hadoop books on the market

Join the discussion

comments powered by Disqus