Use-cases/Project
7 min read

Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)

1YkseCzHNQ9Sxsi4BHnoCOQ

Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark — execute Hive queries using Spark engine to vastly improve performance over standard MapReduce execution. While this operation proven to be tricky due to non-standard configuration of the cluster, the problems reported later by the client are worth sharing as well.

Coexistence with Spark 2

CDH 5.x provides Spark out of the box, but it belongs to the older, no longer actively developed 1.x family. As many developers want to be able to benefit from all the improvements which come with Spark 2, CDS (Cloudera’s Distribution of Spark) is often chosen as an easy way to have the newer version installed on the Cloudera cluster. Instead of overriding Spark 1.x, it provides a new Spark 2 service with its own executables, for example, spark2-shell and pyspark2, which work as Spark 2 equivalent to CDH-provided spark-shell and pyspark, respectively.

Our client’s cluster was configured in a different way, though, as the default Spark version was changed to 2.x using alternatives mechanism available in Linux, just like described in Cloudera documentation. This change made Spark 1 effectively unavailable - both its executables and libraries became symlinks to their Spark 2 counterparts. As a result, Hive on Spark refused to run, as in CDH 5.x it can only work with Spark 1.x.

One of the possible approaches to this problem was to rollback the change of default Spark version. We wanted to avoid that for a number of reasons, though, with the most important one being a risk of breaking existing applications in the production environment. However, we were able to find a way to make Spark 1.x libraries available again — just for Hive, without changing behavior of any other applications.

Even with executables and configuration directories of Spark being just symlinks to Spark 2, the configuration files and libraries of the older version still exist on the disk. For example, Spark 1.x configuration files are stored in /etc/spark/conf.cloudera.spark_on_yarn directory (and that’s what /etc/spark/conf points at under normal circumstances). It turned out that just setting SPARK_CONF_DIR environment variable to Spark 1.x configuration directory before running Hive CLI is enough to make Hive use Spark 1.x. Then, switching Hive execution engine to Spark just worked like a charm!

While it was now possible to run Hive CLI with Spark 1.x libraries loaded, there were still two problems to solve. First, it required setting environment variables manually before starting Hive CLI. Second, it didn’t work for Beeline, as it’s just a thin JDBC client connecting to HiveServer 2. The first one was solved by setting the following property in Hive configuration in Cloudera Manager:

Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hive-env.sh:
SPARK_CONF_DIR=/etc/spark/conf.cloudera.spark_on_yarn/

After redeploying client configuration, SPARK_CONF_DIR variable is always set to the correct value. The second problem was solved in a similar way, by setting the same variable, but for HiveServer2:

HiveServer2 Environment Advanced Configuration Snippet (Safety Valve): 
SPARK_CONF_DIR=/etc/spark/conf.cloudera.spark_on_yarn/

When the change was applied and HiveServer2 was restarted, Spark execution engine became available via JDBC. Our initial testing confirmed that it works just fine, but later that day we received several issue reports from the client.

FileNotFoundException thrown for some queries

First problem noticed was with Spark being unable to execute some of the queries. When Spark was set as an execution engine for a Hive session, it was throwing the following error:

Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://nameservice1/user/hive/warehouse/test.db/test_table/test_column=201709 does not exist.

The same queries were working just fine when executed on MapReduce.

A thorough investigation led us to the cause of the problem. Before executing the query, Spark checks if the partitions of queried tables exist in HDFS. In this case, there were multiple partitions present in Hive Metastore which were missing from HDFS:

hive> show partitions test.test_table;
OK
test_column=201709
test_column=201710
test_column=201711
test_column=201712
test_column=201801

# hdfs dfs -ls /user/hive/warehouse/test.db/test_table
Found 2 items
drwxrwxrwt   - hive hive          0 2018-11-28 16:57 /user/hive/warehouse/test.db/test_table/test_column=201712
drwxrwxrwt   - hive hive          0 2018-11-28 16:57 /user/hive/warehouse/test.db/test_table/test_column=201801

Probably their directories were just deleted by someone when the data was not needed anymore. While it has no consequences for MapReduce engine, which simply ignores nonexistent partitions, Spark turned out to be more strict about this. To correct the issue and prevent it from happening in the future, partitions should be deleted through Hive, using ALTER TABLE table_name DROP PARTITION (partition_specification) statement.

Random query failures

The next problem report was about some queries randomly failing with the following error message:

ERROR : Failed to monitor Job[ 0] with exception 'java.lang.IllegalStateException(RPC channel is closed.)'

(...)

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

When executed again, they completed just fine. Failures seemed to be random and not related to the specific query or table.

While HiveServer logs did not reveal any useful information about the problem, a quick look at the ResourceManager UI made it clear that the Spark application was in a failed state. Diagnostic logs made the reason of failure obvious:

Application application_1528230206654_23647 failed 1 times due to AM Container for appattempt_1528230206654_23647_000001 exited with exitCode: -104

(...)

Diagnostics: Container [pid=31224,containerID=container_e92_1528230206654_23647_01_000001] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing container.

(...)

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.

Spark driver was apparently getting killed due to lack of memory. The solution was to increase the memory allocation for the driver. This can be set globally, in Hive on Spark configuration in Cloudera Manager, or per session. As the driver memory consists of heap and off-heap memory (Cloudera recommends off-heap to be set to 10–15% of total driver memory), the following example makes driver use 4GB:

set spark.driver.memory=3645m;
set spark.yarn.driver.memoryOverhead=450;

If you are interested in Hive on Spark tuning, you can find more information in Cloudera documentation.

Only one query at a time can be executed in Hue

The last reported problem involved Hue and running multiple queries as the same user, but in a different web browser tabs. When one query was running in the one tab and then the other one was executed in another tab, the first query failed with the following exception:

ERROR : Failed to monitor Job[ 1] with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException)'
org.apache.hadoop.hive.ql.metadata.HiveException: java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException
at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:153)
at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getState(RemoteSparkJobStatus.java:82)

(...)

ERROR : Failed to monitor Job[ 1] with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException)'

(...)

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

By default, Hue spawns just one Hive session per user and single session is able to handle only one Spark query. Trying to execute multiple queries in one session leads to failures with error messages like shown above. However, per-user parallelism can be easily increased by setting max_number_of_sessions parameter in Hue configuration file.

A lesson learned from this project: there are many dependencies between Hadoop ecosystem components which might not be visible at a first glance, but sometimes they break things supposed to work out of the box.

big data
hadoop
hive
spark
Cloudera
5 February 2019

Want more? Check our articles

datagenerationobszar roboczy 1 4
Tutorial

Data online generation for event stream processing

In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…

Read more
bloggcpobszar roboczy 1 4
Tutorial

Data isolation in tenant architecture on the Google Cloud Platform (GCP)

Multi-tenant architecture, also known as multi-tenancy, is a software architecture in which a single instance of software runs on a server and serves…

Read more
radiodataquantum
Radio DaTa Podcast

Data Journey with Yetunde Dada & Ivan Danov (QuantumBlack) – Kedro (an open-source MLOps framework) – introduction, benefits, use-cases, data & insights used for its development

In this episode of the RadioData Podcast, Adam Kawa talks with Yetunde Dada & Ivan Danov  about QuantumBlack, Kedro, trends in the MLOps landscape e.g…

Read more
radiodatawilla
Radio DaTa Podcast

Data Journey with Arunabh Singh (Willa) – Building robust ML & Analytics capability very early with FinTech, skills & competencies for data scientists with ML/AI predictions for the next decades.

In this episode of the RadioData Podcast, Adama Kawa talks with Arunabh Singh about Willa use cases (​ FinTech): the most important ML models…

Read more
play casestudy
Success Stories

Play Case: Migrating a Petabyte Hadoop Cluster to Kubernetes Using Open Source. Cloud-agnostic and Fully Open Source Data Platform.

Can you build a data platform that offers scalability, reliability, ease of management and deployment that will allow multiple teams to work in…

Read more
getindata big data blog apache spark iceberg
Tutorial

Apache Spark with Apache Iceberg - a way to boost your data pipeline performance and safety

SQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it easily on the…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy