Skip to main content
Blogs

Impala Performance Tuning – Best Practices

By May 15, 2022May 30th, 2022No Comments
sql Edgematics

The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations. This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.

Recommendations for these configurations include:

  • Ensuring that the timestamps between Hive and Impala match, set the below two startup flags to true

-use_local_tz_for_unix_timestamp_conversions startup flag

-convert_legacy_hive_parquet_utc_timestamps startup flag

  • Always setting the –idle_session_timeout, –idle_query_timeout timeouts for the Impala daemon (impalad).
  • Ensuring that the setting for idle_session_timout is less than the setting for the timeout set for your load balancer.
  • Making the maximum number of concurrent client connections allowed.

Set the –fe_service_threads startup option for the Impala daemon (impalad) to 256.

Improving the metadata loading performance

Increase the –num_metadata_loading_threads startup option to 64.

Detecting & Avoiding block skews –

The most convenient way to detect a block skew or a “slow-host” issue is to compare the Avg Time to the Max Time in the execution summary section of the query plan.

For each phase of the query, there is an Avg Time and a Max Time value, along with #Hosts, which indicates how many hosts are involved in that phase of the query. For all query phases involving more than one host, look for cases where the maximum time is substantially greater than the average time.

 

Slow hosts can be caused by a variety of reasons. To avoid slow hosts:

  • Ensure that all Impala nodes have the same configuration.
  • Ensure that other workloads, which are not run by Impala, are evenly distributed by setting the yarn.scheduler.fair.assignmultiple property to false in the yarn-site.xml.
  • Ensure that HDFS data is balanced by running the HDFS balancer utility during the off-peak hours and then run the INVALIDATE METADATA on Impala metadata after you perform the rebalancing to refresh the metadata in the Hive metastore.
  • Ensure that the HDFS block sizes are optimal. For example, use 256MB blocks and use snappy compression in a splittable block, which is preferable to GZip.

Avoiding small files –

To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS. Small files in HDFS can be caused by either having partitions that are too granular or by performing data ingestion too frequently. Cloudera recommends that you regularly compact small files. In Hive, you can compact small files with the following SQL commands:

SET hive.merge.mapfiles = true;

SET hive.merge.mapredfiles = true;

SET hive.merge.size.per.task = 256000000;

SET hive.merge.smallfiles.avgsize = 134217728;

SET hive.exec.compress.output = true;

SET parquet.compression = snappy;

INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;

Run <Refresh Table> in impala after the Hive job finishes.

Appropriate file formats –

For BI queries, the Parquet file format performs best because of its combination of columnar storage layout, compression, and encoding. The default setting for COMPRESSION_CODEC is snappy compression, but GZip compression is also supported.

Impala also supports reading ORC file formats from version 2.12 and onwards, however, expect query performance with ORC tables to be slower than it is with Parquet tables.

Text formats can be used when all columns are retrieved from a table. However, because compression on text is lower, HDFS I/O could be longer than when you use the Parquet file format.

Partitioning granularity recommendations –

  • Choose a partitioning strategy that ensures there is at least 256 MB of data in each partition.
  • Over-partitioning causes query planning to take longer than necessary because Impala prunes the unnecessary partitions, which results in small files in each partition.
  • Cloudera recommends that you keep the number of partitions in tables under 30,000.
  • Always use integer data types for partition key columns.
  • Partition key values are turned into HDFS directory names so you can minimize memory usage by using numeric values for common partition key fields such as YEAR, MONTH, and DAY.
  • Use the smallest integer data type that holds the appropriate range of values. Typically, TINYINT for MONTH and DAY, and SMALLINT for YEAR. Use the EXTRACT() function to pull out individual date and time fields from a TIMESTAMP value, and CAST() the return value to the appropriate integer data type.

Join query performance tuning –

Join queries are tuned by using the join order and using the appropriate type of join.

Join order:

  • Make sure that table and column statistics are collected on joined columns. This allows Impala to create the most optimal join plan.
  • If table or column statistics are not available, join the largest table first. If table or column statistics are not available for some tables in a join, Impala reorders the tables. Impala places tables with statistics on the left side of the join order in descending order of cost, based on overall size and cardinality. Tables without statistics are treated as “zero-size,” which means they are always placed on the right side of the join order.

Join types:

  • Broadcast joins are the default join type. In this join type the right-hand table is considered to be smaller than the left-hand table, and its contents are sent to all the other nodes involved in the query.
  • Partitioned joins are more suitable for large tables of roughly equal size. With this technique, portions of each table are sent to other appropriate nodes where those subsets of rows can be processed in parallel.
  • The choice of broadcast or partitioned joins also depends on statistics being available for all tables in the join.
  • Join types can be set manually using the following query options:

SET DEFAULT_JOIN_DISTRIBUTION_MODE=shuffle;

SET DEFAULT_JOIN_DISTRIBUTION_MODE=broadcast;

This article describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.

References:

https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_performance.html

About The Author

ABDUR RASHID ALEEM |  Big Data & Solutions Engagement Architect
Abdur Rashid, has 18 years of experience in the IT industry. He is working with us as a Solutions Architect for Cloudera & AWS projects. He has worked extensively in Big Data, Cloudera, Hortonworks. He has Designed Architected several Cloudera, AWS projects for multiple clients from Minnesota, UK, Australia, South Africa & Middle East and has shown accountability for our multiple clients and proved his technical skills. Abdur is certified from Cloudera, Amazon (AWS) for his skills in Professional Solutions Architect. He loves to coach and train the team for new skills in his area.

Image: Edgematics Technologies LLC.