Big Data 15 min read

How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost

To dramatically improve Hive UDF performance when converting coordinates to administrative districts, this article compares two implementation strategies, details the technical challenges of repeatedly loading a 157 MB Geo data file, and presents a static‑cached solution that reduces query time from seconds to milliseconds, achieving roughly a 140‑fold speed increase.

Huolala Tech

Aug 4, 2020

How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost

1. Background

The mapping team needed a custom UDF on BigQuery to convert latitude/longitude to administrative districts. Example usage:

hive> select MatchDistrict("113.2222,24.33333", "formattedAddress")
-- returns: 中华人民共和国-广东省-肇庆市-四会市

2. Preliminary Research Plan

Two possible solutions were identified:

Provide a conversion service and let the custom UDF call this service.

Encapsulate the conversion method directly inside the UDF.

Comparison of the two schemes:

Scheme

Advantages

Disadvantages

Scheme 1

Simple packaging, fast implementation, smaller UDF JAR.

Depends on external service; service outage makes UDF unavailable; network overhead reduces performance.

Scheme 2

High performance, no external dependencies.

Complex implementation, technical challenges with handling offline resources.

For a platform, stability and performance are paramount; therefore Scheme 2 was chosen.

3. Implementation Method

4. Technical Difficulties

The mapping team provides the conversion method GeoRTreeData.java, which requires loading an external resource lbs_geo_data.json (157.4 MB). Loading this file locally takes about 5 seconds. If a query uses the UDF twice, total latency exceeds 10 seconds because the resource is initialized on each UDF invocation, leading to unnecessary repeated loading.

5. Final Effect

After optimization, the resource is loaded only on the first UDF call; subsequent calls reuse the cached data. The first query takes about 6 seconds, while later queries take roughly 70 ms, a performance improvement of about 140×, allowing 300 million rows to be processed in just over 30 seconds.

6. Solution

Define GeoTreeData as a static object in the UDF class and load the resource lazily in initialize only if it has not been loaded yet. The implementation is straightforward, though the underlying principle is more complex.

public class DistrictMatch extends GenericUDF {    
  public static GeoRTreeData rTreeData;

  private static void initRTreeData() {        
    InputStreamReader is = new InputStreamReader(DistrictMatch.class.getClassLoader()
                    .getResourceAsStream("lbs_geo_data.json"));        
    rTreeData = new GeoRTreeData();        
    rTreeData.init(is);    
  }

  @Override    
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {        
    if (arguments.length != 2) {            
      throw new UDFArgumentException("method need 2 params");        
    }
    if (!arguments[0].getCategory().equals(ObjectInspector.Category.PRIMITIVE)) {            
      throw new UDFArgumentException("except String, but got:" + arguments[0].getTypeName());        
    }
    if (!arguments[1].getCategory().equals(ObjectInspector.Category.PRIMITIVE)) {            
      throw new UDFArgumentException("except String, but got:" + arguments[0].getTypeName());        
    }
    if (rTreeData == null) {            
      initRTreeData();        
    }
    return PrimitiveObjectInspectorFactory.writableStringObjectInspector;    
  } 
}

7. Theoretical Support

The solution is explained from three perspectives:

HQL execution flow: how HQL is transformed into MapReduce tasks.

Generic UDF loading process: registration and usage of GenericUDF.

JVM static variable loading: why static fields enable sharing of the large Geo data across UDF instances.

1. HQL Execution Flow

Hive translates HQL into one or more MapReduce jobs. The process includes parsing (ANTLR), semantic analysis, logical plan generation, logical optimizer (e.g., SimpleFetchOptimizer, MapJoinProcessor, BucketMapJoinOptimizer, GroupByOptimizer, ReducesSinkDeDuplication, PredicatePushDown, CorrelationOptimizer, ColumnPruner), physical plan generation, physical optimizer, and finally execution on the Hadoop cluster.

Example

EXPLAIN SELECT a.name, MatchDistrict(a.max_x||','||a.max_y,'formattedAddress','wgs84') as point, b.area_id FROM parsed_area a JOIN (SELECT name, area_id FROM parsed_area2) b ON a.name = b.name;

produces a three‑stage plan with two MapReduce jobs (Stage4 and Stage3). Stage4 performs a hash‑table build without invoking the UDF; Stage3 performs a map‑side join and passes the UDF parameters as an array to the output operator, where the UDF is finally executed.

2. Generic UDF Loading Process

UDFs are registered via CREATE TEMPORARY FUNCTION or CREATE FUNCTION. Registration creates a FunctionInfo by reflecting the GenericUDF class and storing it in the session’s registry. During query execution, each expression node creates an ExprNodeEvaluator that holds the UDF instance; its initialize method is called once per evaluator.

3. JVM Static Property Loading

Static fields reside in the method area and are shared across all threads, allowing multiple UDF instances to reuse the same GeoRTreeData. Lazy loading in initialize avoids unnecessary resource consumption when the function is registered but never used.

8. Horizontal Extension

Future UDFs that need to load files or request network resources should preload them into static objects to avoid repeated loading.

9. Remarks

Testing showed that setting Hive’s heap size to 1 GB prevents OOM when loading the UDF; the heap size must not exceed the Hadoop client’s -Xmx limit.

For MapReduce, configure per‑node map memory via

<property><name>mapred.child.java.opts</name><value>-Xmx512m</value></property>

On Tez, avoid packaging Hive‑exec dependencies inside the UDF JAR to prevent runtime issues.

10. Future Outlook

Potential extensions include Hive on Tez, Hive on Spark, and developing UDAFs/UDTFs.

11. References

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateFunction

https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-DeployingJarsforUserDefinedFunctionsandUserDefinedSerDes

https://blog.csdn.net/moon_yang_bj/article/details/31744381

https://www.jianshu.com/p/660fd157c5eb

https://www.cnblogs.com/nashiyue/p/5751102.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Hive UDF Static Caching

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.