How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost
To dramatically improve Hive UDF performance when converting coordinates to administrative districts, this article compares two implementation strategies, details the technical challenges of repeatedly loading a 157 MB Geo data file, and presents a static‑cached solution that reduces query time from seconds to milliseconds, achieving roughly a 140‑fold speed increase.
1. Background
The mapping team needed a custom UDF on BigQuery to convert latitude/longitude to administrative districts. Example usage:
hive> select MatchDistrict("113.2222,24.33333", "formattedAddress")
-- returns: 中华人民共和国-广东省-肇庆市-四会市2. Preliminary Research Plan
Two possible solutions were identified:
Provide a conversion service and let the custom UDF call this service.
Encapsulate the conversion method directly inside the UDF.
Comparison of the two schemes:
Scheme
Advantages
Disadvantages
Scheme 1
Simple packaging, fast implementation, smaller UDF JAR.
Depends on external service; service outage makes UDF unavailable; network overhead reduces performance.
Scheme 2
High performance, no external dependencies.
Complex implementation, technical challenges with handling offline resources.
For a platform, stability and performance are paramount; therefore Scheme 2 was chosen.
3. Implementation Method
4. Technical Difficulties
The mapping team provides the conversion method GeoRTreeData.java, which requires loading an external resource lbs_geo_data.json (157.4 MB). Loading this file locally takes about 5 seconds. If a query uses the UDF twice, total latency exceeds 10 seconds because the resource is initialized on each UDF invocation, leading to unnecessary repeated loading.
5. Final Effect
After optimization, the resource is loaded only on the first UDF call; subsequent calls reuse the cached data. The first query takes about 6 seconds, while later queries take roughly 70 ms, a performance improvement of about 140×, allowing 300 million rows to be processed in just over 30 seconds.
6. Solution
Define GeoTreeData as a static object in the UDF class and load the resource lazily in initialize only if it has not been loaded yet. The implementation is straightforward, though the underlying principle is more complex.
public class DistrictMatch extends GenericUDF {
public static GeoRTreeData rTreeData;
private static void initRTreeData() {
InputStreamReader is = new InputStreamReader(DistrictMatch.class.getClassLoader()
.getResourceAsStream("lbs_geo_data.json"));
rTreeData = new GeoRTreeData();
rTreeData.init(is);
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length != 2) {
throw new UDFArgumentException("method need 2 params");
}
if (!arguments[0].getCategory().equals(ObjectInspector.Category.PRIMITIVE)) {
throw new UDFArgumentException("except String, but got:" + arguments[0].getTypeName());
}
if (!arguments[1].getCategory().equals(ObjectInspector.Category.PRIMITIVE)) {
throw new UDFArgumentException("except String, but got:" + arguments[0].getTypeName());
}
if (rTreeData == null) {
initRTreeData();
}
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
}7. Theoretical Support
The solution is explained from three perspectives:
HQL execution flow: how HQL is transformed into MapReduce tasks.
Generic UDF loading process: registration and usage of GenericUDF.
JVM static variable loading: why static fields enable sharing of the large Geo data across UDF instances.
1. HQL Execution Flow
Hive translates HQL into one or more MapReduce jobs. The process includes parsing (ANTLR), semantic analysis, logical plan generation, logical optimizer (e.g., SimpleFetchOptimizer, MapJoinProcessor, BucketMapJoinOptimizer, GroupByOptimizer, ReducesSinkDeDuplication, PredicatePushDown, CorrelationOptimizer, ColumnPruner), physical plan generation, physical optimizer, and finally execution on the Hadoop cluster.
Example
EXPLAIN SELECT a.name, MatchDistrict(a.max_x||','||a.max_y,'formattedAddress','wgs84') as point, b.area_id FROM parsed_area a JOIN (SELECT name, area_id FROM parsed_area2) b ON a.name = b.name;produces a three‑stage plan with two MapReduce jobs (Stage4 and Stage3). Stage4 performs a hash‑table build without invoking the UDF; Stage3 performs a map‑side join and passes the UDF parameters as an array to the output operator, where the UDF is finally executed.
2. Generic UDF Loading Process
UDFs are registered via CREATE TEMPORARY FUNCTION or CREATE FUNCTION. Registration creates a FunctionInfo by reflecting the GenericUDF class and storing it in the session’s registry. During query execution, each expression node creates an ExprNodeEvaluator that holds the UDF instance; its initialize method is called once per evaluator.
3. JVM Static Property Loading
Static fields reside in the method area and are shared across all threads, allowing multiple UDF instances to reuse the same GeoRTreeData. Lazy loading in initialize avoids unnecessary resource consumption when the function is registered but never used.
8. Horizontal Extension
Future UDFs that need to load files or request network resources should preload them into static objects to avoid repeated loading.
9. Remarks
Testing showed that setting Hive’s heap size to 1 GB prevents OOM when loading the UDF; the heap size must not exceed the Hadoop client’s -Xmx limit.
For MapReduce, configure per‑node map memory via
<property><name>mapred.child.java.opts</name><value>-Xmx512m</value></property>.
On Tez, avoid packaging Hive‑exec dependencies inside the UDF JAR to prevent runtime issues.
10. Future Outlook
Potential extensions include Hive on Tez, Hive on Spark, and developing UDAFs/UDTFs.
11. References
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateFunction
https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-DeployingJarsforUserDefinedFunctionsandUserDefinedSerDes
https://blog.csdn.net/moon_yang_bj/article/details/31744381
https://www.jianshu.com/p/660fd157c5eb
https://www.cnblogs.com/nashiyue/p/5751102.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
